add nocmem: auto memory recall + ingest via NuoNuo hippocampal network

- nocmem Python service (mem/): FastAPI wrapper around NuoNuo's Hopfield-Hebbian memory, with /recall, /ingest, /store, /stats endpoints - NOC integration: auto recall after user message (injected as system msg), async ingest after LLM response (fire-and-forget) - Recall: cosine pre-filter (threshold 0.35) + Hopfield attention (β=32), top_k=3, KV-cache friendly (appended after user msg, not in system prompt) - Ingest: LLM extraction + paraphrase augmentation, heuristic fallback - Wired into main.rs, life.rs (agent done), http.rs (api chat) - Config: optional `nocmem.endpoint` in config.yaml - Includes benchmarks: LongMemEval (R@5=94.0%), efficiency, noise vs scale - Design doc: doc/nocmem.md
2026-04-11 12:24:48 +01:00
parent 688387dac3
commit 7000ccda0f
17 changed files with 4164 additions and 3 deletions
--- a/doc/nocmem.md
+++ b/doc/nocmem.md
@@ -0,0 +1,277 @@
+# nocmem — NOC 自动记忆系统
+
+## 动机
+
+NOC 现有记忆：100 个文本槽位（200 字符/槽）+ 滑动窗口摘要。全部塞在 system prompt 里，每次对话都带着。
+
+问题：
+- 没有语义检索，无关记忆浪费 token
+- 槽位容量有限，不可扩展
+- 没有联想能力（A 提到 → 想起 B → 引出 C）
+
+nocmem 用 NuoNuo 的 Hopfield-Hebbian 混合记忆网络替代朴素文本槽位，实现**自动召回**和**自动存储**。
+
+## 核心技术
+
+### NuoNuo Hippocampal Memory
+
+生物启发的双层记忆架构（详见 `../nuonuo/doc/architecture.md`）：
+
+**Layer 1 — Hopfield（单跳，噪声容忍）**
+
+存储 (cue, target) embedding 对。召回时两阶段：
+
+1. **NN 预过滤**：cosine similarity 找 top-K 候选（K=20）
+2. **Hopfield settle**：β-scaled softmax attention 迭代收敛（3 步）
+
+关键特性：**paraphrase 容忍** — 用户换一种说法问同样的事，照样能召回。通过存储 cue variants（同一条记忆的多种表述）实现，attention 按 memory_id 聚合。
+
+**Layer 2 — Hebbian（多跳，联想链）**
+
+WTA pattern separation（384D → 16384D 稀疏码，k=50，稀疏度 0.3%）+ 外积权重矩阵 W。
+
+Hopfield 找到起点后，Hebbian 通过 `W @ code` 沿关联链前进：A → B → C。
+
+这是传统 RAG 做不到的——向量搜索只能找"相似"，Hebbian 能找"相关但不相似"的东西。
+
+**性能指标**
+
+| 指标 | 数值 |
+|------|------|
+| Paraphrase recall（+augmentation, 2K bg） | 95-100% |
+| Multi-hop（3 hops, 500 bg） | 100% |
+| Scale（20K memories, no augmentation） | 80% |
+| Recall 延迟 @ 20K | 4ms |
+| VRAM | ~1 GB |
+
+### Embedding
+
+使用 `all-MiniLM-L6-v2`（384 维），CPU/GPU 均可。选择理由：
+
+- NuoNuo 实验（P1）验证：**gap metric（相关与不相关的分数差）比绝对相似度更重要**
+- MiniLM 在 gap metric 上优于 BGE-large 等更大模型
+- 推理快：GPU ~1ms，CPU ~10ms per query
+
+### 记忆提取
+
+对话结束后，用 LLM 从 (user_msg, assistant_msg) 中提取 (cue, target, importance) 三元组：
+
+- **cue**：什么情况下应该回忆起这条记忆（触发短语）
+- **target**：记忆内容本身
+- **importance**：0-1 重要度评分
+
+LLM 不可用时回退到 heuristic（问答模式检测 + 技术关键词匹配）。
+
+提取后，LLM 为每个 cue 生成 3 个 paraphrase，作为 cue_variants 存入，提升召回鲁棒性。
+
+## 架构
+
+```
+              ┌─────────────┐
+              │   Telegram   │
+              │    User      │
+              └──────┬───────┘
+                     │ message
+                     ▼
+              ┌─────────────┐
+              │     NOC      │
+              │   (Rust)     │
+              │              │
+              │ 1. 收到 user │
+              │    message   │
+              │              │
+              │ 2. HTTP POST ├──────────────────┐
+              │    /recall   │                  │
+              │              │                  ▼
+              │              │         ┌─────────────────┐
+              │              │         │     nocmem      │
+              │              │         │   (Python)      │
+              │              │         │                 │
+              │              │         │ embed(query)    │
+              │              │◄────────┤ hippocampus     │
+              │   recalled   │         │   .recall()     │
+              │   memories   │         │ format results  │
+              │              │         └─────────────────┘
+              │ 3. 构建 messages:       
+              │    [...history,         
+              │     user_msg,           
+              │     {role:system,       
+              │      recalled memories}]
+              │              │
+              │ 4. 调 LLM    │
+              │    (stream)  │
+              │              │
+              │ 5. 得到      │
+              │    response  │
+              │              │
+              │ 6. 异步 POST ├──────────────────┐
+              │    /ingest   │                  │
+              │              │                  ▼
+              │              │         ┌─────────────────┐
+              │              │         │     nocmem      │
+              │              │         │                 │
+              │              │         │ LLM extract     │
+              │              │         │ embed + store   │
+              │              │         │ save checkpoint │
+              │              │         └─────────────────┘
+              │ 7. 回复用户  │
+              └──────────────┘
+```
+
+## 消息注入策略
+
+**关键设计**：recalled memories 注入在 user message **之后**，作为独立的 system message。
+
+```json
+[
+  {"role": "system", "content": "persona + memory_slots + ..."},   // 不变
+  {"role": "user", "content": "历史消息1"},                         // 历史
+  {"role": "assistant", "content": "历史回复1"},
+  ...
+  {"role": "user", "content": "当前用户消息"},                      // 当前轮
+  {"role": "system", "content": "[相关记忆]\n- 记忆1\n- 记忆2"}    // ← nocmem 注入
+]
+```
+
+为什么不放 system prompt 里？
+
+**KV cache 友好**。System prompt 是所有对话共享的前缀，如果每条消息都改 system prompt 的内容（注入不同的 recalled memories），整个 KV cache 前缀失效，前面几千 token 全部重算。
+
+放在 user message 之后，前缀（system prompt + 历史消息 + 当前 user message）保持稳定，只有尾部的 recalled memories 是变化的，KV cache 命中率最大化。
+
+**临时性**。Recalled memories 不持久化到对话历史数据库。每轮对话独立召回，下一轮消息进来时重新召回当时相关的记忆。这避免了历史消息中堆积大量冗余的记忆注入。
+
+## HTTP API
+
+### POST /recall
+
+请求：
+```json
+{"text": "数据库最近是不是很慢"}
+```
+
+响应：
+```json
+{
+  "memories": "[相关记忆]\n- 上次数据库慢是因为缺少索引 (hop=1)\n- PostgreSQL 跑在 5432 端口 (hop=2)",
+  "count": 2
+}
+```
+
+- 如果没有相关记忆，返回 `{"memories": "", "count": 0}`
+- NOC 检查 count > 0 才注入，避免空消息
+
+### POST /ingest
+
+请求：
+```json
+{
+  "user_msg": "帮我看看数据库为什么慢",
+  "assistant_msg": "检查了一下，是 users 表缺少 email 字段的索引..."
+}
+```
+
+响应：
+```json
+{"stored": 2}
+```
+
+- fire-and-forget，NOC 不等响应
+- 内部流程：LLM 提取 → embed → generate paraphrases → store → save checkpoint
+
+### GET /stats
+
+```json
+{
+  "num_memories": 1234,
+  "num_cue_entries": 4500,
+  "augmentation_ratio": 3.6,
+  "vram_mb": 1024,
+  "embedding_model": "all-MiniLM-L6-v2"
+}
+```
+
+## NOC 侧改动
+
+### config.yaml
+
+```yaml
+nocmem:
+  endpoint: "http://127.0.0.1:9820"
+```
+
+### Rust 改动（最小化）
+
+**`config.rs`**：加一个可选字段
+
+```rust
+#[serde(default)]
+pub nocmem: Option<NocmemConfig>,
+
+#[derive(Deserialize, Clone)]
+pub struct NocmemConfig {
+    pub endpoint: String,
+}
+```
+
+**`main.rs`**（主消息处理路径）：
+
+在 `api_messages.push(user_msg)` 之后、`run_openai_with_tools` 之前：
+
+```rust
+// auto recall from nocmem
+if let Some(ref nocmem) = config.nocmem {
+    if let Ok(recalled) = nocmem_recall(&nocmem.endpoint, &prompt).await {
+        if !recalled.is_empty() {
+            api_messages.push(serde_json::json!({
+                "role": "system",
+                "content": recalled
+            }));
+        }
+    }
+}
+```
+
+在 LLM 回复之后（`push_message` 之后）：
+
+```rust
+// async ingest to nocmem (fire-and-forget)
+if let Some(ref nocmem) = config.nocmem {
+    let endpoint = nocmem.endpoint.clone();
+    let u = prompt.clone();
+    let a = response.clone();
+    tokio::spawn(async move {
+        let _ = nocmem_ingest(&endpoint, &u, &a).await;
+    });
+}
+```
+
+`nocmem_recall` 和 `nocmem_ingest` 是两个简单的 HTTP 调用函数。recall 设 500ms 超时（失败就跳过，不影响正常对话）。
+
+### 同步覆盖的调用点
+
+| 位置 | 场景 | recall | ingest |
+|------|------|--------|--------|
+| `main.rs` handle_message | 用户聊天 | ✅ | ✅ |
+| `life.rs` AgentDone | 子代理完成通知 | ✅ | ❌ |
+| `life.rs` run_timer | 定时器触发 | ❌ | ❌ |
+| `http.rs` api_chat | HTTP API 聊天 | ✅ | ✅ |
+| `gitea.rs` | Gitea webhook | ❌ | ❌ |
+
+## 部署
+
+nocmem 作为独立 Python 服务运行：
+
+```bash
+cd /data/src/noc/mem
+uv run uvicorn server:app --host 127.0.0.1 --port 9820
+```
+
+可配 systemd 管理。checkpoint 持久化到 `./data/hippocampus.pt`（相对于 mem 目录）。
+
+## 未来方向
+
+- **重要度衰减**：长期不被召回的记忆自动降权
+- **矛盾检测**：新记忆与旧记忆冲突时自动替换
+- **记忆整合（sleep consolidation）**：定期合并碎片记忆为更紧凑的表示
+- **和 memory slot 融合**：逐步迁移 slot 内容到 nocmem，最终淘汰 slot 系统