Files
nuonuo/doc/longmemeval_benchmark.md
Fam Zheng d923aa1e31 NuoNuo: Hippocampal memory module prototype
Hopfield + Hebbian hybrid memory system for LLMs.
Two nights of experiments (16 iterations), validated on LongMemEval (ICLR 2025).

Architecture:
- Single-hop: Two-Stage Hopfield (NN top-20 → softmax settle)
- Multi-hop: Hebbian W matrix with WTA pattern separation
- 64% on LongMemEval (500 questions), retrieval-only, no LLM dependency
- 4ms latency @ 20K memories, ~1GB VRAM

Key findings:
- Hopfield attention solved noise tolerance (20% → 100% vs flat Hebbian)
- WTA pattern separation enables 20K+ capacity
- Multi-hop associative chains (6 hops, CosSim=1.0) — RAG can't do this
- MiniLM-L6 is optimal (discrimination gap > absolute similarity)
- Paraphrase cue augmentation: 55% → 100% on synthetic, 36% → 64% on benchmark
- SNN encoder viable (CosSim 0.99) but not needed for current architecture
2026-04-07 10:37:24 +01:00

63 lines
2.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LongMemEval Benchmark 结果
## 数据集
LongMemEval (ICLR 2025, MIT License): 500 个问题6 种类型,真实多轮多 session 对话。
## 结果
### Retrieval-only最终方案
| 类型 | v1 (旧提取) | v2 (改进提取) | 提升 |
|------|------------|-------------|------|
| single-session-user | 81% | **86%** | +5 |
| single-session-assistant | 25% | **82%** | **+57** |
| knowledge-update | 53% | **71%** | +18 |
| multi-session | 23% | **53%** | +30 |
| temporal-reasoning | 29% | **61%** | +32 |
| preference | 0% | **27%** | +27 |
| **Overall** | **36%** | **64%** | **+28** |
### 加 Gemma 4 推理反而更差
| | Retrieval-only | + Gemma 4 |
|--|---------------|-----------|
| Overall | **64%** | 40% |
Gemma 太保守,检索到了信息但说 "Not mentioned"。不值得增加 1.7s/query 的延迟。
## 关键改进v1 → v2
1. **不截断 assistant 回复**分段存储500 字/段)→ single-session-assistant 25% → 82%
2. **用户自述作为记忆**:用户说的每句话都存一份 → multi-session +30pp
3. **偏好提取**:正则匹配 "I like/prefer/use/enjoy" → preference 0% → 27%
4. **日期元数据**:存储 session 日期 → temporal 辅助
## 性能
- 56ms/queryembedding + Hopfield recall
- 平均 22 条记忆/问题
- 无外部 LLM 依赖
## 各类型分析
### 强项
- **single-session-user (86%)**: 用户明确说的信息 → 直接存直接检索,天然适配
- **single-session-assistant (82%)**: 分段存储解决了长回复截断问题
### 中等
- **knowledge-update (71%)**: 新旧信息都检索到了top-1 通常是新值
- **temporal-reasoning (61%)**: 日期信息在 context 里,但检索不做日期计算
- **multi-session (53%)**: 需要跨 session 聚合top-K 能召回部分但不完整
### 弱项
- **preference (27%)**: 偏好是隐含的,正则提取覆盖有限。需要 LLM 提取或更多规则
## 对比定位
64% 在 LongMemEval 上是一个 **competitive retrieval baseline**。论文中的 RAG 基线通常在 40-60%SOTA带 LLM 推理)在 70-80%。我们的 retrieval-only 64% 已经超过了多数 RAG 基线。
## 结论
**Retrieval-only 是正确选择。** 简单、快速、无依赖。提升空间在提取策略(更好的 memory 切分和偏好识别),不在检索架构。