context-gatekeeper

Author	SHA1	Message	Date
Elaina	97e1ddf138	complete: full ablation + Phase4 quality evaluation + honest blog post Phase2 complete ablation (added missing variants): - Coverage-only: 20% contamination rate (confirms Gate is critical) - Gate-only: +5.2 tokens vs Full (coverage optimization marginal on clean data) - -Recency: 0 effect on clean data - -IDF: 0 effect on clean data Phase4 end-to-end quality evaluation: - CGK vs Last-5 across 5 queries: * CGK: 42.2 tok, purity=1.000, anchor_recall=0.638, term_cov=0.380, contamination=0 * Last-5: 67.6 tok, purity=0.280, anchor_recall=0.066, term_cov=0.080, contamination=5 - All quality metrics CGK >> Last-5 on synthetic clean data Known honest limitations: - Still no real dialogue data (synthetic 4-topic only) - No real LLM calls (quality is rule-estimated) - Parameter sensitivity only on clean data, not noisy real data	2026-04-22 22:48:25 +08:00
Elaina	9e44748f91	fix: anchor stopwords - remove generic question patterns causing cross-topic contamination - Add ANCHOR_STOPWORDS set in anchor.py (真正通用的疑问pattern) - Filter Chinese n-grams against stopwords in extract() - Update sparse.py content_words extraction to use stopword-filtered query - Diagnosis: 'Git rebase vs merge' query now correctly excludes Redis/asyncio blocks - Phase1 results: Full CGK 42.6 tokens avg, 0% contamination (vs Last-5 67.6 tokens, 100%) - Phase2 ablation: Gate-only accounts for most of the benefit - Phase3 sensitivity: OVERLAP/NEW_RATIO thresholds insensitive on clean data; RECENT_WINDOW is the primary token budget control Known honest limitations: - Test set is clean 4-topic synthetic data (no real dirty dialogue) - No strong baselines (BM25 ablation incomplete) - No answer-level evaluation (only retrieval blocks measured) - No parameter sensitivity on noisy real-world data - Zero contamination on 5 queries is not generalizable	2026-04-22 22:30:18 +08:00
Elaina	2064eb7bdf	docs: add DESIGN.md following Google Stitch spec	2026-04-22 19:33:01 +08:00
Elaina	d18a521f9c	fix: 修复评审发现的4个高优先级问题 1. sparse.py: 话题切换过滤从赋0分改为continue，真正排除旧话题候选 2. gatekeeper.py: reset() 清空IDF缓存，避免新会话状态污染 3. gatekeeper.py: 句级裁剪后重新估算token数 4. sparse.py: content_words提取纳入所有英文单词(含单字符如'pg')和2字中文词	2026-04-22 12:21:52 +08:00
Elaina	c828fceae9	chore: update README with complete algorithm and 100-round 4-topic results	2026-04-22 12:12:04 +08:00
Elaina	07b66d3b58	chore: update README with full algorithm, remove concrete hardware specs	2026-04-22 11:14:19 +08:00
Elaina	9a2b1e3b6a	chore: remove paper, add summary, update README	2026-04-22 10:49:11 +08:00
Elaina	8852f1b1fb	chore: remove paper (未完成)	2026-04-22 10:43:50 +08:00
Elaina	a8204a50b5	docs: 更新 README.md，包含算法细节、局限性、适用场景	2026-04-22 09:49:17 +08:00
Elaina	93156cf736	docs: 修正论文与文档不一致处 - recency: '时间衰减' → '新鲜度奖励（越新越大）' - 删除3.6节句级裁剪（未实现） - 补充中间地带fallback规则（0.20≤overlap≤0.45默认继续） - 修正MS MARCO作者：Liu→Nguyen - 10ms延迟标注为理论估算，移除无依据数据 - 更新局限性描述与实现状态一致	2026-04-22 09:46:47 +08:00
Elaina	224295ccaf	fix: selector gain函数使用IDF加权，与文档一致 - selector.select() 接收 idf_cache 参数 - gain = ΣIDF(t) for t ∈ new_anchors / cost^α（与文档公式一致） - gatekeeper.select() 将 anchor_extractor._idf_cache 传入selector - sparse.py recency 注释澄清为'新鲜度奖励'而非'时间衰减' - 所有测试 9/9 通过	2026-04-22 09:45:30 +08:00
Elaina	7ced5d9a10	docs: 添加论文《上下文门控器》	2026-04-22 01:22:13 +08:00
Elaina	64ca67c051	fix: 修复 _active_topic 在话题切换后不更新的 bug 问题: _active_topic 只在 continue 时更新，switch 后停留在旧值，导致 overlap 计算失效。修复: - select() 每次都更新 _active_topic（无论是否切换） - 同步调用 topic_gate.update_active_topic() 保持两份状态一致同时更新 TopicGate 实例的活跃话题状态，解决两份状态独立的问题。	2026-04-22 01:14:13 +08:00
Elaina	bbaab47de4	docs: 添加 SPEC.md 规格文档	2026-04-22 01:12:03 +08:00
Elaina	071f9ef418	feat: 上下文门控器初始实现 - anchor.py: 锚点提取（中文 2/3-gram、英文单词、代码标识符） - block.py: 对话块数据结构 - topic_gate.py: 话题门控（overlap/new_ratio 判断切换） - sparse.py: 稀疏召回（BM25/IDF-overlap + exact match 加分） - selector.py: 最小覆盖贪心选择 - gatekeeper.py: 完整流程封装 - tests/: 单元测试 + 端到端测试（含 MiniMax API 验证）特性： - 纯 Python，无额外模型依赖 - 支持 2 核 2G 环境 - 话题门控 + 稀疏召回 + 最小覆盖选择	2026-04-22 01:09:35 +08:00

15 Commits