chore: update README with full algorithm, remove concrete hardware specs
This commit is contained in:
87
README.md
87
README.md
@@ -1,15 +1,11 @@
|
|||||||
# 上下文门控器 (Context Gatekeeper)
|
# 上下文门控器 (Context Gatekeeper)
|
||||||
|
|
||||||
> ⚠️ **项目状态**:代码已完成并通过测试,论文暂未撰写。如需在学术场景使用,建议先在 QuAC/CoQA 等标准数据集上完成对照实验。
|
|
||||||
|
|
||||||
**灵感和背景**:https://gitea.ephron.ren/elaina/context-gatekeeper/src/branch/main/SUMMARY.md
|
|
||||||
|
|
||||||
轻量级上下文选择器,在同一会话中自动从历史对话里选出最小且相关的片段,减少话题污染和控制上下文长度。
|
轻量级上下文选择器,在同一会话中自动从历史对话里选出最小且相关的片段,减少话题污染和控制上下文长度。
|
||||||
|
|
||||||
## 特性
|
## 特性
|
||||||
|
|
||||||
- 🚀 **纯 Python**,无需向量化模型依赖(无 embedding、reranker、分类器)
|
- 🚀 **纯 Python**,无需向量化模型依赖(无 embedding、reranker、分类器)
|
||||||
- 💻 **轻量运行**,2 核 2G 环境可流畅运行
|
- 💻 **资源消耗极低**,依赖极少,普通的私有部署环境都能跑
|
||||||
- 🔍 **话题门控**,通过锚点 overlap + new_ratio 判断继续/切换,含指代词强制继承
|
- 🔍 **话题门控**,通过锚点 overlap + new_ratio 判断继续/切换,含指代词强制继承
|
||||||
- 📦 **稀疏召回**,BM25/IDF-overlap 评分,用户侧权重高于助手侧
|
- 📦 **稀疏召回**,BM25/IDF-overlap 评分,用户侧权重高于助手侧
|
||||||
- 🎯 **最小覆盖**,基于 IDF 加权集合覆盖的贪心选择
|
- 🎯 **最小覆盖**,基于 IDF 加权集合覆盖的贪心选择
|
||||||
@@ -94,16 +90,31 @@ python test_comparison.py
|
|||||||
|
|
||||||
## 算法细节
|
## 算法细节
|
||||||
|
|
||||||
|
### 锚点提取
|
||||||
|
|
||||||
|
从文本中提取有检索价值的关键词单元,支持:
|
||||||
|
|
||||||
|
- **中文**:2-gram 和 3-gram(如"分布式锁"、"跨进程通信")
|
||||||
|
- **英文**:单词形态
|
||||||
|
- **代码**:标识符、版本号(如 `v1.2.3`)
|
||||||
|
- **引号短语**:完整的技术术语
|
||||||
|
|
||||||
|
规则驱动,无需分词库,响应速度极快。
|
||||||
|
|
||||||
### 话题门控判断
|
### 话题门控判断
|
||||||
|
|
||||||
```
|
```
|
||||||
overlap = Σ IDF(t) for t ∈ A(q)∩A(T) / Σ IDF(t) for t ∈ A(q)
|
overlap = Σ IDF(t) for t ∈ A(q)∩A(T) / Σ IDF(t) for t ∈ A(q)
|
||||||
new_ratio = Σ IDF(t) for t ∈ A(q)\A(T) / Σ IDF(t) for t ∈ A(q)
|
new_ratio = Σ IDF(t) for t ∈ A(q)\A(T) / Σ IDF(t) for t ∈ A(q)
|
||||||
|
|
||||||
if overlap > 0.45: continue
|
if overlap > 0.45: # 重叠度高,继续当前话题
|
||||||
elif overlap < 0.20 and new_ratio > 0.70: switch
|
continue
|
||||||
elif has_deictic: continue # 指代词强制继承
|
elif overlap < 0.20 and new_ratio > 0.70: # 新词占比高,切换话题
|
||||||
else: continue # 中间地带默认继续
|
switch
|
||||||
|
elif has_deictic: # 有指代词,强制继承
|
||||||
|
continue
|
||||||
|
else:
|
||||||
|
continue # 中间地带默认继续,避免切断正在发展的思路
|
||||||
```
|
```
|
||||||
|
|
||||||
### 稀疏召回评分
|
### 稀疏召回评分
|
||||||
@@ -112,12 +123,23 @@ else: continue # 中间地带默认继续
|
|||||||
score = 1.5·lex(u_b,q) + 0.7·lex(a_b,q) + 1.0·exact(b,q) + 0.2·recency(b)
|
score = 1.5·lex(u_b,q) + 0.7·lex(a_b,q) + 1.0·exact(b,q) + 0.2·recency(b)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
- `lex(u_b,q)`:用户轮次与 Query 的 BM25/IDF 重叠(权重 1.5)
|
||||||
|
- `lex(a_b,q)`:助手轮次与 Query 的 BM25/IDF 重叠(权重 0.7,助手侧信息量通常更小)
|
||||||
|
- `exact(b,q)`:完全匹配奖励(精确命中关键词)
|
||||||
|
- `recency(b)`:新鲜度奖励,越近的轮次权重越高
|
||||||
|
|
||||||
|
取 top-20 进入下一步。
|
||||||
|
|
||||||
### 最小覆盖 gain
|
### 最小覆盖 gain
|
||||||
|
|
||||||
```
|
```
|
||||||
gain(b|S) = Σ IDF(t) for t ∈ cov(b)\covered(S) / cost(b)^α, α=0.8
|
gain(b|S) = Σ IDF(t) for t ∈ cov(b)\covered(S) / cost(b)^α, α=0.8
|
||||||
|
|
||||||
|
覆盖率达到 85% 或 token 预算耗尽时停止。
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**为什么用 IDF 加权**:高频词(如"数据"、"系统")区分度低,低频词(如"GeoHash"、"分布式锁")才是真正的语义锚点。用 IDF 加权确保选择的是真正有信息量的片段,而不是反复覆盖高频通用词。
|
||||||
|
|
||||||
## 对照实验(50轮对话)
|
## 对照实验(50轮对话)
|
||||||
|
|
||||||
使用 SiliconFlow Qwen/Qwen3-8B 模型,50轮对话(前35轮Redis,中间10轮Python,最后5轮Redis):
|
使用 SiliconFlow Qwen/Qwen3-8B 模型,50轮对话(前35轮Redis,中间10轮Python,最后5轮Redis):
|
||||||
@@ -129,13 +151,54 @@ gain(b|S) = Σ IDF(t) for t ∈ cov(b)\covered(S) / cost(b)^α, α=0.8
|
|||||||
|
|
||||||
有门控时 Query "Redis 的 GeoHash 用来做什么?" 仅召回轮次46(精确匹配),Python asyncio 轮次全部被过滤。
|
有门控时 Query "Redis 的 GeoHash 用来做什么?" 仅召回轮次46(精确匹配),Python asyncio 轮次全部被过滤。
|
||||||
|
|
||||||
|
完整伪代码:
|
||||||
|
|
||||||
|
```
|
||||||
|
function select(q, turns):
|
||||||
|
# 1. 锚点提取
|
||||||
|
anchors_q = extract_anchors(q)
|
||||||
|
active_topic = get_active_topic()
|
||||||
|
|
||||||
|
# 2. 话题门控
|
||||||
|
overlap = compute_overlap(anchors_q, active_topic)
|
||||||
|
new_ratio = compute_new_ratio(anchors_q, active_topic)
|
||||||
|
|
||||||
|
if overlap < 0.20 and new_ratio > 0.70:
|
||||||
|
active_topic = create_new_topic(anchors_q) # 切换
|
||||||
|
elif has_deictic(q):
|
||||||
|
inherit_recent(2) # 指代词,强制继承最近2轮
|
||||||
|
# 否则继续当前话题
|
||||||
|
|
||||||
|
# 3. 稀疏召回
|
||||||
|
candidates = []
|
||||||
|
for each turn i:
|
||||||
|
score_i = 1.5 * bm25(user_i, q) + 0.7 * bm25(assistant_i, q) + \
|
||||||
|
1.0 * exact_match(i, q) + 0.2 * recency(i)
|
||||||
|
candidates.append((score_i, i))
|
||||||
|
|
||||||
|
top20 = top_k(candidates, k=20)
|
||||||
|
|
||||||
|
# 4. 最小覆盖贪心选择
|
||||||
|
selected = []
|
||||||
|
covered = empty_set()
|
||||||
|
for each block b in top20 sorted by gain:
|
||||||
|
new_anchors = extract_anchors(b) \ covered
|
||||||
|
if len(new_anchors) == 0: continue
|
||||||
|
gain_b = sum(IDF(t) for t in new_anchors) / cost(b)^0.8
|
||||||
|
selected.append((gain_b, b))
|
||||||
|
covered.update(new_anchors)
|
||||||
|
if coverage(covered) >= 0.85: break
|
||||||
|
|
||||||
|
return selected
|
||||||
|
```
|
||||||
|
|
||||||
## 局限性与适用场景
|
## 局限性与适用场景
|
||||||
|
|
||||||
**局限性:**
|
**局限性:**
|
||||||
- 稀疏检索在语义相似但词形不同时召回率有限
|
- 稀疏检索依赖词形匹配,语义相近但词形不同的情况容易漏召
|
||||||
- 中文锚点无停用词过滤,高频无意义词可能干扰 IDF
|
|
||||||
- Token 估算为粗略估算(字符数×1.5),与实际有 2-3 倍误差
|
- Token 估算为粗略估算(字符数×1.5),与实际有 2-3 倍误差
|
||||||
- 最小粒度是整个 block,block 内部无句级裁剪
|
- 最小粒度是整个 block,block 内部无句级裁剪,边界粗糙
|
||||||
|
- 没有在 QuAC 这类标准学术数据集上做对照实验,无法跟 Attentive History 这类基于注意力机制的方法直接对比
|
||||||
|
|
||||||
**适用场景:**
|
**适用场景:**
|
||||||
- 资源受限的生产环境(边缘设备、私有部署)
|
- 资源受限的生产环境(边缘设备、私有部署)
|
||||||
|
|||||||
Reference in New Issue
Block a user