fix: anchor stopwords - remove generic question patterns causing cross-topic contamination

- Add ANCHOR_STOPWORDS set in anchor.py (真正通用的疑问pattern)
- Filter Chinese n-grams against stopwords in extract()
- Update sparse.py content_words extraction to use stopword-filtered query
- Diagnosis: 'Git rebase vs merge' query now correctly excludes Redis/asyncio blocks
- Phase1 results: Full CGK 42.6 tokens avg, 0% contamination (vs Last-5 67.6 tokens, 100%)
- Phase2 ablation: Gate-only accounts for most of the benefit
- Phase3 sensitivity: OVERLAP/NEW_RATIO thresholds insensitive on clean data;
  RECENT_WINDOW is the primary token budget control

Known honest limitations:
- Test set is clean 4-topic synthetic data (no real dirty dialogue)
- No strong baselines (BM25 ablation incomplete)
- No answer-level evaluation (only retrieval blocks measured)
- No parameter sensitivity on noisy real-world data
- Zero contamination on 5 queries is not generalizable
This commit is contained in:
Elaina
2026-04-22 22:30:18 +08:00
parent 2064eb7bdf
commit 9e44748f91
10 changed files with 1461 additions and 12 deletions

View File

@@ -0,0 +1,112 @@
{
"Full CGK": [
{
"pt": 16,
"cont": false
},
{
"pt": 59,
"cont": false
},
{
"pt": 19,
"cont": false
},
{
"pt": 56,
"cont": false
},
{
"pt": 61,
"cont": false
}
],
"-Deictic": [
{
"pt": 16,
"cont": false
},
{
"pt": 59,
"cont": false
},
{
"pt": 19,
"cont": false
},
{
"pt": 56,
"cont": false
},
{
"pt": 61,
"cont": false
}
],
"-Exact Match": [
{
"pt": 16,
"cont": false
},
{
"pt": 59,
"cont": false
},
{
"pt": 19,
"cont": false
},
{
"pt": 56,
"cont": false
},
{
"pt": 61,
"cont": false
}
],
"-Trim": [
{
"pt": 16,
"cont": false
},
{
"pt": 59,
"cont": false
},
{
"pt": 19,
"cont": false
},
{
"pt": 56,
"cont": false
},
{
"pt": 61,
"cont": false
}
],
"Gate-only": [
{
"pt": 16,
"cont": false
},
{
"pt": 59,
"cont": false
},
{
"pt": 45,
"cont": false
},
{
"pt": 56,
"cont": false
},
{
"pt": 61,
"cont": false
}
]
}