- Add ANCHOR_STOPWORDS set in anchor.py (真正通用的疑问pattern) - Filter Chinese n-grams against stopwords in extract() - Update sparse.py content_words extraction to use stopword-filtered query - Diagnosis: 'Git rebase vs merge' query now correctly excludes Redis/asyncio blocks - Phase1 results: Full CGK 42.6 tokens avg, 0% contamination (vs Last-5 67.6 tokens, 100%) - Phase2 ablation: Gate-only accounts for most of the benefit - Phase3 sensitivity: OVERLAP/NEW_RATIO thresholds insensitive on clean data; RECENT_WINDOW is the primary token budget control Known honest limitations: - Test set is clean 4-topic synthetic data (no real dirty dialogue) - No strong baselines (BM25 ablation incomplete) - No answer-level evaluation (only retrieval blocks measured) - No parameter sensitivity on noisy real-world data - Zero contamination on 5 queries is not generalizable
112 lines
1.3 KiB
JSON
112 lines
1.3 KiB
JSON
{
|
|
"Full CGK": [
|
|
{
|
|
"pt": 16,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 59,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 19,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 56,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 61,
|
|
"cont": false
|
|
}
|
|
],
|
|
"-Deictic": [
|
|
{
|
|
"pt": 16,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 59,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 19,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 56,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 61,
|
|
"cont": false
|
|
}
|
|
],
|
|
"-Exact Match": [
|
|
{
|
|
"pt": 16,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 59,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 19,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 56,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 61,
|
|
"cont": false
|
|
}
|
|
],
|
|
"-Trim": [
|
|
{
|
|
"pt": 16,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 59,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 19,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 56,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 61,
|
|
"cont": false
|
|
}
|
|
],
|
|
"Gate-only": [
|
|
{
|
|
"pt": 16,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 59,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 45,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 56,
|
|
"cont": false
|
|
},
|
|
{
|
|
"pt": 61,
|
|
"cont": false
|
|
}
|
|
]
|
|
} |