Phase2 complete ablation (added missing variants): - Coverage-only: 20% contamination rate (confirms Gate is critical) - Gate-only: +5.2 tokens vs Full (coverage optimization marginal on clean data) - -Recency: 0 effect on clean data - -IDF: 0 effect on clean data Phase4 end-to-end quality evaluation: - CGK vs Last-5 across 5 queries: * CGK: 42.2 tok, purity=1.000, anchor_recall=0.638, term_cov=0.380, contamination=0 * Last-5: 67.6 tok, purity=0.280, anchor_recall=0.066, term_cov=0.080, contamination=5 - All quality metrics CGK >> Last-5 on synthetic clean data Known honest limitations: - Still no real dialogue data (synthetic 4-topic only) - No real LLM calls (quality is rule-estimated) - Parameter sensitivity only on clean data, not noisy real data
12 lines
346 B
JSON
12 lines
346 B
JSON
{
|
|
"cgk_avg_tokens": 42.2,
|
|
"last5_avg_tokens": 67.6,
|
|
"cgk_avg_purity": 1.0,
|
|
"last5_avg_purity": 0.28,
|
|
"cgk_avg_anchor_recall": 0.6382417582417583,
|
|
"last5_avg_anchor_recall": 0.06598502946329034,
|
|
"cgk_avg_term_coverage": 0.38,
|
|
"last5_avg_term_coverage": 0.08,
|
|
"cgk_contamination_episodes": 0,
|
|
"last5_contamination_episodes": 5
|
|
} |