Improve AI daily report operations and dedupe observability

Add Stage 2.8 recall, quality gate, retries, and publish idempotency
fix: add cross-day dedupe
2026-06-10 21:55:29 +08:00 · 2026-06-10 21:31:13 +08:00 · 2026-06-08 12:05:45 +08:00 · 2026-06-04 17:42:08 +08:00 · 2026-06-04 17:12:59 +08:00 · 2026-06-04 16:51:12 +08:00
81 changed files with 7771 additions and 1316 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,9 @@
 .env
 .env.*
 !.env.example
 __pycache__/
 *.py[cod]
 .pytest_cache/
 runs/
 runs-*/
 .idea/
--- a/.learnings/ERRORS.md
+++ b/.learnings/ERRORS.md
@@ -0,0 +1,144 @@
 ## [ERR-20260606-001] computer_use_helper_startup
 **Logged**: 2026-06-06T00:00:00+08:00
 **Priority**: medium
 **Status**: pending
 **Area**: infra
 ### Summary
 Computer Use helper failed during Windows automation startup.
 ### Error
 ```text
 node_repl kernel exited unexpectedly
 windows sandbox failed: spawn setup refresh
 ```
 ### Context
 - Operation attempted: initialize Computer Use and list Windows apps.
 - Retried after resetting the JavaScript session.
 - Both attempts failed before any app automation actions were taken.
 ### Suggested Fix
 Investigate the Computer Use Windows helper startup path and sandbox setup; retry after the helper/runtime is refreshed.
 ### Metadata
 - Reproducible: yes
 - Related Files: C:/Users/12256/.codex/plugins/cache/openai-bundled/computer-use/26.602.40724/scripts/computer-use-client.mjs
 ---
 ## [ERR-20260610-001] absolute_path_prefixed_with_workspace
 **Logged**: 2026-06-10T00:00:00+08:00
 **Priority**: low
 **Status**: pending
 **Area**: docs
 ### Summary
 An absolute skill file path was accidentally prefixed with the current workspace path when verifying completion.
 ### Error
 ```text
 Get-Content : Cannot find path 'E:\Codes\ai-daily-report\C:\Users\12256\.codex\superpowers\skills\verification-before-completion\SKILL.md'
 ```
 ### Context
 - Operation attempted: read `C:\Users\12256\.codex\superpowers\skills\verification-before-completion\SKILL.md`.
 - The command used a malformed literal path that concatenated the workspace root and the absolute path.
 - Re-running with the actual absolute path succeeded.
 ### Suggested Fix
 When reading skill files or other absolute Windows paths, pass the `C:\...` path directly and do not combine it with the workspace path.
 ### Metadata
 - Reproducible: yes
 - Related Files: C:\Users\12256\.codex\superpowers\skills\verification-before-completion\SKILL.md
 ---
 ## [ERR-20260608-003] git_push_auth_failed
 **Logged**: 2026-06-08T00:00:00+08:00
 **Priority**: medium
 **Status**: pending
 **Area**: infra
 ### Summary
 `git push origin main` failed because the Gitea remote rejected authentication.
 ### Error
 ```text
 remote: Failed to authenticate user
 fatal: Authentication failed for 'https://gitea.ephron.ren/Elaina/ai-daily-report.git/'
 ```
 ### Context
 - Operation attempted: push committed cross-day dedupe fix to `origin/main`.
 - Local commit exists: `07786e3 fix: add cross-day dedupe`.
 - Test suite passed before commit: `79 passed`.
 ### Suggested Fix
 Refresh Git credentials for `https://gitea.ephron.ren` or switch the remote to an authenticated SSH/HTTPS URL, then rerun `git push origin main`.
 ### Metadata
 - Reproducible: yes
 - Related Files: git remote origin
 ---
 ## [ERR-20260608-002] powershell_convertfromjson_mojibake
 **Logged**: 2026-06-08T00:00:00+08:00
 **Priority**: low
 **Status**: pending
 **Area**: tests
 ### Summary
 PowerShell `ConvertFrom-Json` failed on a generated report containing existing mojibake section labels, while Python `json.loads` parsed the same report successfully.
 ### Error
 ```text
 ConvertFrom-Json : Invalid object passed in, ':' or '}' expected.
 ```
 ### Context
 - Operation attempted: verify CLI dry-run output by piping `run_report.json` through `ConvertFrom-Json`.
 - Follow-up verification with Python `json.loads` succeeded and confirmed `stage2_5` and `stage8` fields.
 ### Suggested Fix
 Use Python's JSON parser for verification in this repository when report content includes mojibake-rendered non-ASCII strings.
 ### Metadata
 - Reproducible: yes
 - Related Files: run_report.json
 ---
 ## [ERR-20260608-001] apply_patch_context_encoding
 **Logged**: 2026-06-08T00:00:00+08:00
 **Priority**: low
 **Status**: pending
 **Area**: tests
 ### Summary
 `apply_patch` failed when matching context lines that contained mojibake-rendered Chinese text.
 ### Error
 ```text
 apply_patch verification failed: Failed to find expected lines
 ```
 ### Context
 - Operation attempted: update `tests/test_stage2_dedupe.py` with a patch anchored on displayed non-ASCII strings.
 - The file content rendered differently enough that the expected context did not match.
 ### Suggested Fix
 Use ASCII-only anchors, line-number inspection, or smaller structural context when patching files that contain mojibake-rendered non-ASCII text.
 ### Metadata
 - Reproducible: yes
 - Related Files: tests/test_stage2_dedupe.py
 ---
--- a/ai_daily_report/init.py
+++ b/ai_daily_report/init.py
@@ -0,0 +1,2 @@
 """Core package for the AI daily report pipeline."""
--- a/ai_daily_report/assemble.py
+++ b/ai_daily_report/assemble.py
@@ -0,0 +1,91 @@
 from __future__ import annotations
 import re
 from typing import Any
 from .classify import SECTION_ORDER
 from .models import NewsItem
 from .validate import validate_markdown
 END_PUNCTUATION = "。！？；.!?;"
 def _clean_text(text: str) -> str:
    value = re.sub(r"^```(?:\w+)?\s*\n?", "", (text or "").strip())
    value = re.sub(r"\n?```\s*$", "", value)
    value = re.sub(r"^\s*>\s*", "", value)
    value = re.sub(r"\[\d+\]|\[N\]", "", value)
    value = re.sub(r"主线判断[：:]\s*", "", value)
    value = re.sub(r"\s+", " ", value).strip()
    return value
 def _ensure_sentence(text: str) -> str:
    value = _clean_text(text)
    if value and value[-1] not in END_PUNCTUATION:
        value += "。"
    return value
 def _source_link(item: NewsItem) -> str:
    source = item.source_label or item.source_group or "来源"
    if item.url:
        return f"[{source} ↗]({item.url})"
    return source
 def _fallback_intro(items: list[NewsItem]) -> str:
    count = len(items)
    return f"今天共聚合 {count} 条 AI 动态，覆盖模型能力、产品应用、基础设施、资本与治理等方向。"
 def _fallback_conclusion(items: list[NewsItem]) -> str:
    sections = [section for section in SECTION_ORDER if any(item.section == section for item in items)]
    if sections:
        return "总体看，今日 AI 动态主要集中在" + "、".join(sections[:4]) + "等方向，后续仍需持续观察落地进展。"
    return "总体看，今日 AI 动态仍在持续演进，后续需要关注产品落地和生态变化。"
 def assemble_markdown(items: list[NewsItem], guide: dict[str, Any] | None = None) -> tuple[str, dict[str, Any]]:
    guide = guide or {"intro": "", "theme": "", "threads": [], "conclusion": ""}
    lines: list[str] = []
    intro = _ensure_sentence(str(guide.get("intro") or "")) or _fallback_intro(items)
    lines.extend(["## 引言", "", f"> {intro}", ""])
    item_number = 1
    for section in SECTION_ORDER:
        section_items = [item for item in items if item.section == section]
        if not section_items:
            continue
        lines.extend([f"## {section}", ""])
        for item in section_items:
            title = _clean_text(item.title or item.title_raw)
            summary = _ensure_sentence(item.summary or item.summary_raw or "该条目暂无摘要。")
            lines.extend(
                [
                    f"**{item_number}. {title}**",
                    "",
                    f"> {summary}{_source_link(item)}",
                    "",
                ]
            )
            item_number += 1
    threads = guide.get("threads", []) or []
    if threads:
        lines.extend(["## 今日脉络", ""])
        for thread in threads:
            title = _clean_text(str(thread.get("title") or ""))
            text = _ensure_sentence(str(thread.get("text") or ""))
            if not title or not text:
                continue
            lines.extend([f"- **{title}**", f"  {text}", ""])
    conclusion = _ensure_sentence(str(guide.get("conclusion") or "")) or _fallback_conclusion(items)
    lines.extend(["## 总结", "", f"> {conclusion}", ""])
    markdown = "\n".join(lines).strip()
    report = validate_markdown(markdown, items)
    return markdown, report
--- a/ai_daily_report/audit.py
+++ b/ai_daily_report/audit.py
@@ -0,0 +1,89 @@
 from __future__ import annotations
 import json
 from pathlib import Path
 from typing import Any
 def load_run_report(path: Path) -> dict[str, Any] | None:
    report_path = path / "run_report.json" if path.is_dir() else path
    if not report_path.exists():
        return None
    try:
        value = json.loads(report_path.read_text(encoding="utf-8"))
    except Exception:
        return None
    return value if isinstance(value, dict) else None
 def summarize_reports(out_dir: Path, *, limit_days: int = 7) -> dict[str, Any]:
    run_dirs = sorted([path for path in out_dir.iterdir() if path.is_dir()], reverse=True)[:limit_days]
    rows: list[dict[str, Any]] = []
    totals: dict[str, Any] = {
        "source_failures": 0,
        "duplicate_candidates": 0,
        "final_items": 0,
        "fallback_items": 0,
        "quality_warnings": 0,
        "quality_blocks": 0,
    }
    for run_dir in sorted(run_dirs):
        report = load_run_report(run_dir)
        if not report:
            continue
        quality_gate = report.get("quality_gate", {}) or {}
        stage2_8 = report.get("stage2_8", {}) or {}
        stage4 = report.get("stage4", {}) or {}
        stage5 = report.get("stage5", {}) or {}
        stage8 = report.get("stage8", {}) or {}
        fallback_count = int(stage4.get("fallback_count", stage4.get("fallback_item_count", 0)) or 0)
        final_count = int(stage5.get("output_count", stage4.get("output_count", 0)) or 0)
        source_failures = len(quality_gate.get("source_failures", []) or [])
        duplicate_candidates = int(stage2_8.get("candidate_group_count", 0) or 0)
        warnings = len(quality_gate.get("warnings", []) or [])
        blocks = len(quality_gate.get("blocking_errors", []) or [])
        row = {
            "date": run_dir.name,
            "source_failures": source_failures,
            "duplicate_candidates": duplicate_candidates,
            "final_items": final_count,
            "fallback_items": fallback_count,
            "fallback_ratio": round(fallback_count / final_count, 4) if final_count else 0,
            "quality_warnings": warnings,
            "quality_blocks": blocks,
            "publish_status": stage8.get("status"),
            "publish_slug": stage8.get("slug"),
        }
        rows.append(row)
        totals["source_failures"] += source_failures
        totals["duplicate_candidates"] += duplicate_candidates
        totals["final_items"] += final_count
        totals["fallback_items"] += fallback_count
        totals["quality_warnings"] += warnings
        totals["quality_blocks"] += blocks
    totals["fallback_ratio"] = round(totals["fallback_items"] / totals["final_items"], 4) if totals["final_items"] else 0
    return {"run_count": len(rows), "totals": totals, "runs": rows}
 def render_markdown(summary: dict[str, Any]) -> str:
    totals = summary.get("totals", {})
    lines = [
        "# AI日报每周自动审计报告",
        "",
        f"- 覆盖运行数：{summary.get('run_count', 0)}",
        f"- 源失败次数：{totals.get('source_failures', 0)}",
        f"- 重复候选数：{totals.get('duplicate_candidates', 0)}",
        f"- 最终条数：{totals.get('final_items', 0)}",
        f"- fallback ratio：{totals.get('fallback_ratio', 0)}",
        f"- 质量门禁 warning/block：{totals.get('quality_warnings', 0)}/{totals.get('quality_blocks', 0)}",
        "",
        "| 日期 | 源失败 | 重复候选 | 最终条数 | fallback | warning | block | 发布 | slug |",
        "|---|---:|---:|---:|---:|---:|---:|---|---|",
    ]
    for row in summary.get("runs", []) or []:
        lines.append(
            f"| {row['date']} | {row['source_failures']} | {row['duplicate_candidates']} | "
            f"{row['final_items']} | {row['fallback_ratio']} | {row['quality_warnings']} | "
            f"{row['quality_blocks']} | {row.get('publish_status') or ''} | {row.get('publish_slug') or ''} |"
        )
    return "\n".join(lines) + "\n"
--- a/ai_daily_report/candidate_recall.py
+++ b/ai_daily_report/candidate_recall.py
@@ -0,0 +1,162 @@
 from __future__ import annotations
 import difflib
 import re
 from collections import defaultdict
 from typing import Any
 from .dedupe import _jaccard_similarity, _title_tokens
 from .models import NewsItem
 DEFAULT_CONFIG = {
    "enabled": True,
    "max_pairs": 80,
    "max_pairs_per_item": 5,
    "title_similarity_threshold": 0.45,
    "title_jaccard_threshold": 0.25,
    "summary_jaccard_threshold": 0.18,
    "strong_entity_overlap_threshold": 2,
 }
 STOP_ENTITIES = {
    "AI",
    "API",
    "CLI",
    "LLM",
    "Open Source",
    "GitHub",
    "Google",
    "OpenAI",
    "Anthropic",
    "Microsoft",
    "Meta",
    "Amazon",
    "NVIDIA",
 }
 def _config_value(config: dict[str, Any], name: str):
    return (config or {}).get(name, DEFAULT_CONFIG[name])
 def _text_tokens(value: str) -> set[str]:
    return _title_tokens(value)
 def _entity_tokens(value: str) -> set[str]:
    text = value or ""
    entities = set(re.findall(r"\b[A-Z][A-Za-z0-9]*(?:[- ][A-Z0-9][A-Za-z0-9]*)*\b", text))
    entities.update(re.findall(r"[\u4e00-\u9fffA-Za-z0-9]*[A-Za-z]+[0-9]+[A-Za-z0-9-]*", text))
    cleaned = {entity.strip() for entity in entities if len(entity.strip()) >= 3}
    return {entity for entity in cleaned if entity not in STOP_ENTITIES}
 def _pair_key(item_ids: list[str]) -> frozenset[str]:
    return frozenset(item_ids)
 def _candidate_score(left: NewsItem, right: NewsItem, config: dict[str, Any]) -> tuple[float, str, dict[str, Any]] | None:
    title_ratio = difflib.SequenceMatcher(None, left.title_norm, right.title_norm).ratio()
    title_jaccard = _jaccard_similarity(_text_tokens(left.title_norm), _text_tokens(right.title_norm))
    summary_jaccard = _jaccard_similarity(_text_tokens(left.summary_raw), _text_tokens(right.summary_raw))
    left_entities = _entity_tokens(f"{left.title_raw} {left.summary_raw}")
    right_entities = _entity_tokens(f"{right.title_raw} {right.summary_raw}")
    shared_entities = sorted(left_entities & right_entities)
    strong_entity_threshold = int(_config_value(config, "strong_entity_overlap_threshold"))
    if len(shared_entities) >= strong_entity_threshold and summary_jaccard > 0:
        score = min(1.0, 0.55 + len(shared_entities) * 0.1 + summary_jaccard * 0.35)
        return score, "strong_entity_overlap", {
            "shared_entities": shared_entities,
            "title_similarity": round(title_ratio, 3),
            "title_jaccard": round(title_jaccard, 3),
            "summary_jaccard": round(summary_jaccard, 3),
        }
    if title_ratio >= float(_config_value(config, "title_similarity_threshold")) and (
        title_jaccard >= float(_config_value(config, "title_jaccard_threshold"))
        or summary_jaccard >= float(_config_value(config, "summary_jaccard_threshold")) * 2
        or shared_entities
    ):
        return title_ratio, "title_similarity", {
            "title_similarity": round(title_ratio, 3),
            "title_jaccard": round(title_jaccard, 3),
            "summary_jaccard": round(summary_jaccard, 3),
        }
    if (
        title_jaccard >= float(_config_value(config, "title_jaccard_threshold"))
        and summary_jaccard >= float(_config_value(config, "summary_jaccard_threshold"))
    ):
        score = (title_jaccard + summary_jaccard) / 2
        return score, "title_summary_jaccard", {
            "title_similarity": round(title_ratio, 3),
            "title_jaccard": round(title_jaccard, 3),
            "summary_jaccard": round(summary_jaccard, 3),
        }
    return None
 def recall_semantic_candidates(
    items: list[NewsItem],
    *,
    existing_candidates: list[dict[str, Any]] | None = None,
    config: dict[str, Any] | None = None,
 ) -> tuple[list[dict[str, Any]], dict[str, Any]]:
    config = {**DEFAULT_CONFIG, **(config or {})}
    existing_candidates = list(existing_candidates or [])
    if not bool(config.get("enabled", True)):
        return existing_candidates, {
            "enabled": False,
            "input_count": len(items),
            "existing_candidate_group_count": len(existing_candidates),
            "added_candidate_group_count": 0,
            "candidate_group_count": len(existing_candidates),
            "candidates": existing_candidates,
        }
    existing_keys = {_pair_key(list(candidate.get("item_ids", []) or [])) for candidate in existing_candidates}
    pair_counts: defaultdict[str, int] = defaultdict(int)
    recalled: list[dict[str, Any]] = []
    for index, left in enumerate(items):
        for right in items[index + 1 :]:
            if pair_counts[left.id] >= int(config["max_pairs_per_item"]):
                continue
            if pair_counts[right.id] >= int(config["max_pairs_per_item"]):
                continue
            key = frozenset({left.id, right.id})
            if key in existing_keys:
                continue
            scored = _candidate_score(left, right, config)
            if scored is None:
                continue
            score, reason, evidence = scored
            recalled.append(
                {
                    "item_ids": [left.id, right.id],
                    "reason": reason,
                    "score": round(score, 3),
                    "confidence": "medium",
                    **evidence,
                }
            )
            pair_counts[left.id] += 1
            pair_counts[right.id] += 1
            if len(recalled) >= int(config["max_pairs"]):
                break
        if len(recalled) >= int(config["max_pairs"]):
            break
    candidates = existing_candidates + recalled
    report = {
        "enabled": True,
        "input_count": len(items),
        "existing_candidate_group_count": len(existing_candidates),
        "added_candidate_group_count": len(recalled),
        "candidate_group_count": len(candidates),
        "candidates": candidates,
    }
    return candidates, report
--- a/ai_daily_report/classify.py
+++ b/ai_daily_report/classify.py
@@ -0,0 +1,118 @@
 from __future__ import annotations
 from collections import Counter
 from typing import Any
 from .models import NewsItem
 SECTION_ORDER = [
    "模型与能力",
    "产品与应用",
    "开发与基础设施",
    "公司与资本",
    "政策与安全",
    "论文与研究",
    "观点与教程",
    "人物与动态",
 ]
 SECTION_ALIASES = {
    "模型发布/更新": "模型与能力",
    "产品发布/更新": "产品与应用",
    "产品与工具": "产品与应用",
    "开发与工程": "开发与基础设施",
    "行业动态": "公司与资本",
    "行业与公司": "公司与资本",
    "论文研究": "论文与研究",
    "论文与研究": "论文与研究",
    "技巧与观点": "观点与教程",
    "观点与教程": "观点与教程",
    "人物与花絮": "人物与动态",
 }
 RULES = [
    ("政策与安全", ("监管", "政策", "安全", "风险", "滥用", "攻击", "合规", "版权")),
    ("论文与研究", ("论文", "研究", "arxiv", "cvpr", "benchmark", "评测", "实验")),
    ("开发与基础设施", ("sdk", "api", "mcp", "kubernetes", "框架", "开源", "github", "部署", "基础设施")),
    ("公司与资本", ("融资", "ipo", "上市", "招股书", "合作", "估值", "收购", "资本")),
    ("模型与能力", ("模型", "gpt", "claude", "gemini", "grok", "token", "参数", "多模态", "语音", "推理")),
    ("产品与应用", ("agent", "应用", "产品", "平台", "上线", "工具", "智能体")),
    ("观点与教程", ("教程", "观点", "方法论", "guide", "实践", "技巧")),
    ("人物与动态", ("黄仁勋", "纳德拉", "访谈", "演讲", "人物")),
 ]
 def normalize_section_hint(section_hint: str) -> str:
    hint = (section_hint or "").strip()
    if hint in SECTION_ORDER:
        return hint
    return SECTION_ALIASES.get(hint, "")
 def rule_classify(item: NewsItem) -> str:
    text = f"{item.title or item.title_raw} {item.summary or item.summary_raw}".lower()
    for section, keywords in RULES:
        if any(keyword.lower() in text for keyword in keywords):
            return section
    return "公司与资本"
 def rank_score(item: NewsItem) -> int:
    text = f"{item.title or item.title_raw} {item.summary or item.summary_raw}"
    score = max(0, 200 - item.source_priority)
    if item.source_role == "primary":
        score += 10
    if item.canonical_url:
        score += 10
    if any(ch.isdigit() for ch in text):
        score += 10
    if item.duplicate_sources:
        score += min(20, len(item.duplicate_sources) * 5)
    score -= len(item.quality_flags) * 10
    return score
 def classify_and_order_items(items: list[NewsItem]) -> tuple[list[NewsItem], dict[str, Any]]:
    llm_classified = 0
    hint_classified = 0
    rule_classified = 0
    invalid_llm_section_count = 0
    for item in items:
        if item.section:
            if item.section in SECTION_ORDER:
                llm_classified += 1
                continue
            invalid_llm_section_count += 1
        mapped = normalize_section_hint(item.section_hint)
        if mapped:
            item.section = mapped
            hint_classified += 1
        else:
            item.section = rule_classify(item)
            rule_classified += 1
    section_index = {section: index for index, section in enumerate(SECTION_ORDER)}
    ordered = sorted(
        items,
        key=lambda item: (
            section_index.get(item.section or "", len(SECTION_ORDER)),
            -rank_score(item),
            item.title or item.title_raw,
        ),
    )
    section_counts = Counter(item.section for item in ordered if item.section)
    report = {
        "input_count": len(items),
        "section_counts": dict(section_counts),
        "hint_classified": hint_classified,
        "rule_classified": rule_classified,
        "llm_classified": llm_classified,
        "fallback_classified": hint_classified + rule_classified,
        "invalid_llm_section_count": invalid_llm_section_count,
        "invalid_section_count": sum(1 for item in ordered if item.section not in SECTION_ORDER),
    }
    return ordered, report
--- a/ai_daily_report/cli.py
+++ b/ai_daily_report/cli.py
@@ -0,0 +1,50 @@
 from __future__ import annotations
 import argparse
 from pathlib import Path
 from .audit import render_markdown, summarize_reports
 from .runner import run_daily_report
 def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(prog="ai-daily-report")
    subcommands = parser.add_subparsers(dest="command")
    run = subcommands.add_parser("run")
    run.add_argument("--date", default="today")
    run.add_argument("--mode", choices=["dry-run", "draft", "publish"], default="dry-run")
    run.add_argument("--source-mode", choices=["mock", "live"], default="mock")
    run.add_argument("--llm-mode", choices=["mock", "live"], default="mock")
    run.add_argument("--out-dir", default="runs")
    run.add_argument("--base-url", default="https://blog.ephron.ren")
    run.add_argument("--sources-path", default=None)
    run.add_argument("--pipeline-path", default=None)
    run.add_argument("--history-path", default=None)
    audit = subcommands.add_parser("audit")
    audit.add_argument("--out-dir", default=str(Path.home() / ".hermes" / "scripts" / "ai_morning_out"))
    audit.add_argument("--limit-days", type=int, default=7)
    return parser
 def main(argv: list[str] | None = None) -> int:
    parser = build_parser()
    args = parser.parse_args(argv)
    if args.command == "run":
        run_daily_report(
            run_date=args.date,
            mode=args.mode,
            source_mode=args.source_mode,
            llm_mode=args.llm_mode,
            out_dir=Path(args.out_dir),
            base_url=args.base_url,
            sources_path=Path(args.sources_path) if args.sources_path else None,
            pipeline_path=Path(args.pipeline_path) if args.pipeline_path else None,
            history_path=Path(args.history_path) if args.history_path else None,
        )
    elif args.command == "audit":
        print(render_markdown(summarize_reports(Path(args.out_dir), limit_days=args.limit_days)))
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/ai_daily_report/clients.py
+++ b/ai_daily_report/clients.py
@@ -0,0 +1,164 @@
 from __future__ import annotations
 import json
 import socket
 import time
 from dataclasses import dataclass
 from urllib.error import HTTPError, URLError
 from urllib.parse import urlencode
 import urllib.request
 from typing import Any
 UA = "Mozilla/5.0 (compatible; ai-daily-report/1.0)"
@dataclass
 class FetchTextError(Exception):
    error_type: str
    message: str
    http_status: int | None = None
    attempts: int = 1
    def __str__(self) -> str:
        return self.message
 def _classify_fetch_exception(exc: Exception) -> tuple[str, int | None, bool]:
    if isinstance(exc, HTTPError):
        if exc.code == 404:
            return "http_404", exc.code, False
        if exc.code in {429, 500, 502, 503, 504}:
            return f"http_{exc.code}", exc.code, True
        return f"http_{exc.code}", exc.code, False
    if isinstance(exc, TimeoutError | socket.timeout):
        return "timeout", None, True
    if isinstance(exc, URLError):
        reason = exc.reason
        if isinstance(reason, TimeoutError | socket.timeout):
            return "timeout", None, True
        return "network_error", None, True
    return "fetch_error", None, False
 def fetch_text(
    url: str,
    timeout_seconds: int,
    *,
    retries: int = 0,
    backoff_seconds: float = 0.5,
 ) -> str:
    req = urllib.request.Request(url, headers={"User-Agent": UA})
    attempts = max(1, retries + 1)
    last_error: FetchTextError | None = None
    for attempt in range(1, attempts + 1):
        try:
            with urllib.request.urlopen(req, timeout=timeout_seconds) as response:
                return response.read().decode("utf-8", "ignore")
        except Exception as exc:
            error_type, http_status, retryable = _classify_fetch_exception(exc)
            last_error = FetchTextError(
                error_type=error_type,
                message=f"{type(exc).__name__}: {exc}",
                http_status=http_status,
                attempts=attempt,
            )
            if not retryable or attempt >= attempts:
                raise last_error from exc
            if backoff_seconds > 0:
                time.sleep(backoff_seconds * (2 ** (attempt - 1)))
    raise last_error or FetchTextError("fetch_error", "unknown fetch error", attempts=attempts)
 class OpenAICompatibleClient:
    def __init__(self, *, api_key: str, base_url: str, model: str, timeout_seconds: int = 600):
        self.api_key = api_key
        self.base_url = base_url.rstrip("/")
        self.model = model
        self.timeout_seconds = timeout_seconds
    def chat(self, prompt: str) -> str:
        payload = json.dumps(
            {
                "model": self.model,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.2,
                "max_tokens": 8000,
            },
            ensure_ascii=False,
        ).encode("utf-8")
        req = urllib.request.Request(
            f"{self.base_url}/chat/completions",
            data=payload,
            headers={"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"},
        )
        with urllib.request.urlopen(req, timeout=self.timeout_seconds) as response:
            data = json.loads(response.read().decode("utf-8"))
        return data["choices"][0]["message"]["content"].strip()
 class BlogApiClient:
    def __init__(self, *, base_url: str, token: str, timeout_seconds: int = 25):
        self.base_url = base_url.rstrip("/")
        self.token = token
        self.timeout_seconds = timeout_seconds
    def _request(self, method: str, path: str, payload: dict[str, Any] | None = None) -> dict[str, Any]:
        data = None
        headers = {"Authorization": f"Bearer {self.token}", "User-Agent": UA}
        if payload is not None:
            data = json.dumps(payload, ensure_ascii=False).encode("utf-8")
            headers["Content-Type"] = "application/json"
        req = urllib.request.Request(f"{self.base_url}{path}", data=data, headers=headers, method=method)
        with urllib.request.urlopen(req, timeout=self.timeout_seconds) as response:
            return json.loads(response.read().decode("utf-8"))
    def create_post(self, payload: dict[str, Any]) -> dict[str, Any]:
        return self._request("POST", "/api/service/posts", payload)
    def _normalize_post_response(self, value: Any, slug: str) -> dict[str, Any] | None:
        if isinstance(value, dict):
            if isinstance(value.get("post"), dict):
                value = value["post"]
            elif isinstance(value.get("data"), dict):
                value = value["data"]
            elif isinstance(value.get("items"), list):
                for item in value["items"]:
                    if isinstance(item, dict) and item.get("slug") == slug:
                        return item
                return None
            if value.get("slug") == slug or value.get("id") or value.get("content") or value.get("markdown"):
                return value
        if isinstance(value, list):
            for item in value:
                if isinstance(item, dict) and item.get("slug") == slug:
                    return item
        return None
    def _request_optional(self, method: str, path: str, payload: dict[str, Any] | None = None) -> dict[str, Any] | list[Any] | None:
        try:
            return self._request(method, path, payload)
        except HTTPError as exc:
            if exc.code in {403, 404}:
                return None
            raise
        except FetchTextError as exc:
            if exc.error_type in {"http_403", "http_404"}:
                return None
            raise
    def get_post_by_slug(self, slug: str) -> dict[str, Any] | None:
        paths = [
            f"/api/service/posts/{slug}",
            f"/api/service/posts?{urlencode({'slug': slug})}",
            f"/api/service/posts/slug/{slug}",
        ]
        for path in paths:
            value = self._request_optional("GET", path)
            post = self._normalize_post_response(value, slug)
            if post is not None:
                return post
        return None
    def publish_post(self, slug: str) -> None:
        self._request("POST", f"/api/service/posts/{slug}/publish")
--- a/ai_daily_report/collect.py
+++ b/ai_daily_report/collect.py
@@ -0,0 +1,114 @@
 from __future__ import annotations
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from datetime import datetime, timezone
 from time import perf_counter
 from typing import Callable, Iterable, Any
 from .clients import FetchTextError
 from .models import SourceConfig, SourceResult
 Fetcher = Callable[[SourceConfig, str], list[dict[str, Any]]]
 def _status_from_exception(exc: Exception) -> str:
    if isinstance(exc, FetchTextError):
        return exc.error_type
    if isinstance(exc, TimeoutError):
        return "timeout"
    return "error"
 def _retry_count_from_exception(exc: Exception) -> int:
    if isinstance(exc, FetchTextError):
        return max(0, exc.attempts - 1)
    return 0
 def _collect_one(config: SourceConfig, run_date: str, fetcher: Fetcher) -> SourceResult:
    fetched_at = datetime.now(timezone.utc).isoformat()
    if not config.enabled:
        return SourceResult(
            source=config.name,
            role=config.role,
            ok=False,
            status="disabled",
            fetched_at=fetched_at,
            error=f"failure_policy={config.failure_policy}; min_items={config.min_items}",
        )
    started = perf_counter()
    try:
        items = fetcher(config, run_date)
        elapsed_ms = int((perf_counter() - started) * 1000)
        status = "ok" if items else "empty"
        if status == "ok" and config.min_items and len(items) < config.min_items:
            status = "below_min_items"
        return SourceResult(
            source=config.name,
            role=config.role,
            ok=status == "ok",
            status=status,
            items=items,
            error=None if status == "ok" else f"items={len(items)}; min_items={config.min_items}; failure_policy={config.failure_policy}",
            elapsed_ms=elapsed_ms,
            fetched_at=fetched_at,
        )
    except Exception as exc:
        elapsed_ms = int((perf_counter() - started) * 1000)
        return SourceResult(
            source=config.name,
            role=config.role,
            ok=False,
            status=_status_from_exception(exc),
            error=f"{type(exc).__name__}: {exc}; failure_policy={config.failure_policy}; min_items={config.min_items}",
            elapsed_ms=elapsed_ms,
            retry_count=_retry_count_from_exception(exc),
            fetched_at=fetched_at,
        )
 def collect_sources(
    configs: Iterable[SourceConfig],
    run_date: str,
    *,
    fetcher: Fetcher,
    max_workers: int | None = None,
 ) -> tuple[list[SourceResult], dict[str, Any]]:
    ordered_configs = list(configs)
    if not ordered_configs:
        return [], {
            "input_source_count": 0,
            "ok_source_count": 0,
            "failed_source_count": 0,
            "raw_item_count": 0,
        }
    workers = max_workers or min(8, len(ordered_configs))
    result_by_name: dict[str, SourceResult] = {}
    with ThreadPoolExecutor(max_workers=workers) as executor:
        futures = {
            executor.submit(_collect_one, config, run_date, fetcher): config
            for config in ordered_configs
        }
        for future in as_completed(futures):
            config = futures[future]
            result_by_name[config.name] = future.result()
    results = [result_by_name[config.name] for config in ordered_configs]
    report = {
        "input_source_count": len(results),
        "ok_source_count": sum(1 for result in results if result.ok),
        "failed_source_count": sum(1 for result in results if not result.ok),
        "raw_item_count": sum(len(result.items) for result in results),
        "source_counts": {result.source: len(result.items) for result in results},
        "statuses": {result.source: result.status for result in results},
        "error_types": {
            result.source: result.status
            for result in results
            if not result.ok and result.status != "disabled"
        },
    }
    return results, report
--- a/ai_daily_report/config.py
+++ b/ai_daily_report/config.py
@@ -0,0 +1,28 @@
 from __future__ import annotations
 import json
 from pathlib import Path
 from typing import Any
 from .models import SourceConfig
 from .pipeline import _source_config_from_dict
 def load_json(path: Path) -> Any:
    return json.loads(path.read_text(encoding="utf-8"))
 def load_source_configs(path: Path) -> list[SourceConfig]:
    raw = load_json(path)
    if not isinstance(raw, list):
        raise ValueError("sources config must be a list")
    return [_source_config_from_dict(item) for item in raw]
 def load_pipeline_config(path: Path) -> dict[str, Any]:
    if not path.exists():
        return {}
    raw = load_json(path)
    if not isinstance(raw, dict):
        raise ValueError("pipeline config must be an object")
    return raw
--- a/ai_daily_report/dedupe.py
+++ b/ai_daily_report/dedupe.py
@@ -0,0 +1,182 @@
 from __future__ import annotations
 import difflib
 import re
 from datetime import date, datetime
 from typing import Any
 from .models import NewsItem, PublishedUrlEntry, PublishedUrls
 TITLE_SIMILARITY_THRESHOLD = 0.50
 TOKEN_JACCARD_THRESHOLD = 0.40
 TOKEN_EDIT_DISTANCE_THRESHOLD = 0.40
 def _item_score(item: NewsItem) -> int:
    score = 0
    score += max(0, 200 - item.source_priority)
    if item.canonical_url:
        score += 20
    if item.summary_raw:
        score += min(40, len(item.summary_raw))
    if item.section_hint:
        score += 10
    if item.source_role == "primary":
        score += 10
    score -= len(item.quality_flags) * 10
    return score
 def _merge_group(group: list[NewsItem], reason: str) -> tuple[NewsItem, list[NewsItem], dict[str, Any]]:
    keep = max(group, key=_item_score)
    removed = [item for item in group if item is not keep]
    for removed_item in removed:
        keep.duplicate_sources.append(
            {
                "id": removed_item.id,
                "source_group": removed_item.source_group,
                "source_label": removed_item.source_label,
                "url": removed_item.url,
                "reason": reason,
            }
        )
    report_group = {
        "reason": reason,
        "keep_id": keep.id,
        "removed_ids": [item.id for item in removed],
        "confidence": "high",
    }
    return keep, removed, report_group
 def _group_by_key(items: list[NewsItem], key_name: str) -> dict[str, list[NewsItem]]:
    groups: dict[str, list[NewsItem]] = {}
    for item in items:
        key = getattr(item, key_name)
        if key:
            groups.setdefault(key, []).append(item)
    return {key: group for key, group in groups.items() if len(group) > 1}
 def _title_tokens(value: str) -> set[str]:
    if not value:
        return set()
    return set(re.findall(r"[a-z0-9]+|[\u4e00-\u9fff]", value.lower()))
 def _jaccard_similarity(left: set[str], right: set[str]) -> float:
    if not left or not right:
        return 0.0
    return len(left & right) / len(left | right)
 def _possible_duplicates(items: list[NewsItem]) -> list[dict[str, Any]]:
    possible: list[dict[str, Any]] = []
    for index, left in enumerate(items):
        for right in items[index + 1 :]:
            if not left.title_norm or not right.title_norm:
                continue
            ratio = difflib.SequenceMatcher(None, left.title_norm, right.title_norm).ratio()
            jaccard = _jaccard_similarity(_title_tokens(left.title_norm), _title_tokens(right.title_norm))
            if ratio >= TITLE_SIMILARITY_THRESHOLD or (
                ratio >= TOKEN_EDIT_DISTANCE_THRESHOLD and jaccard >= TOKEN_JACCARD_THRESHOLD
            ):
                possible.append(
                    {
                        "item_ids": [left.id, right.id],
                        "reason": "title_similarity",
                        "similarity": round(ratio, 3),
                        "token_jaccard": round(jaccard, 3),
                        "confidence": "medium",
                    }
                )
    return possible
 def hard_dedup_items(items: list[NewsItem]) -> tuple[list[NewsItem], dict[str, Any]]:
    remaining = list(items)
    removed_object_ids: set[int] = set()
    groups_report: list[dict[str, Any]] = []
    for key_name, reason in (
        ("canonical_url", "same_canonical_url"),
        ("title_norm", "same_title_norm"),
    ):
        grouped = _group_by_key([item for item in remaining if id(item) not in removed_object_ids], key_name)
        for group in grouped.values():
            active_group = [item for item in group if id(item) not in removed_object_ids]
            if len(active_group) < 2:
                continue
            keep, removed, report_group = _merge_group(active_group, reason)
            removed_object_ids.update(id(item) for item in removed)
            groups_report.append(report_group)
    deduped = [item for item in remaining if id(item) not in removed_object_ids]
    report = {
        "input_count": len(items),
        "output_count": len(deduped),
        "removed_count": len(removed_object_ids),
        "groups": groups_report,
        "possible_duplicates": _possible_duplicates(deduped),
    }
    return deduped, report
 def _parse_date(value: str | None) -> date | None:
    if not value:
        return None
    text = value.strip()
    try:
        return date.fromisoformat(text[:10])
    except ValueError:
        try:
            return datetime.fromisoformat(text).date()
        except ValueError:
            return None
 def _entry_within_window(entry: PublishedUrlEntry, *, run_date: str, max_age_days: int) -> bool:
    if max_age_days < 0:
        return True
    current = _parse_date(run_date)
    previous = _parse_date(entry.last_published) or _parse_date(entry.first_seen)
    if current is None or previous is None:
        return True
    return (current - previous).days <= max_age_days
 def cross_day_dedup_items(
    items: list[NewsItem],
    published_urls: PublishedUrls | None,
    *,
    run_date: str,
    max_age_days: int = 7,
 ) -> tuple[list[NewsItem], dict[str, Any]]:
    history = published_urls or PublishedUrls()
    deduped: list[NewsItem] = []
    removed: list[dict[str, Any]] = []
    for item in items:
        entry = history.urls.get(item.canonical_url) if item.canonical_url else None
        if entry and _entry_within_window(entry, run_date=run_date, max_age_days=max_age_days):
            removed.append(
                {
                    "item_id": item.id,
                    "canonical_url": item.canonical_url,
                    "title": item.title or item.title_raw,
                    "first_seen": entry.first_seen,
                    "last_published": entry.last_published,
                }
            )
            continue
        deduped.append(item)
    report = {
        "input_count": len(items),
        "output_count": len(deduped),
        "removed_count": len(removed),
        "removed": removed,
        "max_age_days": max_age_days,
    }
    return deduped, report
--- a/ai_daily_report/env.py
+++ b/ai_daily_report/env.py
@@ -0,0 +1,143 @@
 from __future__ import annotations
 import os
 import json
 from pathlib import Path
 PROJECT_ROOT = Path(__file__).resolve().parents[1]
 def read_env_file(env_path: Path) -> dict[str, str]:
    env: dict[str, str] = {}
    if not env_path.exists():
        return env
    text = env_path.read_text(encoding="utf-8", errors="ignore")
    for line in text.splitlines():
        line = line.strip()
        if not line or line.startswith("#") or "=" not in line:
            continue
        key, value = line.split("=", 1)
        env[key.strip()] = value.strip().strip('"').strip("'")
    return env
 def load_env() -> dict[str, str]:
    env: dict[str, str] = {}
    env.update(read_env_file(PROJECT_ROOT / ".env"))
    env.update(read_env_file(Path.home() / ".hermes" / ".env"))
    env.update({key: value for key, value in os.environ.items() if value})
    return env
 def first_env(env: dict[str, str], *names: str) -> str:
    for name in names:
        value = (env.get(name) or "").strip()
        if value:
            return value
    return ""
 def _load_simple_yaml(path: Path) -> dict[str, object]:
    if not path.exists():
        return {}
    root: dict[str, object] = {}
    stack: list[tuple[int, dict[str, object]]] = [(-1, root)]
    for raw_line in path.read_text(encoding="utf-8", errors="ignore").splitlines():
        if not raw_line.strip() or raw_line.lstrip().startswith("#") or ":" not in raw_line:
            continue
        indent = len(raw_line) - len(raw_line.lstrip(" "))
        key, value = raw_line.strip().split(":", 1)
        key = key.strip()
        value = value.strip().strip('"').strip("'")
        while stack and indent <= stack[-1][0]:
            stack.pop()
        current = stack[-1][1]
        if value:
            current[key] = value
        else:
            child: dict[str, object] = {}
            current[key] = child
            stack.append((indent, child))
    return root
 def _env_with_hermes(env: dict[str, str], hermes_dir: Path) -> dict[str, str]:
    merged = dict(read_env_file(hermes_dir / ".env"))
    merged.update(env)
    return merged
 def _provider_env_names(provider: str) -> tuple[str, str, str]:
    prefix = provider.upper().replace("-", "_")
    return f"{prefix}_API_KEY", f"{prefix}_BASE_URL", f"{prefix}_MODEL"
 def _auth_json_key(env: dict[str, str], hermes_dir: Path, provider: str) -> str:
    auth_path = hermes_dir / "auth.json"
    if not auth_path.exists() or not provider:
        return ""
    try:
        auth = json.loads(auth_path.read_text(encoding="utf-8"))
    except Exception:
        return ""
    pool = auth.get("credential_pool", {}) or {}
    provider_keys = [provider, provider.replace("-", "_")]
    for key in provider_keys:
        creds = pool.get(key, []) or []
        if not creds:
            continue
        cred = creds[0]
        source = str(cred.get("source") or "")
        if source.startswith("env:"):
            resolved = first_env(env, source[4:])
            if resolved:
                return resolved
        token = str(cred.get("access_token") or "").strip()
        if token:
            return token
    return ""
 def resolve_llm_config(env: dict[str, str], *, hermes_dir: Path | None = None) -> dict[str, str]:
    hermes_dir = hermes_dir or Path.home() / ".hermes"
    env = _env_with_hermes(env, hermes_dir)
    hermes_config = _load_simple_yaml(hermes_dir / "config.yaml")
    model_config = hermes_config.get("model", {}) if isinstance(hermes_config.get("model"), dict) else {}
    provider = str(model_config.get("provider") or "").strip()
    provider_key, provider_base_url, provider_model = _provider_env_names(provider) if provider else ("", "", "")
    api_key = first_env(env, "LLM_API_KEY")
    base_url = first_env(env, "LLM_BASE_URL")
    model = first_env(env, "LLM_MODEL")
    if not api_key and provider:
        api_key = first_env(env, provider_key) or _auth_json_key(env, hermes_dir, provider)
    if not base_url and provider:
        base_url = first_env(env, provider_base_url) or str(model_config.get("base_url") or "").strip()
    if not model and provider:
        model = first_env(env, provider_model) or str(model_config.get("default") or "").strip()
    if not api_key:
        api_key = first_env(env, "SUB2API_API_KEY", "XIAOMI_API_KEY", "OPENROUTER_API_KEY")
    if not base_url:
        base_url = first_env(env, "SUB2API_BASE_URL", "XIAOMI_BASE_URL", "OPENROUTER_BASE_URL")
    if not model:
        model = first_env(env, "SUB2API_MODEL", "XIAOMI_MODEL")
    missing = [
        name
        for name, value in (
            ("LLM_API_KEY", api_key),
            ("LLM_BASE_URL", base_url),
            ("LLM_MODEL", model),
        )
        if not value
    ]
    if missing:
        raise ValueError("missing_llm_config: " + ",".join(missing))
    return {"api_key": api_key, "base_url": base_url, "model": model}
 def resolve_blog_token(env: dict[str, str]) -> str:
    return first_env(env, "BLOG_SERVICE_TOKEN", "EPHRON_SERVICE_TOKEN")
--- a/ai_daily_report/guide.py
+++ b/ai_daily_report/guide.py
@@ -0,0 +1,123 @@
 from __future__ import annotations
 import json
 import re
 from typing import Any, Callable
 from .llm import parse_json_object
 from .models import NewsItem
 GuideLlmCall = Callable[[str], str]
 def _clean_text(text: str, limit: int | None = None) -> str:
    value = re.sub(r"^\s*>\s*", "", text or "").strip()
    value = re.sub(r"\[\d+\]|\[N\]", "", value)
    value = re.sub(r"\s+", " ", value).strip()
    if limit and len(value) > limit:
        value = value[:limit].rstrip()
    return value
 def _build_prompt(items: list[NewsItem]) -> str:
    payload = {
        "task": (
            "Generate a concise Chinese AI daily report guide. Return JSON only. "
            "Do not use 强信号/中信号/待验证. Do not add facts. "
            "Write one opening intro, a short theme, 2-4 daily threads, and one closing conclusion. "
            "Every thread must reference existing item_ids."
        ),
        "items": [
            {
                "id": item.id,
                "title": item.title or item.title_raw,
                "summary": item.summary or item.summary_raw,
                "section": item.section,
                "source": item.source_label,
            }
            for item in items
        ],
        "output_schema": {
            "intro": "one opening paragraph under 160 Chinese characters",
            "theme": "one sentence under 120 Chinese characters",
            "threads": [
                {
                    "title": "thread title",
                    "text": "one or two sentences",
                    "item_ids": ["existing item id"],
                    "kind": "thread|uncertain",
                }
            ],
            "conclusion": "one closing paragraph under 180 Chinese characters",
        },
    }
    return json.dumps(payload, ensure_ascii=False)
 def _empty_guide() -> dict[str, Any]:
    return {"intro": "", "theme": "", "threads": [], "conclusion": ""}
 def generate_guide(
    items: list[NewsItem],
    *,
    llm_call: GuideLlmCall,
 ) -> tuple[dict[str, Any], dict[str, Any]]:
    if not items:
        return _empty_guide(), {
            "input_count": 0,
            "intro_present": False,
            "theme_present": False,
            "conclusion_present": False,
            "thread_count": 0,
            "dropped_thread_count": 0,
            "fallback_used": False,
            "errors": [],
        }
    try:
        obj = parse_json_object(llm_call(_build_prompt(items)))
    except Exception as exc:
        return _empty_guide(), {
            "input_count": len(items),
            "intro_present": False,
            "theme_present": False,
            "conclusion_present": False,
            "thread_count": 0,
            "dropped_thread_count": 0,
            "fallback_used": True,
            "errors": [f"{type(exc).__name__}: {exc}"],
        }
    valid_ids = {item.id for item in items}
    threads: list[dict[str, Any]] = []
    dropped = 0
    for thread in obj.get("threads", []) or []:
        item_ids = [item_id for item_id in thread.get("item_ids", []) if item_id in valid_ids]
        if not item_ids:
            dropped += 1
            continue
        title = _clean_text(str(thread.get("title") or ""), limit=80)
        text = _clean_text(str(thread.get("text") or ""), limit=220)
        if not title or not text:
            dropped += 1
            continue
        kind = thread.get("kind") if thread.get("kind") in ("thread", "uncertain") else "thread"
        threads.append({"title": title, "text": text, "item_ids": item_ids, "kind": kind})
    intro = _clean_text(str(obj.get("intro") or ""), limit=160)
    theme = _clean_text(str(obj.get("theme") or ""), limit=120)
    conclusion = _clean_text(str(obj.get("conclusion") or ""), limit=180)
    guide = {"intro": intro, "theme": theme, "threads": threads, "conclusion": conclusion}
    report = {
        "input_count": len(items),
        "intro_present": bool(intro),
        "theme_present": bool(theme),
        "conclusion_present": bool(conclusion),
        "thread_count": len(threads),
        "dropped_thread_count": dropped,
        "fallback_used": False,
        "errors": [],
    }
    return guide, report
--- a/ai_daily_report/llm.py
+++ b/ai_daily_report/llm.py
@@ -0,0 +1,18 @@
 from __future__ import annotations
 import json
 import re
 from typing import Any, Callable
 LlmCall = Callable[[str], str]
 def parse_json_object(text: str) -> dict[str, Any]:
    text = re.sub(r"^```(?:json)?\s*\n?", "", text.strip())
    text = re.sub(r"\n?```\s*$", "", text)
    match = re.search(r"\{.*\}\s*$", text, re.S)
    if not match:
        raise ValueError("LLM output does not contain a JSON object")
    return json.loads(match.group(0))
--- a/ai_daily_report/models.py
+++ b/ai_daily_report/models.py
@@ -0,0 +1,69 @@
 from dataclasses import dataclass, field
 from typing import Any
@dataclass(frozen=True)
 class SourceConfig:
    name: str
    type: str
    role: str = "supplement"
    priority: int = 100
    required: bool = False
    enabled: bool = True
    timeout_seconds: int = 25
    retries: int = 0
    min_items: int = 0
    url: str = ""
    max_item_age_days: int | None = None
    failure_policy: str = "warn"
@dataclass
 class SourceResult:
    source: str
    role: str
    ok: bool
    status: str
    items: list[dict[str, Any]] = field(default_factory=list)
    error: str | None = None
    elapsed_ms: int = 0
    retry_count: int = 0
    fetched_at: str = ""
@dataclass
 class NewsItem:
    id: str
    source_group: str
    source_label: str
    source_role: str
    source_priority: int
    title_raw: str
    title_norm: str
    summary_raw: str
    url: str
    canonical_url: str
    published_at: str | None = None
    collected_at: str = ""
    origin_type: str = ""
    section_hint: str = ""
    language_hint: str = ""
    title: str | None = None
    summary: str | None = None
    section: str | None = None
    quality_flags: list[str] = field(default_factory=list)
    duplicate_sources: list[dict[str, Any]] = field(default_factory=list)
@dataclass
 class PublishedUrlEntry:
    first_seen: str
    last_published: str
    titles: list[str] = field(default_factory=list)
@dataclass
 class PublishedUrls:
    version: int = 1
    urls: dict[str, PublishedUrlEntry] = field(default_factory=dict)
    updated_at: str = ""
--- a/ai_daily_report/normalize.py
+++ b/ai_daily_report/normalize.py
@@ -0,0 +1,132 @@
 from __future__ import annotations
 import hashlib
 import html
 import re
 import unicodedata
 from collections import Counter
 from datetime import datetime, timezone
 from typing import Any
 from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse
 from .models import NewsItem, SourceResult
 TRACKING_QUERY_PREFIXES = ("utm_",)
 TRACKING_QUERY_KEYS = {"fbclid", "gclid", "spm", "from", "ref"}
 def clean_text(value: str) -> str:
    text = html.unescape(value or "")
    text = re.sub(r"<[^>]+>", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text
 def canonicalize_url(url: str) -> str:
    if not url:
        return ""
    parsed = urlparse(url.strip())
    scheme = (parsed.scheme or "https").lower()
    host = (parsed.netloc or "").lower()
    if host.startswith("www."):
        host = host[4:]
    if host == "twitter.com":
        host = "x.com"
    query = []
    for key, value in parse_qsl(parsed.query, keep_blank_values=True):
        key_lower = key.lower()
        if key_lower in TRACKING_QUERY_KEYS:
            continue
        if any(key_lower.startswith(prefix) for prefix in TRACKING_QUERY_PREFIXES):
            continue
        query.append((key, value))
    path = parsed.path or ""
    if len(path) > 1:
        path = path.rstrip("/")
    return urlunparse((scheme, host, path, "", urlencode(query), ""))
 def normalize_title(title: str) -> str:
    text = unicodedata.normalize("NFKC", title or "").lower()
    text = re.sub(r"[^\w\u4e00-\u9fff]+", "", text)
    return text
 def _item_id(canonical_url: str, source_group: str, title_norm: str, published_at: str | None) -> str:
    seed = canonical_url or "|".join([source_group, title_norm, published_at or ""])
    digest = hashlib.sha1(seed.encode("utf-8")).hexdigest()[:16]
    return f"item_{digest}"
 def _quality_flags(title: str, summary: str, url: str) -> list[str]:
    flags: list[str] = []
    if not url:
        flags.append("missing_url")
    if not summary:
        flags.append("missing_summary")
    if len(normalize_title(title)) < 3:
        flags.append("short_title")
    return flags
 def normalize_items(
    source_results: list[SourceResult],
    *,
    run_date: str,
    source_priorities: dict[str, int] | None = None,
 ) -> tuple[list[NewsItem], dict[str, Any]]:
    source_priorities = source_priorities or {}
    collected_at = datetime.now(timezone.utc).isoformat()
    items: list[NewsItem] = []
    flag_counts: Counter[str] = Counter()
    id_counts: Counter[str] = Counter()
    input_count = 0
    for source_result in source_results:
        for raw in source_result.items:
            input_count += 1
            title = clean_text(str(raw.get("title_raw") or raw.get("title") or ""))
            summary = clean_text(str(raw.get("summary_raw") or raw.get("summary") or ""))
            url = str(raw.get("url") or "").strip()
            canonical_url = canonicalize_url(url)
            title_norm = normalize_title(title)
            flags = _quality_flags(title, summary, canonical_url)
            flag_counts.update(flags)
            source_label = clean_text(str(raw.get("source_label") or source_result.source))
            published_at = raw.get("published_at")
            base_id = _item_id(canonical_url, source_result.source, title_norm, published_at)
            id_counts[base_id] += 1
            item_id = base_id if id_counts[base_id] == 1 else f"{base_id}_{id_counts[base_id]}"
            items.append(
                NewsItem(
                    id=item_id,
                    source_group=source_result.source,
                    source_label=source_label,
                    source_role=source_result.role,
                    source_priority=source_priorities.get(source_result.source, 100),
                    title_raw=title,
                    title_norm=title_norm,
                    summary_raw=summary,
                    url=url,
                    canonical_url=canonical_url,
                    published_at=published_at,
                    collected_at=collected_at,
                    origin_type=str(raw.get("origin_type") or ""),
                    section_hint=str(raw.get("section_hint") or ""),
                    language_hint=str(raw.get("language_hint") or ""),
                    quality_flags=flags,
                )
            )
    report = {
        "run_date": run_date,
        "input_count": input_count,
        "output_count": len(items),
        "quality_flag_counts": dict(flag_counts),
    }
    return items, report
--- a/ai_daily_report/observability.py
+++ b/ai_daily_report/observability.py
@@ -0,0 +1,54 @@
 from __future__ import annotations
 import hashlib
 from dataclasses import dataclass, field
 from typing import Any, Callable
 def sha256_text(value: str) -> str:
    return hashlib.sha256((value or "").encode("utf-8")).hexdigest()
 def truncate_text(value: str, limit: int = 500) -> str:
    text = value or ""
    if len(text) <= limit:
        return text
    return f"{text[:limit]}…[truncated {len(text) - limit} chars]"
@dataclass
 class LlmCallObserver:
    call: Callable[[str], str]
    stage: str
    records: list[dict[str, Any]] = field(default_factory=list)
    prompt_preview_chars: int = 500
    response_preview_chars: int = 500
    def __call__(self, prompt: str) -> str:
        response = self.call(prompt)
        self.records.append(
            {
                "stage": self.stage,
                "call_index": len(self.records) + 1,
                "prompt_hash": sha256_text(prompt),
                "response_hash": sha256_text(response),
                "prompt_chars": len(prompt or ""),
                "response_chars": len(response or ""),
                "prompt_preview": truncate_text(prompt, self.prompt_preview_chars),
                "response_preview": truncate_text(response, self.response_preview_chars),
            }
        )
        return response
 def summarize_observed_calls(observers: list[LlmCallObserver]) -> dict[str, Any]:
    records: list[dict[str, Any]] = []
    by_stage: dict[str, int] = {}
    for observer in observers:
        records.extend(observer.records)
        by_stage[observer.stage] = by_stage.get(observer.stage, 0) + len(observer.records)
    return {
        "total_calls": len(records),
        "by_stage": by_stage,
        "records": records,
    }
--- a/ai_daily_report/pipeline.py
+++ b/ai_daily_report/pipeline.py
@@ -0,0 +1,386 @@
 from __future__ import annotations
 from typing import Any
 from .assemble import assemble_markdown
 from .candidate_recall import recall_semantic_candidates
 from .classify import classify_and_order_items
 from .collect import Fetcher, collect_sources
 from .dedupe import cross_day_dedup_items, hard_dedup_items
 from .guide import GuideLlmCall, generate_guide
 from .models import PublishedUrls, SourceConfig
 from .normalize import normalize_items
 from .publish import BlogClient, publish_markdown
 from .quality_gate import evaluate_quality_gate
 from .rewrite import RewriteLlmCall, rewrite_items
 from .semantic_dedupe import SemanticLlmCall, semantic_dedup_items
 def _source_config_from_dict(value: dict[str, Any]) -> SourceConfig:
    max_item_age_days = value.get("max_item_age_days")
    return SourceConfig(
        name=value["name"],
        type=value["type"],
        role=value.get("role", "supplement"),
        priority=int(value.get("priority", 100)),
        required=bool(value.get("required", False)),
        enabled=bool(value.get("enabled", True)),
        timeout_seconds=int(value.get("timeout_seconds", 25)),
        retries=int(value.get("retries", 0)),
        min_items=int(value.get("min_items", 0)),
        url=value.get("url", ""),
        max_item_age_days=int(max_item_age_days) if max_item_age_days is not None else None,
        failure_policy=str(value.get("failure_policy") or ("block" if bool(value.get("required", False)) else "warn")),
    )
 def run_stage0_to_stage2(
    source_configs: list[dict[str, Any] | SourceConfig],
    run_date: str,
    *,
    fetcher: Fetcher,
 ) -> dict[str, Any]:
    configs = [
        config if isinstance(config, SourceConfig) else _source_config_from_dict(config)
        for config in source_configs
    ]
    source_results, stage0_report = collect_sources(configs, run_date, fetcher=fetcher)
    source_priorities = {config.name: config.priority for config in configs}
    normalized_items, stage1_report = normalize_items(
        source_results,
        run_date=run_date,
        source_priorities=source_priorities,
    )
    deduped_items, stage2_report = hard_dedup_items(normalized_items)
    artifacts = {
        "stage0_sources": source_results,
        "stage1_items": normalized_items,
        "stage2_items": deduped_items,
    }
    return {
        "source_results": source_results,
        "items": deduped_items,
        "reports": {
            "stage0": stage0_report,
            "stage1": stage1_report,
            "stage2": stage2_report,
        },
        "artifacts": artifacts,
    }
 def run_stage0_to_stage2_5(
    source_configs: list[dict[str, Any] | SourceConfig],
    run_date: str,
    *,
    fetcher: Fetcher,
    published_urls: PublishedUrls | None = None,
    cross_day_dedup_enabled: bool = True,
    cross_day_dedup_max_age_days: int = 7,
 ) -> dict[str, Any]:
    stage2_result = run_stage0_to_stage2(source_configs, run_date, fetcher=fetcher)
    if cross_day_dedup_enabled:
        items, stage2_5_report = cross_day_dedup_items(
            stage2_result["items"],
            published_urls,
            run_date=run_date,
            max_age_days=cross_day_dedup_max_age_days,
        )
    else:
        items = stage2_result["items"]
        stage2_5_report = {
            "input_count": len(items),
            "output_count": len(items),
            "removed_count": 0,
            "removed": [],
            "enabled": False,
            "max_age_days": cross_day_dedup_max_age_days,
        }
    reports = dict(stage2_result["reports"])
    stage2_5_report.setdefault("enabled", cross_day_dedup_enabled)
    reports["stage2_5"] = stage2_5_report
    artifacts = dict(stage2_result.get("artifacts", {}))
    artifacts["stage2_5_items"] = items
    return {
        "source_results": stage2_result["source_results"],
        "items": items,
        "reports": reports,
        "artifacts": artifacts,
    }
 def run_stage0_to_stage4(
    source_configs: list[dict[str, Any] | SourceConfig],
    run_date: str,
    *,
    fetcher: Fetcher,
    semantic_llm_call: SemanticLlmCall,
    rewrite_llm_call: RewriteLlmCall,
    published_urls: PublishedUrls | None = None,
    cross_day_dedup_enabled: bool = True,
    cross_day_dedup_max_age_days: int = 7,
    semantic_dedup_max_deletion_ratio: float = 0.5,
    rewrite_batch_size: int = 30,
    semantic_candidate_recall_config: dict[str, Any] | None = None,
    quality_gate_config: dict[str, Any] | None = None,
 ) -> dict[str, Any]:
    stage2_5_result = run_stage0_to_stage2_5(
        source_configs,
        run_date,
        fetcher=fetcher,
        published_urls=published_urls,
        cross_day_dedup_enabled=cross_day_dedup_enabled,
        cross_day_dedup_max_age_days=cross_day_dedup_max_age_days,
    )
    items = stage2_5_result["items"]
    remaining_ids = {item.id for item in items}
    candidates = [
        candidate
        for candidate in stage2_5_result["reports"]["stage2"].get("possible_duplicates", [])
        if set(candidate.get("item_ids", [])).issubset(remaining_ids)
    ]
    candidates, stage2_8_report = recall_semantic_candidates(
        items,
        existing_candidates=candidates,
        config=semantic_candidate_recall_config,
    )
    semantic_items, stage3_report = semantic_dedup_items(
        items,
        candidates,
        llm_call=semantic_llm_call,
        max_deletion_ratio=semantic_dedup_max_deletion_ratio,
    )
    rewritten_items, stage4_report = rewrite_items(
        semantic_items,
        llm_call=rewrite_llm_call,
        batch_size=rewrite_batch_size,
    )
    reports = dict(stage2_5_result["reports"])
    reports["stage2_8"] = stage2_8_report
    reports["stage3"] = stage3_report
    reports["stage4"] = stage4_report
    artifacts = dict(stage2_5_result.get("artifacts", {}))
    artifacts["stage2_8_candidates"] = candidates
    artifacts["stage3_items"] = semantic_items
    artifacts["stage4_items"] = rewritten_items
    return {
        "source_results": stage2_5_result["source_results"],
        "items": rewritten_items,
        "reports": reports,
        "artifacts": artifacts,
    }
 def run_stage0_to_stage5(
    source_configs: list[dict[str, Any] | SourceConfig],
    run_date: str,
    *,
    fetcher: Fetcher,
    semantic_llm_call: SemanticLlmCall,
    rewrite_llm_call: RewriteLlmCall,
    published_urls: PublishedUrls | None = None,
    cross_day_dedup_enabled: bool = True,
    cross_day_dedup_max_age_days: int = 7,
    semantic_dedup_max_deletion_ratio: float = 0.5,
    rewrite_batch_size: int = 30,
    semantic_candidate_recall_config: dict[str, Any] | None = None,
    quality_gate_config: dict[str, Any] | None = None,
 ) -> dict[str, Any]:
    stage4_result = run_stage0_to_stage4(
        source_configs,
        run_date,
        fetcher=fetcher,
        semantic_llm_call=semantic_llm_call,
        rewrite_llm_call=rewrite_llm_call,
        published_urls=published_urls,
        cross_day_dedup_enabled=cross_day_dedup_enabled,
        cross_day_dedup_max_age_days=cross_day_dedup_max_age_days,
        semantic_dedup_max_deletion_ratio=semantic_dedup_max_deletion_ratio,
        rewrite_batch_size=rewrite_batch_size,
        semantic_candidate_recall_config=semantic_candidate_recall_config,
    )
    classified_items, stage5_report = classify_and_order_items(stage4_result["items"])
    reports = dict(stage4_result["reports"])
    reports["stage5"] = stage5_report
    return {
        "source_results": stage4_result["source_results"],
        "items": classified_items,
        "reports": reports,
        "artifacts": stage4_result.get("artifacts", {}),
    }
 def run_stage0_to_stage6(
    source_configs: list[dict[str, Any] | SourceConfig],
    run_date: str,
    *,
    fetcher: Fetcher,
    semantic_llm_call: SemanticLlmCall,
    rewrite_llm_call: RewriteLlmCall,
    guide_llm_call: GuideLlmCall,
    published_urls: PublishedUrls | None = None,
    cross_day_dedup_enabled: bool = True,
    cross_day_dedup_max_age_days: int = 7,
    semantic_dedup_max_deletion_ratio: float = 0.5,
    rewrite_batch_size: int = 30,
    semantic_candidate_recall_config: dict[str, Any] | None = None,
 ) -> dict[str, Any]:
    stage5_result = run_stage0_to_stage5(
        source_configs,
        run_date,
        fetcher=fetcher,
        semantic_llm_call=semantic_llm_call,
        rewrite_llm_call=rewrite_llm_call,
        published_urls=published_urls,
        cross_day_dedup_enabled=cross_day_dedup_enabled,
        cross_day_dedup_max_age_days=cross_day_dedup_max_age_days,
        semantic_dedup_max_deletion_ratio=semantic_dedup_max_deletion_ratio,
        rewrite_batch_size=rewrite_batch_size,
        semantic_candidate_recall_config=semantic_candidate_recall_config,
    )
    guide, stage6_report = generate_guide(stage5_result["items"], llm_call=guide_llm_call)
    reports = dict(stage5_result["reports"])
    reports["stage6"] = stage6_report
    return {
        "source_results": stage5_result["source_results"],
        "items": stage5_result["items"],
        "guide": guide,
        "reports": reports,
        "artifacts": stage5_result.get("artifacts", {}),
    }
 def run_stage0_to_stage7(
    source_configs: list[dict[str, Any] | SourceConfig],
    run_date: str,
    *,
    fetcher: Fetcher,
    semantic_llm_call: SemanticLlmCall,
    rewrite_llm_call: RewriteLlmCall,
    guide_llm_call: GuideLlmCall,
    published_urls: PublishedUrls | None = None,
    cross_day_dedup_enabled: bool = True,
    cross_day_dedup_max_age_days: int = 7,
    semantic_dedup_max_deletion_ratio: float = 0.5,
    rewrite_batch_size: int = 30,
    semantic_candidate_recall_config: dict[str, Any] | None = None,
    quality_gate_config: dict[str, Any] | None = None,
 ) -> dict[str, Any]:
    stage6_result = run_stage0_to_stage6(
        source_configs,
        run_date,
        fetcher=fetcher,
        semantic_llm_call=semantic_llm_call,
        rewrite_llm_call=rewrite_llm_call,
        guide_llm_call=guide_llm_call,
        published_urls=published_urls,
        cross_day_dedup_enabled=cross_day_dedup_enabled,
        cross_day_dedup_max_age_days=cross_day_dedup_max_age_days,
        semantic_dedup_max_deletion_ratio=semantic_dedup_max_deletion_ratio,
        rewrite_batch_size=rewrite_batch_size,
        semantic_candidate_recall_config=semantic_candidate_recall_config,
    )
    markdown, stage7_report = assemble_markdown(stage6_result["items"], stage6_result["guide"])
    upstream_blocking_errors: list[str] = []
    for stage_name in ("stage3", "stage4", "stage5", "stage6"):
        for error in stage6_result["reports"].get(stage_name, {}).get("blocking_errors", []) or []:
            upstream_blocking_errors.append(str(error))
    if upstream_blocking_errors:
        existing_errors = list(stage7_report.get("blocking_errors", []) or [])
        stage7_report["blocking_errors"] = existing_errors + upstream_blocking_errors
    reports = dict(stage6_result["reports"])
    quality_gate_report = evaluate_quality_gate(
        stage6_result["items"],
        source_results=stage6_result["source_results"],
        reports=reports,
        config=quality_gate_config,
    )
    if quality_gate_report.get("blocking_errors"):
        existing_errors = list(stage7_report.get("blocking_errors", []) or [])
        stage7_report["blocking_errors"] = existing_errors + list(quality_gate_report["blocking_errors"])
    reports["quality_gate"] = quality_gate_report
    reports["stage7"] = stage7_report
    artifacts = dict(stage6_result.get("artifacts", {}))
    artifacts["quality_gate"] = quality_gate_report
    return {
        "source_results": stage6_result["source_results"],
        "items": stage6_result["items"],
        "guide": stage6_result["guide"],
        "markdown": markdown,
        "reports": reports,
        "artifacts": artifacts,
    }
 def run_stage0_to_stage8(
    source_configs: list[dict[str, Any] | SourceConfig],
    run_date: str,
    *,
    fetcher: Fetcher,
    semantic_llm_call: SemanticLlmCall,
    rewrite_llm_call: RewriteLlmCall,
    guide_llm_call: GuideLlmCall,
    mode: str,
    base_url: str,
    client: BlogClient | None,
    published_urls: PublishedUrls | None = None,
    cross_day_dedup_enabled: bool = True,
    cross_day_dedup_max_age_days: int = 7,
    semantic_dedup_max_deletion_ratio: float = 0.5,
    rewrite_batch_size: int = 30,
    semantic_candidate_recall_config: dict[str, Any] | None = None,
    quality_gate_config: dict[str, Any] | None = None,
    publish_idempotency_config: dict[str, Any] | None = None,
 ) -> dict[str, Any]:
    stage7_result = run_stage0_to_stage7(
        source_configs,
        run_date,
        fetcher=fetcher,
        semantic_llm_call=semantic_llm_call,
        rewrite_llm_call=rewrite_llm_call,
        guide_llm_call=guide_llm_call,
        published_urls=published_urls,
        cross_day_dedup_enabled=cross_day_dedup_enabled,
        cross_day_dedup_max_age_days=cross_day_dedup_max_age_days,
        semantic_dedup_max_deletion_ratio=semantic_dedup_max_deletion_ratio,
        rewrite_batch_size=rewrite_batch_size,
        semantic_candidate_recall_config=semantic_candidate_recall_config,
        quality_gate_config=quality_gate_config,
    )
    slug = f"ai-{run_date}"
    effective_mode = mode
    quality_gate_report = stage7_result["reports"].get("quality_gate", {}) or {}
    required_policy = str(quality_gate_report.get("required_source_failure_policy") or "block")
    if quality_gate_report.get("required_source_failures") and required_policy in {"draft", "dry_run"}:
        effective_mode = "dry-run" if required_policy == "dry_run" else "draft"
    publish_result = publish_markdown(
        title=f"AI日报 · {run_date}",
        markdown=stage7_result["markdown"],
        tags=["AI日报", "AI资讯", "人工智能"],
        slug=slug,
        base_url=base_url,
        mode=effective_mode,
        markdown_report=stage7_result["reports"]["stage7"],
        client=client,
        idempotency_config=publish_idempotency_config,
    )
    reports = dict(stage7_result["reports"])
    reports["stage8"] = {
        "requested_mode": mode,
        "mode": publish_result.mode,
        "status": publish_result.status,
        "slug": publish_result.slug,
        "blog_url": publish_result.blog_url,
        "public_ok": publish_result.public_ok,
        "error": publish_result.error,
    }
    return {
        "source_results": stage7_result["source_results"],
        "items": stage7_result["items"],
        "guide": stage7_result["guide"],
        "markdown": stage7_result["markdown"],
        "publish": publish_result,
        "reports": reports,
        "artifacts": stage7_result.get("artifacts", {}),
    }
--- a/ai_daily_report/publish.py
+++ b/ai_daily_report/publish.py
@@ -0,0 +1,261 @@
 from __future__ import annotations
 import json
 import hashlib
 from dataclasses import dataclass
 from datetime import date, datetime, timezone
 from pathlib import Path
 from typing import Any, Protocol
 from .models import NewsItem, PublishedUrlEntry, PublishedUrls
@dataclass
 class PublishResult:
    mode: str
    status: str
    slug: str
    blog_url: str
    public_ok: bool = False
    error: str | None = None
 class BlogClient(Protocol):
    def get_post_by_slug(self, slug: str) -> dict[str, Any] | None:
        ...
    def create_post(self, payload: dict[str, Any]) -> dict[str, Any]:
        ...
    def publish_post(self, slug: str) -> None:
        ...
 def _parse_date(value: str | None) -> date | None:
    if not value:
        return None
    text = value.strip()
    try:
        return date.fromisoformat(text[:10])
    except ValueError:
        try:
            return datetime.fromisoformat(text).date()
        except ValueError:
            return None
 def _published_entry_from_dict(value: Any) -> PublishedUrlEntry | None:
    if not isinstance(value, dict):
        return None
    first_seen = str(value.get("first_seen") or "")
    last_published = str(value.get("last_published") or first_seen)
    titles = [str(title) for title in value.get("titles", []) or [] if str(title)]
    if not first_seen and not last_published:
        return None
    return PublishedUrlEntry(
        first_seen=first_seen or last_published,
        last_published=last_published or first_seen,
        titles=titles,
    )
 def load_published_urls(path: Path) -> PublishedUrls:
    if not path.exists():
        return PublishedUrls()
    try:
        raw = json.loads(path.read_text(encoding="utf-8"))
    except Exception:
        return PublishedUrls()
    if not isinstance(raw, dict):
        return PublishedUrls()
    urls: dict[str, PublishedUrlEntry] = {}
    for canonical_url, value in (raw.get("urls") or {}).items():
        if not canonical_url:
            continue
        entry = _published_entry_from_dict(value)
        if entry is not None:
            urls[str(canonical_url)] = entry
    return PublishedUrls(
        version=int(raw.get("version") or 1),
        urls=urls,
        updated_at=str(raw.get("updated_at") or ""),
    )
 def _entry_within_window(entry: PublishedUrlEntry, *, run_date: str, max_age_days: int) -> bool:
    if max_age_days < 0:
        return True
    current = _parse_date(run_date)
    previous = _parse_date(entry.last_published) or _parse_date(entry.first_seen)
    if current is None or previous is None:
        return True
    return (current - previous).days <= max_age_days
 def _published_urls_to_dict(history: PublishedUrls) -> dict[str, Any]:
    return {
        "version": history.version,
        "urls": {
            canonical_url: {
                "first_seen": entry.first_seen,
                "last_published": entry.last_published,
                "titles": entry.titles,
            }
            for canonical_url, entry in sorted(history.urls.items())
        },
        "updated_at": history.updated_at,
    }
 def update_published_urls(
    path: Path,
    items: list[NewsItem],
    *,
    run_date: str,
    max_age_days: int = 7,
 ) -> PublishedUrls:
    history = load_published_urls(path)
    history.urls = {
        canonical_url: entry
        for canonical_url, entry in history.urls.items()
        if _entry_within_window(entry, run_date=run_date, max_age_days=max_age_days)
    }
    for item in items:
        if not item.canonical_url:
            continue
        title = item.title or item.title_raw
        entry = history.urls.get(item.canonical_url)
        if entry is None:
            entry = PublishedUrlEntry(
                first_seen=run_date,
                last_published=run_date,
                titles=[],
            )
            history.urls[item.canonical_url] = entry
        entry.last_published = run_date
        if title and title not in entry.titles:
            entry.titles.append(title)
    history.updated_at = datetime.now(timezone.utc).isoformat()
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(
        json.dumps(_published_urls_to_dict(history), ensure_ascii=False, indent=2),
        encoding="utf-8",
    )
    return history
 def dry_run_publish(slug: str, base_url: str) -> PublishResult:
    return PublishResult(
        mode="dry-run",
        status="ok",
        slug=slug,
        blog_url=f"{base_url.rstrip('/')}/posts/{slug}",
        public_ok=True,
    )
 def _content_hash(value: str) -> str:
    return hashlib.sha256((value or "").encode("utf-8")).hexdigest()
 def _get_existing_post(client: BlogClient, slug: str) -> dict[str, Any] | None:
    getter = getattr(client, "get_post_by_slug", None)
    if getter is None:
        return None
    existing = getter(slug)
    return existing if isinstance(existing, dict) else None
 def publish_markdown(
    *,
    title: str,
    markdown: str,
    tags: list[str],
    slug: str,
    base_url: str,
    mode: str,
    markdown_report: dict[str, Any],
    client: BlogClient | None,
    idempotency_config: dict[str, Any] | None = None,
 ) -> PublishResult:
    blocking_errors = markdown_report.get("blocking_errors", []) or []
    blog_url = f"{base_url.rstrip('/')}/posts/{slug}"
    if blocking_errors:
        return PublishResult(
            mode=mode,
            status="blocked",
            slug=slug,
            blog_url=blog_url,
            public_ok=False,
            error=";".join(blocking_errors),
        )
    if mode == "dry-run":
        return dry_run_publish(slug, base_url)
    if client is None:
        return PublishResult(
            mode=mode,
            status="failed",
            slug=slug,
            blog_url=blog_url,
            public_ok=False,
            error="missing_blog_client",
        )
    idempotency_config = idempotency_config or {}
    if bool(idempotency_config.get("enabled", False)):
        try:
            existing_post = _get_existing_post(client, slug)
        except Exception as exc:
            return PublishResult(
                mode=mode,
                status="failed",
                slug=slug,
                blog_url=blog_url,
                public_ok=False,
                error=f"idempotency_check_failed:{type(exc).__name__}: {exc}",
            )
        if existing_post is not None:
            existing_content = str(existing_post.get("content") or existing_post.get("markdown") or "")
            if _content_hash(existing_content) == _content_hash(markdown):
                return PublishResult(
                    mode=mode,
                    status="already_published",
                    slug=slug,
                    blog_url=blog_url,
                    public_ok=True,
                )
            if not bool(idempotency_config.get("allow_republish", False)):
                return PublishResult(
                    mode=mode,
                    status="blocked",
                    slug=slug,
                    blog_url=blog_url,
                    public_ok=False,
                    error="slug_already_exists",
                )
    payload = {"title": title, "content": markdown, "tags": tags, "slug": slug}
    try:
        create_resp = client.create_post(payload)
        created_slug = create_resp.get("slug") or slug
        if mode == "publish":
            client.publish_post(created_slug)
        return PublishResult(
            mode=mode,
            status="ok",
            slug=created_slug,
            blog_url=f"{base_url.rstrip('/')}/posts/{created_slug}",
            public_ok=mode == "publish",
        )
    except Exception as exc:
        return PublishResult(
            mode=mode,
            status="failed",
            slug=slug,
            blog_url=blog_url,
            public_ok=False,
            error=f"{type(exc).__name__}: {exc}",
        )
--- a/ai_daily_report/quality_gate.py
+++ b/ai_daily_report/quality_gate.py
@@ -0,0 +1,98 @@
 from __future__ import annotations
 import difflib
 from typing import Any
 from .dedupe import _title_tokens
 from .models import NewsItem, SourceResult
 DEFAULT_CONFIG = {
    "required_source_failure_policy": "block",  # block | draft | dry_run | warn
    "block_on_required_source_failure": True,
    "warn_on_enabled_source_failure": True,
    "warn_when_stage3_candidates_zero_min_items": 30,
    "warn_on_final_title_similarity": 0.55,
    "warn_on_entity_frequency": 3,
    "required_sources": [],
 }
 def _config(config: dict[str, Any] | None) -> dict[str, Any]:
    return {**DEFAULT_CONFIG, **(config or {})}
 def _source_failures(source_results: list[SourceResult]) -> list[dict[str, Any]]:
    failures: list[dict[str, Any]] = []
    for result in source_results:
        if result.ok or result.status == "disabled":
            continue
        failures.append(
            {
                "source": result.source,
                "role": result.role,
                "status": result.status,
                "error": result.error,
            }
        )
    return failures
 def _similar_title_warnings(items: list[NewsItem], threshold: float) -> list[str]:
    warnings: list[str] = []
    for index, left in enumerate(items):
        left_title = left.title or left.title_raw
        for right in items[index + 1 :]:
            right_title = right.title or right.title_raw
            if len(_title_tokens(left_title)) < 2 or len(_title_tokens(right_title)) < 2:
                continue
            ratio = difflib.SequenceMatcher(None, left_title.lower(), right_title.lower()).ratio()
            if ratio >= threshold:
                warnings.append(f"final_title_similarity:{left.id}:{right.id}:{ratio:.3f}")
    return warnings
 def evaluate_quality_gate(
    items: list[NewsItem],
    *,
    source_results: list[SourceResult],
    reports: dict[str, Any],
    config: dict[str, Any] | None = None,
 ) -> dict[str, Any]:
    config = _config(config)
    warnings: list[str] = []
    blocking_errors: list[str] = []
    stage3_report = reports.get("stage3", {}) or {}
    min_items = int(config["warn_when_stage3_candidates_zero_min_items"])
    if len(items) > min_items and int(stage3_report.get("candidate_group_count", 0)) == 0:
        warnings.append("stage3_candidates_zero")
    failures = _source_failures(source_results)
    if bool(config["warn_on_enabled_source_failure"]):
        for failure in failures:
            warnings.append(f"enabled_source_failed:{failure['source']}:{failure['status']}")
    required_sources = set(config.get("required_sources") or [])
    required_failures = [failure for failure in failures if failure["source"] in required_sources]
    policy = str(config.get("required_source_failure_policy") or "block")
    if bool(config["block_on_required_source_failure"]) and policy == "block":
        for failure in required_failures:
            blocking_errors.append(f"required_source_failed:{failure['source']}:{failure['status']}")
    elif required_failures:
        for failure in required_failures:
            warnings.append(f"required_source_failed:{failure['source']}:{failure['status']}:{policy}")
    title_threshold = float(config["warn_on_final_title_similarity"])
    if title_threshold > 0:
        warnings.extend(_similar_title_warnings(items, title_threshold))
    return {
        "input_count": len(items),
        "warnings": warnings,
        "blocking_errors": blocking_errors,
        "source_failures": failures,
        "required_source_failures": required_failures,
        "required_source_failure_policy": policy,
        "quality_gate_failed": bool(blocking_errors),
    }
--- a/ai_daily_report/rewrite.py
+++ b/ai_daily_report/rewrite.py
@@ -0,0 +1,192 @@
 from __future__ import annotations
 import json
 from typing import Any, Callable
 from urllib.error import HTTPError
 from .classify import SECTION_ORDER
 from .llm import parse_json_object
 from .models import NewsItem
 RewriteLlmCall = Callable[[str], str]
 def _chunks(items: list[NewsItem], size: int) -> list[list[NewsItem]]:
    return [items[index : index + size] for index in range(0, len(items), size)]
 def _build_prompt(batch: list[NewsItem]) -> str:
    payload = {
        "task": (
            "For each AI news item, translate when needed, rewrite the title and summary into concise Chinese, "
            "and classify it into exactly one allowed section. Preserve brand/model/API names such as GPT-5, "
            "Codex, Gemini, Claude, API, MCP. Do not add facts."
        ),
        "allowed_sections": SECTION_ORDER,
        "section_guidance": {
            "模型与能力": "model releases, capability upgrades, modalities, context windows, inference, benchmarks tied to model ability",
            "产品与应用": "end-user products, apps, agents, workflows, product launches, practical business or consumer use cases",
            "开发与基础设施": "developer tools, APIs, SDKs, MCP, frameworks, deployment, chips, cloud, infra, open source engineering",
            "公司与资本": "company strategy, financing, IPO, acquisitions, partnerships, revenue, business competition",
            "政策与安全": "policy, regulation, safety, privacy, copyright, misuse, security incidents, governance",
            "论文与研究": "papers, academic research, arXiv, methods, experiments, datasets, evaluations",
            "观点与教程": "opinions, analysis, explainers, tutorials, guides, practices",
            "人物与动态": "people-focused interviews, speeches, career moves, public appearances",
        },
        "items": [
            {
                "id": item.id,
                "title_raw": item.title_raw,
                "summary_raw": item.summary_raw,
                "source": item.source_label,
                "language_hint": item.language_hint,
                "source_section_hint": item.section_hint,
            }
            for item in batch
        ],
        "output_schema": {
            "rewrites": [
                {
                    "id": "item id",
                    "title": "display title",
                    "summary": "display summary",
                    "section": "one allowed section",
                    "confidence": 0.0,
                    "flags": [],
                }
            ]
        },
    }
    return json.dumps(payload, ensure_ascii=False)
 def _fallback(item: NewsItem) -> None:
    item.title = item.title_raw
    item.summary = item.summary_raw or "该条目暂无摘要。"
 def _is_transient_llm_error(exc: Exception) -> bool:
    if isinstance(exc, TimeoutError):
        return True
    if isinstance(exc, HTTPError):
        return exc.code in {429, 500, 502, 503, 504}
    return False
 def _apply_rewrite_results(batch: list[NewsItem], rewrites: list[Any]) -> tuple[int, int]:
    by_id = {item.id: item for item in batch}
    seen_ids: set[str] = set()
    section_count = 0
    for entry in rewrites:
        if not isinstance(entry, dict):
            continue
        item_id = entry.get("id")
        title = str(entry.get("title") or "").strip()
        summary = str(entry.get("summary") or "").strip()
        if item_id in by_id and title and summary:
            by_id[item_id].title = title
            by_id[item_id].summary = summary
            section = str(entry.get("section") or "").strip()
            if section in SECTION_ORDER:
                by_id[item_id].section = section
                section_count += 1
            seen_ids.add(item_id)
    return len(seen_ids), section_count
 def _apply_rewrite_batch(batch: list[NewsItem], llm_call: RewriteLlmCall) -> tuple[int, int]:
    obj = parse_json_object(llm_call(_build_prompt(batch)))
    rewrites = obj.get("rewrites", [])
    if not isinstance(rewrites, list):
        raise ValueError("rewrites is not a list")
    return _apply_rewrite_results(batch, rewrites)
 def rewrite_items(
    items: list[NewsItem],
    *,
    llm_call: RewriteLlmCall,
    batch_size: int = 30,
    retry_batch_size: int = 10,
    max_fallback_ratio: float = 0.2,
    retry_single_items: bool = False,
 ) -> tuple[list[NewsItem], dict[str, Any]]:
    rewritten_count = 0
    llm_section_count = 0
    fallback_count = 0
    missing_rewrite_count = 0
    batch_retry_count = 0
    errors: list[str] = []
    for batch in _chunks(items, max(1, batch_size)):
        try:
            batch_rewritten_count, batch_section_count = _apply_rewrite_batch(batch, llm_call)
            rewritten_count += batch_rewritten_count
            llm_section_count += batch_section_count
            for item in batch:
                if item.title is None or item.summary is None:
                    errors.append(f"missing_rewrite_for_item: {item.id}")
                    _fallback(item)
                    fallback_count += 1
                    missing_rewrite_count += 1
        except Exception as exc:
            errors.append(f"batch:{type(exc).__name__}: {exc}")
            if _is_transient_llm_error(exc):
                for item in batch:
                    _fallback(item)
                    fallback_count += 1
                continue
            if len(batch) > max(1, retry_batch_size):
                for retry_batch in _chunks(batch, max(1, retry_batch_size)):
                    batch_retry_count += 1
                    try:
                        retry_rewritten_count, retry_section_count = _apply_rewrite_batch(retry_batch, llm_call)
                        rewritten_count += retry_rewritten_count
                        llm_section_count += retry_section_count
                        for item in retry_batch:
                            if item.title is None or item.summary is None:
                                errors.append(f"missing_rewrite_for_item: {item.id}")
                                _fallback(item)
                                fallback_count += 1
                                missing_rewrite_count += 1
                    except Exception as retry_exc:
                        errors.append(f"batch_retry:{type(retry_exc).__name__}: {retry_exc}")
                        for item in retry_batch:
                            _fallback(item)
                            fallback_count += 1
                continue
            if not retry_single_items:
                for item in batch:
                    _fallback(item)
                    fallback_count += 1
                continue
            for item in batch:
                try:
                    item_rewritten_count, item_section_count = _apply_rewrite_batch([item], llm_call)
                    rewritten_count += item_rewritten_count
                    llm_section_count += item_section_count
                except Exception as item_exc:
                    errors.append(f"item:{item.id}:{type(item_exc).__name__}: {item_exc}")
                    _fallback(item)
                    fallback_count += 1
    fallback_ratio = fallback_count / len(items) if items else 0
    blocking_errors: list[str] = []
    if fallback_ratio > max_fallback_ratio:
        blocking_errors.append("rewrite_fallback_ratio_exceeded")
    report = {
        "input_count": len(items),
        "rewritten_count": rewritten_count,
        "llm_section_count": llm_section_count,
        "fallback_count": fallback_count,
        "missing_rewrite_count": missing_rewrite_count,
        "fallback_ratio": round(fallback_ratio, 4),
        "batch_count": len(_chunks(items, max(1, batch_size))),
        "batch_retry_count": batch_retry_count,
        "errors": errors,
        "blocking_errors": blocking_errors,
        "quality_gate_failed": bool(blocking_errors),
    }
    return items, report
--- a/ai_daily_report/runner.py
+++ b/ai_daily_report/runner.py
@@ -0,0 +1,225 @@
 from __future__ import annotations
 import json
 from dataclasses import asdict, is_dataclass
 from pathlib import Path
 from typing import Any
 from .clients import BlogApiClient, OpenAICompatibleClient, fetch_text as default_fetch_text
 from .config import load_pipeline_config, load_source_configs
 from .env import load_env, resolve_blog_token, resolve_llm_config
 from .models import SourceConfig
 from .observability import LlmCallObserver, summarize_observed_calls
 from .pipeline import run_stage0_to_stage8
 from .publish import load_published_urls, update_published_urls
 from .sources.registry import get_source_fetcher
 def _json_default(value: Any):
    if is_dataclass(value):
        return asdict(value)
    raise TypeError(f"Object is not JSON serializable: {type(value).__name__}")
 def _mock_source_configs() -> list[SourceConfig]:
    return [SourceConfig(name="Mock AI HOT", type="mock", role="primary", priority=10)]
 def _mock_fetcher(config: SourceConfig, run_date: str) -> list[dict[str, Any]]:
    return [
        {
            "title_raw": "GPT-5 API 发布",
            "summary_raw": "OpenAI 发布 GPT-5 API，用于本地 mock 测试。",
            "url": "https://example.com/gpt5",
            "source_label": "OpenAI：Blog",
            "section_hint": "模型发布/更新",
            "origin_type": "mock",
            "language_hint": "zh",
        }
    ]
 def _mock_semantic_llm(prompt: str) -> str:
    return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []}, ensure_ascii=False)
 def _mock_rewrite_llm(prompt: str) -> str:
    payload = json.loads(prompt)
    return json.dumps(
        {
            "rewrites": [
                {
                    "id": item["id"],
                    "title": item["title_raw"],
                    "summary": item["summary_raw"],
                    "flags": [],
                }
                for item in payload["items"]
            ]
        },
        ensure_ascii=False,
    )
 def _mock_guide_llm(prompt: str) -> str:
    payload = json.loads(prompt)
    item_ids = [item["id"] for item in payload["items"][:3]]
    return json.dumps(
        {
            "intro": "本地 mock 模式已生成 AI 日报，用于验证流水线。",
            "theme": "本地 mock 模式已生成 AI 日报，用于验证流水线。",
            "threads": [
                {
                    "title": "本地链路验证",
                    "text": "采集、改写、分类、导览、Markdown 和发布报告都已通过 mock 数据串联。",
                    "item_ids": item_ids,
                    "kind": "thread",
                }
            ],
            "conclusion": "本地 mock 结果可用于确认定时任务入口和文件输出是否正常。",
        },
        ensure_ascii=False,
    )
 def run_daily_report(
    *,
    run_date: str,
    mode: str,
    source_mode: str,
    llm_mode: str,
    out_dir: Path,
    base_url: str,
    sources_path: Path | None = None,
    pipeline_path: Path | None = None,
    history_path: Path | None = None,
    fetch_text=None,
    env: dict[str, str] | None = None,
    llm_client_factory=OpenAICompatibleClient,
    blog_client_factory=BlogApiClient,
 ) -> dict[str, Any]:
    fetch_text = fetch_text or default_fetch_text
    env = env if env is not None else load_env()
    pipeline_config_path = pipeline_path or Path("config") / "pipeline.json"
    pipeline_config = load_pipeline_config(pipeline_config_path)
    cross_day_config = pipeline_config.get("cross_day_dedup", {}) or {}
    cross_day_enabled = bool(cross_day_config.get("enabled", True))
    cross_day_max_age_days = int(cross_day_config.get("max_age_days", 7))
    semantic_dedup_max_deletion_ratio = float(pipeline_config.get("semantic_dedup_max_deletion_ratio", 0.5))
    rewrite_batch_size = int(pipeline_config.get("rewrite_batch_size", 30))
    semantic_candidate_recall_config = pipeline_config.get("semantic_candidate_recall", {}) or {}
    quality_gate_config = pipeline_config.get("quality_gate", {}) or {}
    publish_idempotency_config = pipeline_config.get("publish_idempotency", {}) or {}
    configured_history_path = history_path or Path(
        str(cross_day_config.get("history_path") or "~/.hermes/scripts/ai_morning_out/published_urls.json")
    ).expanduser()
    published_urls = load_published_urls(configured_history_path) if cross_day_enabled else None
    if source_mode == "mock":
        source_configs = _mock_source_configs()
        fetcher = _mock_fetcher
    elif source_mode == "live":
        if sources_path is None:
            sources_path = Path("config") / "sources.json"
        source_configs = load_source_configs(sources_path)
        def fetcher(config: SourceConfig, current_date: str) -> list[dict[str, Any]]:
            source_fetcher = get_source_fetcher(config.type)
            def configured_fetch_text(url: str, timeout_seconds: int) -> str:
                try:
                    return fetch_text(url, timeout_seconds, retries=config.retries)
                except TypeError:
                    return fetch_text(url, timeout_seconds)
            return source_fetcher(config, current_date, configured_fetch_text)
    else:
        raise ValueError("source_mode must be 'mock' or 'live'")
    llm_observability_config = pipeline_config.get("llm_observability", {}) or {}
    llm_observers: list[LlmCallObserver] = []
    observe_llm = bool(llm_observability_config.get("enabled", True))
    prompt_preview_chars = int(llm_observability_config.get("prompt_preview_chars", 500))
    response_preview_chars = int(llm_observability_config.get("response_preview_chars", 500))
    def maybe_observe(stage: str, call):
        if not observe_llm:
            return call
        observer = LlmCallObserver(
            call=call,
            stage=stage,
            prompt_preview_chars=prompt_preview_chars,
            response_preview_chars=response_preview_chars,
        )
        llm_observers.append(observer)
        return observer
    if llm_mode == "mock":
        semantic_llm_call = maybe_observe("stage3", _mock_semantic_llm)
        rewrite_llm_call = maybe_observe("stage4", _mock_rewrite_llm)
        guide_llm_call = maybe_observe("stage6", _mock_guide_llm)
    elif llm_mode == "live":
        llm_client = llm_client_factory(**resolve_llm_config(env))
        semantic_llm_call = maybe_observe("stage3", llm_client.chat)
        rewrite_llm_call = maybe_observe("stage4", llm_client.chat)
        guide_llm_call = maybe_observe("stage6", llm_client.chat)
    else:
        raise ValueError("llm_mode must be 'mock' or 'live'")
    blog_client = None
    if mode in ("draft", "publish"):
        token = resolve_blog_token(env)
        if not token:
            raise ValueError("missing_blog_token: set BLOG_SERVICE_TOKEN or EPHRON_SERVICE_TOKEN")
        blog_client = blog_client_factory(base_url=base_url, token=token)
    result = run_stage0_to_stage8(
        source_configs,
        run_date,
        fetcher=fetcher,
        semantic_llm_call=semantic_llm_call,
        rewrite_llm_call=rewrite_llm_call,
        guide_llm_call=guide_llm_call,
        mode=mode,
        base_url=base_url,
        client=blog_client,
        published_urls=published_urls,
        cross_day_dedup_enabled=cross_day_enabled,
        cross_day_dedup_max_age_days=cross_day_max_age_days,
        semantic_dedup_max_deletion_ratio=semantic_dedup_max_deletion_ratio,
        rewrite_batch_size=rewrite_batch_size,
        semantic_candidate_recall_config=semantic_candidate_recall_config,
        quality_gate_config=quality_gate_config,
        publish_idempotency_config=publish_idempotency_config,
    )
    if cross_day_enabled and result["publish"].mode == "publish" and result["publish"].status == "ok":
        update_published_urls(
            configured_history_path,
            result["items"],
            run_date=run_date,
            max_age_days=cross_day_max_age_days,
        )
    llm_observability_report = summarize_observed_calls(llm_observers)
    result["reports"]["llm_observability"] = llm_observability_report
    run_dir = out_dir / run_date
    run_dir.mkdir(parents=True, exist_ok=True)
    (run_dir / "blog_markdown.md").write_text(result["markdown"], encoding="utf-8")
    (run_dir / "run_report.json").write_text(
        json.dumps(result["reports"], ensure_ascii=False, indent=2, default=_json_default),
        encoding="utf-8",
    )
    for artifact_name, artifact_value in result.get("artifacts", {}).items():
        (run_dir / f"{artifact_name}.json").write_text(
            json.dumps(artifact_value, ensure_ascii=False, indent=2, default=_json_default),
            encoding="utf-8",
        )
    return {
        "run_dir": str(run_dir),
        "markdown": result["markdown"],
        "reports": result["reports"],
        "publish": result["publish"],
        "artifacts": result.get("artifacts", {}),
    }
--- a/ai_daily_report/semantic_dedupe.py
+++ b/ai_daily_report/semantic_dedupe.py
@@ -0,0 +1,224 @@
 from __future__ import annotations
 import json
 from typing import Any, Callable
 from .llm import parse_json_object
 from .models import NewsItem
 SemanticLlmCall = Callable[[str], str]
 def _build_prompt(items: list[NewsItem], candidates: list[dict[str, Any]]) -> str:
    item_payload = [
        {
            "id": item.id,
            "title": item.title or item.title_raw,
            "summary": item.summary or item.summary_raw,
            "source": item.source_label,
            "section_hint": item.section_hint,
        }
        for item in items
    ]
    prompt = {
        "task": "Identify only high-confidence semantic duplicates. Do not curate or remove by importance.",
        "items": item_payload,
        "candidates": candidates,
        "dedupe_policy": [
            "Use duplicate_groups only when items are substantially the same article/event and one can be removed.",
            "Use merge_groups when items cover the same concrete event from different angles; keep the best item and attach the others as supplementary sources instead of dropping the event context.",
            "Do not curate by importance. Do not merge unrelated follow-ups just because they mention the same company/model.",
        ],
        "output_schema": {
            "duplicate_groups": [
                {
                    "keep_id": "item id",
                    "remove_ids": ["item id"],
                    "confidence": "high|medium|low",
                    "reason": "same concrete event reason",
                }
            ],
            "merge_groups": [
                {
                    "keep_id": "item id",
                    "merge_ids": ["item id"],
                    "confidence": "high|medium|low",
                    "reason": "same event, complementary angle/source",
                }
            ],
            "not_duplicates": [],
            "uncertain": [],
        },
    }
    return json.dumps(prompt, ensure_ascii=False)
 def _score(item: NewsItem) -> int:
    score = max(0, 200 - item.source_priority)
    if item.source_role == "primary":
        score += 10
    if item.summary_raw:
        score += min(40, len(item.summary_raw))
    if item.canonical_url:
        score += 20
    score -= len(item.quality_flags) * 10
    return score
 def _choose_keep(group_items: list[NewsItem], suggested_keep_id: str) -> NewsItem:
    suggested = [item for item in group_items if item.id == suggested_keep_id]
    if suggested:
        best = max(group_items, key=_score)
        if _score(suggested[0]) >= _score(best) - 10:
            return suggested[0]
    return max(group_items, key=_score)
 def semantic_dedup_items(
    items: list[NewsItem],
    candidates: list[dict[str, Any]],
    *,
    llm_call: SemanticLlmCall,
    max_deletion_ratio: float = 0.5,
 ) -> tuple[list[NewsItem], dict[str, Any]]:
    if not items or not candidates:
        return items, {
            "input_count": len(items),
            "candidate_group_count": len(candidates),
            "removed_count": 0,
            "duplicate_groups": [],
            "merge_groups": [],
            "uncertain": [],
            "errors": [],
            "skipped_for_deletion_ratio": False,
        }
    errors: list[str] = []
    try:
        obj = parse_json_object(llm_call(_build_prompt(items, candidates)))
    except Exception as exc:
        return items, {
            "input_count": len(items),
            "candidate_group_count": len(candidates),
            "removed_count": 0,
            "duplicate_groups": [],
            "merge_groups": [],
            "uncertain": [],
            "errors": [f"{type(exc).__name__}: {exc}"],
            "skipped_for_deletion_ratio": False,
        }
    by_id = {item.id: item for item in items}
    candidate_sets = {
        frozenset(item_id for item_id in candidate.get("item_ids", []) if isinstance(item_id, str))
        for candidate in candidates
    }
    candidate_removals: set[str] = set()
    valid_groups: list[dict[str, Any]] = []
    valid_merge_groups: list[dict[str, Any]] = []
    def _validate_group_ids(group: dict[str, Any], member_key: str) -> tuple[list[str], list[NewsItem]] | None:
        raw_ids = [group.get("keep_id")] + list(group.get(member_key) or [])
        if any(not isinstance(item_id, str) or item_id not in by_id for item_id in raw_ids):
            errors.append(f"invalid_ids_in_group: {group}")
            return None
        ids = [str(item_id) for item_id in raw_ids]
        group_set = frozenset(ids)
        if not any(group_set.issubset(candidate_set) for candidate_set in candidate_sets):
            errors.append(f"group_outside_candidates: {group}")
            return None
        return ids, [by_id[item_id] for item_id in ids]
    for group in obj.get("duplicate_groups", []) or []:
        if group.get("confidence") != "high":
            continue
        validated = _validate_group_ids(group, "remove_ids")
        if validated is None:
            continue
        ids, group_items = validated
        keep = _choose_keep(group_items, str(group.get("keep_id")))
        remove_items = [item for item in group_items if item is not keep]
        candidate_removals.update(item.id for item in remove_items)
        valid_groups.append(
            {
                "keep_id": keep.id,
                "remove_ids": [item.id for item in remove_items],
                "confidence": "high",
                "reason": str(group.get("reason") or "semantic_duplicate"),
            }
        )
    for group in obj.get("merge_groups", []) or []:
        if group.get("confidence") != "high":
            continue
        validated = _validate_group_ids(group, "merge_ids")
        if validated is None:
            continue
        ids, group_items = validated
        keep = _choose_keep(group_items, str(group.get("keep_id")))
        merge_items = [item for item in group_items if item is not keep]
        valid_merge_groups.append(
            {
                "keep_id": keep.id,
                "merge_ids": [item.id for item in merge_items],
                "confidence": "high",
                "reason": str(group.get("reason") or "semantic_merge"),
            }
        )
    deletion_ratio = len(candidate_removals) / len(items) if items else 0
    if deletion_ratio > max_deletion_ratio:
        return items, {
            "input_count": len(items),
            "candidate_group_count": len(candidates),
            "removed_count": 0,
            "duplicate_groups": valid_groups,
            "merge_groups": valid_merge_groups,
            "uncertain": obj.get("uncertain", []) or [],
            "errors": errors,
            "skipped_for_deletion_ratio": True,
        }
    removed_ids: set[str] = set()
    def append_supplement(keep: NewsItem, source_item: NewsItem, reason: str, action: str) -> None:
        keep.duplicate_sources.append(
            {
                "id": source_item.id,
                "source_group": source_item.source_group,
                "source_label": source_item.source_label,
                "url": source_item.url,
                "title": source_item.title or source_item.title_raw,
                "summary": source_item.summary or source_item.summary_raw,
                "reason": reason,
                "action": action,
            }
        )
    for group in valid_groups:
        keep = by_id[group["keep_id"]]
        for remove_id in group["remove_ids"]:
            removed = by_id[remove_id]
            append_supplement(keep, removed, group["reason"], "dedupe_remove")
            removed_ids.add(remove_id)
    for group in valid_merge_groups:
        keep = by_id[group["keep_id"]]
        for merge_id in group["merge_ids"]:
            if merge_id in removed_ids:
                continue
            append_supplement(keep, by_id[merge_id], group["reason"], "merge_supplement")
    deduped = [item for item in items if item.id not in removed_ids]
    report = {
        "input_count": len(items),
        "candidate_group_count": len(candidates),
        "removed_count": len(removed_ids),
        "duplicate_groups": valid_groups,
        "merge_groups": valid_merge_groups,
        "uncertain": obj.get("uncertain", []) or [],
        "errors": errors,
        "skipped_for_deletion_ratio": False,
    }
    return deduped, report
--- a/ai_daily_report/sources/init.py
+++ b/ai_daily_report/sources/init.py
@@ -0,0 +1,2 @@
 """Source adapters for the AI daily report pipeline."""
--- a/ai_daily_report/sources/aihot.py
+++ b/ai_daily_report/sources/aihot.py
@@ -0,0 +1,32 @@
 from __future__ import annotations
 import json
 from typing import Any, Callable
 from ai_daily_report.models import SourceConfig
 FetchText = Callable[[str, int], str]
 def fetch_aihot(config: SourceConfig, run_date: str, fetch_text: FetchText) -> list[dict[str, Any]]:
    data = json.loads(fetch_text(f"https://aihot.virxact.com/api/public/daily/{run_date}", config.timeout_seconds))
    items: list[dict[str, Any]] = []
    generated = data.get("generatedAt")
    for section in data.get("sections", []) or []:
        for raw in section.get("items", []) or []:
            items.append(
                {
                    "source_group": config.name,
                    "source_label": raw.get("sourceName") or config.name,
                    "title_raw": raw.get("title") or "",
                    "summary_raw": raw.get("summary") or "",
                    "url": raw.get("sourceUrl") or "",
                    "published_at": generated,
                    "origin_type": "aihot_json",
                    "section_hint": section.get("label") or "",
                    "language_hint": "zh",
                }
            )
    return items
--- a/ai_daily_report/sources/juya.py
+++ b/ai_daily_report/sources/juya.py
@@ -0,0 +1,58 @@
 from __future__ import annotations
 import re
 import xml.etree.ElementTree as ET
 from typing import Any, Callable
 from ai_daily_report.models import SourceConfig
 from ai_daily_report.normalize import clean_text
 from ai_daily_report.sources.labels import source_label_from_url
 FetchText = Callable[[str, int], str]
 def parse_juya_rss(config: SourceConfig, xml_text: str, run_date: str) -> list[dict[str, Any]]:
    root = ET.fromstring(xml_text)
    channel = root.find("channel")
    raw_items = channel.findall("item") if channel is not None else []
    article_html = ""
    for raw in raw_items:
        if (raw.findtext("title") or "").strip() != run_date:
            continue
        content_el = raw.find("{http://purl.org/rss/1.0/modules/content/}encoded")
        article_html = content_el.text if content_el is not None and content_el.text else ""
        break
    if not article_html:
        return []
    block_pattern = re.compile(
        r'<h2[^>]*>\s*(?:<a[^>]*href="(?P<title_url>[^"]+)"[^>]*>)?(?P<title_html>[^<]*?)</a>?\s*<code>#(?P<num>\d+)</code>\s*</h2>(?P<body>.*?)(?=<hr\s*/?>\s*<h2|<p><strong>提示</strong>|$)',
        re.S | re.I,
    )
    items: list[dict[str, Any]] = []
    for match in block_pattern.finditer(article_html):
        title = clean_text(match.group("title_html") or "")
        body_html = match.group("body") or ""
        links = re.findall(r'<a[^>]*href="([^"]+)"[^>]*>', body_html, re.I)
        url = links[0].replace("&amp;", "&").strip() if links else (match.group("title_url") or "")
        summary = clean_text(re.sub(r"<[^>]+>", " ", body_html))
        if title:
            items.append(
                {
                    "source_group": config.name,
                    "source_label": source_label_from_url(url, fallback=config.name),
                    "title_raw": title,
                    "summary_raw": summary[:500],
                    "url": url,
                    "published_at": None,
                    "origin_type": "juya_issue",
                    "section_hint": "",
                    "language_hint": "zh",
                }
            )
    return items
 def fetch_juya(config: SourceConfig, run_date: str, fetch_text: FetchText) -> list[dict[str, Any]]:
    return parse_juya_rss(config, fetch_text(config.url, config.timeout_seconds), run_date)
--- a/ai_daily_report/sources/labels.py
+++ b/ai_daily_report/sources/labels.py
@@ -0,0 +1,78 @@
 from __future__ import annotations
 from urllib.parse import urlparse
 DOMAIN_LABELS = {
    "anthropic.com": "Anthropic",
    "arxiv.org": "arXiv",
    "bloomberg.com": "Bloomberg",
    "deepseek.com": "DeepSeek",
    "github.blog": "GitHub Blog",
    "github.com": "GitHub",
    "huggingface.co": "Hugging Face",
    "infoq.com": "InfoQ",
    "mp.weixin.qq.com": "微信公众号",
    "openai.com": "OpenAI",
    "platform.minimaxi.com": "MiniMax：Docs",
    "qbitai.com": "量子位",
    "techcrunch.com": "TechCrunch",
    "technologyreview.com": "MIT科技评论AI",
    "theverge.com": "The Verge",
    "x.com": "X",
    "twitter.com": "X",
 }
 X_DISPLAY_NAMES = {
    "MiniMax_AI": "MiniMax",
    "OpenAIDevs": "OpenAI Developers",
    "openai": "OpenAI",
    "openclaw": "OpenClaw",
    "xai": "xAI",
    "krea_ai": "Krea AI",
    "nvidia": "NVIDIA",
    "NVIDIAAI": "NVIDIA AI",
    "alibaba_cloud": "阿里云 / Alibaba Cloud",
    "cb_doge": "cb_doge",
 }
 def _host(url: str) -> str:
    host = (urlparse(url).netloc or "").lower()
    return host[4:] if host.startswith("www.") else host
 def _domain_label(host: str) -> str:
    for domain, label in DOMAIN_LABELS.items():
        if host == domain or host.endswith("." + domain):
            return label
    return host
 def _x_handle(url: str) -> str:
    parts = [part for part in urlparse(url).path.split("/") if part]
    if not parts:
        return ""
    handle = parts[0]
    if handle in {"i", "search", "explore", "settings", "notifications", "home", "compose"}:
        return ""
    return handle
 def source_label_from_url(url: str, *, fallback: str = "来源") -> str:
    if not url:
        return fallback
    host = _host(url)
    if host in {"x.com", "twitter.com"}:
        handle = _x_handle(url)
        if handle:
            display = X_DISPLAY_NAMES.get(handle, handle)
            return f"X：{display} (@{handle})"
        return "X"
    label = _domain_label(host)
    parsed = urlparse(url)
    path = (parsed.path or "").lower()
    if label and ("blog" in host or "/blog" in path or "/research" in path):
        return f"{label}：Blog"
    return label or fallback
--- a/ai_daily_report/sources/registry.py
+++ b/ai_daily_report/sources/registry.py
@@ -0,0 +1,24 @@
 from __future__ import annotations
 from typing import Callable
 from ai_daily_report.models import SourceConfig
 from ai_daily_report.sources.aihot import fetch_aihot
 from ai_daily_report.sources.juya import fetch_juya
 from ai_daily_report.sources.rss import fetch_rss
 SourceFetcher = Callable[[SourceConfig, str, Callable[[str, int], str]], list[dict]]
 SOURCE_FETCHERS: dict[str, SourceFetcher] = {
    "aihot": fetch_aihot,
    "rss": fetch_rss,
    "juya_rss": fetch_juya,
 }
 def get_source_fetcher(source_type: str) -> SourceFetcher:
    if source_type not in SOURCE_FETCHERS:
        raise KeyError(f"Unknown source type: {source_type}")
    return SOURCE_FETCHERS[source_type]
--- a/ai_daily_report/sources/rss.py
+++ b/ai_daily_report/sources/rss.py
@@ -0,0 +1,94 @@
 from __future__ import annotations
 import xml.etree.ElementTree as ET
 from datetime import date, datetime
 from email.utils import parsedate_to_datetime
 from typing import Any, Callable
 from ai_daily_report.models import SourceConfig
 from ai_daily_report.normalize import clean_text
 FetchText = Callable[[str, int], str]
 def _parse_pubdate(value: str) -> str | None:
    if not value:
        return None
    try:
        return parsedate_to_datetime(value).isoformat()
    except Exception:
        return None
 def _parse_run_date(value: str | None) -> date | None:
    if not value:
        return None
    try:
        return date.fromisoformat(value[:10])
    except ValueError:
        return None
 def _parse_iso_date(value: str | None) -> date | None:
    if not value:
        return None
    try:
        return datetime.fromisoformat(value).date()
    except ValueError:
        return None
 def _within_max_item_age(published_at: str | None, *, run_date: str | None, max_item_age_days: int | None) -> bool:
    if max_item_age_days is None:
        return True
    published_date = _parse_iso_date(published_at)
    current_date = _parse_run_date(run_date)
    if published_date is None or current_date is None:
        return True
    return (current_date - published_date).days <= max_item_age_days
 def parse_rss_items(
    config: SourceConfig,
    xml_text: str,
    *,
    limit: int = 20,
    run_date: str | None = None,
 ) -> list[dict[str, Any]]:
    root = ET.fromstring(xml_text)
    channel = root.find("channel")
    raw_items = channel.findall("item") if channel is not None else []
    items: list[dict[str, Any]] = []
    for raw in raw_items:
        title = clean_text(raw.findtext("title") or "")
        if not title:
            continue
        summary = clean_text(raw.findtext("description") or "")
        published_at = _parse_pubdate(raw.findtext("pubDate") or "")
        if not _within_max_item_age(
            published_at,
            run_date=run_date,
            max_item_age_days=config.max_item_age_days,
        ):
            continue
        items.append(
            {
                "source_group": config.name,
                "source_label": config.name,
                "title_raw": title,
                "summary_raw": summary,
                "url": (raw.findtext("link") or "").strip(),
                "published_at": published_at,
                "origin_type": "rss",
                "section_hint": "",
                "language_hint": "en" if title.encode("utf-8").isascii() else "zh",
            }
        )
        if len(items) >= limit:
            break
    return items
 def fetch_rss(config: SourceConfig, run_date: str, fetch_text: FetchText) -> list[dict[str, Any]]:
    return parse_rss_items(config, fetch_text(config.url, config.timeout_seconds), run_date=run_date)
--- a/ai_daily_report/validate.py
+++ b/ai_daily_report/validate.py
@@ -0,0 +1,46 @@
 from __future__ import annotations
 import re
 from typing import Any
 from .classify import SECTION_ORDER
 from .models import NewsItem
 def validate_report_markdown(markdown: str, items: list[NewsItem]) -> dict[str, Any]:
    return validate_markdown(markdown, items)
 def validate_markdown(markdown: str, items: list[NewsItem]) -> dict[str, Any]:
    blocking_errors: list[str] = []
    auto_fixes: list[str] = []
    warnings: list[dict[str, str]] = []
    if not items:
        blocking_errors.append("no_items")
    if len((markdown or "").strip()) < 80:
        blocking_errors.append("markdown_too_short")
    if items and "## " not in markdown:
        blocking_errors.append("no_sections")
    if re.search(r"\{[^{}]*\}", markdown or ""):
        blocking_errors.append("json_fragment_detected")
    if "> >" in (markdown or ""):
        auto_fixes.append("double_blockquote_detected")
    if re.search(r"\[\d+\]|\[N\]", markdown or ""):
        auto_fixes.append("reference_marker_detected")
    for item in items:
        if not item.url:
            warnings.append({"type": "missing_url", "item_id": item.id})
        if item.section not in SECTION_ORDER:
            blocking_errors.append("invalid_section")
            break
    return {
        "item_count": len(items),
        "section_count": len({item.section for item in items if item.section}),
        "markdown_length": len(markdown or ""),
        "auto_fixes": auto_fixes,
        "warnings": warnings,
        "blocking_errors": blocking_errors,
    }
--- a/config/pipeline.json
+++ b/config/pipeline.json
@@ -0,0 +1,52 @@
 {
  "sections": [
    "模型与能力",
    "产品与应用",
    "开发与基础设施",
    "公司与资本",
    "政策与安全",
    "论文与研究",
    "观点与教程",
    "人物与动态"
  ],
  "rewrite_batch_size": 10,
  "semantic_dedup_max_deletion_ratio": 0.5,
  "default_mode": "dry-run",
  "cross_day_dedup": {
    "enabled": true,
    "max_age_days": 7,
    "history_path": "~/.hermes/scripts/ai_morning_out/published_urls.json"
  },
  "semantic_candidate_recall": {
    "enabled": true,
    "max_pairs": 80,
    "max_pairs_per_item": 5,
    "title_similarity_threshold": 0.45,
    "title_jaccard_threshold": 0.25,
    "summary_jaccard_threshold": 0.18,
    "strong_entity_overlap_threshold": 2
  },
  "quality_gate": {
    "required_source_failure_policy": "block",
    "block_on_required_source_failure": true,
    "warn_on_enabled_source_failure": true,
    "warn_when_stage3_candidates_zero_min_items": 30,
    "warn_on_final_title_similarity": 0.55,
    "warn_on_entity_frequency": 3,
    "required_sources": ["AI HOT"]
  },
  "publish_idempotency": {
    "enabled": true,
    "allow_republish": false,
    "slug_lookup_paths": [
      "/api/service/posts/{slug}",
      "/api/service/posts?slug={slug}",
      "/api/service/posts/slug/{slug}"
    ]
  },
  "llm_observability": {
    "enabled": true,
    "prompt_preview_chars": 500,
    "response_preview_chars": 500
  }
 }
--- a/config/sources.json
+++ b/config/sources.json
@@ -0,0 +1,68 @@
 [
  {
    "name": "AI HOT",
    "type": "aihot",
    "role": "primary",
    "required": true,
    "failure_policy": "block",
    "priority": 10,
    "timeout_seconds": 25,
    "retries": 2,
    "min_items": 10,
    "enabled": true
  },
  {
    "name": "橘鸦AI早报",
    "type": "juya_rss",
    "url": "https://imjuya.github.io/juya-ai-daily/rss.xml",
    "role": "supplement",
    "required": false,
    "failure_policy": "warn",
    "priority": 20,
    "timeout_seconds": 45,
    "retries": 2,
    "min_items": 0,
    "enabled": true
  },
  {
    "name": "量子位",
    "type": "rss",
    "url": "https://www.qbitai.com/feed",
    "role": "supplement",
    "required": false,
    "failure_policy": "warn",
    "priority": 30,
    "timeout_seconds": 25,
    "retries": 1,
    "min_items": 0,
    "enabled": true
  },
  {
    "name": "InfoQ AI",
    "type": "rss",
    "url": "https://feed.infoq.com/ai-ml-data-eng/",
    "role": "supplement",
    "required": false,
    "failure_policy": "warn",
    "priority": 40,
    "timeout_seconds": 25,
    "retries": 1,
    "min_items": 0,
    "max_item_age_days": 3,
    "enabled": true
  },
  {
    "name": "MIT科技评论AI",
    "type": "rss",
    "url": "https://www.technologyreview.com/topic/artificial-intelligence/feed",
    "role": "supplement",
    "required": false,
    "failure_policy": "warn",
    "priority": 50,
    "timeout_seconds": 25,
    "retries": 1,
    "min_items": 0,
    "max_item_age_days": 5,
    "enabled": true
  }
 ]
--- a/docs/ops-thresholds.generated.md
+++ b/docs/ops-thresholds.generated.md
@@ -0,0 +1,33 @@
 # AI日报运维阈值（自动生成）
 > 由 `scripts/generate_ops_docs.py` 从 `config/pipeline.json` 和 `config/sources.json` 生成；不要手改本文件。
 ## Quality Gate
 - `block_on_required_source_failure`: `True`
 - `required_source_failure_policy`: `block`
 - `required_sources`: `['AI HOT']`
 - `warn_on_enabled_source_failure`: `True`
 - `warn_on_entity_frequency`: `3`
 - `warn_on_final_title_similarity`: `0.55`
 - `warn_when_stage3_candidates_zero_min_items`: `30`
 ## Semantic Candidate Recall
 - `enabled`: `True`
 - `max_pairs`: `80`
 - `max_pairs_per_item`: `5`
 - `strong_entity_overlap_threshold`: `2`
 - `summary_jaccard_threshold`: `0.18`
 - `title_jaccard_threshold`: `0.25`
 - `title_similarity_threshold`: `0.45`
 ## Sources
 | source | required | failure_policy | min_items | retries | timeout_seconds |
 |---|---:|---|---:|---:|---:|
 | AI HOT | True | block | 10 | 2 | 25 |
 | 橘鸦AI早报 | False | warn | 0 | 2 | 45 |
 | 量子位 | False | warn | 0 | 1 | 25 |
 | InfoQ AI | False | warn | 0 | 1 | 25 |
 | MIT科技评论AI | False | warn | 0 | 1 | 25 |
--- a/docs/pipeline-optimization-plan.md
+++ b/docs/pipeline-optimization-plan.md
@@ -0,0 +1,786 @@
 # AI Daily Report Pipeline Optimization Plan
 ## Objective
 This project should become a stable, long-running AI daily report system for Hermes, OpenClaw, and similar agents. The goal is not only to keep the current script runnable, but to make the whole pipeline observable, replayable, maintainable, and safe to run on a daily schedule.
 The recommended direction is:
 ```text
 stable core library + CLI + skill wrapper
 ```
 Core business logic should live in deterministic code. The skill should describe how agents run, diagnose, replay, publish, and extend the pipeline.
 ## Stage Model
 Use this stage model going forward:
 ```text
 Stage 0: Collect Sources
 Stage 1: Normalize Items
 Stage 2: Hard Dedup
 Stage 3: Semantic Dedup
 Stage 4: Rewrite Titles and Summaries
 Stage 5: Classify and Order
 Stage 6: Guide and Daily Threads
 Stage 7: Assemble and Validate Markdown
 Stage 8: Publish and Deliver
 ```
 The current script names script-level deduplication as Stage 0. That should be treated as old terminology. In the long-term pipeline, the first stage is source collection.
 ## Architecture
 Recommended structure:
 ```text
 ai-daily-report/
 ├── ai_daily_report/
 │   ├── models.py
 │   ├── sources/
 │   │   ├── aihot.py
 │   │   ├── rss.py
 │   │   ├── juya.py
 │   │   └── registry.py
 │   ├── collect.py
 │   ├── normalize.py
 │   ├── dedupe.py
 │   ├── llm.py
 │   ├── rewrite.py
 │   ├── classify.py
 │   ├── assemble.py
 │   ├── validate.py
 │   ├── publish.py
 │   └── cli.py
 ├── config/
 │   ├── sources.json
 │   └── pipeline.json
 ├── docs/
 ├── skill/
 │   ├── SKILL.md
 │   ├── scripts/
 │   └── references/
 ├── tests/
 │   └── fixtures/
 └── script/
    └── ai_daily_blog_pipeline.py
 ```
 Keep `script/ai_daily_blog_pipeline.py` as a compatibility entrypoint during migration, but move implementation into importable modules.
 ## Data Model
 ### SourceResult
 Every data source should return a structured result:
 ```json
 {
  "source": "AI HOT",
  "role": "primary",
  "ok": true,
  "status": "ok",
  "items": [],
  "error": null,
  "elapsed_ms": 820,
  "retry_count": 0,
  "fetched_at": "2026-06-04T10:00:00+08:00"
 }
 ```
 Supported statuses:
 ```text
 ok
 empty
 not_ready
 timeout
 http_error
 parse_error
 disabled
 ```
 ### NewsItem
 All raw source items should be normalized into one structure:
 ```json
 {
  "id": "item_...",
  "source_group": "AI HOT",
  "source_label": "OpenAI: Blog",
  "source_role": "primary",
  "source_priority": 10,
  "title_raw": "...",
  "title_norm": "...",
  "summary_raw": "...",
  "title": null,
  "summary": null,
  "url": "...",
  "canonical_url": "...",
  "published_at": "...",
  "collected_at": "...",
  "origin_type": "aihot_json",
  "section_hint": "...",
  "section": null,
  "language_hint": "zh",
  "quality_flags": [],
  "duplicate_sources": []
 }
 ```
 Do not overwrite raw fields with LLM output. Keep display fields separate.
 ## Stage 0: Collect Sources
 ### Goal
 Collect candidate news from all configured sources in a stable, observable, and recoverable way.
 ### Design
 Use a primary-plus-supplement model at the quality layer, and parallel execution at the scheduling layer.
 ```text
 Quality layer:
 AI HOT = primary source
 RSS / Juya / InfoQ / QbitAI / MIT = supplement sources
 Execution layer:
 start all sources concurrently with per-source timeout, retry, and reporting
 ```
 ### Source Config
 Example:
 ```json
 {
  "name": "AI HOT",
  "type": "aihot",
  "role": "primary",
  "required": true,
  "priority": 10,
  "timeout_seconds": 20,
  "retries": 2,
  "min_items": 10,
  "enabled": true
 }
 ```
 Supplement source example:
 ```json
 {
  "name": "Juya AI Daily",
  "type": "juya_rss",
  "url": "https://imjuya.github.io/juya-ai-daily/rss.xml",
  "role": "supplement",
  "required": false,
  "priority": 20,
  "timeout_seconds": 45,
  "retries": 2,
  "enabled": true
 }
 ```
 ### Optimizations
 - Run supplement sources concurrently.
 - Do not let one slow source block the whole pipeline.
 - Replace the fixed Juya `sleep(120)` with bounded short retries and a clear `not_ready` or `timeout` status.
 - Treat AI HOT 404 as "not ready" rather than a generic failure.
 - Allow degraded generation if the primary source has a temporary network failure and supplement sources are usable.
 - Persist raw source results for replay.
 ### Artifacts
 ```text
 source_results.json
 raw_items.json
 stage0_collect_report.json
 ```
 ## Stage 1: Normalize Items
 ### Goal
 Convert heterogeneous source output into clean, comparable, traceable `NewsItem` objects.
 ### Optimizations
 - Normalize text with HTML stripping, entity decoding, whitespace cleanup, and RSS boilerplate removal.
 - Generate stable `id` values from canonical URL when possible, otherwise from source, normalized title, and date.
 - Canonicalize URLs:
  - Lowercase scheme and host.
  - Remove `utm_*`, `fbclid`, `gclid`, `spm`, `from`, and fragments.
  - Normalize trailing slashes.
  - Normalize `twitter.com` and `x.com` URLs.
 - Generate `title_norm`:
  - Unicode NFKC normalization.
  - Lowercase English text.
  - Normalize whitespace and weak punctuation.
  - Preserve numbers, versions, model names, and product names.
 - Standardize source labels:
  - X links as `X:@username`.
  - Official blogs as `OpenAI: Blog`, `Google Research: Blog`, etc.
  - Avoid generic labels such as "technology media" when a domain label is available.
 - Add `quality_flags` instead of silently dropping items:
  - `missing_url`
  - `missing_summary`
  - `short_title`
  - `bad_url`
  - `old_item`
  - `parse_suspect`
 ### Non-goals
 - Do not dedupe.
 - Do not rewrite content.
 - Do not call the LLM.
 - Do not remove items based on importance.
 ### Artifacts
 ```text
 normalized_items.json
 stage1_normalize_report.json
 ```
 ## Stage 2: Hard Dedup
 ### Goal
 Remove only high-confidence duplicates with deterministic rules. Mark uncertain similarities for Stage 3.
 ### Rules
 High-confidence removal:
 - Same canonical URL.
 - Same normalized title.
 - Same platform entity, such as the same X status ID.
 - Same source and same exact normalized title.
 Uncertain cases:
 - Similar title but different URL.
 - Same company or model, but unclear whether the event is identical.
 - Same topic across multiple sources with different factual details.
 Uncertain cases should go to `possible_duplicates`, not be removed.
 ### Replacement for Current Logic
 The current `SequenceMatcher > 0.7` direct deletion is too risky. Replace it with:
 - Exact deterministic deletion.
 - Similarity-based candidate marking only.
 ### Keep Item Selection
 When merging a duplicate group, choose the item with a local score:
 ```text
 official source bonus
 + primary source bonus
 + source priority
 + has URL
 + has summary
 + has section hint
 + newer published_at
 - quality flag penalty
 ```
 Attach removed items to `duplicate_sources` on the kept item.
 ### Artifacts
 ```text
 deduped_items.json
 stage2_dedupe_report.json
 ```
 ## Stage 3: Semantic Dedup
 ### Goal
 Use the LLM to identify semantic duplicates that deterministic rules cannot safely remove.
 ### Principles
 - The LLM judges duplicate candidates; local code enforces safety.
 - The LLM must not select, curate, or remove items by importance.
 - Only remove `confidence = high` duplicate groups.
 - Treat medium or uncertain results as non-removal.
 ### Input
 Prefer candidate groups from Stage 2. Avoid sending all items at once unless the item count is small.
 Example item payload:
 ```json
 {
  "id": "item_123",
  "title": "...",
  "summary": "...",
  "source": "QbitAI",
  "url_host": "qbitai.com",
  "published_at": "...",
  "section_hint": "Company and Capital"
 }
 ```
 ### Output Schema
 ```json
 {
  "duplicate_groups": [
    {
      "keep_id": "item_123",
      "remove_ids": ["item_456"],
      "confidence": "high",
      "reason": "Both items report the same concrete event."
    }
  ],
  "not_duplicates": [],
  "uncertain": []
 }
 ```
 ### Safety Checks
 - Validate all IDs exist.
 - Validate confidence values.
 - Apply local keep-item scoring instead of blindly trusting `keep_id`.
 - Skip deletion if the deletion ratio exceeds a configured threshold.
 - Skip deletion when versions, product names, or dates conflict.
 ### Failure Behavior
 If timeout, JSON parse failure, or schema validation failure occurs, keep Stage 2 output and continue.
 ### Artifacts
 ```text
 semantic_dedup_input.json
 semantic_dedup_output.json
 stage3_semantic_dedup_report.json
 ```
 ## Stage 4: Rewrite Titles and Summaries
 ### Goal
 Produce concise, accurate Chinese display titles and summaries.
 ### Rules
 - Keep `title_raw` and `summary_raw` unchanged.
 - Write display fields to `title` and `summary`.
 - Preserve brand names, model names, API names, and common technical acronyms in English.
 - Translate the rest into natural Chinese.
 - Avoid marketing words such as "heavyweight", "explosive", or "just now" unless they are factual and necessary.
 - Summaries should be factual, concise, and usually 80-140 Chinese characters.
 - Do not add facts not present in the raw title or summary.
 - Do not write advice or commentary.
 ### Batch Strategy
 - Process 8-12 items per batch.
 - Allow limited parallel batches.
 - Retry a failed batch once.
 - Fall back per item or per batch if needed.
 ### Validation
 Check:
 - Non-empty title and summary.
 - No markdown links in title.
 - No URL in summary.
 - No `[N]` or reference markers.
 - No emoji.
 - Summary length under limit.
 - Key numbers, versions, and model names are preserved when present in raw input.
 ### Artifacts
 ```text
 rewritten_items.json
 rewrite_llm_outputs.json
 stage4_rewrite_report.json
 ```
 ## Stage 5: Classify and Order
 ### Goal
 Place each item into a stable section and order items for readable scanning.
 ### Recommended Sections
 Use a fixed section whitelist:
 ```text
 模型与能力
 产品与应用
 开发与基础设施
 公司与资本
 政策与安全
 论文与研究
 观点与教程
 人物与动态
 ```
 Hide empty sections. Do not create dynamic section names.
 ### Classification Strategy
 Use a three-layer approach:
 1. Source hint mapping.
 2. Local rule fallback.
 3. LLM classification for ambiguous items only.
 Example alias mapping:
 ```text
 模型发布/更新 -> 模型与能力
 产品发布/更新 -> 产品与应用
 产品与工具 -> 产品与应用
 开发与工程 -> 开发与基础设施
 行业动态 -> 公司与资本
 行业与公司 -> 公司与资本
 论文研究 -> 论文与研究
 技巧与观点 -> 观点与教程
 人物与花絮 -> 人物与动态
 ```
 ### Ordering Strategy
 Do not let the LLM freely order all items. Use local scoring:
 ```text
 rank_score =
  source priority
  + official source bonus
  + primary source bonus
  + recency score
  + key metric bonus
  + duplicate source bonus
  - quality flag penalty
 ```
 Ordering is for readability only. It must not remove items.
 ### Artifacts
 ```text
 classified_items.json
 stage5_classify_order_report.json
 ```
 ## Stage 6: Guide and Daily Threads
 ### Goal
 Generate a concise top guide and a bottom "daily threads" section that helps readers understand the day's shape without turning the report into an investment memo.
 ### Replace Current Summary Style
 Do not use:
 ```text
 强信号 / 中信号 / 待验证
 ```
 This style feels too much like an industry rating or investment brief.
 Use:
 ```text
 导览
 今日脉络
 仍待确认, when needed
 ```
 ### Output Schema
 The LLM should output structured JSON, not Markdown:
 ```json
 {
  "theme": "One concise daily theme.",
  "threads": [
    {
      "title": "模型能力继续向长上下文、实时语音、多模态生成推进",
      "text": "MiniMax M3、Miso One、Ideogram v4.0 分别从长上下文解码、语音克隆和图像生成质量上更新能力边界。",
      "item_ids": ["item_1", "item_2", "item_3"],
      "kind": "thread"
    },
    {
      "title": "仍待确认",
      "text": "融资传闻、排行榜和单源爆料类消息需要等待官方或更多来源确认。",
      "item_ids": ["item_8"],
      "kind": "uncertain"
    }
  ]
 }
 ```
 ### Rules
 - Theme should be one paragraph under 120 Chinese characters.
 - Threads should be 2-4 items.
 - Each thread must bind to existing `item_ids`.
 - Do not add facts absent from the item list.
 - Do not write advice.
 - Do not include reference numbers.
 - Do not include Markdown blockquote syntax. Stage 7 will render Markdown.
 ### Failure Behavior
 - If theme generation fails, omit the guide or use a conservative fallback.
 - If threads fail, omit `今日脉络`.
 - Invalid thread IDs should drop that thread.
 ### Artifacts
 ```text
 guide_input.json
 guide_output.json
 stage6_guide_report.json
 ```
 ## Stage 7: Assemble and Validate Markdown
 ### Goal
 Render final Markdown deterministically and validate it before publishing.
 ### Recommended Structure
 ```markdown
 ## 导览
 > 一句话主线。
 ## 模型与能力
 **1. 新闻标题**
 > 新闻摘要。[来源 ↗](https://example.com)
 ## 今日脉络
 - **主题**
  说明...
 ```
 ### Rendering Rules
 - Render Markdown in code only.
 - Use global continuous numbering.
 - Hide empty sections.
 - Add blockquote syntax for the guide in code.
 - Strip any leading `>` from LLM-provided theme text before rendering.
 - Use source links consistently:
 ```markdown
 [OpenAI: Blog ↗](https://example.com)
 ```
 If URL is unavailable, render the source label without a link.
 ### Auto-fixes
 - Remove `> >`.
 - Remove `[N]` and numeric reference markers.
 - Remove code fences from guide/thread text.
 - Normalize extra blank lines.
 - Add missing Chinese punctuation to summaries.
 - Remove `主线判断:` prefixes if present.
 ### Blocking Checks
 Block publish or downgrade to draft when:
 - Item count is zero.
 - No sections are rendered.
 - Markdown is abnormally short.
 - Section name is outside the whitelist.
 - JSON fragments remain in Markdown.
 - Link formatting is broadly broken.
 - Forbidden advisory language appears in guide/thread text.
 ### Artifacts
 ```text
 blog_markdown.md
 stage7_markdown_report.json
 ```
 ## Stage 8: Publish and Deliver
 ### Goal
 Publish only validated Markdown, verify the public page, and make the operation idempotent and recoverable.
 ### Modes
 ```text
 dry-run
 draft
 publish
 ```
 ### Requirements
 - Do not publish when Stage 7 has blocking errors.
 - Use a deterministic slug such as `ai-YYYY-MM-DD`.
 - Check whether the slug already exists before creating a new post.
 - Support existence strategies:
  - `skip`
  - `update-draft`
  - `replace`
  - `republish`
 - Verify the public URL with retries.
 - Preserve Markdown and reports when publishing fails.
 - Support publishing from an existing run directory.
 ### Artifacts
 ```text
 stage8_publish_report.json
 run_report.json
 ```
 ## Run Directory
 Every run should write to an isolated directory:
 ```text
 runs/2026-06-04/
  source_results.json
  raw_items.json
  stage0_collect_report.json
  normalized_items.json
  stage1_normalize_report.json
  deduped_items.json
  stage2_dedupe_report.json
  semantic_dedup_output.json
  stage3_semantic_dedup_report.json
  rewritten_items.json
  stage4_rewrite_report.json
  classified_items.json
  stage5_classify_order_report.json
  guide_output.json
  stage6_guide_report.json
  blog_markdown.md
  stage7_markdown_report.json
  stage8_publish_report.json
  run_report.json
 ```
 This makes the pipeline replayable and debuggable.
 ## CLI
 Provide agent-friendly commands:
 ```bash
 ai-daily-report run --date today --mode publish
 ai-daily-report run --date today --mode dry-run
 ai-daily-report run --date 2026-06-04 --mode draft
 ai-daily-report replay --run-id 2026-06-04 --from-stage 4
 ai-daily-report publish --from-run 2026-06-04
 ai-daily-report status --date 2026-06-04
 ```
 The current cron can keep invoking the compatibility script, which should delegate to the CLI.
 ## Skill Strategy
 Create or update an `ai-daily-report` skill for Hermes/OpenClaw. The skill should not contain business logic. It should provide:
 - How to run daily generation.
 - How to dry-run.
 - How to replay from an existing run.
 - How to publish already generated Markdown.
 - How to diagnose source, LLM, Markdown, or publish failures.
 - How to add a new RSS source.
 - How to adjust output style without breaking the pipeline.
 Suggested skill references:
 ```text
 skill/references/sources.md
 skill/references/output-style.md
 skill/references/troubleshooting.md
 skill/references/llm-config.md
 ```
 ## Testing
 Add fixtures and tests for:
 - AI HOT sample parsing.
 - RSS parsing.
 - Juya `content:encoded` parsing.
 - URL canonicalization.
 - Title normalization.
 - Deterministic deduplication.
 - LLM JSON schema validation.
 - Rewrite output validation.
 - Section alias mapping.
 - Markdown rendering.
 - Markdown validation.
 - Publish dry-run behavior.
 Start with local fixture tests. They will give most of the stability benefit without needing live network calls.
 ## Migration Plan
 ### Phase 1: Stabilize Current Script
 - Add run directories.
 - Add SourceResult and stage reports.
 - Add URL canonicalization.
 - Replace risky Stage 0 dedupe with hard dedup.
 - Add Markdown validation and auto-fixes.
 ### Phase 2: Improve Quality
 - Add semantic dedup schema and safety checks.
 - Batch rewrite title and summary.
 - Add section alias mapping and rule-first classification.
 - Replace the current summary with `今日脉络`.
 ### Phase 3: Modularize
 - Extract modules under `ai_daily_report/`.
 - Add CLI.
 - Keep old script as compatibility entrypoint.
 - Add fixture tests.
 ### Phase 4: Skill Integration
 - Update `skill/SKILL.md`.
 - Add references for sources, style, troubleshooting, and LLM config.
 - Make Hermes/OpenClaw call the CLI.
 ## Success Criteria
 The optimized pipeline should satisfy:
 - A usable Markdown report is generated whenever enough source data exists.
 - Optional source failures degrade the run but do not stop it.
 - LLM failures degrade individual stages but do not destroy the whole report.
 - No non-duplicate item is removed by importance or editorial selection.
 - Every removed duplicate has a reason.
 - Every stage writes inspectable artifacts.
 - A failed publish can be retried from an existing run.
 - Agents can run, diagnose, replay, and publish via stable commands.
--- a/docs/plans/2026-06-04-local-dry-run-foundation.md
+++ b/docs/plans/2026-06-04-local-dry-run-foundation.md
@@ -0,0 +1,159 @@
 # Local Dry-Run Foundation Implementation Plan
 > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
 **Goal:** Make the current pipeline testable on a local machine without Hermes credentials, blog credentials, or live LLM calls.
 **Architecture:** Keep the existing single script as the compatibility entrypoint. Add small, tested helpers for project `.env` loading, dry-run token behavior, and mock LLM responses. This creates a safe base for later Stage 0-8 modularization.
 **Tech Stack:** Python standard library, `unittest`, current `script/ai_daily_blog_pipeline.py`.
 ---
 ### Task 1: Add Local `.env` Loading
 **Files:**
 - Modify: `script/ai_daily_blog_pipeline.py`
 - Create: `tests/test_env_loading.py`
 **Step 1: Write the failing test**
 Test that `load_env()` reads project-root `.env` values when Hermes env is absent, and that real process environment variables override file values.
 **Step 2: Run test to verify it fails**
 Run: `python -m unittest tests.test_env_loading -v`
 Expected: FAIL because the script currently only reads `~/.hermes/.env`.
 **Step 3: Implement minimal code**
 Add a helper to parse env files and update `load_env()` to read:
 1. Project `.env`
 2. `~/.hermes/.env`
 3. process environment
 Later sources override earlier ones.
 **Step 4: Run test to verify it passes**
 Run: `python -m unittest tests.test_env_loading -v`
 Expected: PASS.
 ### Task 2: Let Dry-Run Skip Blog Token Requirement
 **Files:**
 - Modify: `script/ai_daily_blog_pipeline.py`
 - Create: `tests/test_dry_run_config.py`
 **Step 1: Write the failing test**
 Extract a small helper such as `is_dry_run(env)` and `require_blog_token(env)`, then test:
 - `AI_DAILY_DRY_RUN=1` does not require `BLOG_SERVICE_TOKEN`.
 - normal publish mode still requires a token.
 **Step 2: Run test to verify it fails**
 Run: `python -m unittest tests.test_dry_run_config -v`
 Expected: FAIL because no helper exists and `main()` checks token before dry-run.
 **Step 3: Implement minimal code**
 Move dry-run detection before token validation in `main()`.
 **Step 4: Run test to verify it passes**
 Run: `python -m unittest tests.test_dry_run_config -v`
 Expected: PASS.
 ### Task 3: Add Mock LLM Mode
 **Files:**
 - Modify: `script/ai_daily_blog_pipeline.py`
 - Create: `tests/test_mock_llm.py`
 **Step 1: Write the failing test**
 Test that `llm_call(prompt, {"AI_DAILY_LLM_MODE": "mock"})` returns valid JSON for:
 - semantic dedup prompts
 - summary rewrite prompts
 - classify prompts
 Also test that guide generation can get a non-empty mock response.
 **Step 2: Run test to verify it fails**
 Run: `python -m unittest tests.test_mock_llm -v`
 Expected: FAIL because mock mode does not exist.
 **Step 3: Implement minimal code**
 Add `AI_DAILY_LLM_MODE=mock` support in `llm_call()`.
 **Step 4: Run test to verify it passes**
 Run: `python -m unittest tests.test_mock_llm -v`
 Expected: PASS.
 ### Task 4: Add Markdown Smoke Test
 **Files:**
 - Create: `tests/test_markdown_rendering.py`
 - Modify: `script/ai_daily_blog_pipeline.py` only if necessary.
 **Step 1: Write the failing or characterization test**
 Test that `blog_markdown()` renders:
 - `## 导览`
 - at least one section
 - source links
 - no `> >`
 - no `[N]`
 **Step 2: Run test**
 Run: `python -m unittest tests.test_markdown_rendering -v`
 Expected: If it already passes, keep it as characterization coverage. If it fails because of `> >`, implement a focused fix.
 **Step 3: Implement minimal fix if needed**
 Strip leading `>` from guide text before adding blockquote syntax.
 **Step 4: Run test to verify it passes**
 Run: `python -m unittest tests.test_markdown_rendering -v`
 Expected: PASS.
 ### Task 5: Run Full Verification
 **Files:**
 - No new files.
 **Step 1: Run unit tests**
 Run: `python -m unittest discover -s tests -v`
 Expected: PASS.
 **Step 2: Run compile check**
 Run: `python -m py_compile script/ai_daily_blog_pipeline.py`
 Expected: exit code 0.
 **Step 3: Check git status**
 Run: `git status --short`
 Expected: only intended files are modified or added.
--- a/docs/plans/2026-06-10-ai-daily-full-chain-optimization.md
+++ b/docs/plans/2026-06-10-ai-daily-full-chain-optimization.md
@@ -0,0 +1,130 @@
 # AI Daily Full Chain Optimization Implementation Plan
 > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
 **Goal:** Add the first quality safety layer for the AI daily report pipeline: semantic candidate recall, quality gate reporting, stage snapshots, and effective pipeline configuration.
 **Architecture:** Keep the existing stage functions and add a rule-based Stage 2.8 between cross-day URL dedupe and LLM semantic dedupe. Quality gate stays deterministic and report-only for dry-run visibility, while publish blocking can consume its `blocking_errors` through the existing Stage 7/8 guard path. Runner persists stage artifacts from the pipeline result without changing generated content.
 **Tech Stack:** Python standard library, `unittest`, existing dataclass models and pipeline modules.
 ---
 ### Task 1: Make Pipeline Config Effective
 **Files:**
 - Modify: `ai_daily_report/pipeline.py`
 - Modify: `ai_daily_report/runner.py`
 - Test: `tests/test_stage0_to_4_pipeline.py`
 - Test: `tests/test_runner.py`
 **Step 1: Write failing tests**
 Use existing tests that call `run_stage0_to_stage4(..., semantic_dedup_max_deletion_ratio=0.1, rewrite_batch_size=1)` and expect Stage 4 `batch_count == 3`.
 **Step 2: Run tests to verify failure**
 Run: `python -m pytest tests/test_stage0_to_4_pipeline.py tests/test_runner.py -q`
 Expected: failure from unexpected keyword arguments or ignored config.
 **Step 3: Implement minimal code**
 Thread `semantic_dedup_max_deletion_ratio` into `semantic_dedup_items()` and `rewrite_batch_size` into `rewrite_items()`. Read both from `pipeline.json` in `runner.py`.
 **Step 4: Verify**
 Run the same tests and expect pass.
 ### Task 2: Add Stage 2.8 Candidate Recall
 **Files:**
 - Create: `ai_daily_report/candidate_recall.py`
 - Modify: `ai_daily_report/pipeline.py`
 - Test: `tests/test_candidate_recall.py`
 - Test: `tests/test_stage0_to_4_pipeline.py`
 **Step 1: Write failing tests**
 Add tests proving related Claude Fable/Mythos items are recalled even when Stage 2 title candidates are empty, while unrelated Gemini/Gemma items are not grouped by company name alone.
 **Step 2: Run tests to verify failure**
 Run: `python -m pytest tests/test_candidate_recall.py tests/test_stage0_to_4_pipeline.py -q`
 Expected: import failure for the new module or zero recalled candidates.
 **Step 3: Implement minimal code**
 Use deterministic title similarity, token Jaccard, summary Jaccard, and strong entity overlap to produce candidate groups with `item_ids`, `reason`, `score`, and evidence fields.
 **Step 4: Verify**
 Run targeted tests and expect pass.
 ### Task 3: Add Quality Gate Reporting
 **Files:**
 - Create: `ai_daily_report/quality_gate.py`
 - Modify: `ai_daily_report/pipeline.py`
 - Test: `tests/test_quality_gate.py`
 **Step 1: Write failing tests**
 Add tests for warnings when Stage 3 candidates are zero for large item sets, enabled sources fail, and required sources fail.
 **Step 2: Run tests to verify failure**
 Run: `python -m pytest tests/test_quality_gate.py -q`
 Expected: import failure for the new module.
 **Step 3: Implement minimal code**
 Return a report with `warnings`, `blocking_errors`, `source_failures`, and `quality_gate_failed`. Add it after Stage 7 and propagate blocking errors into Stage 7 before publish.
 **Step 4: Verify**
 Run quality gate and publish-path tests.
 ### Task 4: Persist Stage Snapshots
 **Files:**
 - Modify: `ai_daily_report/pipeline.py`
 - Modify: `ai_daily_report/runner.py`
 - Test: `tests/test_runner.py`
 **Step 1: Write failing tests**
 Assert that a mock run writes `stage0_sources.json`, `stage1_items.json`, `stage2_items.json`, `stage2_5_items.json`, `stage2_8_candidates.json`, `stage3_items.json`, `stage4_items.json`, and `quality_gate.json`.
 **Step 2: Run tests to verify failure**
 Run: `python -m pytest tests/test_runner.py -q`
 Expected: snapshot files are missing.
 **Step 3: Implement minimal code**
 Have pipeline results carry an `artifacts` dict and have runner serialize the requested JSON files using the existing dataclass serializer.
 **Step 4: Verify**
 Run runner tests and inspect generated files through assertions.
 ### Task 5: Full Regression
 **Files:**
 - All touched files
 **Step 1: Run targeted tests**
 Run: `python -m pytest tests/test_candidate_recall.py tests/test_quality_gate.py tests/test_stage0_to_4_pipeline.py tests/test_runner.py -q`
 **Step 2: Run full test suite**
 Run: `python -m pytest -q`
 **Step 3: Fix regressions**
 Fix only issues caused by this change set.
--- a/script/ai_daily_blog_pipeline.py
+++ b/script/ai_daily_blog_pipeline.py
--- a/script/blog_markdown.md
+++ b/script/blog_markdown.md
@@ -1,198 +0,0 @@
 ## 导览
 > > 微软与OpenAI正式分家、Anthropic提交招股书、DeepSeek计划融500亿——AI行业正在从“联盟军”转向“诸侯争霸”。
 ## 模型发布/更新
 **1. Grok Imagine 1.5 预览版发布**
 > Grok Imagine 1.5 预览版即日起在 API 中上线，SpaceXAI 持续发力。[X：@cb_doge ↗](https://x.com/cb_doge/status/2062242490745594085)
 **2. MiniMax M3 1M token 解码加速 15.6 倍**
 > MiniMax M3 在 1M token 下解码加速 15.6 倍，FireworksAI_HQ 提供推理支持。[X：@MiniMax_AI ↗](https://x.com/MiniMax_AI/status/2062316914618388758)
 **3. Miso One 开源语音模型：8B 参数、110ms 延迟、一次语音克隆**
 > Miso One 发布 8B 参数开源语音模型，支持一次语音克隆（短样本），推理延迟 110ms，权重已开源，可自托管，API 即将推出，演示已上线。[X：@kimmonismus ↗](https://x.com/kimmonismus/status/2062210845308780639)
 **4. Ideogram v4.0 发布：2K 分辨率和 JSON 提示支持**
 > Ideogram v4.0 发布，原生 2K 分辨率，文字渲染出色，支持 JSON 提示词，可在 Krea 中体验。[X：@krea_ai ↗](https://x.com/krea_ai/status/2062227837130887567)
 ## 产品与工具
 **5. Meta 面向 WhatsApp Business 的 AI 智能体现已全球上线**
 > Meta 为 WhatsApp Business 推出的 AI 智能体面向全球商家开放，按模型 token 使用量收费。[TechCrunch ↗](https://techcrunch.com/2026/06/03/metas-ai-agent-for-whatsapp-business-is-now-available-globally)
 **6. NousResearch 发布 Hermes Agent 桌面应用公测版**
 > NousResearch 推出 Hermes Agent 桌面应用公测版。[X：@SiliconFlowAI ↗](https://x.com/SiliconFlowAI/status/2062042813852995899)
 **7. xAI Grok 语音模型上线 Vapi 平台**
 > xAI 的 Grok STT 和 TTS 语音模型登陆企业语音 AI 平台 Vapi，可用于构建自定义语音智能体。[X：@xai ↗](https://x.com/xai/status/2062209374039499178)
 **8. Grok 模型登陆 Cloudflare AI Gateway**
 > Grok 模型现已可在 Cloudflare AI Gateway 上试用。[X：@xai ↗](https://x.com/xai/status/2062294202625696081)
 **9. OpenShell v0.0.55 发布：新增 Vertex AI 推理支持**
 > OpenShell v0.0.55 发布，新增 Google Vertex AI 推理支持，改进策略可见性、Podman 检测和 GPU 沙箱行为。[X：@NVIDIAAI ↗](https://x.com/NVIDIAAI/status/2062210034109677665)
 **10. Replit 上线 SEO Agent 助应用被发现**
 > Replit 推出 SEO Agent，扫描应用并提供修复建议，帮助应用在网页和 AI 搜索中被发现。[X：@Replit ↗](https://x.com/Replit/status/2062211976995188871)
 **11. OpenClaw 2026.6.1 发布：新增 Windows 节点与技能工坊**
 > OpenClaw 2026.6.1 发布，新增原生 Windows 节点主机、技能工坊和工作板编排，支持 MiniMax M3。[X：@openclaw ↗](https://x.com/openclaw/status/2062288421406785710)
 **12. Reachy Mini 添加 MCP 工具**
 > Reachy Mini 推出公开 MCP canary Space，支持远程工具调用。[Hugging Face：Blog ↗](https://huggingface.co/blog/adding-mcp-tools-to-reachy-mini)
 **13. 刚刚，Meta Skill 来了**
 > GitHub 热门仓库 OpenSquilla 发布，代表 Meta Skill 新动向。[量子位 ↗](https://www.qbitai.com/2026/06/428335.html)
 ## 开发与工程
 **14. Qwen Cloud 全球 AI 黑客马拉松启动**
 > 首届 Qwen Cloud 全球 AI 黑客马拉松启动，5 大赛道，总奖金超 7 万美元（赛道冠军 1 万美元），Devpost 报名。[X：@alibaba_cloud ↗](https://x.com/alibaba_cloud/status/2062113338994172169)
 **15. 洪水韧性新篇章：Google 开源水文建模框架**
 > Google Research 开源基于 PyTorch 的水文建模框架，采用 Flood Hub 相同架构，允许各国气象部门在本地训练 AI 洪水预报模型。[Google Research：Blog ↗](https://research.google/blog/the-next-chapter-in-flood-resilience-open-sourcing-googles-hydrology-framework)
 **16. 文章：导致 Spark 在 Kubernetes 上 OOM 失败的两个错误配置**
 > 迁移 Spark 到 AKS 后，两个配置交互导致 OOM：spark.kubernetes.local.dirs.tmpfs 使 shuffle spill 改用 RAM 而非磁盘。[InfoQ AI ↗](https://www.infoq.com/articles/spark-oom-kubernetes-misconfigurations/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=AI%2C+ML+%26+Data+Engineering)
 ## 行业与公司
 **17. 微软与 OpenAI 分道扬镳——如今双方准备正面交锋**
 > 微软与 OpenAI 合作关系破裂，进入直接竞争。微软 AI 主管 Mustafa Suleyman 称微软需独立证明能力。[The Verge ↗](https://www.theverge.com/ai-artificial-intelligence/942242/microsoft-build-ai-agents-openai-competition)
 **18. 欧盟公布全面技术主权计划，推动芯片与 AI 自主发展**
 > 欧盟推出技术主权计划，扩大本土半导体、AI 和云计算供应链，减少对美亚依赖。[Bloomberg ↗](https://www.bloomberg.com/news/articles/2026-06-03/europe-unveils-sweeping-tech-sovereignty-plan-to-boost-chips-ai)
 **19. Sensor Tower：OpenAI 旗下 ChatGPT 月活已破 10 亿，史上最快**
 > Sensor Tower 估计 ChatGPT 月活于 2025 年 5 月突破 10 亿，增速史上最快；Claude 月活 5600 万，同比增 640%。[IT之家 ↗](https://www.ithome.com/0/959/083.htm)
 **20. 消息称 DeepSeek 首轮融资拟筹集 500 亿元，腾讯、宁德时代等参投**
 > DeepSeek 首轮拟融资 500 亿元，投后估值 3500-4000 亿元。创始人梁文峰出资 200 亿，腾讯拟投 100 亿，宁德时代 50 亿。[IT之家 ↗](https://www.ithome.com/0/959/249.htm)
 **21. Suno 完成 4 亿美元 D 轮融资**
 > Suno 完成 4 亿美元 D 轮融资，估值 54 亿美元，致力于让更多人体验音乐制作。[X：@suno ↗](https://x.com/suno/status/2062183524887675243)
 **22. 宏利香港与阿里云达成 AI 战略合作**
 > 宏利香港与阿里云建立战略合作，共建负责任 AI 创新框架，加速 AI 部署。[X：@alibaba_cloud ↗](https://x.com/alibaba_cloud/status/2062006591377829922)
 **23. 优步每月 1,500 美元的 AI 使用上限为 AI 工具定价提供参考**
 > 优步将 AI 工具月使用上限设为 1500 美元，为行业 AI 定价提供参考信号。[Simon Willison ↗](https://simonwillison.net/2026/Jun/3/uber-caps-usage)
 **24. 世界模型榜首易主！跨维智能登顶 WorldArena**
 > 跨维智能在 WorldArena 上登顶，成为世界模型新榜首。[量子位 ↗](https://www.qbitai.com/2026/06/428435.html)
 **25. 刚刚，Anthropic 提交了招股书！**
 > Anthropic 已提交招股书，预计最快 Q4 上市。[量子位 ↗](https://www.qbitai.com/2026/06/428407.html)
 ## 论文与研究
 **26. 斯坦福大学法学院研究：人工智能的表现优于法学教授**
 > 斯坦福大学法学院研究显示，AI 表现优于法学教授，该结果在 Hacker News 获 104 个 Points。[law.stanford.edu ↗](https://law.stanford.edu/press/ai-outperforms-law-professors-in-stanford-law-study)
 **27. NVIDIA Research 在 CVPR 2026 发表三篇论文：规模化训练实现抓取、自动驾驶与智能体泛化**
 > NVIDIA Research 在 CVPR 2026 发表三篇论文：零样本抓取模型 GraspGen-X、自动驾驶 LCDrive、具身智能体 NitroGen，均基于大规模训练。[blogs.nvidia.com：Blog ↗](https://blogs.nvidia.com/blog/cvpr-research-grasping-driving-agent-training)
 **28. Anthropic 分析 832 个 AI 恶意账户：中高风险攻击者半年从 33% 跃至 56%**
 > Anthropic 分析 832 个被封恶意账户，67.3% 使用 AI 编写恶意软件，中高风险占比半年内从 33% 升至 56%，传统威胁评估失效。[Anthropic ↗](https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack)
 **29. 微软研究：装瓶厂 AI 从聊天到决策**
 > 微软在中西部装瓶厂试点三个月显示，AI 超越聊天进入决策领域，需应对真实风险和可靠性要求。[X：@MSFTResearch ↗](https://x.com/MSFTResearch/status/2062204914223169635)
 **30. 世界模型的功能分类**
 > World Labs 与李飞飞发文梳理“世界模型”概念，基于 POMDP 框架分类，指出当前所谓世界模型本质是同一循环的不同投影（如渲染器）。[X：@drfeifei ↗](https://x.com/drfeifei/status/2062247238143996275)
 **31. 从看懂世界到做对动作，卧安机器人 OneModel 1.7 用一条「隐式通路」打通了具身智能的关键断层**
 > 卧安机器人 OneModel 1.7 通过隐式通路在潜在空间完成信息传导，打通具身智能关键断层。[量子位 ↗](https://www.qbitai.com/2026/06/428703.html)
 ## 人物与花絮
 **32. 黄仁勋与纳德拉共议智能体 AI 时代**
 > 黄仁勋与纳德拉在台北 MSBuild 同台，展示 NVIDIA 与微软从 Windows 到 AI 工厂的协作。[X：@nvidia ↗](https://x.com/nvidia/status/2062228974273716457)
 **33. Satya Nadella 谈微软 Build 大会主旨演讲**
 > Satya Nadella 在 Microsoft Build 主旨演讲，强调共同构建前沿智能生态系统。[X：@satyanadella ↗](https://x.com/satyanadella/status/2062022060176801826)
 **34. Karpathy 的 llm-wiki 项目获超五千星**
 > @karpathy 的 llm-wiki 项目几周内获 5000+ 星，理念是让 LLM 构建并维护可持续进化的维基知识库。[X：@SiliconFlowAI ↗](https://x.com/SiliconFlowAI/status/2062054848762450324)
 ## 观点与教程
 **35. 智能体工程实战窍门全录**
 > @mvanhorn 分享智能体工程方法论：人主导方向、智能体执行，核心为 plan.md 约束行为，总结 22 条实战技巧及完整工具栈。[X：@shao__meng ↗](https://x.com/shao__meng/status/2061974983094755575)
 **36. Anthropic 用 Claude 赋能自助数据分析**
 > Anthropic 用 Claude 自动化 95% 业务分析查询，准确率约 95%，通过智能体分析栈解决概念-实体歧义等三大错误来源。[Claude：Blog ↗](https://claude.com/blog/how-anthropic-enables-self-service-data-analytics-with-claude)
 **37. 超越聊天机器人的直接偏好优化**
 > Dharma-AI 在 Hugging Face 博客发文，探讨直接偏好优化（DPO）在聊天机器人之外的广泛应用。[Hugging Face：Blog ↗](https://huggingface.co/blog/Dharma-AI/direct-preference-optimization-beyond-chatbots)
 **38. 演讲：选择你的 AI 副驾驶：最大化开发效率**
 > Sepehr Khosravi 探讨开发效率工具演变，评估 Cursor 和 Claude Code 等优势，为高级工程师提供可行技巧。[InfoQ AI ↗](https://www.infoq.com/presentations/choosing-ai-copilot/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=AI%2C+ML+%26+Data+Engineering)
 ## 总结
 **强信号**
 - **微软与OpenAl分道扬镳，双方开始正面竞争**
  合作终结后，微软AI主管Mustafa Suleyman称公司必须独立证明能力，这意味着微软将不再依赖OpenAI的模型，而是全力押注自研，OpenAI也失去最大云盟友。
 - **Anthropic提交招股书，预计最快Q4上市**
  这标志着安全派AI公司正式进入资本市场，与OpenAI争夺投资者注意，Claude的月活同比增长640%也为其估值提供了底气。
 - **ChatGPT月活突破10亿，成为史上增长最快的应用**
  Sensor Tower数据显示ChatGPT在2025年5月达到这一里程碑，Claude月活5600万，两家头部消费级AI应用的用户粘性正在拉开差距。
 **中信号**
 - **Miso One发布8B开源语音模型，支持一次语音克隆且延迟仅110ms**
  权重已开放、可自托管，意味着实时语音克隆的门槛从专有API降到了个人部署，可能加速语音交互在开发者中的普及。
 - **欧盟公布全面技术主权计划，推动芯片与AI自主发展**
  计划扩大本土半导体、AI和云计算供应链，目标减少对美亚依赖——这将对全球AI公司的合规、市场准入和数据主权产生实质影响。
 **待验证**
 - **DeepSeek首轮融资拟筹500亿元，腾讯、宁德时代参投**
  投后估值高达3500-4000亿元，但融资消息来源为IT之家，未见官方确认。如此大体量的AI融资在国内市场是否顺利落地，存在不确定性。
 - **跨维智能登顶WorldArena世界模型榜首**
  WorldArena的评测权威性尚未被广泛验证，且“世界模型”概念本身缺乏统一标准，需要看后续是否有独立第三方复现其能力。
--- a/script/run_meta.json
+++ b/script/run_meta.json
@@ -1,35 +0,0 @@
 {
  "date": "2026-06-04",
  "slug": "ai-2026-06-04",
  "blog_url": "https://blog.ephron.ren/posts/ai-2026-06-04",
  "public_ok": true,
  "errors": [
    "橘鸦AI早报(重试): TimeoutError"
  ],
  "aihot_sections": [
    "模型发布/更新",
    "产品发布/更新",
    "行业动态",
    "论文研究",
    "技巧与观点"
  ],
  "raw_item_count": 39,
  "stage0_count": 39,
  "final_item_count": 38,
  "has_juya": false,
  "source_counts": {
    "AI HOT": 32,
    "InfoQ AI": 2,
    "MIT科技评论AI": 0,
    "量子位": 5,
    "橘鸦AI早报": 0
  },
  "featured_titles": [
    "Grok Imagine 1.5 预览版发布",
    "MiniMax M3 1M token 解码加速 15.6 倍",
    "Miso One 开源语音模型：8B 参数、110ms 延迟、一次语音克隆",
    "Ideogram v4.0 发布：2K 分辨率和 JSON 提示支持",
    "Meta 面向 WhatsApp Business 的 AI 智能体现已全球上线",
    "NousResearch 发布 Hermes Agent 桌面应用公测版"
  ]
 }
--- a/scripts/generate_ops_docs.py
+++ b/scripts/generate_ops_docs.py
@@ -0,0 +1,41 @@
 #!/usr/bin/env python3
 from __future__ import annotations
 import json
 from pathlib import Path
 ROOT = Path(__file__).resolve().parents[1]
 PIPELINE = json.loads((ROOT / "config" / "pipeline.json").read_text(encoding="utf-8"))
 SOURCES = json.loads((ROOT / "config" / "sources.json").read_text(encoding="utf-8"))
 DOC = ROOT / "docs" / "ops-thresholds.generated.md"
 def main() -> int:
    quality = PIPELINE.get("quality_gate", {})
    recall = PIPELINE.get("semantic_candidate_recall", {})
    lines = [
        "# AI日报运维阈值（自动生成）",
        "",
        "> 由 `scripts/generate_ops_docs.py` 从 `config/pipeline.json` 和 `config/sources.json` 生成；不要手改本文件。",
        "",
        "## Quality Gate",
        "",
    ]
    for key in sorted(quality):
        lines.append(f"- `{key}`: `{quality[key]}`")
    lines.extend(["", "## Semantic Candidate Recall", ""])
    for key in sorted(recall):
        lines.append(f"- `{key}`: `{recall[key]}`")
    lines.extend(["", "## Sources", "", "| source | required | failure_policy | min_items | retries | timeout_seconds |", "|---|---:|---|---:|---:|---:|"])
    for source in SOURCES:
        lines.append(
            f"| {source['name']} | {source.get('required', False)} | {source.get('failure_policy', '')} | "
            f"{source.get('min_items', 0)} | {source.get('retries', 0)} | {source.get('timeout_seconds', '')} |"
        )
    DOC.write_text("\n".join(lines) + "\n", encoding="utf-8")
    print(DOC)
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/skill/scripts/.gitkeep
+++ b/skill/scripts/.gitkeep
@@ -0,0 +1 @@
--- a/skill/scripts/run_daily_report.py
+++ b/skill/scripts/run_daily_report.py
@@ -0,0 +1,7 @@
 #!/usr/bin/env python3
 from ai_daily_report.cli import main
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/skill/scripts/weekly_audit.py
+++ b/skill/scripts/weekly_audit.py
@@ -0,0 +1,24 @@
 #!/usr/bin/env python3
 from __future__ import annotations
 import sys
 from pathlib import Path
 REPO_DIR = Path(__file__).resolve().parents[2]
 if str(REPO_DIR) not in sys.path:
    sys.path.insert(0, str(REPO_DIR))
 from ai_daily_report.audit import render_markdown, summarize_reports
 def main() -> int:
    out_dir = Path.home() / ".hermes" / "scripts" / "ai_morning_out"
    if not out_dir.exists():
        print("AI日报每周审计：未找到输出目录")
        return 1
    print(render_markdown(summarize_reports(out_dir, limit_days=7)))
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/tests/fixtures/.gitkeep
+++ b/tests/fixtures/.gitkeep
@@ -0,0 +1 @@
--- a/tests/fixtures/history_replay_2026_06_04_2026_06_10.json
+++ b/tests/fixtures/history_replay_2026_06_04_2026_06_10.json
@@ -0,0 +1,74 @@
 {
  "date_range": ["2026-06-04", "2026-06-10"],
  "purpose": "Historical replay fixtures for semantic candidate recall, Stage 3 merge_groups, and cross-day regression tests.",
  "events": [
    {
      "event_id": "claude-fable-mythos",
      "title": "Claude Fable/Mythos",
      "expected_behavior": "same_event_merge_or_dedupe",
      "items": [
        {
          "date": "2026-06-04",
          "id": "claude-fable-1",
          "source": "AI HOT",
          "title_raw": "Anthropic 推出 Claude Fable，用长篇叙事测试模型记忆",
          "summary_raw": "Claude Fable 面向长篇故事生成，强调角色一致性和上下文管理。",
          "url": "https://example.com/claude-fable"
        },
        {
          "date": "2026-06-05",
          "id": "claude-mythos-1",
          "source": "InfoQ AI",
          "title_raw": "Claude Mythos/Fable 项目扩展到多角色故事工作流",
          "summary_raw": "报道从创作流程角度补充 Anthropic Fable/Mythos 的应用场景。",
          "url": "https://example.com/claude-mythos"
        }
      ]
    },
    {
      "event_id": "openclaw-suno",
      "title": "OpenClaw/Suno",
      "expected_behavior": "same_event_merge_or_dedupe",
      "items": [
        {"date": "2026-06-05", "id": "openclaw-suno-1", "source": "AI HOT", "title_raw": "OpenClaw 集成 Suno 音乐生成能力", "summary_raw": "OpenClaw 新版加入 Suno 风格的音乐生成入口。", "url": "https://example.com/openclaw-suno-a"},
        {"date": "2026-06-05", "id": "openclaw-suno-2", "source": "量子位", "title_raw": "Suno 能力进入 OpenClaw，开源智能体开始做音乐", "summary_raw": "量子位从开源智能体生态角度报道 OpenClaw 与 Suno 相关能力。", "url": "https://example.com/openclaw-suno-b"}
      ]
    },
    {
      "event_id": "magenta-realtime-2",
      "title": "Magenta RealTime 2",
      "expected_behavior": "same_event_merge_or_dedupe",
      "items": [
        {"date": "2026-06-06", "id": "magenta-rt2-1", "source": "AI HOT", "title_raw": "Google 发布 Magenta RealTime 2，主打实时音乐生成", "summary_raw": "Magenta RealTime 2 降低延迟，支持互动式音乐创作。", "url": "https://example.com/magenta-rt2-a"},
        {"date": "2026-06-06", "id": "magenta-rt2-2", "source": "MIT科技评论AI", "title_raw": "Magenta RealTime 2 shows live AI music co-creation", "summary_raw": "MIT Tech Review explains the latency and interaction improvements in Magenta RealTime 2.", "url": "https://example.com/magenta-rt2-b"}
      ]
    },
    {
      "event_id": "open-code-review",
      "title": "Open Code Review",
      "expected_behavior": "same_event_merge_or_dedupe",
      "items": [
        {"date": "2026-06-07", "id": "open-code-review-1", "source": "AI HOT", "title_raw": "Open Code Review 发布，开源代码审查智能体上线", "summary_raw": "Open Code Review 面向 GitHub/Gitea 仓库自动生成审查意见。", "url": "https://example.com/open-code-review-a"},
        {"date": "2026-06-07", "id": "open-code-review-2", "source": "InfoQ AI", "title_raw": "Open Code Review brings agentic review to open-source repos", "summary_raw": "InfoQ focuses on CI integration and review workflows for Open Code Review.", "url": "https://example.com/open-code-review-b"}
      ]
    },
    {
      "event_id": "openai-chip-talent-move",
      "title": "OpenAI 芯片成员跳槽",
      "expected_behavior": "same_event_merge_or_dedupe",
      "items": [
        {"date": "2026-06-08", "id": "openai-chip-1", "source": "AI HOT", "title_raw": "OpenAI 定制芯片核心成员跳槽 Anthropic", "summary_raw": "OpenAI 芯片团队关键工程师在量产前离职加入 Anthropic。", "url": "https://example.com/openai-chip-a"},
        {"date": "2026-06-08", "id": "openai-chip-2", "source": "量子位", "title_raw": "OpenAI 芯片核心叛逃 Anthropic，就在量产前夜", "summary_raw": "量子位强调人才流动对 OpenAI 自研芯片进度的潜在影响。", "url": "https://example.com/openai-chip-b"}
      ]
    },
    {
      "event_id": "amap-abot",
      "title": "高德 ABot",
      "expected_behavior": "same_event_merge_or_dedupe",
      "items": [
        {"date": "2026-06-10", "id": "amap-abot-1", "source": "AI HOT", "title_raw": "高德推出 ABot，地图入口接入智能体服务", "summary_raw": "高德 ABot 将出行、搜索和本地生活任务整合到地图智能体。", "url": "https://example.com/amap-abot-a"},
        {"date": "2026-06-10", "id": "amap-abot-2", "source": "橘鸦AI早报", "title_raw": "高德 ABot 上线，本地生活智能体开始进入地图", "summary_raw": "橘鸦从产品入口角度记录高德 ABot 的上线。", "url": "https://example.com/amap-abot-b"}
      ]
    }
  ]
 }
--- a/tests/test_audit.py
+++ b/tests/test_audit.py
@@ -0,0 +1,42 @@
 import json
 import tempfile
 import unittest
 from pathlib import Path
 from ai_daily_report.audit import render_markdown, summarize_reports
 class AuditTests(unittest.TestCase):
    def test_summarizes_weekly_metrics(self):
        with tempfile.TemporaryDirectory() as tmp:
            run_dir = Path(tmp) / "2026-06-10"
            run_dir.mkdir()
            (run_dir / "run_report.json").write_text(
                json.dumps(
                    {
                        "quality_gate": {
                            "source_failures": [{"source": "橘鸦AI早报"}],
                            "warnings": ["enabled_source_failed:橘鸦AI早报:error"],
                            "blocking_errors": [],
                        },
                        "stage2_8": {"candidate_group_count": 6},
                        "stage4": {"fallback_count": 2, "output_count": 20},
                        "stage5": {"output_count": 20},
                        "stage8": {"status": "ok", "slug": "ai-2026-06-10"},
                    }
                ),
                encoding="utf-8",
            )
            summary = summarize_reports(Path(tmp), limit_days=7)
            markdown = render_markdown(summary)
        self.assertEqual(summary["run_count"], 1)
        self.assertEqual(summary["totals"]["source_failures"], 1)
        self.assertEqual(summary["totals"]["duplicate_candidates"], 6)
        self.assertEqual(summary["totals"]["fallback_ratio"], 0.1)
        self.assertIn("AI日报每周自动审计报告", markdown)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_candidate_recall.py
+++ b/tests/test_candidate_recall.py
@@ -0,0 +1,79 @@
 import unittest
 from ai_daily_report.candidate_recall import recall_semantic_candidates
 from ai_daily_report.models import NewsItem
 from ai_daily_report.normalize import normalize_title
 def item(item_id, title, summary):
    return NewsItem(
        id=item_id,
        source_group="AI HOT",
        source_label="AI HOT",
        source_role="primary",
        source_priority=10,
        title_raw=title,
        title_norm=normalize_title(title),
        summary_raw=summary,
        url=f"https://example.com/{item_id}",
        canonical_url=f"https://example.com/{item_id}",
    )
 class CandidateRecallTests(unittest.TestCase):
    def test_recalls_shared_event_entities_when_titles_are_not_stage2_similar(self):
        items = [
            item(
                "a",
                "Anthropic 被曝开发 Claude Fable",
                "Anthropic 正在开发名为 Claude Fable 和 Claude Mythos 的新产品。",
            ),
            item(
                "b",
                "Claude Mythos 进入内部测试",
                "Anthropic 的 Claude Mythos 与 Claude Fable 面向内容生成场景。",
            ),
            item(
                "c",
                "Gemini CLI 发布更新",
                "Google 为 Gemini CLI 增加新的开发者命令。",
            ),
        ]
        candidates, report = recall_semantic_candidates(items, existing_candidates=[])
        candidate_sets = [set(candidate["item_ids"]) for candidate in candidates]
        self.assertIn({"a", "b"}, candidate_sets)
        self.assertNotIn({"a", "c"}, candidate_sets)
        self.assertEqual(report["candidate_group_count"], 1)
        self.assertEqual(candidates[0]["reason"], "strong_entity_overlap")
    def test_does_not_group_same_company_different_products_without_event_overlap(self):
        items = [
            item("gemini", "Google 发布 Gemini CLI", "Google 发布面向开发者的 Gemini CLI 工具。"),
            item("gemma", "Google 开源 Gemma 3n", "Google 开源 Gemma 3n 模型，面向端侧部署。"),
        ]
        candidates, report = recall_semantic_candidates(items, existing_candidates=[])
        self.assertEqual(candidates, [])
        self.assertEqual(report["candidate_group_count"], 0)
    def test_preserves_existing_candidates_and_adds_new_ones_without_duplicates(self):
        items = [
            item("a", "Anthropic 发布 Claude Fable", "Claude Fable 与 Claude Mythos 同时曝光。"),
            item("b", "Claude Mythos 新功能曝光", "Claude Mythos 和 Claude Fable 是 Anthropic 新项目。"),
        ]
        candidates, report = recall_semantic_candidates(
            items,
            existing_candidates=[{"item_ids": ["a", "b"], "reason": "title_similarity"}],
        )
        self.assertEqual(len(candidates), 1)
        self.assertEqual(candidates[0]["reason"], "title_similarity")
        self.assertEqual(report["existing_candidate_group_count"], 1)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@@ -0,0 +1,47 @@
 import unittest
 from pathlib import Path
 from tempfile import TemporaryDirectory
 from ai_daily_report.cli import build_parser, main
 class CliTests(unittest.TestCase):
    def test_run_command_parses_date_and_mode(self):
        parser = build_parser()
        args = parser.parse_args(["run", "--date", "2026-06-04", "--mode", "dry-run", "--source-mode", "live", "--llm-mode", "live", "--sources-path", "config/sources.json"])
        self.assertEqual(args.command, "run")
        self.assertEqual(args.date, "2026-06-04")
        self.assertEqual(args.mode, "dry-run")
        self.assertEqual(args.source_mode, "live")
        self.assertEqual(args.llm_mode, "live")
        self.assertEqual(args.sources_path, "config/sources.json")
    def test_main_returns_zero_for_parseable_command(self):
        self.assertEqual(main(["run", "--date", "2026-06-04", "--mode", "dry-run"]), 0)
    def test_main_mock_run_writes_outputs(self):
        with TemporaryDirectory() as temp_dir:
            exit_code = main(
                [
                    "run",
                    "--date",
                    "2026-06-04",
                    "--mode",
                    "dry-run",
                    "--source-mode",
                    "mock",
                    "--llm-mode",
                    "mock",
                    "--out-dir",
                    temp_dir,
                ]
            )
            self.assertEqual(exit_code, 0)
            self.assertTrue((Path(temp_dir) / "2026-06-04" / "blog_markdown.md").exists())
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_clients.py
+++ b/tests/test_clients.py
@@ -0,0 +1,85 @@
 import json
 import unittest
 from email.message import Message
 from urllib.error import HTTPError
 from unittest.mock import patch
 from ai_daily_report.clients import FetchTextError, BlogApiClient, OpenAICompatibleClient, fetch_text
 class FakeResponse:
    status = 200
    def __init__(self, body):
        self.body = body
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc, tb):
        return False
    def read(self):
        return self.body
 class ClientTests(unittest.TestCase):
    def test_fetch_text_decodes_response(self):
        with patch("urllib.request.urlopen", return_value=FakeResponse("ok".encode("utf-8"))):
            self.assertEqual(fetch_text("https://example.com", 1), "ok")
    def test_fetch_text_retries_transient_http_errors(self):
        responses = [
            HTTPError("https://example.com", 503, "Service Unavailable", {}, None),
            FakeResponse("ok".encode("utf-8")),
        ]
        with patch("urllib.request.urlopen", side_effect=responses) as urlopen:
            self.assertEqual(fetch_text("https://example.com", 1, retries=1, backoff_seconds=0), "ok")
        self.assertEqual(urlopen.call_count, 2)
    def test_fetch_text_does_not_retry_404_and_classifies_error(self):
        with patch(
            "urllib.request.urlopen",
            side_effect=HTTPError("https://example.com", 404, "Not Found", {}, None),
        ) as urlopen:
            with self.assertRaises(FetchTextError) as context:
                fetch_text("https://example.com", 1, retries=2, backoff_seconds=0)
        self.assertEqual(urlopen.call_count, 1)
        self.assertEqual(context.exception.error_type, "http_404")
        self.assertEqual(context.exception.http_status, 404)
    def test_openai_compatible_client_returns_message_content(self):
        body = json.dumps({"choices": [{"message": {"content": "hello"}}]}).encode("utf-8")
        with patch("urllib.request.urlopen", return_value=FakeResponse(body)):
            client = OpenAICompatibleClient(api_key="key", base_url="https://llm.example/v1", model="model")
            self.assertEqual(client.chat("prompt"), "hello")
    def test_blog_api_client_create_and_publish(self):
        responses = [
            FakeResponse(json.dumps({"slug": "ai-2026-06-04"}).encode("utf-8")),
            FakeResponse(json.dumps({"ok": True}).encode("utf-8")),
        ]
        with patch("urllib.request.urlopen", side_effect=responses):
            client = BlogApiClient(base_url="https://blog.example", token="token")
            self.assertEqual(client.create_post({"title": "t"})["slug"], "ai-2026-06-04")
            client.publish_post("ai-2026-06-04")
    def test_blog_api_client_slug_lookup_falls_back_to_query_endpoint(self):
        responses = [
            HTTPError("https://blog.example/api/service/posts/ai-2026-06-10", 404, "Not Found", Message(), None),
            FakeResponse(json.dumps({"items": [{"slug": "ai-2026-06-10", "content": "body"}]}).encode("utf-8")),
        ]
        with patch("urllib.request.urlopen", side_effect=responses) as urlopen:
            client = BlogApiClient(base_url="https://blog.example", token="token")
            post = client.get_post_by_slug("ai-2026-06-10")
        self.assertIsNotNone(post)
        assert post is not None
        self.assertEqual(post["slug"], "ai-2026-06-10")
        self.assertEqual(urlopen.call_count, 2)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_config_loading.py
+++ b/tests/test_config_loading.py
@@ -0,0 +1,33 @@
 import unittest
 from pathlib import Path
 from ai_daily_report.config import load_source_configs
 from ai_daily_report.sources.registry import get_source_fetcher
 ROOT = Path(__file__).resolve().parents[1]
 class ConfigLoadingTests(unittest.TestCase):
    def test_load_source_configs_from_json(self):
        configs = load_source_configs(ROOT / "config" / "sources.json")
        self.assertGreaterEqual(len(configs), 5)
        self.assertEqual(configs[0].name, "AI HOT")
        self.assertEqual(configs[0].type, "aihot")
    def test_rss_configs_can_set_max_item_age_days(self):
        configs = load_source_configs(ROOT / "config" / "sources.json")
        by_name = {config.name: config for config in configs}
        self.assertEqual(by_name["InfoQ AI"].max_item_age_days, 3)
    def test_all_configured_source_types_are_registered(self):
        configs = load_source_configs(ROOT / "config" / "sources.json")
        for config in configs:
            self.assertTrue(callable(get_source_fetcher(config.type)))
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_dry_run_config.py
+++ b/tests/test_dry_run_config.py
@@ -0,0 +1,33 @@
 import importlib.util
 import unittest
 from pathlib import Path
 ROOT = Path(__file__).resolve().parents[1]
 SCRIPT = ROOT / "script" / "ai_daily_blog_pipeline.py"
 def load_pipeline_module():
    spec = importlib.util.spec_from_file_location("ai_daily_blog_pipeline", SCRIPT)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)
    return module
 class DryRunConfigTests(unittest.TestCase):
    def test_dry_run_does_not_require_blog_token(self):
        module = load_pipeline_module()
        self.assertTrue(module.is_dry_run({"AI_DAILY_DRY_RUN": "1"}))
        self.assertFalse(module.requires_blog_token({"AI_DAILY_DRY_RUN": "1"}))
    def test_publish_mode_requires_blog_token(self):
        module = load_pipeline_module()
        self.assertFalse(module.is_dry_run({}))
        self.assertTrue(module.requires_blog_token({}))
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_env_config.py
+++ b/tests/test_env_config.py
@@ -0,0 +1,88 @@
 import unittest
 from pathlib import Path
 from tempfile import TemporaryDirectory
 from ai_daily_report.env import resolve_blog_token, resolve_llm_config
 class EnvConfigTests(unittest.TestCase):
    def test_resolve_llm_config_prefers_generic_values(self):
        config = resolve_llm_config(
            {
                "LLM_API_KEY": "generic-key",
                "LLM_BASE_URL": "https://generic.example/v1",
                "LLM_MODEL": "generic-model",
                "SUB2API_API_KEY": "sub-key",
                "SUB2API_BASE_URL": "https://sub.example/v1",
                "SUB2API_MODEL": "sub-model",
            }
        )
        self.assertEqual(
            config,
            {
                "api_key": "generic-key",
                "base_url": "https://generic.example/v1",
                "model": "generic-model",
            },
        )
    def test_resolve_llm_config_reports_missing_fields(self):
        with TemporaryDirectory() as temp_dir:
            with self.assertRaisesRegex(ValueError, "missing_llm_config: LLM_BASE_URL,LLM_MODEL"):
                resolve_llm_config({"LLM_API_KEY": "key"}, hermes_dir=Path(temp_dir))
    def test_resolve_llm_config_follows_hermes_provider_config(self):
        with TemporaryDirectory() as temp_dir:
            hermes_dir = Path(temp_dir)
            (hermes_dir / "config.yaml").write_text(
                """
 model:
  provider: sub2api
  default: findmini/gpt-5.5
  base_url: http://sub2api.example/v1
 """.strip(),
                encoding="utf-8",
            )
            (hermes_dir / ".env").write_text("SUB2API_API_KEY=hermes-key\n", encoding="utf-8")
            config = resolve_llm_config({}, hermes_dir=hermes_dir)
        self.assertEqual(
            config,
            {
                "api_key": "hermes-key",
                "base_url": "http://sub2api.example/v1",
                "model": "findmini/gpt-5.5",
            },
        )
    def test_resolve_llm_config_uses_hermes_auth_json_env_source(self):
        with TemporaryDirectory() as temp_dir:
            hermes_dir = Path(temp_dir)
            (hermes_dir / "config.yaml").write_text(
                """
 model:
  provider: sub2api
  default: findmini/gpt-5.5
  base_url: http://sub2api.example/v1
 """.strip(),
                encoding="utf-8",
            )
            (hermes_dir / "auth.json").write_text(
                '{"credential_pool": {"sub2api": [{"source": "env:SUB2API_API_KEY"}]}}',
                encoding="utf-8",
            )
            config = resolve_llm_config({"SUB2API_API_KEY": "auth-env-key"}, hermes_dir=hermes_dir)
        self.assertEqual(config["api_key"], "auth-env-key")
        self.assertEqual(config["base_url"], "http://sub2api.example/v1")
        self.assertEqual(config["model"], "findmini/gpt-5.5")
    def test_resolve_blog_token_uses_supported_names(self):
        self.assertEqual(resolve_blog_token({"EPHRON_SERVICE_TOKEN": "token"}), "token")
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_env_loading.py
+++ b/tests/test_env_loading.py
@@ -0,0 +1,39 @@
 import importlib.util
 import os
 import unittest
 from pathlib import Path
 from unittest.mock import patch
 ROOT = Path(__file__).resolve().parents[1]
 SCRIPT = ROOT / "script" / "ai_daily_blog_pipeline.py"
 def load_pipeline_module():
    spec = importlib.util.spec_from_file_location("ai_daily_blog_pipeline", SCRIPT)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)
    return module
 class EnvLoadingTests(unittest.TestCase):
    def test_project_env_is_loaded_and_process_env_wins(self):
        module = load_pipeline_module()
        env_text = "LLM_MODEL=file-model\nLLM_BASE_URL=https://file.example/v1\n"
        with patch.object(module.Path, "home", return_value=ROOT / "missing-home"):
            with patch.dict(os.environ, {"LLM_MODEL": "process-model"}, clear=False):
                with patch.object(module, "PROJECT_ENV_PATH", ROOT / ".env.test"):
                    (ROOT / ".env.test").write_text(env_text, encoding="utf-8")
                    try:
                        env = module.load_env()
                    finally:
                        (ROOT / ".env.test").unlink(missing_ok=True)
        self.assertEqual(env["LLM_BASE_URL"], "https://file.example/v1")
        self.assertEqual(env["LLM_MODEL"], "process-model")
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_generated_docs.py
+++ b/tests/test_generated_docs.py
@@ -0,0 +1,17 @@
 import subprocess
 import sys
 import unittest
 from pathlib import Path
 class GeneratedDocsTests(unittest.TestCase):
    def test_ops_threshold_doc_is_up_to_date(self):
        root = Path(__file__).resolve().parents[1]
        before = (root / "docs" / "ops-thresholds.generated.md").read_text(encoding="utf-8")
        subprocess.run([sys.executable, "scripts/generate_ops_docs.py"], cwd=root, check=True, capture_output=True, text=True)
        after = (root / "docs" / "ops-thresholds.generated.md").read_text(encoding="utf-8")
        self.assertEqual(after, before)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_history_replay_fixtures.py
+++ b/tests/test_history_replay_fixtures.py
@@ -0,0 +1,67 @@
 import json
 import unittest
 from pathlib import Path
 from ai_daily_report.candidate_recall import recall_semantic_candidates
 from ai_daily_report.models import NewsItem
 FIXTURE_PATH = Path(__file__).parent / "fixtures" / "history_replay_2026_06_04_2026_06_10.json"
 def make_item(raw, index):
    return NewsItem(
        id=raw["id"],
        source_group=raw["source"],
        source_label=raw["source"],
        source_role="primary" if raw["source"] == "AI HOT" else "supplement",
        source_priority=10 if raw["source"] == "AI HOT" else 50,
        title_raw=raw["title_raw"],
        title_norm=raw["title_raw"].lower(),
        summary_raw=raw["summary_raw"],
        url=raw["url"],
        canonical_url=raw["url"],
        published_at=raw["date"],
    )
 class HistoryReplayFixtureTests(unittest.TestCase):
    def test_fixture_covers_required_incidents(self):
        data = json.loads(FIXTURE_PATH.read_text(encoding="utf-8"))
        event_ids = {event["event_id"] for event in data["events"]}
        self.assertEqual(
            event_ids,
            {
                "claude-fable-mythos",
                "openclaw-suno",
                "magenta-realtime-2",
                "open-code-review",
                "openai-chip-talent-move",
                "amap-abot",
            },
        )
    def test_candidate_recall_finds_fixture_event_pairs(self):
        data = json.loads(FIXTURE_PATH.read_text(encoding="utf-8"))
        misses = []
        for event in data["events"]:
            items = [make_item(item, index) for index, item in enumerate(event["items"])]
            candidates, report = recall_semantic_candidates(
                items,
                config={
                    "enabled": True,
                    "title_similarity_threshold": 0.25,
                    "title_jaccard_threshold": 0.10,
                    "summary_jaccard_threshold": 0.05,
                    "strong_entity_overlap_threshold": 1,
                },
            )
            if not candidates:
                misses.append(event["event_id"])
        self.assertEqual(misses, [])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_legacy_script_delegation.py
+++ b/tests/test_legacy_script_delegation.py
@@ -0,0 +1,70 @@
 import importlib.util
 import unittest
 from pathlib import Path
 from unittest.mock import patch
 ROOT = Path(__file__).resolve().parents[1]
 SCRIPT = ROOT / "script" / "ai_daily_blog_pipeline.py"
 def load_pipeline_module():
    spec = importlib.util.spec_from_file_location("ai_daily_blog_pipeline", SCRIPT)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)
    return module
 class LegacyScriptDelegationTests(unittest.TestCase):
    def test_main_delegates_to_new_pipeline_by_default(self):
        module = load_pipeline_module()
        calls = []
        def fake_run_daily_report(**kwargs):
            calls.append(kwargs)
            return {"reports": {"stage8": {"status": "ok"}}}
        with patch.object(module, "load_env", return_value={"AI_DAILY_DRY_RUN": "1"}):
            with patch("ai_daily_report.runner.run_daily_report", side_effect=fake_run_daily_report):
                module.main()
        self.assertEqual(len(calls), 1)
        self.assertEqual(calls[0]["mode"], "dry-run")
        self.assertEqual(calls[0]["source_mode"], "live")
        self.assertEqual(calls[0]["llm_mode"], "live")
    def test_main_allows_mock_modes_for_local_test(self):
        module = load_pipeline_module()
        calls = []
        def fake_run_daily_report(**kwargs):
            calls.append(kwargs)
            return {"reports": {"stage8": {"status": "ok"}}}
        with patch.object(
            module,
            "load_env",
            return_value={"AI_DAILY_DRY_RUN": "1", "AI_DAILY_SOURCE_MODE": "mock", "AI_DAILY_LLM_MODE": "mock"},
        ):
            with patch("ai_daily_report.runner.run_daily_report", side_effect=fake_run_daily_report):
                module.main()
        self.assertEqual(calls[0]["source_mode"], "mock")
        self.assertEqual(calls[0]["llm_mode"], "mock")
    def test_main_exits_nonzero_when_new_pipeline_blocks_publish(self):
        module = load_pipeline_module()
        def fake_run_daily_report(**kwargs):
            return {"reports": {"stage8": {"status": "blocked", "error": "rewrite_fallback_ratio_exceeded"}}}
        with patch.object(module, "load_env", return_value={}):
            with patch("ai_daily_report.runner.run_daily_report", side_effect=fake_run_daily_report):
                with self.assertRaises(SystemExit) as raised:
                    module.main()
        self.assertEqual(raised.exception.code, 2)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_llm_utils.py
+++ b/tests/test_llm_utils.py
@@ -0,0 +1,17 @@
 import unittest
 from ai_daily_report.llm import parse_json_object
 class LlmUtilsTests(unittest.TestCase):
    def test_parse_json_object_strips_markdown_fence(self):
        self.assertEqual(parse_json_object('```json\n{"ok": true}\n```'), {"ok": True})
    def test_parse_json_object_raises_without_json(self):
        with self.assertRaises(ValueError):
            parse_json_object("not json")
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_markdown_rendering.py
+++ b/tests/test_markdown_rendering.py
@@ -0,0 +1,39 @@
 import unittest
 from ai_daily_report.assemble import assemble_markdown
 from ai_daily_report.models import NewsItem
 class MarkdownRenderingTests(unittest.TestCase):
    def test_blog_markdown_strips_double_blockquote_and_reference_markers(self):
        items = [
            NewsItem(
                id="a",
                source_group="AI HOT",
                source_label="OpenAI：Blog",
                source_role="primary",
                source_priority=10,
                title_raw="测试模型发布",
                title_norm="测试模型发布",
                summary_raw="测试摘要",
                title="测试模型发布",
                summary="测试摘要",
                url="https://openai.com/blog/test",
                canonical_url="https://openai.com/blog/test",
                section="模型与能力",
            )
        ]
        guide = {"theme": "> 主线判断：测试主线[1]", "threads": []}
        md, _ = assemble_markdown(items, guide)
        self.assertNotIn("## 导览", md)
        self.assertIn("## 模型与能力", md)
        self.assertIn("[OpenAI：Blog ↗](https://openai.com/blog/test)", md)
        self.assertNotIn("> >", md)
        self.assertNotIn("[1]", md)
        self.assertNotIn("主线判断", md)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_observability.py
+++ b/tests/test_observability.py
@@ -0,0 +1,34 @@
 import json
 import unittest
 from ai_daily_report.observability import LlmCallObserver, summarize_observed_calls
 class ObservabilityTests(unittest.TestCase):
    def test_records_prompt_and_response_hashes(self):
        observer = LlmCallObserver(lambda prompt: json.dumps({"ok": True}), stage="stage3")
        response = observer("prompt")
        self.assertEqual(response, '{"ok": true}')
        self.assertEqual(len(observer.records), 1)
        self.assertEqual(observer.records[0]["stage"], "stage3")
        self.assertEqual(observer.records[0]["prompt_chars"], 6)
        self.assertEqual(observer.records[0]["response_chars"], len(response))
        self.assertRegex(observer.records[0]["prompt_hash"], r"^[0-9a-f]{64}$")
        self.assertRegex(observer.records[0]["response_hash"], r"^[0-9a-f]{64}$")
    def test_summarizes_observed_calls(self):
        left = LlmCallObserver(lambda prompt: "a", stage="stage3")
        right = LlmCallObserver(lambda prompt: "b", stage="stage4")
        left("x")
        right("y")
        right("z")
        report = summarize_observed_calls([left, right])
        self.assertEqual(report["total_calls"], 3)
        self.assertEqual(report["by_stage"], {"stage3": 1, "stage4": 2})
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_project_structure.py
+++ b/tests/test_project_structure.py
@@ -0,0 +1,33 @@
 import unittest
 from pathlib import Path
 ROOT = Path(__file__).resolve().parents[1]
 class ProjectStructureTests(unittest.TestCase):
    def test_pipeline_plan_structure_exists(self):
        expected_paths = [
            "ai_daily_report/sources/__init__.py",
            "ai_daily_report/sources/aihot.py",
            "ai_daily_report/sources/rss.py",
            "ai_daily_report/sources/juya.py",
            "ai_daily_report/sources/registry.py",
            "ai_daily_report/llm.py",
            "ai_daily_report/validate.py",
            "ai_daily_report/publish.py",
            "ai_daily_report/cli.py",
            "config/sources.json",
            "config/pipeline.json",
            "tests/fixtures/.gitkeep",
            "skill/scripts/.gitkeep",
            "skill/scripts/run_daily_report.py",
        ]
        missing = [path for path in expected_paths if not (ROOT / path).exists()]
        self.assertEqual(missing, [])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_quality_gate.py
+++ b/tests/test_quality_gate.py
@@ -0,0 +1,78 @@
 import unittest
 from ai_daily_report.models import NewsItem, SourceResult
 from ai_daily_report.quality_gate import evaluate_quality_gate
 def news_item(item_id, title="Story"):
    return NewsItem(
        id=item_id,
        source_group="AI HOT",
        source_label="AI HOT",
        source_role="primary",
        source_priority=10,
        title_raw=f"{title} {item_id}",
        title_norm=f"{title} {item_id}".lower(),
        summary_raw="summary",
        url=f"https://example.com/{item_id}",
        canonical_url=f"https://example.com/{item_id}",
    )
 class QualityGateTests(unittest.TestCase):
    def test_warns_when_stage3_candidates_zero_for_large_item_set(self):
        items = [news_item(str(index)) for index in range(31)]
        report = evaluate_quality_gate(
            items,
            source_results=[],
            reports={"stage3": {"candidate_group_count": 0}},
            config={"warn_when_stage3_candidates_zero_min_items": 30},
        )
        self.assertIn("stage3_candidates_zero", report["warnings"])
        self.assertEqual(report["blocking_errors"], [])
    def test_warns_on_enabled_source_failure(self):
        report = evaluate_quality_gate(
            [news_item("a")],
            source_results=[
                SourceResult(
                    source="橘鸦AI早报",
                    role="supplement",
                    ok=False,
                    status="error",
                    error="HTTPError: 404",
                )
            ],
            reports={"stage3": {"candidate_group_count": 1}},
            config={"warn_on_enabled_source_failure": True},
        )
        self.assertIn("enabled_source_failed:橘鸦AI早报:error", report["warnings"])
        self.assertEqual(report["source_failures"][0]["source"], "橘鸦AI早报")
    def test_blocks_required_source_failure_when_configured(self):
        report = evaluate_quality_gate(
            [news_item("a")],
            source_results=[
                SourceResult(
                    source="AI HOT",
                    role="primary",
                    ok=False,
                    status="timeout",
                    error="TimeoutError",
                )
            ],
            reports={"stage3": {"candidate_group_count": 1}},
            config={
                "block_on_required_source_failure": True,
                "required_sources": ["AI HOT"],
            },
        )
        self.assertIn("required_source_failed:AI HOT:timeout", report["blocking_errors"])
        self.assertTrue(report["quality_gate_failed"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_rss.py
+++ b/tests/test_rss.py
@@ -0,0 +1,58 @@
 import unittest
 from ai_daily_report.models import SourceConfig
 from ai_daily_report.sources.rss import parse_rss_items
 class RssSourceTests(unittest.TestCase):
    def test_parse_rss_items_filters_entries_older_than_configured_age(self):
        config = SourceConfig(
            name="InfoQ AI",
            type="rss",
            url="https://feed.example/rss",
            max_item_age_days=3,
        )
        xml = """<?xml version="1.0"?>
 <rss><channel>
  <item>
    <title>Fresh item</title>
    <link>https://example.com/fresh</link>
    <description>Fresh summary</description>
    <pubDate>Sun, 07 Jun 2026 06:25:00 GMT</pubDate>
  </item>
  <item>
    <title>Old item</title>
    <link>https://example.com/old</link>
    <description>Old summary</description>
    <pubDate>Mon, 01 Jun 2026 06:25:00 GMT</pubDate>
  </item>
 </channel></rss>"""
        items = parse_rss_items(config, xml, run_date="2026-06-08")
        self.assertEqual([item["title_raw"] for item in items], ["Fresh item"])
    def test_parse_rss_items_keeps_unparseable_dates_to_avoid_false_drops(self):
        config = SourceConfig(
            name="InfoQ AI",
            type="rss",
            url="https://feed.example/rss",
            max_item_age_days=3,
        )
        xml = """<?xml version="1.0"?>
 <rss><channel>
  <item>
    <title>No date item</title>
    <link>https://example.com/no-date</link>
    <description>No date summary</description>
    <pubDate>not a date</pubDate>
  </item>
 </channel></rss>"""
        items = parse_rss_items(config, xml, run_date="2026-06-08")
        self.assertEqual([item["title_raw"] for item in items], ["No date item"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_runner.py
+++ b/tests/test_runner.py
@@ -0,0 +1,283 @@
 import unittest
 import json
 from pathlib import Path
 from tempfile import TemporaryDirectory
 from ai_daily_report.publish import load_published_urls
 from ai_daily_report.runner import run_daily_report
 class RunnerTests(unittest.TestCase):
    def test_run_daily_report_mock_mode_writes_markdown_and_reports(self):
        with TemporaryDirectory() as temp_dir:
            result = run_daily_report(
                run_date="2026-06-04",
                mode="dry-run",
                source_mode="mock",
                llm_mode="mock",
                out_dir=Path(temp_dir),
                base_url="https://blog.example",
            )
            run_dir = Path(result["run_dir"])
            self.assertTrue((run_dir / "blog_markdown.md").exists())
            self.assertTrue((run_dir / "run_report.json").exists())
            for filename in [
                "stage0_sources.json",
                "stage1_items.json",
                "stage2_items.json",
                "stage2_5_items.json",
                "stage2_8_candidates.json",
                "stage3_items.json",
                "stage4_items.json",
                "quality_gate.json",
            ]:
                self.assertTrue((run_dir / filename).exists(), filename)
            self.assertEqual(result["reports"]["stage8"]["status"], "ok")
    def test_run_daily_report_passes_pipeline_config_to_stage_functions(self):
        class FakeLlmClient:
            def chat(self, prompt):
                payload = json.loads(prompt)
                if "candidates" in payload:
                    first_candidate = payload["candidates"][0]["item_ids"]
                    return json.dumps(
                        {
                            "duplicate_groups": [
                                {
                                    "keep_id": first_candidate[0],
                                    "remove_ids": [first_candidate[1]],
                                    "confidence": "high",
                                    "reason": "same event",
                                }
                            ],
                            "not_duplicates": [],
                            "uncertain": [],
                        }
                    )
                if "allowed_sections" in payload:
                    return json.dumps(
                        {
                            "rewrites": [
                                {
                                    "id": item["id"],
                                    "title": item["title_raw"],
                                    "summary": item["summary_raw"],
                                    "flags": [],
                                }
                                for item in payload["items"]
                            ]
                        }
                    )
                return json.dumps(
                    {
                        "intro": "Daily intro.",
                        "theme": "Pipeline config.",
                        "threads": [
                            {
                                "title": "Config thread",
                                "text": "Config values reached the pipeline.",
                                "item_ids": [payload["items"][0]["id"]],
                                "kind": "thread",
                            }
                        ],
                        "conclusion": "Done.",
                    }
                )
        with TemporaryDirectory() as temp_dir:
            temp_path = Path(temp_dir)
            pipeline_config = temp_path / "pipeline.json"
            pipeline_config.write_text(
                json.dumps(
                    {
                        "semantic_dedup_max_deletion_ratio": 0.1,
                        "rewrite_batch_size": 1,
                        "cross_day_dedup": {"enabled": False},
                    }
                ),
                encoding="utf-8",
            )
            source_config = temp_path / "sources.json"
            source_config.write_text(
                json.dumps(
                    [
                        {
                            "name": "AI HOT",
                            "type": "rss",
                            "url": "https://feed.example/rss",
                            "role": "primary",
                            "priority": 10,
                            "enabled": True,
                        }
                    ]
                ),
                encoding="utf-8",
            )
            def fetch_text(url, timeout):
                return """<?xml version="1.0"?><rss><channel>
 <item><title>Anthropic launches Claude Code</title><link>https://example.com/a</link><description>Anthropic launches Claude Code for developers.</description></item>
 <item><title>Anthropic launch Claude Code</title><link>https://example.com/b</link><description>Anthropic launch Claude Code for coding.</description></item>
 <item><title>Gemini CLI update</title><link>https://example.com/c</link><description>Google updates Gemini CLI.</description></item>
 </channel></rss>"""
            result = run_daily_report(
                run_date="2026-06-10",
                mode="dry-run",
                source_mode="live",
                llm_mode="live",
                out_dir=temp_path / "out",
                base_url="https://blog.example",
                sources_path=source_config,
                pipeline_path=pipeline_config,
                fetch_text=fetch_text,
                env={
                    "LLM_API_KEY": "test-key",
                    "LLM_BASE_URL": "https://llm.example/v1",
                    "LLM_MODEL": "test-model",
                },
                llm_client_factory=lambda **config: FakeLlmClient(),
            )
        self.assertTrue(result["reports"]["stage3"]["skipped_for_deletion_ratio"])
        self.assertEqual(result["reports"]["stage4"]["batch_count"], 3)
        self.assertIn("quality_gate", result["reports"])
    def test_run_daily_report_live_sources_can_use_config_and_fetch_text(self):
        with TemporaryDirectory() as temp_dir:
            out_dir = Path(temp_dir) / "out"
            source_config = Path(temp_dir) / "sources.json"
            source_config.write_text(
                json.dumps(
                    [
                        {
                            "name": "InfoQ AI",
                            "type": "rss",
                            "url": "https://feed.example/rss",
                            "role": "supplement",
                            "priority": 40,
                            "enabled": True,
                        }
                    ]
                ),
                encoding="utf-8",
            )
            def fetch_text(url, timeout):
                return """<?xml version="1.0"?><rss><channel><item><title>GPT-5 API 发布</title><link>https://example.com/gpt5</link><description>OpenAI 发布 GPT-5 API。</description></item></channel></rss>"""
            result = run_daily_report(
                run_date="2026-06-04",
                mode="dry-run",
                source_mode="live",
                llm_mode="mock",
                out_dir=out_dir,
                base_url="https://blog.example",
                sources_path=source_config,
                fetch_text=fetch_text,
            )
            self.assertEqual(result["reports"]["stage0"]["raw_item_count"], 1)
            self.assertTrue((out_dir / "2026-06-04" / "blog_markdown.md").exists())
    def test_run_daily_report_live_llm_uses_env_config_in_dry_run(self):
        class FakeLlmClient:
            def __init__(self):
                self.prompts = []
            def chat(self, prompt):
                self.prompts.append(prompt)
                if "duplicate_groups" in prompt:
                    return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
                if "rewrites" in prompt:
                    payload = json.loads(prompt)
                    return json.dumps(
                        {
                            "rewrites": [
                                {
                                    "id": item["id"],
                                    "title": item["title_raw"],
                                    "summary": item["summary_raw"],
                                    "flags": [],
                                }
                                for item in payload["items"]
                            ]
                        }
                    )
                return json.dumps(
                    {
                        "theme": "模型能力继续进入产品入口。",
                        "threads": [
                            {
                                "title": "模型 API 更新",
                                "text": "GPT-5 API 发布，说明模型能力继续进入产品入口。",
                                "item_ids": [json.loads(prompt)["items"][0]["id"]],
                                "kind": "thread",
                            }
                        ],
                    }
                )
        fake_client = FakeLlmClient()
        captured_config = {}
        def llm_client_factory(**config):
            captured_config.update(config)
            return fake_client
        with TemporaryDirectory() as temp_dir:
            result = run_daily_report(
                run_date="2026-06-04",
                mode="dry-run",
                source_mode="mock",
                llm_mode="live",
                out_dir=Path(temp_dir),
                base_url="https://blog.example",
                env={
                    "LLM_API_KEY": "test-key",
                    "LLM_BASE_URL": "https://llm.example/v1",
                    "LLM_MODEL": "test-model",
                },
                llm_client_factory=llm_client_factory,
            )
        self.assertEqual(captured_config["api_key"], "test-key")
        self.assertEqual(captured_config["base_url"], "https://llm.example/v1")
        self.assertEqual(captured_config["model"], "test-model")
        self.assertGreaterEqual(len(fake_client.prompts), 2)
        self.assertEqual(result["reports"]["stage8"]["status"], "ok")
    def test_run_daily_report_publish_updates_published_url_history(self):
        class FakeBlogClient:
            def __init__(self, **kwargs):
                self.kwargs = kwargs
            def create_post(self, payload):
                return {"slug": payload["slug"]}
            def publish_post(self, slug):
                self.slug = slug
        with TemporaryDirectory() as temp_dir:
            history_path = Path(temp_dir) / "published_urls.json"
            result = run_daily_report(
                run_date="2026-06-08",
                mode="publish",
                source_mode="mock",
                llm_mode="mock",
                out_dir=Path(temp_dir) / "out",
                base_url="https://blog.example",
                env={"BLOG_SERVICE_TOKEN": "token"},
                blog_client_factory=FakeBlogClient,
                history_path=history_path,
            )
            history = load_published_urls(history_path)
        self.assertEqual(result["reports"]["stage8"]["status"], "ok")
        self.assertIn("https://example.com/gpt5", history.urls)
        self.assertEqual(history.urls["https://example.com/gpt5"].last_published, "2026-06-08")
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_source_labels.py
+++ b/tests/test_source_labels.py
@@ -0,0 +1,55 @@
 import unittest
 from ai_daily_report.models import SourceConfig
 from ai_daily_report.sources.juya import parse_juya_rss
 from ai_daily_report.sources.labels import source_label_from_url
 class SourceLabelTests(unittest.TestCase):
    def test_source_label_from_x_url_includes_handle(self):
        self.assertEqual(
            source_label_from_url("https://x.com/MiniMax_AI/status/123", fallback="橘鸦AI早报"),
            "X：MiniMax (@MiniMax_AI)",
        )
    def test_source_label_from_blog_url_marks_blog(self):
        self.assertEqual(
            source_label_from_url("https://openai.com/blog/example", fallback="橘鸦AI早报"),
            "OpenAI：Blog",
        )
    def test_source_label_from_known_non_blog_domains(self):
        self.assertEqual(
            source_label_from_url("https://mp.weixin.qq.com/s/example", fallback="橘鸦AI早报"),
            "微信公众号",
        )
        self.assertEqual(
            source_label_from_url("https://platform.minimaxi.com/docs/token-plan/migration", fallback="橘鸦AI早报"),
            "MiniMax：Docs",
        )
    def test_parse_juya_rss_uses_item_url_as_source_label(self):
        config = SourceConfig(name="橘鸦AI早报", type="juya_rss", url="https://juya.example/rss")
        xml = """<?xml version="1.0"?>
 <rss xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <item>
      <title>2026-06-04</title>
      <content:encoded><![CDATA[
        <h2><a href="https://x.com/MiniMax_AI/status/123">MiniMax M3 加速</a> <code>#1</code></h2>
        <p>MiniMax M3 加速。</p>
        <p><a href="https://x.com/MiniMax_AI/status/123">来源</a></p>
        <hr/>
      ]]></content:encoded>
    </item>
  </channel>
 </rss>"""
        items = parse_juya_rss(config, xml, "2026-06-04")
        self.assertEqual(items[0]["source_label"], "X：MiniMax (@MiniMax_AI)")
        self.assertNotEqual(items[0]["source_label"], "橘鸦AI早报")
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage0_collect.py
+++ b/tests/test_stage0_collect.py
@@ -0,0 +1,62 @@
 import unittest
 from ai_daily_report.clients import FetchTextError
 from ai_daily_report.collect import collect_sources
 from ai_daily_report.models import SourceConfig
 class Stage0CollectTests(unittest.TestCase):
    def test_collect_sources_returns_structured_results_for_each_source(self):
        configs = [
            SourceConfig(name="Primary", type="fake", role="primary", priority=10),
            SourceConfig(name="Supplement", type="fake", role="supplement", priority=20),
        ]
        def fetcher(config, run_date):
            return [{"title_raw": f"{config.name} item", "url": f"https://example.com/{config.name}"}]
        results, report = collect_sources(configs, "2026-06-04", fetcher=fetcher)
        self.assertEqual([r.source for r in results], ["Primary", "Supplement"])
        self.assertTrue(all(r.ok for r in results))
        self.assertEqual(sum(len(r.items) for r in results), 2)
        self.assertEqual(report["input_source_count"], 2)
        self.assertEqual(report["ok_source_count"], 2)
        self.assertEqual(report["raw_item_count"], 2)
    def test_collect_sources_records_failed_source_without_blocking_others(self):
        configs = [
            SourceConfig(name="Broken", type="fake", role="supplement", priority=20),
            SourceConfig(name="Healthy", type="fake", role="supplement", priority=30),
        ]
        def fetcher(config, run_date):
            if config.name == "Broken":
                raise TimeoutError("timed out")
            return [{"title_raw": "healthy item", "url": "https://example.com/healthy"}]
        results, report = collect_sources(configs, "2026-06-04", fetcher=fetcher)
        by_source = {r.source: r for r in results}
        self.assertFalse(by_source["Broken"].ok)
        self.assertEqual(by_source["Broken"].status, "timeout")
        self.assertIn("TimeoutError", by_source["Broken"].error)
        self.assertTrue(by_source["Healthy"].ok)
        self.assertEqual(report["failed_source_count"], 1)
        self.assertEqual(report["raw_item_count"], 1)
    def test_collect_sources_records_fetch_text_error_metadata(self):
        configs = [SourceConfig(name="RSS", type="rss", retries=2)]
        def fetcher(config, run_date):
            raise FetchTextError("http_404", "HTTPError: 404", http_status=404, attempts=1)
        results, report = collect_sources(configs, "2026-06-10", fetcher=fetcher)
        self.assertEqual(results[0].status, "http_404")
        self.assertEqual(results[0].retry_count, 0)
        self.assertIn("http_404", report["error_types"]["RSS"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage0_to_2_pipeline.py
+++ b/tests/test_stage0_to_2_pipeline.py
@@ -0,0 +1,32 @@
 import unittest
 from ai_daily_report.pipeline import run_stage0_to_stage2
 class Stage0To2PipelineTests(unittest.TestCase):
    def test_run_stage0_to_stage2_returns_deduped_items_and_reports(self):
        configs = [
            {"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10},
            {"name": "RSS", "type": "fake", "role": "supplement", "priority": 50},
        ]
        def fetcher(config, run_date):
            return [
                {
                    "title_raw": "OpenAI 发布 GPT-5",
                    "summary_raw": f"{config.name} summary",
                    "url": "https://openai.com/blog/gpt-5?utm_source=test",
                    "source_label": config.name,
                }
            ]
        result = run_stage0_to_stage2(configs, "2026-06-04", fetcher=fetcher)
        self.assertEqual(len(result["items"]), 1)
        self.assertEqual(result["reports"]["stage0"]["raw_item_count"], 2)
        self.assertEqual(result["reports"]["stage1"]["output_count"], 2)
        self.assertEqual(result["reports"]["stage2"]["removed_count"], 1)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage0_to_4_pipeline.py
+++ b/tests/test_stage0_to_4_pipeline.py
@@ -0,0 +1,268 @@
 import json
 import unittest
 from ai_daily_report.pipeline import run_stage0_to_stage4
 from ai_daily_report.models import PublishedUrlEntry, PublishedUrls
 class Stage0To4PipelineTests(unittest.TestCase):
    def test_run_stage0_to_stage4_passes_semantic_and_rewrite_config(self):
        configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
        seen = {}
        def fetcher(config, run_date):
            return [
                {
                    "title_raw": "Anthropic launches Claude Code",
                    "summary_raw": "Anthropic launches Claude Code for developers.",
                    "url": "https://example.com/a",
                    "source_label": config.name,
                },
                {
                    "title_raw": "Anthropic launch Claude Code",
                    "summary_raw": "Anthropic launch Claude Code for coding.",
                    "url": "https://example.com/b",
                    "source_label": config.name,
                },
                {
                    "title_raw": "Gemini CLI update",
                    "summary_raw": "Google updates Gemini CLI.",
                    "url": "https://example.com/c",
                    "source_label": config.name,
                },
            ]
        def semantic_llm_call(prompt):
            payload = json.loads(prompt)
            seen["semantic_prompt"] = payload
            first_candidate = payload["candidates"][0]["item_ids"]
            return json.dumps(
                {
                    "duplicate_groups": [
                        {
                            "keep_id": first_candidate[0],
                            "remove_ids": [first_candidate[1]],
                            "confidence": "high",
                            "reason": "same event",
                        }
                    ],
                    "not_duplicates": [],
                    "uncertain": [],
                }
            )
        def rewrite_llm_call(prompt):
            payload = json.loads(prompt)
            seen.setdefault("rewrite_batches", []).append(len(payload["items"]))
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": item["id"],
                            "title": item["title_raw"],
                            "summary": item["summary_raw"],
                            "flags": [],
                        }
                        for item in payload["items"]
                    ]
                }
            )
        result = run_stage0_to_stage4(
            configs,
            "2026-06-10",
            fetcher=fetcher,
            semantic_llm_call=semantic_llm_call,
            rewrite_llm_call=rewrite_llm_call,
            semantic_dedup_max_deletion_ratio=0.1,
            rewrite_batch_size=1,
        )
        self.assertTrue(result["reports"]["stage3"]["skipped_for_deletion_ratio"])
        self.assertEqual(seen["rewrite_batches"], [1, 1, 1])
    def test_run_stage0_to_stage4_semantic_dedupes_and_rewrites(self):
        configs = [
            {"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10},
            {"name": "RSS", "type": "fake", "role": "supplement", "priority": 50},
        ]
        def fetcher(config, run_date):
            return [
                {
                    "title_raw": f"{config.name} Anthropic IPO",
                    "summary_raw": f"{config.name} reports Anthropic IPO filing.",
                    "url": f"https://example.com/{config.name}",
                    "source_label": config.name,
                }
            ]
        def semantic_llm_call(prompt):
            return json.dumps(
                {
                    "duplicate_groups": [],
                    "not_duplicates": [],
                    "uncertain": [],
                }
            )
        def rewrite_llm_call(prompt):
            payload = json.loads(prompt)
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": entry["id"],
                            "title": "Anthropic 提交 IPO 文件",
                            "summary": "Anthropic 被报道提交 IPO 文件。",
                            "flags": [],
                        }
                        for entry in payload["items"]
                    ]
                },
                ensure_ascii=False,
            )
        result = run_stage0_to_stage4(
            configs,
            "2026-06-04",
            fetcher=fetcher,
            semantic_llm_call=semantic_llm_call,
            rewrite_llm_call=rewrite_llm_call,
        )
        self.assertEqual(len(result["items"]), 2)
        self.assertEqual(result["items"][0].title, "Anthropic 提交 IPO 文件")
        self.assertIn("stage3", result["reports"])
        self.assertIn("stage4", result["reports"])
        self.assertEqual(result["reports"]["stage4"]["rewritten_count"], 2)
    def test_run_stage0_to_stage4_filters_published_urls_before_semantic_dedupe(self):
        configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
        seen_semantic_payloads = []
        seen_rewrite_payloads = []
        def fetcher(config, run_date):
            return [
                {
                    "title_raw": "Already published",
                    "summary_raw": "Old summary",
                    "url": "https://example.com/already",
                    "source_label": config.name,
                },
                {
                    "title_raw": "Fresh story",
                    "summary_raw": "Fresh summary",
                    "url": "https://example.com/fresh",
                    "source_label": config.name,
                },
            ]
        def semantic_llm_call(prompt):
            seen_semantic_payloads.append(json.loads(prompt))
            return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
        def rewrite_llm_call(prompt):
            payload = json.loads(prompt)
            seen_rewrite_payloads.append(payload)
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": entry["id"],
                            "title": entry["title_raw"],
                            "summary": entry["summary_raw"],
                            "flags": [],
                        }
                        for entry in payload["items"]
                    ]
                }
            )
        published_urls = PublishedUrls(
            urls={
                "https://example.com/already": PublishedUrlEntry(
                    first_seen="2026-06-07",
                    last_published="2026-06-07",
                    titles=["Already published"],
                )
            }
        )
        result = run_stage0_to_stage4(
            configs,
            "2026-06-08",
            fetcher=fetcher,
            semantic_llm_call=semantic_llm_call,
            rewrite_llm_call=rewrite_llm_call,
            published_urls=published_urls,
        )
        self.assertEqual([entry.title_raw for entry in result["items"]], ["Fresh story"])
        self.assertEqual(result["reports"]["stage2_5"]["removed_count"], 1)
        self.assertEqual([entry["title_raw"] for entry in seen_rewrite_payloads[0]["items"]], ["Fresh story"])
    def test_run_stage0_to_stage4_uses_stage2_8_recalled_candidates(self):
        configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
        seen = {}
        def fetcher(config, run_date):
            return [
                {
                    "title_raw": "Anthropic 被曝开发 Claude Fable",
                    "summary_raw": "Anthropic 正在开发名为 Claude Fable 和 Claude Mythos 的新产品。",
                    "url": "https://example.com/fable",
                    "source_label": config.name,
                },
                {
                    "title_raw": "Claude Mythos 进入内部测试",
                    "summary_raw": "Anthropic 的 Claude Mythos 与 Claude Fable 面向内容生成场景。",
                    "url": "https://example.com/mythos",
                    "source_label": config.name,
                },
                {
                    "title_raw": "Google 开源 Gemma 3n",
                    "summary_raw": "Google 开源 Gemma 3n 模型，面向端侧部署。",
                    "url": "https://example.com/gemma",
                    "source_label": config.name,
                },
            ]
        def semantic_llm_call(prompt):
            payload = json.loads(prompt)
            seen["candidate_count"] = len(payload["candidates"])
            seen["candidate_reasons"] = [candidate["reason"] for candidate in payload["candidates"]]
            return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
        def rewrite_llm_call(prompt):
            payload = json.loads(prompt)
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": entry["id"],
                            "title": entry["title_raw"],
                            "summary": entry["summary_raw"],
                            "flags": [],
                        }
                        for entry in payload["items"]
                    ]
                },
                ensure_ascii=False,
            )
        result = run_stage0_to_stage4(
            configs,
            "2026-06-10",
            fetcher=fetcher,
            semantic_llm_call=semantic_llm_call,
            rewrite_llm_call=rewrite_llm_call,
        )
        self.assertEqual(seen["candidate_count"], 1)
        self.assertIn("strong_entity_overlap", seen["candidate_reasons"])
        self.assertEqual(result["reports"]["stage2_8"]["added_candidate_group_count"], 1)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage0_to_5_pipeline.py
+++ b/tests/test_stage0_to_5_pipeline.py
@@ -0,0 +1,63 @@
 import json
 import unittest
 from ai_daily_report.pipeline import run_stage0_to_stage5
 class Stage0To5PipelineTests(unittest.TestCase):
    def test_run_stage0_to_stage5_classifies_and_orders_items(self):
        configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
        def fetcher(config, run_date):
            return [
                {
                    "title_raw": "Anthropic 提交 IPO 文件",
                    "summary_raw": "Anthropic 被报道提交 IPO 文件。",
                    "url": "https://example.com/ipo",
                    "source_label": config.name,
                },
                {
                    "title_raw": "GPT-5 API 发布，延迟降低 30%",
                    "summary_raw": "OpenAI 发布 GPT-5 API。",
                    "url": "https://example.com/gpt5",
                    "source_label": config.name,
                    "section_hint": "模型发布/更新",
                },
            ]
        def semantic_llm_call(prompt):
            return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
        def rewrite_llm_call(prompt):
            payload = json.loads(prompt)
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": entry["id"],
                            "title": entry["title_raw"],
                            "summary": entry["summary_raw"],
                            "section": "模型与能力" if "GPT-5" in entry["title_raw"] else "公司与资本",
                            "flags": [],
                        }
                        for entry in payload["items"]
                    ]
                },
                ensure_ascii=False,
            )
        result = run_stage0_to_stage5(
            configs,
            "2026-06-04",
            fetcher=fetcher,
            semantic_llm_call=semantic_llm_call,
            rewrite_llm_call=rewrite_llm_call,
        )
        self.assertEqual([item.section for item in result["items"]], ["模型与能力", "公司与资本"])
        self.assertEqual(result["reports"]["stage5"]["section_counts"]["模型与能力"], 1)
        self.assertEqual(result["reports"]["stage5"]["section_counts"]["公司与资本"], 1)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage0_to_6_pipeline.py
+++ b/tests/test_stage0_to_6_pipeline.py
@@ -0,0 +1,75 @@
 import json
 import unittest
 from ai_daily_report.pipeline import run_stage0_to_stage6
 class Stage0To6PipelineTests(unittest.TestCase):
    def test_run_stage0_to_stage6_generates_guide(self):
        configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
        def fetcher(config, run_date):
            return [
                {
                    "title_raw": "GPT-5 API 发布",
                    "summary_raw": "OpenAI 发布 GPT-5 API。",
                    "url": "https://example.com/gpt5",
                    "source_label": config.name,
                    "section_hint": "模型发布/更新",
                }
            ]
        def semantic_llm_call(prompt):
            return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
        def rewrite_llm_call(prompt):
            payload = json.loads(prompt)
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": entry["id"],
                            "title": entry["title_raw"],
                            "summary": entry["summary_raw"],
                            "flags": [],
                        }
                        for entry in payload["items"]
                    ]
                },
                ensure_ascii=False,
            )
        def guide_llm_call(prompt):
            payload = json.loads(prompt)
            item_id = payload["items"][0]["id"]
            return json.dumps(
                {
                    "theme": "模型 API 能力继续更新。",
                    "threads": [
                        {
                            "title": "模型能力更新",
                            "text": "GPT-5 API 发布，体现模型能力继续产品化。",
                            "item_ids": [item_id],
                            "kind": "thread",
                        }
                    ],
                },
                ensure_ascii=False,
            )
        result = run_stage0_to_stage6(
            configs,
            "2026-06-04",
            fetcher=fetcher,
            semantic_llm_call=semantic_llm_call,
            rewrite_llm_call=rewrite_llm_call,
            guide_llm_call=guide_llm_call,
        )
        self.assertEqual(result["guide"]["theme"], "模型 API 能力继续更新。")
        self.assertEqual(len(result["guide"]["threads"]), 1)
        self.assertTrue(result["reports"]["stage6"]["theme_present"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage0_to_7_pipeline.py
+++ b/tests/test_stage0_to_7_pipeline.py
@@ -0,0 +1,76 @@
 import json
 import unittest
 from ai_daily_report.pipeline import run_stage0_to_stage7
 class Stage0To7PipelineTests(unittest.TestCase):
    def test_run_stage0_to_stage7_assembles_markdown(self):
        configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
        def fetcher(config, run_date):
            return [
                {
                    "title_raw": "GPT-5 API 发布",
                    "summary_raw": "OpenAI 发布 GPT-5 API。",
                    "url": "https://example.com/gpt5",
                    "source_label": "OpenAI：Blog",
                    "section_hint": "模型发布/更新",
                }
            ]
        def semantic_llm_call(prompt):
            return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
        def rewrite_llm_call(prompt):
            payload = json.loads(prompt)
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": entry["id"],
                            "title": entry["title_raw"],
                            "summary": entry["summary_raw"],
                            "flags": [],
                        }
                        for entry in payload["items"]
                    ]
                },
                ensure_ascii=False,
            )
        def guide_llm_call(prompt):
            payload = json.loads(prompt)
            item_id = payload["items"][0]["id"]
            return json.dumps(
                {
                    "theme": "模型 API 能力继续更新。",
                    "threads": [
                        {
                            "title": "模型能力产品化",
                            "text": "GPT-5 API 发布，说明模型能力继续进入产品入口。",
                            "item_ids": [item_id],
                            "kind": "thread",
                        }
                    ],
                },
                ensure_ascii=False,
            )
        result = run_stage0_to_stage7(
            configs,
            "2026-06-04",
            fetcher=fetcher,
            semantic_llm_call=semantic_llm_call,
            rewrite_llm_call=rewrite_llm_call,
            guide_llm_call=guide_llm_call,
        )
        self.assertNotIn("## 导览", result["markdown"])
        self.assertIn("## 模型与能力", result["markdown"])
        self.assertIn("## 今日脉络", result["markdown"])
        self.assertEqual(result["reports"]["stage7"]["blocking_errors"], [])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage0_to_8_pipeline.py
+++ b/tests/test_stage0_to_8_pipeline.py
@@ -0,0 +1,139 @@
 import json
 import unittest
 from urllib.error import HTTPError
 from ai_daily_report.pipeline import run_stage0_to_stage8
 class Stage0To8PipelineTests(unittest.TestCase):
    def test_run_stage0_to_stage8_dry_run_publishes_report(self):
        configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
        def fetcher(config, run_date):
            return [
                {
                    "title_raw": "GPT-5 API 发布",
                    "summary_raw": "OpenAI 发布 GPT-5 API。",
                    "url": "https://example.com/gpt5",
                    "source_label": "OpenAI：Blog",
                    "section_hint": "模型发布/更新",
                }
            ]
        def semantic_llm_call(prompt):
            return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
        def rewrite_llm_call(prompt):
            payload = json.loads(prompt)
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": entry["id"],
                            "title": entry["title_raw"],
                            "summary": entry["summary_raw"],
                            "flags": [],
                        }
                        for entry in payload["items"]
                    ]
                },
                ensure_ascii=False,
            )
        def guide_llm_call(prompt):
            payload = json.loads(prompt)
            item_id = payload["items"][0]["id"]
            return json.dumps(
                {
                    "theme": "模型 API 能力继续更新。",
                    "threads": [
                        {
                            "title": "模型能力产品化",
                            "text": "GPT-5 API 发布，说明模型能力继续进入产品入口。",
                            "item_ids": [item_id],
                            "kind": "thread",
                        }
                    ],
                },
                ensure_ascii=False,
            )
        result = run_stage0_to_stage8(
            configs,
            "2026-06-04",
            fetcher=fetcher,
            semantic_llm_call=semantic_llm_call,
            rewrite_llm_call=rewrite_llm_call,
            guide_llm_call=guide_llm_call,
            mode="dry-run",
            base_url="https://blog.example",
            client=None,
        )
        self.assertEqual(result["publish"].status, "ok")
        self.assertEqual(result["publish"].blog_url, "https://blog.example/posts/ai-2026-06-04")
        self.assertIn("stage8", result["reports"])
        self.assertEqual(result["reports"]["stage8"]["status"], "ok")
    def test_run_stage0_to_stage8_blocks_publish_when_rewrite_quality_gate_fails(self):
        configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
        def fetcher(config, run_date):
            return [
                {
                    "title_raw": f"News {index}",
                    "summary_raw": f"Summary {index}",
                    "url": f"https://example.com/{index}",
                    "source_label": "Example",
                    "section_hint": "模型发布/更新",
                }
                for index in range(6)
            ]
        def semantic_llm_call(prompt):
            return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
        def rewrite_llm_call(prompt):
            raise HTTPError(
                url="https://llm.example/v1/chat/completions",
                code=503,
                msg="Service Unavailable",
                hdrs=None,
                fp=None,
            )
        def guide_llm_call(prompt):
            payload = json.loads(prompt)
            return json.dumps(
                {
                    "theme": "模型能力继续更新。",
                    "threads": [
                        {
                            "title": "模型更新",
                            "text": "多条模型新闻更新。",
                            "item_ids": [payload["items"][0]["id"]],
                            "kind": "thread",
                        }
                    ],
                }
            )
        result = run_stage0_to_stage8(
            configs,
            "2026-06-04",
            fetcher=fetcher,
            semantic_llm_call=semantic_llm_call,
            rewrite_llm_call=rewrite_llm_call,
            guide_llm_call=guide_llm_call,
            mode="publish",
            base_url="https://blog.example",
            client=None,
        )
        self.assertEqual(result["publish"].status, "blocked")
        self.assertIn("rewrite_fallback_ratio_exceeded", result["reports"]["stage7"]["blocking_errors"])
        self.assertIn("rewrite_fallback_ratio_exceeded", result["reports"]["stage8"]["error"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage1_normalize.py
+++ b/tests/test_stage1_normalize.py
@@ -0,0 +1,85 @@
 import unittest
 from ai_daily_report.models import SourceResult
 from ai_daily_report.normalize import canonicalize_url, normalize_items, normalize_title
 class Stage1NormalizeTests(unittest.TestCase):
    def test_canonicalize_url_removes_tracking_and_normalizes_x_host(self):
        url = "HTTPS://Twitter.com/OpenAI/status/123/?utm_source=newsletter&fbclid=abc#fragment"
        self.assertEqual(canonicalize_url(url), "https://x.com/OpenAI/status/123")
    def test_normalize_items_builds_news_items_with_ids_and_norms(self):
        source_result = SourceResult(
            source="AI HOT",
            role="primary",
            ok=True,
            status="ok",
            items=[
                {
                    "title_raw": "  GPT-5 发布：速度提升 2x！ ",
                    "summary_raw": " <p>OpenAI 发布更新。</p> ",
                    "url": "https://openai.com/blog/gpt-5?utm_campaign=test",
                    "source_label": "OpenAI：Blog",
                    "section_hint": "模型发布/更新",
                }
            ],
        )
        items, report = normalize_items([source_result], run_date="2026-06-04")
        self.assertEqual(len(items), 1)
        self.assertTrue(items[0].id.startswith("item_"))
        self.assertEqual(items[0].canonical_url, "https://openai.com/blog/gpt-5")
        self.assertEqual(items[0].title_norm, normalize_title("GPT-5 发布：速度提升 2x！"))
        self.assertEqual(items[0].summary_raw, "OpenAI 发布更新。")
        self.assertEqual(items[0].source_role, "primary")
        self.assertEqual(report["input_count"], 1)
        self.assertEqual(report["output_count"], 1)
    def test_normalize_items_marks_quality_flags_without_dropping_item(self):
        source_result = SourceResult(
            source="RSS",
            role="supplement",
            ok=True,
            status="ok",
            items=[{"title_raw": "短", "summary_raw": "", "url": ""}],
        )
        items, report = normalize_items([source_result], run_date="2026-06-04")
        self.assertEqual(len(items), 1)
        self.assertIn("missing_url", items[0].quality_flags)
        self.assertIn("missing_summary", items[0].quality_flags)
        self.assertIn("short_title", items[0].quality_flags)
        self.assertEqual(report["quality_flag_counts"]["missing_url"], 1)
    def test_normalize_items_keeps_ids_unique_for_same_canonical_url(self):
        source_result = SourceResult(
            source="AI HOT",
            role="primary",
            ok=True,
            status="ok",
            items=[
                {
                    "title_raw": "OpenAI 发布 GPT-5",
                    "summary_raw": "summary a",
                    "url": "https://example.com/news?utm_source=a",
                },
                {
                    "title_raw": "OpenAI 发布 GPT-5",
                    "summary_raw": "summary b",
                    "url": "https://example.com/news",
                },
            ],
        )
        items, _ = normalize_items([source_result], run_date="2026-06-04")
        self.assertEqual(len({item.id for item in items}), 2)
        self.assertEqual(items[0].canonical_url, items[1].canonical_url)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage2_dedupe.py
+++ b/tests/test_stage2_dedupe.py
@@ -0,0 +1,129 @@
 import unittest
 from ai_daily_report.dedupe import cross_day_dedup_items, hard_dedup_items
 from ai_daily_report.models import NewsItem, PublishedUrlEntry, PublishedUrls
 def item(
    item_id,
    title,
    title_norm,
    url,
    canonical_url,
    source_group="AI HOT",
    source_label="AI HOT",
    source_priority=100,
    summary="summary",
 ):
    return NewsItem(
        id=item_id,
        source_group=source_group,
        source_label=source_label,
        source_role="primary" if source_group == "AI HOT" else "supplement",
        source_priority=source_priority,
        title_raw=title,
        title_norm=title_norm,
        summary_raw=summary,
        url=url,
        canonical_url=canonical_url,
    )
 class Stage2DedupeTests(unittest.TestCase):
    def test_hard_dedup_merges_same_canonical_url_and_keeps_better_item(self):
        items = [
            item("a", "OpenAI 发布 GPT-5", "openai发布gpt5", "https://example.com/a?utm_source=x", "https://example.com/a", source_group="RSS", source_priority=50, summary="short"),
            item("b", "OpenAI 发布 GPT-5", "openai发布gpt5", "https://example.com/a", "https://example.com/a", source_group="AI HOT", source_priority=10, summary="longer summary"),
        ]
        deduped, report = hard_dedup_items(items)
        self.assertEqual([i.id for i in deduped], ["b"])
        self.assertEqual(report["input_count"], 2)
        self.assertEqual(report["output_count"], 1)
        self.assertEqual(report["removed_count"], 1)
        self.assertEqual(report["groups"][0]["reason"], "same_canonical_url")
        self.assertEqual(deduped[0].duplicate_sources[0]["source_group"], "RSS")
    def test_hard_dedup_marks_similar_titles_without_removing(self):
        items = [
            item("a", "Grok API 上线 Cloudflare Gateway", "grokapi上线cloudflaregateway", "https://x.com/a", "https://x.com/a"),
            item("b", "Grok 模型登陆 Cloudflare AI Gateway", "grok模型登陆cloudflareaigateway", "https://x.com/b", "https://x.com/b"),
        ]
        deduped, report = hard_dedup_items(items)
        self.assertEqual(len(deduped), 2)
        self.assertEqual(report["removed_count"], 0)
        self.assertEqual(len(report["possible_duplicates"]), 1)
        self.assertEqual(set(report["possible_duplicates"][0]["item_ids"]), {"a", "b"})
    def test_hard_dedup_marks_lower_similarity_mixed_language_titles_as_candidates(self):
        items = [
            item("a", "OpenAI custom chip lead Clive Chan joins Anthropic", "openai定制芯片核心成员clivechan跳槽至anthropic", "https://example.com/a", "https://example.com/a"),
            item("b", "OpenAI chip core member defects to Anthropic before mass production", "openai芯片核心叛逃anthropic就在量产前夜", "https://example.com/b", "https://example.com/b"),
        ]
        deduped, report = hard_dedup_items(items)
        self.assertEqual(len(deduped), 2)
        self.assertEqual(report["removed_count"], 0)
        self.assertEqual(len(report["possible_duplicates"]), 1)
        self.assertEqual(set(report["possible_duplicates"][0]["item_ids"]), {"a", "b"})
    def test_cross_day_dedup_filters_recently_published_canonical_urls_only(self):
        items = [
            item("old", "Old URL", "oldurl", "https://example.com/old", "https://example.com/old"),
            item("new", "New URL", "newurl", "https://example.com/new", "https://example.com/new"),
            item("missing", "Missing URL", "missingurl", "", ""),
        ]
        published_urls = PublishedUrls(
            urls={
                "https://example.com/old": PublishedUrlEntry(
                    first_seen="2026-06-07",
                    last_published="2026-06-07",
                    titles=["Old URL"],
                )
            }
        )
        deduped, report = cross_day_dedup_items(
            items,
            published_urls,
            run_date="2026-06-08",
            max_age_days=7,
        )
        self.assertEqual([entry.id for entry in deduped], ["new", "missing"])
        self.assertEqual(report["input_count"], 3)
        self.assertEqual(report["output_count"], 2)
        self.assertEqual(report["removed_count"], 1)
        self.assertEqual(report["removed"][0]["item_id"], "old")
    def test_cross_day_dedup_ignores_urls_outside_history_window(self):
        items = [
            item("stale", "Stale URL", "staleurl", "https://example.com/stale", "https://example.com/stale"),
        ]
        published_urls = PublishedUrls(
            urls={
                "https://example.com/stale": PublishedUrlEntry(
                    first_seen="2026-05-01",
                    last_published="2026-05-01",
                    titles=["Stale URL"],
                )
            }
        )
        deduped, report = cross_day_dedup_items(
            items,
            published_urls,
            run_date="2026-06-08",
            max_age_days=7,
        )
        self.assertEqual([entry.id for entry in deduped], ["stale"])
        self.assertEqual(report["removed_count"], 0)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage3_semantic_dedupe.py
+++ b/tests/test_stage3_semantic_dedupe.py
@@ -0,0 +1,163 @@
 import json
 import unittest
 from ai_daily_report.models import NewsItem
 from ai_daily_report.semantic_dedupe import semantic_dedup_items
 def news_item(item_id, title, source_group="AI HOT"):
    return NewsItem(
        id=item_id,
        source_group=source_group,
        source_label=source_group,
        source_role="primary" if source_group == "AI HOT" else "supplement",
        source_priority=10 if source_group == "AI HOT" else 50,
        title_raw=title,
        title_norm=title.lower(),
        summary_raw=f"{title} summary",
        url=f"https://example.com/{item_id}",
        canonical_url=f"https://example.com/{item_id}",
    )
 class Stage3SemanticDedupeTests(unittest.TestCase):
    def test_semantic_dedup_removes_only_high_confidence_duplicates(self):
        items = [
            news_item("a", "Anthropic 提交 IPO 招股书", "AI HOT"),
            news_item("b", "刚刚，Anthropic 提交了招股书", "量子位"),
            news_item("c", "Grok 上线 Cloudflare Gateway", "AI HOT"),
        ]
        candidates = [{"item_ids": ["a", "b"], "reason": "title_similarity"}]
        def llm_call(prompt):
            return json.dumps(
                {
                    "duplicate_groups": [
                        {
                            "keep_id": "a",
                            "remove_ids": ["b"],
                            "confidence": "high",
                            "reason": "same IPO filing event",
                        }
                    ],
                    "not_duplicates": [],
                    "uncertain": [],
                }
            )
        deduped, report = semantic_dedup_items(items, candidates, llm_call=llm_call)
        self.assertEqual([item.id for item in deduped], ["a", "c"])
        self.assertEqual(report["removed_count"], 1)
        self.assertEqual(report["duplicate_groups"][0]["reason"], "same IPO filing event")
        self.assertEqual(deduped[0].duplicate_sources[0]["id"], "b")
    def test_semantic_dedup_skips_deletion_when_ratio_exceeds_limit(self):
        items = [
            news_item("a", "A"),
            news_item("b", "B"),
            news_item("c", "C"),
        ]
        candidates = [{"item_ids": ["a", "b", "c"], "reason": "llm_candidate"}]
        def llm_call(prompt):
            return json.dumps(
                {
                    "duplicate_groups": [
                        {
                            "keep_id": "a",
                            "remove_ids": ["b", "c"],
                            "confidence": "high",
                            "reason": "too broad",
                        }
                    ],
                    "not_duplicates": [],
                    "uncertain": [],
                }
            )
        deduped, report = semantic_dedup_items(
            items,
            candidates,
            llm_call=llm_call,
            max_deletion_ratio=0.5,
        )
        self.assertEqual(len(deduped), 3)
        self.assertEqual(report["removed_count"], 0)
        self.assertTrue(report["skipped_for_deletion_ratio"])
    def test_semantic_dedup_supports_merge_groups_as_supplementary_sources(self):
        items = [
            news_item("a", "高德推出 ABot", "AI HOT"),
            news_item("b", "高德 ABot 进入本地生活入口", "橘鸦AI早报"),
            news_item("c", "Meta 发布新眼镜", "InfoQ AI"),
        ]
        candidates = [{"item_ids": ["a", "b"], "reason": "same_event_complementary"}]
        def llm_call(prompt):
            self.assertIn("merge_groups", prompt)
            return json.dumps(
                {
                    "duplicate_groups": [],
                    "merge_groups": [
                        {
                            "keep_id": "a",
                            "merge_ids": ["b"],
                            "confidence": "high",
                            "reason": "same ABot launch, different angle",
                        }
                    ],
                    "not_duplicates": [],
                    "uncertain": [],
                }
            )
        deduped, report = semantic_dedup_items(items, candidates, llm_call=llm_call)
        self.assertEqual([item.id for item in deduped], ["a", "b", "c"])
        self.assertEqual(report["removed_count"], 0)
        self.assertEqual(report["merge_groups"][0]["merge_ids"], ["b"])
        self.assertEqual(deduped[0].duplicate_sources[0]["action"], "merge_supplement")
        self.assertEqual(deduped[0].duplicate_sources[0]["id"], "b")
    def test_semantic_dedup_ignores_groups_outside_candidate_sets(self):
        items = [
            news_item("a", "Suno 完成融资"),
            news_item("b", "Suno 完成 D 轮融资"),
            news_item("c", "Ideogram 发布 v4"),
            news_item("d", "OpenClaw 发布新版"),
        ]
        candidates = [{"item_ids": ["a", "b"], "reason": "title_similarity"}]
        def llm_call(prompt):
            return json.dumps(
                {
                    "duplicate_groups": [
                        {
                            "keep_id": "a",
                            "remove_ids": ["b"],
                            "confidence": "high",
                            "reason": "same Suno event",
                        },
                        {
                            "keep_id": "c",
                            "remove_ids": ["d"],
                            "confidence": "high",
                            "reason": "not part of candidates",
                        },
                    ],
                    "not_duplicates": [],
                    "uncertain": [],
                }
            )
        deduped, report = semantic_dedup_items(items, candidates, llm_call=llm_call)
        self.assertEqual([item.id for item in deduped], ["a", "c", "d"])
        self.assertEqual(report["removed_count"], 1)
        self.assertIn("group_outside_candidates", report["errors"][0])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage4_rewrite.py
+++ b/tests/test_stage4_rewrite.py
@@ -0,0 +1,242 @@
 import json
 import unittest
 from urllib.error import HTTPError
 from ai_daily_report.models import NewsItem
 from ai_daily_report.rewrite import rewrite_items
 def news_item(item_id="a"):
    return NewsItem(
        id=item_id,
        source_group="AI HOT",
        source_label="AI HOT",
        source_role="primary",
        source_priority=10,
        title_raw="OpenAI launches GPT-5 API",
        title_norm="openailaunchesgpt5api",
        summary_raw="OpenAI launched the GPT-5 API with better latency.",
        url="https://example.com/a",
        canonical_url="https://example.com/a",
    )
 class Stage4RewriteTests(unittest.TestCase):
    def test_rewrite_items_writes_display_fields_without_overwriting_raw(self):
        items = [news_item("a")]
        def llm_call(prompt):
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": "a",
                            "title": "OpenAI 发布 GPT-5 API",
                            "summary": "OpenAI 发布 GPT-5 API，延迟表现更好。",
                            "flags": [],
                        }
                    ]
                },
                ensure_ascii=False,
            )
        rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=10)
        self.assertEqual(rewritten[0].title, "OpenAI 发布 GPT-5 API")
        self.assertEqual(rewritten[0].summary, "OpenAI 发布 GPT-5 API，延迟表现更好。")
        self.assertEqual(rewritten[0].title_raw, "OpenAI launches GPT-5 API")
        self.assertEqual(report["rewritten_count"], 1)
        self.assertEqual(report["fallback_count"], 0)
    def test_rewrite_items_accepts_llm_section_classification(self):
        items = [news_item("a")]
        def llm_call(prompt):
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": "a",
                            "title": "OpenAI 发布 GPT-5 API",
                            "summary": "OpenAI 发布 GPT-5 API，延迟表现更好。",
                            "section": "模型与能力",
                            "confidence": 0.92,
                            "flags": [],
                        }
                    ]
                },
                ensure_ascii=False,
            )
        rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=10)
        self.assertEqual(rewritten[0].section, "模型与能力")
        self.assertEqual(report["llm_section_count"], 1)
    def test_rewrite_items_falls_back_when_llm_fails(self):
        items = [news_item("a")]
        def llm_call(prompt):
            raise TimeoutError("slow")
        rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=10)
        self.assertEqual(rewritten[0].title, "OpenAI launches GPT-5 API")
        self.assertEqual(rewritten[0].summary, "OpenAI launched the GPT-5 API with better latency.")
        self.assertEqual(report["rewritten_count"], 0)
        self.assertEqual(report["fallback_count"], 1)
        self.assertIn("TimeoutError", report["errors"][0])
    def test_rewrite_items_can_retry_failed_batch_as_single_items_when_enabled(self):
        items = [news_item("a"), news_item("b")]
        calls = []
        def llm_call(prompt):
            payload = json.loads(prompt)
            ids = [item["id"] for item in payload["items"]]
            calls.append(ids)
            if len(ids) > 1:
                return "not json"
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": ids[0],
                            "title": f"title {ids[0]}",
                            "summary": f"summary {ids[0]}",
                            "flags": [],
                        }
                    ]
                }
            )
        rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=2, retry_single_items=True)
        self.assertEqual([item.title for item in rewritten], ["title a", "title b"])
        self.assertEqual(report["rewritten_count"], 2)
        self.assertEqual(report["fallback_count"], 0)
        self.assertEqual(calls, [["a", "b"], ["a"], ["b"]])
    def test_rewrite_items_does_not_retry_single_items_by_default(self):
        items = [news_item("a"), news_item("b")]
        calls = []
        def llm_call(prompt):
            payload = json.loads(prompt)
            calls.append([item["id"] for item in payload["items"]])
            return "not json"
        rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=2)
        self.assertEqual(calls, [["a", "b"]])
        self.assertEqual([item.title for item in rewritten], ["OpenAI launches GPT-5 API", "OpenAI launches GPT-5 API"])
        self.assertEqual(report["fallback_count"], 2)
    def test_rewrite_items_retries_failed_large_batch_as_smaller_batches_by_default(self):
        items = [news_item(str(index)) for index in range(30)]
        calls = []
        def llm_call(prompt):
            payload = json.loads(prompt)
            ids = [item["id"] for item in payload["items"]]
            calls.append(ids)
            if len(ids) == 30:
                return "not json"
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": item_id,
                            "title": f"title {item_id}",
                            "summary": f"summary {item_id}",
                            "section": "模型与能力",
                            "flags": [],
                        }
                        for item_id in ids
                    ]
                },
                ensure_ascii=False,
            )
        rewritten, report = rewrite_items(items, llm_call=llm_call)
        self.assertEqual([len(call) for call in calls], [30, 10, 10, 10])
        self.assertEqual(report["rewritten_count"], 30)
        self.assertEqual(report["llm_section_count"], 30)
        self.assertEqual(report["fallback_count"], 0)
        self.assertEqual(report["batch_retry_count"], 3)
        self.assertEqual(report["blocking_errors"], [])
        self.assertEqual(rewritten[0].title, "title 0")
    def test_rewrite_items_keeps_partial_batch_rewrites_when_some_ids_are_missing(self):
        items = [news_item("a"), news_item("b"), news_item("c")]
        def llm_call(prompt):
            return json.dumps(
                {
                    "rewrites": [
                        {"id": "a", "title": "title a", "summary": "summary a", "flags": []},
                        {"id": "c", "title": "title c", "summary": "summary c", "flags": []},
                    ]
                }
            )
        rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=3, max_fallback_ratio=0.5)
        self.assertEqual([item.title for item in rewritten], ["title a", "OpenAI launches GPT-5 API", "title c"])
        self.assertEqual(report["rewritten_count"], 2)
        self.assertEqual(report["fallback_count"], 1)
        self.assertEqual(report["missing_rewrite_count"], 1)
        self.assertEqual(report["blocking_errors"], [])
    def test_rewrite_items_defaults_to_large_batches_to_reduce_llm_requests(self):
        items = [news_item(str(index)) for index in range(61)]
        batch_sizes = []
        def llm_call(prompt):
            payload = json.loads(prompt)
            batch_sizes.append(len(payload["items"]))
            return json.dumps(
                {
                    "rewrites": [
                        {
                            "id": entry["id"],
                            "title": entry["title_raw"],
                            "summary": entry["summary_raw"],
                            "flags": [],
                        }
                        for entry in payload["items"]
                    ]
                }
            )
        rewrite_items(items, llm_call=llm_call)
        self.assertEqual(batch_sizes, [30, 30, 1])
    def test_rewrite_items_does_not_retry_single_items_after_transient_http_error(self):
        items = [news_item("a"), news_item("b")]
        calls = 0
        def llm_call(prompt):
            nonlocal calls
            calls += 1
            raise HTTPError(
                url="https://llm.example/v1/chat/completions",
                code=503,
                msg="Service Unavailable",
                hdrs=None,
                fp=None,
            )
        rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=2)
        self.assertEqual(calls, 1)
        self.assertEqual([item.title for item in rewritten], ["OpenAI launches GPT-5 API", "OpenAI launches GPT-5 API"])
        self.assertEqual(report["fallback_count"], 2)
        self.assertTrue(report["quality_gate_failed"])
        self.assertIn("rewrite_fallback_ratio_exceeded", report["blocking_errors"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage5_classify.py
+++ b/tests/test_stage5_classify.py
@@ -0,0 +1,88 @@
 import unittest
 from ai_daily_report.classify import SECTION_ORDER, classify_and_order_items
 from ai_daily_report.models import NewsItem
 def news_item(item_id, title, summary="", section_hint="", source_priority=50):
    return NewsItem(
        id=item_id,
        source_group="AI HOT",
        source_label="AI HOT",
        source_role="primary",
        source_priority=source_priority,
        title_raw=title,
        title_norm=title.lower(),
        summary_raw=summary or f"{title} summary",
        title=title,
        summary=summary or f"{title} summary",
        url=f"https://example.com/{item_id}",
        canonical_url=f"https://example.com/{item_id}",
        section_hint=section_hint,
    )
 class Stage5ClassifyTests(unittest.TestCase):
    def test_classify_maps_legacy_section_hints_to_new_sections(self):
        items = [news_item("a", "GPT-5 发布", section_hint="模型发布/更新")]
        classified, report = classify_and_order_items(items)
        self.assertEqual(classified[0].section, "模型与能力")
        self.assertEqual(report["hint_classified"], 1)
        self.assertIn("模型与能力", SECTION_ORDER)
    def test_classify_uses_rules_when_hint_is_missing(self):
        items = [
            news_item("a", "Anthropic 提交 IPO 文件", summary="Anthropic 计划上市并提交文件。"),
            news_item("b", "MCP SDK 发布新版", summary="开发者可用新版 SDK 构建工具。"),
        ]
        classified, report = classify_and_order_items(items)
        by_id = {item.id: item for item in classified}
        self.assertEqual(by_id["a"].section, "公司与资本")
        self.assertEqual(by_id["b"].section, "开发与基础设施")
        self.assertEqual(report["rule_classified"], 2)
    def test_classify_prefers_valid_llm_section_from_rewrite_stage(self):
        item = news_item(
            "a",
            "API 发布",
            summary="这其实是一个面向开发者的基础设施能力更新。",
            section_hint="产品发布/更新",
        )
        item.section = "开发与基础设施"
        classified, report = classify_and_order_items([item])
        self.assertEqual(classified[0].section, "开发与基础设施")
        self.assertEqual(report["llm_classified"], 1)
        self.assertEqual(report["hint_classified"], 0)
        self.assertEqual(report["rule_classified"], 0)
    def test_classify_falls_back_when_llm_section_is_invalid(self):
        item = news_item("a", "GPT-5 发布", section_hint="模型发布/更新")
        item.section = "热点新闻"
        classified, report = classify_and_order_items([item])
        self.assertEqual(classified[0].section, "模型与能力")
        self.assertEqual(report["llm_classified"], 0)
        self.assertEqual(report["hint_classified"], 1)
        self.assertEqual(report["invalid_llm_section_count"], 1)
    def test_classify_orders_items_by_local_rank_score_within_sections(self):
        items = [
            news_item("low", "普通模型更新", section_hint="模型发布/更新", source_priority=80),
            news_item("high", "GPT-5 API 发布，延迟降低 30%", section_hint="模型发布/更新", source_priority=10),
        ]
        classified, report = classify_and_order_items(items)
        self.assertEqual([item.id for item in classified], ["high", "low"])
        self.assertEqual(report["section_counts"]["模型与能力"], 2)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage6_guide.py
+++ b/tests/test_stage6_guide.py
@@ -0,0 +1,83 @@
 import json
 import unittest
 from ai_daily_report.guide import generate_guide
 from ai_daily_report.models import NewsItem
 def news_item(item_id, title, section="模型与能力"):
    return NewsItem(
        id=item_id,
        source_group="AI HOT",
        source_label="AI HOT",
        source_role="primary",
        source_priority=10,
        title_raw=title,
        title_norm=title.lower(),
        summary_raw=f"{title} summary",
        title=title,
        summary=f"{title} summary",
        url=f"https://example.com/{item_id}",
        canonical_url=f"https://example.com/{item_id}",
        section=section,
    )
 class Stage6GuideTests(unittest.TestCase):
    def test_generate_guide_returns_intro_theme_threads_and_conclusion(self):
        items = [
            news_item("a", "GPT-5 API 发布"),
            news_item("b", "Miso One 开源语音模型"),
        ]
        def llm_call(prompt):
            return json.dumps(
                {
                    "intro": "今天的 AI 行业继续围绕模型能力、Agent 产品和基础设施演进展开。",
                    "theme": "模型能力继续向 API 和实时语音两端推进。",
                    "threads": [
                        {
                            "title": "模型能力继续推进",
                            "text": "GPT-5 API 和 Miso One 分别代表 API 能力和语音模型更新。",
                            "item_ids": ["a", "b"],
                            "kind": "thread",
                        },
                        {
                            "title": "无效脉络",
                            "text": "这条引用了不存在的条目。",
                            "item_ids": ["missing"],
                            "kind": "thread",
                        },
                    ],
                    "conclusion": "总体看，模型能力正在进入更多产品入口，生态竞争也在继续加速。",
                },
                ensure_ascii=False,
            )
        guide, report = generate_guide(items, llm_call=llm_call)
        self.assertEqual(guide["intro"], "今天的 AI 行业继续围绕模型能力、Agent 产品和基础设施演进展开。")
        self.assertEqual(guide["theme"], "模型能力继续向 API 和实时语音两端推进。")
        self.assertEqual(guide["conclusion"], "总体看，模型能力正在进入更多产品入口，生态竞争也在继续加速。")
        self.assertEqual(len(guide["threads"]), 1)
        self.assertEqual(guide["threads"][0]["item_ids"], ["a", "b"])
        self.assertEqual(report["dropped_thread_count"], 1)
    def test_generate_guide_falls_back_when_llm_fails(self):
        items = [news_item("a", "GPT-5 API 发布")]
        def llm_call(prompt):
            raise TimeoutError("slow")
        guide, report = generate_guide(items, llm_call=llm_call)
        self.assertEqual(guide["intro"], "")
        self.assertEqual(guide["theme"], "")
        self.assertEqual(guide["conclusion"], "")
        self.assertEqual(guide["threads"], [])
        self.assertTrue(report["fallback_used"])
        self.assertIn("TimeoutError", report["errors"][0])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage7_assemble.py
+++ b/tests/test_stage7_assemble.py
@@ -0,0 +1,69 @@
 import unittest
 from ai_daily_report.assemble import assemble_markdown, validate_markdown
 from ai_daily_report.models import NewsItem
 def news_item(item_id, title, section):
    return NewsItem(
        id=item_id,
        source_group="AI HOT",
        source_label="OpenAI：Blog",
        source_role="primary",
        source_priority=10,
        title_raw=title,
        title_norm=title.lower(),
        summary_raw=f"{title} summary",
        title=title,
        summary=f"{title} summary",
        url=f"https://example.com/{item_id}",
        canonical_url=f"https://example.com/{item_id}",
        section=section,
    )
 class Stage7AssembleTests(unittest.TestCase):
    def test_assemble_markdown_renders_intro_sections_daily_threads_and_conclusion(self):
        items = [
            news_item("a", "GPT-5 API 发布", "模型与能力"),
            news_item("b", "Anthropic 提交 IPO 文件", "公司与资本"),
        ]
        guide = {
            "intro": "今天的 AI 行业继续围绕模型、产品和资本展开。",
            "theme": "> 模型和资本两条线都在推进。[1]",
            "threads": [
                {
                    "title": "模型能力产品化",
                    "text": "GPT-5 API 发布，说明模型能力继续进入产品入口。",
                    "item_ids": ["a"],
                    "kind": "thread",
                }
            ],
            "conclusion": "总体看，AI 竞争继续从单点模型能力转向产品、基础设施和资本协同。",
        }
        md, report = assemble_markdown(items, guide)
        self.assertTrue(md.startswith("## 引言\n\n> 今天的 AI 行业继续围绕模型、产品和资本展开。"))
        self.assertNotIn("## 导览", md)
        self.assertNotIn("> 模型和资本两条线都在推进。", md)
        self.assertIn("## 模型与能力", md)
        self.assertIn("**1. GPT-5 API 发布**", md)
        self.assertIn("**2. Anthropic 提交 IPO 文件**", md)
        self.assertIn("## 今日脉络", md)
        self.assertIn("- **模型能力产品化**", md)
        self.assertTrue(md.endswith("## 总结\n\n> 总体看，AI 竞争继续从单点模型能力转向产品、基础设施和资本协同。"))
        self.assertNotIn("> >", md)
        self.assertNotIn("[1]", md)
        self.assertEqual(report["item_count"], 2)
        self.assertEqual(report["blocking_errors"], [])
    def test_validate_markdown_blocks_empty_report(self):
        report = validate_markdown("", [])
        self.assertIn("no_items", report["blocking_errors"])
        self.assertIn("markdown_too_short", report["blocking_errors"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_stage8_publish.py
+++ b/tests/test_stage8_publish.py
@@ -0,0 +1,162 @@
 import unittest
 from pathlib import Path
 from tempfile import TemporaryDirectory
 from ai_daily_report.models import NewsItem
 from ai_daily_report.publish import load_published_urls, publish_markdown, update_published_urls
 class FakeBlogClient:
    def __init__(self, existing_post=None):
        self.created_payload = None
        self.published_slug = None
        self.existing_post = existing_post
    def create_post(self, payload):
        self.created_payload = payload
        return {"slug": "ai-2026-06-04"}
    def publish_post(self, slug):
        self.published_slug = slug
    def get_post_by_slug(self, slug):
        return self.existing_post
 class Stage8PublishTests(unittest.TestCase):
    def test_publish_markdown_dry_run_does_not_call_client(self):
        result = publish_markdown(
            title="AI日报 · 2026-06-04",
            markdown="## 导览\n\n> ok",
            tags=["AI日报"],
            slug="ai-2026-06-04",
            base_url="https://blog.example",
            mode="dry-run",
            markdown_report={"blocking_errors": []},
            client=None,
        )
        self.assertEqual(result.status, "ok")
        self.assertEqual(result.mode, "dry-run")
        self.assertEqual(result.blog_url, "https://blog.example/posts/ai-2026-06-04")
        self.assertTrue(result.public_ok)
    def test_publish_markdown_blocks_when_markdown_has_errors(self):
        client = FakeBlogClient()
        result = publish_markdown(
            title="AI日报 · 2026-06-04",
            markdown="bad",
            tags=["AI日报"],
            slug="ai-2026-06-04",
            base_url="https://blog.example",
            mode="publish",
            markdown_report={"blocking_errors": ["markdown_too_short"]},
            client=client,
        )
        self.assertEqual(result.status, "blocked")
        self.assertIsNone(client.created_payload)
        self.assertIn("markdown_too_short", result.error)
    def test_publish_markdown_publish_mode_calls_client(self):
        client = FakeBlogClient()
        result = publish_markdown(
            title="AI日报 · 2026-06-04",
            markdown="## 导览\n\n> ok",
            tags=["AI日报"],
            slug="ai-2026-06-04",
            base_url="https://blog.example",
            mode="publish",
            markdown_report={"blocking_errors": []},
            client=client,
        )
        self.assertEqual(result.status, "ok")
        self.assertEqual(client.created_payload["title"], "AI日报 · 2026-06-04")
        self.assertEqual(client.published_slug, "ai-2026-06-04")
        self.assertEqual(result.blog_url, "https://blog.example/posts/ai-2026-06-04")
    def test_publish_markdown_returns_already_published_for_same_slug_and_content(self):
        markdown = "## 导览\n\n> ok"
        client = FakeBlogClient(existing_post={"slug": "ai-2026-06-04", "content": markdown})
        result = publish_markdown(
            title="AI日报 · 2026-06-04",
            markdown=markdown,
            tags=["AI日报"],
            slug="ai-2026-06-04",
            base_url="https://blog.example",
            mode="publish",
            markdown_report={"blocking_errors": []},
            client=client,
            idempotency_config={"enabled": True},
        )
        self.assertEqual(result.status, "already_published")
        self.assertIsNone(client.created_payload)
        self.assertIsNone(client.published_slug)
    def test_publish_markdown_blocks_existing_slug_with_different_content(self):
        client = FakeBlogClient(existing_post={"slug": "ai-2026-06-04", "content": "old"})
        result = publish_markdown(
            title="AI日报 · 2026-06-04",
            markdown="new",
            tags=["AI日报"],
            slug="ai-2026-06-04",
            base_url="https://blog.example",
            mode="publish",
            markdown_report={"blocking_errors": []},
            client=client,
            idempotency_config={"enabled": True},
        )
        self.assertEqual(result.status, "blocked")
        self.assertIn("slug_already_exists", result.error)
        self.assertIsNone(client.created_payload)
    def test_update_published_urls_writes_canonical_urls_for_final_items(self):
        with TemporaryDirectory() as temp_dir:
            history_path = Path(temp_dir) / "published_urls.json"
            items = [
                NewsItem(
                    id="a",
                    source_group="AI HOT",
                    source_label="AI HOT",
                    source_role="primary",
                    source_priority=10,
                    title_raw="Fresh story",
                    title_norm="freshstory",
                    summary_raw="summary",
                    url="https://example.com/fresh?utm_source=x",
                    canonical_url="https://example.com/fresh",
                    title="Fresh story",
                ),
                NewsItem(
                    id="missing",
                    source_group="AI HOT",
                    source_label="AI HOT",
                    source_role="primary",
                    source_priority=10,
                    title_raw="Missing URL",
                    title_norm="missingurl",
                    summary_raw="summary",
                    url="",
                    canonical_url="",
                ),
            ]
            update_published_urls(history_path, items, run_date="2026-06-08", max_age_days=7)
            loaded = load_published_urls(history_path)
        self.assertIn("https://example.com/fresh", loaded.urls)
        self.assertNotIn("", loaded.urls)
        self.assertEqual(loaded.urls["https://example.com/fresh"].first_seen, "2026-06-08")
        self.assertEqual(loaded.urls["https://example.com/fresh"].last_published, "2026-06-08")
        self.assertEqual(loaded.urls["https://example.com/fresh"].titles, ["Fresh story"])
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_validate.py
+++ b/tests/test_validate.py
@@ -0,0 +1,14 @@
 import unittest
 from ai_daily_report.validate import validate_report_markdown
 class ValidateTests(unittest.TestCase):
    def test_validate_report_markdown_delegates_markdown_checks(self):
        report = validate_report_markdown("", [])
        self.assertIn("no_items", report["blocking_errors"])
 if __name__ == "__main__":
    unittest.main()
Author	SHA1	Message	Date
Ubuntu	2159ee733b	Improve AI daily report operations and dedupe observability	2026-06-10 21:55:29 +08:00
Mimikko-zeus	b46cef2c7b	Add Stage 2.8 recall, quality gate, retries, and publish idempotency	2026-06-10 21:31:13 +08:00
Mimikko-zeus	07786e3bc0	fix: add cross-day dedupe	2026-06-08 12:05:45 +08:00
Mimikko-zeus	2671aee850	Retry failed rewrite batches in smaller chunks	2026-06-04 17:42:08 +08:00
Mimikko-zeus	22cdd71a08	Improve LLM rewrite classification pipeline	2026-06-04 17:12:59 +08:00
Mimikko-zeus	dd12755ff1	Keep partial rewrite results from LLM batches	2026-06-04 16:51:12 +08:00
Mimikko-zeus	6eca615f42	Reduce LLM rewrite calls and add report intro conclusion	2026-06-04 16:41:05 +08:00
Mimikko-zeus	f7e4c9722b	Block publish when LLM rewrite quality degrades	2026-06-04 16:29:40 +08:00
Mimikko-zeus	5a98696255	Refactor AI daily report pipeline	2026-06-04 15:21:56 +08:00
		`@@ -0,0 +1,2 @@`
							`"""Core package for the AI daily report pipeline."""`
		`@@ -0,0 +1,2 @@`
							`"""Source adapters for the AI daily report pipeline."""`