Compare commits
9 Commits
94e18ce22d
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
2159ee733b | ||
|
|
b46cef2c7b | ||
|
|
07786e3bc0 | ||
|
|
2671aee850 | ||
|
|
22cdd71a08 | ||
|
|
dd12755ff1 | ||
|
|
6eca615f42 | ||
|
|
f7e4c9722b | ||
|
|
5a98696255 |
9
.gitignore
vendored
Normal file
9
.gitignore
vendored
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
.env
|
||||||
|
.env.*
|
||||||
|
!.env.example
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
.pytest_cache/
|
||||||
|
runs/
|
||||||
|
runs-*/
|
||||||
|
.idea/
|
||||||
144
.learnings/ERRORS.md
Normal file
144
.learnings/ERRORS.md
Normal file
@@ -0,0 +1,144 @@
|
|||||||
|
## [ERR-20260606-001] computer_use_helper_startup
|
||||||
|
|
||||||
|
**Logged**: 2026-06-06T00:00:00+08:00
|
||||||
|
**Priority**: medium
|
||||||
|
**Status**: pending
|
||||||
|
**Area**: infra
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
Computer Use helper failed during Windows automation startup.
|
||||||
|
|
||||||
|
### Error
|
||||||
|
```text
|
||||||
|
node_repl kernel exited unexpectedly
|
||||||
|
windows sandbox failed: spawn setup refresh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Context
|
||||||
|
- Operation attempted: initialize Computer Use and list Windows apps.
|
||||||
|
- Retried after resetting the JavaScript session.
|
||||||
|
- Both attempts failed before any app automation actions were taken.
|
||||||
|
|
||||||
|
### Suggested Fix
|
||||||
|
Investigate the Computer Use Windows helper startup path and sandbox setup; retry after the helper/runtime is refreshed.
|
||||||
|
|
||||||
|
### Metadata
|
||||||
|
- Reproducible: yes
|
||||||
|
- Related Files: C:/Users/12256/.codex/plugins/cache/openai-bundled/computer-use/26.602.40724/scripts/computer-use-client.mjs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## [ERR-20260610-001] absolute_path_prefixed_with_workspace
|
||||||
|
|
||||||
|
**Logged**: 2026-06-10T00:00:00+08:00
|
||||||
|
**Priority**: low
|
||||||
|
**Status**: pending
|
||||||
|
**Area**: docs
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
An absolute skill file path was accidentally prefixed with the current workspace path when verifying completion.
|
||||||
|
|
||||||
|
### Error
|
||||||
|
```text
|
||||||
|
Get-Content : Cannot find path 'E:\Codes\ai-daily-report\C:\Users\12256\.codex\superpowers\skills\verification-before-completion\SKILL.md'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Context
|
||||||
|
- Operation attempted: read `C:\Users\12256\.codex\superpowers\skills\verification-before-completion\SKILL.md`.
|
||||||
|
- The command used a malformed literal path that concatenated the workspace root and the absolute path.
|
||||||
|
- Re-running with the actual absolute path succeeded.
|
||||||
|
|
||||||
|
### Suggested Fix
|
||||||
|
When reading skill files or other absolute Windows paths, pass the `C:\...` path directly and do not combine it with the workspace path.
|
||||||
|
|
||||||
|
### Metadata
|
||||||
|
- Reproducible: yes
|
||||||
|
- Related Files: C:\Users\12256\.codex\superpowers\skills\verification-before-completion\SKILL.md
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## [ERR-20260608-003] git_push_auth_failed
|
||||||
|
|
||||||
|
**Logged**: 2026-06-08T00:00:00+08:00
|
||||||
|
**Priority**: medium
|
||||||
|
**Status**: pending
|
||||||
|
**Area**: infra
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
`git push origin main` failed because the Gitea remote rejected authentication.
|
||||||
|
|
||||||
|
### Error
|
||||||
|
```text
|
||||||
|
remote: Failed to authenticate user
|
||||||
|
fatal: Authentication failed for 'https://gitea.ephron.ren/Elaina/ai-daily-report.git/'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Context
|
||||||
|
- Operation attempted: push committed cross-day dedupe fix to `origin/main`.
|
||||||
|
- Local commit exists: `07786e3 fix: add cross-day dedupe`.
|
||||||
|
- Test suite passed before commit: `79 passed`.
|
||||||
|
|
||||||
|
### Suggested Fix
|
||||||
|
Refresh Git credentials for `https://gitea.ephron.ren` or switch the remote to an authenticated SSH/HTTPS URL, then rerun `git push origin main`.
|
||||||
|
|
||||||
|
### Metadata
|
||||||
|
- Reproducible: yes
|
||||||
|
- Related Files: git remote origin
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## [ERR-20260608-002] powershell_convertfromjson_mojibake
|
||||||
|
|
||||||
|
**Logged**: 2026-06-08T00:00:00+08:00
|
||||||
|
**Priority**: low
|
||||||
|
**Status**: pending
|
||||||
|
**Area**: tests
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
PowerShell `ConvertFrom-Json` failed on a generated report containing existing mojibake section labels, while Python `json.loads` parsed the same report successfully.
|
||||||
|
|
||||||
|
### Error
|
||||||
|
```text
|
||||||
|
ConvertFrom-Json : Invalid object passed in, ':' or '}' expected.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Context
|
||||||
|
- Operation attempted: verify CLI dry-run output by piping `run_report.json` through `ConvertFrom-Json`.
|
||||||
|
- Follow-up verification with Python `json.loads` succeeded and confirmed `stage2_5` and `stage8` fields.
|
||||||
|
|
||||||
|
### Suggested Fix
|
||||||
|
Use Python's JSON parser for verification in this repository when report content includes mojibake-rendered non-ASCII strings.
|
||||||
|
|
||||||
|
### Metadata
|
||||||
|
- Reproducible: yes
|
||||||
|
- Related Files: run_report.json
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## [ERR-20260608-001] apply_patch_context_encoding
|
||||||
|
|
||||||
|
**Logged**: 2026-06-08T00:00:00+08:00
|
||||||
|
**Priority**: low
|
||||||
|
**Status**: pending
|
||||||
|
**Area**: tests
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
`apply_patch` failed when matching context lines that contained mojibake-rendered Chinese text.
|
||||||
|
|
||||||
|
### Error
|
||||||
|
```text
|
||||||
|
apply_patch verification failed: Failed to find expected lines
|
||||||
|
```
|
||||||
|
|
||||||
|
### Context
|
||||||
|
- Operation attempted: update `tests/test_stage2_dedupe.py` with a patch anchored on displayed non-ASCII strings.
|
||||||
|
- The file content rendered differently enough that the expected context did not match.
|
||||||
|
|
||||||
|
### Suggested Fix
|
||||||
|
Use ASCII-only anchors, line-number inspection, or smaller structural context when patching files that contain mojibake-rendered non-ASCII text.
|
||||||
|
|
||||||
|
### Metadata
|
||||||
|
- Reproducible: yes
|
||||||
|
- Related Files: tests/test_stage2_dedupe.py
|
||||||
|
|
||||||
|
---
|
||||||
2
ai_daily_report/__init__.py
Normal file
2
ai_daily_report/__init__.py
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
"""Core package for the AI daily report pipeline."""
|
||||||
|
|
||||||
91
ai_daily_report/assemble.py
Normal file
91
ai_daily_report/assemble.py
Normal file
@@ -0,0 +1,91 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .classify import SECTION_ORDER
|
||||||
|
from .models import NewsItem
|
||||||
|
from .validate import validate_markdown
|
||||||
|
|
||||||
|
|
||||||
|
END_PUNCTUATION = "。!?;.!?;"
|
||||||
|
|
||||||
|
|
||||||
|
def _clean_text(text: str) -> str:
|
||||||
|
value = re.sub(r"^```(?:\w+)?\s*\n?", "", (text or "").strip())
|
||||||
|
value = re.sub(r"\n?```\s*$", "", value)
|
||||||
|
value = re.sub(r"^\s*>\s*", "", value)
|
||||||
|
value = re.sub(r"\[\d+\]|\[N\]", "", value)
|
||||||
|
value = re.sub(r"主线判断[::]\s*", "", value)
|
||||||
|
value = re.sub(r"\s+", " ", value).strip()
|
||||||
|
return value
|
||||||
|
|
||||||
|
|
||||||
|
def _ensure_sentence(text: str) -> str:
|
||||||
|
value = _clean_text(text)
|
||||||
|
if value and value[-1] not in END_PUNCTUATION:
|
||||||
|
value += "。"
|
||||||
|
return value
|
||||||
|
|
||||||
|
|
||||||
|
def _source_link(item: NewsItem) -> str:
|
||||||
|
source = item.source_label or item.source_group or "来源"
|
||||||
|
if item.url:
|
||||||
|
return f"[{source} ↗]({item.url})"
|
||||||
|
return source
|
||||||
|
|
||||||
|
|
||||||
|
def _fallback_intro(items: list[NewsItem]) -> str:
|
||||||
|
count = len(items)
|
||||||
|
return f"今天共聚合 {count} 条 AI 动态,覆盖模型能力、产品应用、基础设施、资本与治理等方向。"
|
||||||
|
|
||||||
|
|
||||||
|
def _fallback_conclusion(items: list[NewsItem]) -> str:
|
||||||
|
sections = [section for section in SECTION_ORDER if any(item.section == section for item in items)]
|
||||||
|
if sections:
|
||||||
|
return "总体看,今日 AI 动态主要集中在" + "、".join(sections[:4]) + "等方向,后续仍需持续观察落地进展。"
|
||||||
|
return "总体看,今日 AI 动态仍在持续演进,后续需要关注产品落地和生态变化。"
|
||||||
|
|
||||||
|
|
||||||
|
def assemble_markdown(items: list[NewsItem], guide: dict[str, Any] | None = None) -> tuple[str, dict[str, Any]]:
|
||||||
|
guide = guide or {"intro": "", "theme": "", "threads": [], "conclusion": ""}
|
||||||
|
lines: list[str] = []
|
||||||
|
|
||||||
|
intro = _ensure_sentence(str(guide.get("intro") or "")) or _fallback_intro(items)
|
||||||
|
lines.extend(["## 引言", "", f"> {intro}", ""])
|
||||||
|
|
||||||
|
item_number = 1
|
||||||
|
for section in SECTION_ORDER:
|
||||||
|
section_items = [item for item in items if item.section == section]
|
||||||
|
if not section_items:
|
||||||
|
continue
|
||||||
|
lines.extend([f"## {section}", ""])
|
||||||
|
for item in section_items:
|
||||||
|
title = _clean_text(item.title or item.title_raw)
|
||||||
|
summary = _ensure_sentence(item.summary or item.summary_raw or "该条目暂无摘要。")
|
||||||
|
lines.extend(
|
||||||
|
[
|
||||||
|
f"**{item_number}. {title}**",
|
||||||
|
"",
|
||||||
|
f"> {summary}{_source_link(item)}",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
item_number += 1
|
||||||
|
|
||||||
|
threads = guide.get("threads", []) or []
|
||||||
|
if threads:
|
||||||
|
lines.extend(["## 今日脉络", ""])
|
||||||
|
for thread in threads:
|
||||||
|
title = _clean_text(str(thread.get("title") or ""))
|
||||||
|
text = _ensure_sentence(str(thread.get("text") or ""))
|
||||||
|
if not title or not text:
|
||||||
|
continue
|
||||||
|
lines.extend([f"- **{title}**", f" {text}", ""])
|
||||||
|
|
||||||
|
conclusion = _ensure_sentence(str(guide.get("conclusion") or "")) or _fallback_conclusion(items)
|
||||||
|
lines.extend(["## 总结", "", f"> {conclusion}", ""])
|
||||||
|
|
||||||
|
markdown = "\n".join(lines).strip()
|
||||||
|
report = validate_markdown(markdown, items)
|
||||||
|
return markdown, report
|
||||||
89
ai_daily_report/audit.py
Normal file
89
ai_daily_report/audit.py
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
|
||||||
|
def load_run_report(path: Path) -> dict[str, Any] | None:
|
||||||
|
report_path = path / "run_report.json" if path.is_dir() else path
|
||||||
|
if not report_path.exists():
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
value = json.loads(report_path.read_text(encoding="utf-8"))
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
return value if isinstance(value, dict) else None
|
||||||
|
|
||||||
|
|
||||||
|
def summarize_reports(out_dir: Path, *, limit_days: int = 7) -> dict[str, Any]:
|
||||||
|
run_dirs = sorted([path for path in out_dir.iterdir() if path.is_dir()], reverse=True)[:limit_days]
|
||||||
|
rows: list[dict[str, Any]] = []
|
||||||
|
totals: dict[str, Any] = {
|
||||||
|
"source_failures": 0,
|
||||||
|
"duplicate_candidates": 0,
|
||||||
|
"final_items": 0,
|
||||||
|
"fallback_items": 0,
|
||||||
|
"quality_warnings": 0,
|
||||||
|
"quality_blocks": 0,
|
||||||
|
}
|
||||||
|
for run_dir in sorted(run_dirs):
|
||||||
|
report = load_run_report(run_dir)
|
||||||
|
if not report:
|
||||||
|
continue
|
||||||
|
quality_gate = report.get("quality_gate", {}) or {}
|
||||||
|
stage2_8 = report.get("stage2_8", {}) or {}
|
||||||
|
stage4 = report.get("stage4", {}) or {}
|
||||||
|
stage5 = report.get("stage5", {}) or {}
|
||||||
|
stage8 = report.get("stage8", {}) or {}
|
||||||
|
fallback_count = int(stage4.get("fallback_count", stage4.get("fallback_item_count", 0)) or 0)
|
||||||
|
final_count = int(stage5.get("output_count", stage4.get("output_count", 0)) or 0)
|
||||||
|
source_failures = len(quality_gate.get("source_failures", []) or [])
|
||||||
|
duplicate_candidates = int(stage2_8.get("candidate_group_count", 0) or 0)
|
||||||
|
warnings = len(quality_gate.get("warnings", []) or [])
|
||||||
|
blocks = len(quality_gate.get("blocking_errors", []) or [])
|
||||||
|
row = {
|
||||||
|
"date": run_dir.name,
|
||||||
|
"source_failures": source_failures,
|
||||||
|
"duplicate_candidates": duplicate_candidates,
|
||||||
|
"final_items": final_count,
|
||||||
|
"fallback_items": fallback_count,
|
||||||
|
"fallback_ratio": round(fallback_count / final_count, 4) if final_count else 0,
|
||||||
|
"quality_warnings": warnings,
|
||||||
|
"quality_blocks": blocks,
|
||||||
|
"publish_status": stage8.get("status"),
|
||||||
|
"publish_slug": stage8.get("slug"),
|
||||||
|
}
|
||||||
|
rows.append(row)
|
||||||
|
totals["source_failures"] += source_failures
|
||||||
|
totals["duplicate_candidates"] += duplicate_candidates
|
||||||
|
totals["final_items"] += final_count
|
||||||
|
totals["fallback_items"] += fallback_count
|
||||||
|
totals["quality_warnings"] += warnings
|
||||||
|
totals["quality_blocks"] += blocks
|
||||||
|
totals["fallback_ratio"] = round(totals["fallback_items"] / totals["final_items"], 4) if totals["final_items"] else 0
|
||||||
|
return {"run_count": len(rows), "totals": totals, "runs": rows}
|
||||||
|
|
||||||
|
|
||||||
|
def render_markdown(summary: dict[str, Any]) -> str:
|
||||||
|
totals = summary.get("totals", {})
|
||||||
|
lines = [
|
||||||
|
"# AI日报每周自动审计报告",
|
||||||
|
"",
|
||||||
|
f"- 覆盖运行数:{summary.get('run_count', 0)}",
|
||||||
|
f"- 源失败次数:{totals.get('source_failures', 0)}",
|
||||||
|
f"- 重复候选数:{totals.get('duplicate_candidates', 0)}",
|
||||||
|
f"- 最终条数:{totals.get('final_items', 0)}",
|
||||||
|
f"- fallback ratio:{totals.get('fallback_ratio', 0)}",
|
||||||
|
f"- 质量门禁 warning/block:{totals.get('quality_warnings', 0)}/{totals.get('quality_blocks', 0)}",
|
||||||
|
"",
|
||||||
|
"| 日期 | 源失败 | 重复候选 | 最终条数 | fallback | warning | block | 发布 | slug |",
|
||||||
|
"|---|---:|---:|---:|---:|---:|---:|---|---|",
|
||||||
|
]
|
||||||
|
for row in summary.get("runs", []) or []:
|
||||||
|
lines.append(
|
||||||
|
f"| {row['date']} | {row['source_failures']} | {row['duplicate_candidates']} | "
|
||||||
|
f"{row['final_items']} | {row['fallback_ratio']} | {row['quality_warnings']} | "
|
||||||
|
f"{row['quality_blocks']} | {row.get('publish_status') or ''} | {row.get('publish_slug') or ''} |"
|
||||||
|
)
|
||||||
|
return "\n".join(lines) + "\n"
|
||||||
162
ai_daily_report/candidate_recall.py
Normal file
162
ai_daily_report/candidate_recall.py
Normal file
@@ -0,0 +1,162 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import difflib
|
||||||
|
import re
|
||||||
|
from collections import defaultdict
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .dedupe import _jaccard_similarity, _title_tokens
|
||||||
|
from .models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
DEFAULT_CONFIG = {
|
||||||
|
"enabled": True,
|
||||||
|
"max_pairs": 80,
|
||||||
|
"max_pairs_per_item": 5,
|
||||||
|
"title_similarity_threshold": 0.45,
|
||||||
|
"title_jaccard_threshold": 0.25,
|
||||||
|
"summary_jaccard_threshold": 0.18,
|
||||||
|
"strong_entity_overlap_threshold": 2,
|
||||||
|
}
|
||||||
|
|
||||||
|
STOP_ENTITIES = {
|
||||||
|
"AI",
|
||||||
|
"API",
|
||||||
|
"CLI",
|
||||||
|
"LLM",
|
||||||
|
"Open Source",
|
||||||
|
"GitHub",
|
||||||
|
"Google",
|
||||||
|
"OpenAI",
|
||||||
|
"Anthropic",
|
||||||
|
"Microsoft",
|
||||||
|
"Meta",
|
||||||
|
"Amazon",
|
||||||
|
"NVIDIA",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _config_value(config: dict[str, Any], name: str):
|
||||||
|
return (config or {}).get(name, DEFAULT_CONFIG[name])
|
||||||
|
|
||||||
|
|
||||||
|
def _text_tokens(value: str) -> set[str]:
|
||||||
|
return _title_tokens(value)
|
||||||
|
|
||||||
|
|
||||||
|
def _entity_tokens(value: str) -> set[str]:
|
||||||
|
text = value or ""
|
||||||
|
entities = set(re.findall(r"\b[A-Z][A-Za-z0-9]*(?:[- ][A-Z0-9][A-Za-z0-9]*)*\b", text))
|
||||||
|
entities.update(re.findall(r"[\u4e00-\u9fffA-Za-z0-9]*[A-Za-z]+[0-9]+[A-Za-z0-9-]*", text))
|
||||||
|
cleaned = {entity.strip() for entity in entities if len(entity.strip()) >= 3}
|
||||||
|
return {entity for entity in cleaned if entity not in STOP_ENTITIES}
|
||||||
|
|
||||||
|
|
||||||
|
def _pair_key(item_ids: list[str]) -> frozenset[str]:
|
||||||
|
return frozenset(item_ids)
|
||||||
|
|
||||||
|
|
||||||
|
def _candidate_score(left: NewsItem, right: NewsItem, config: dict[str, Any]) -> tuple[float, str, dict[str, Any]] | None:
|
||||||
|
title_ratio = difflib.SequenceMatcher(None, left.title_norm, right.title_norm).ratio()
|
||||||
|
title_jaccard = _jaccard_similarity(_text_tokens(left.title_norm), _text_tokens(right.title_norm))
|
||||||
|
summary_jaccard = _jaccard_similarity(_text_tokens(left.summary_raw), _text_tokens(right.summary_raw))
|
||||||
|
left_entities = _entity_tokens(f"{left.title_raw} {left.summary_raw}")
|
||||||
|
right_entities = _entity_tokens(f"{right.title_raw} {right.summary_raw}")
|
||||||
|
shared_entities = sorted(left_entities & right_entities)
|
||||||
|
strong_entity_threshold = int(_config_value(config, "strong_entity_overlap_threshold"))
|
||||||
|
|
||||||
|
if len(shared_entities) >= strong_entity_threshold and summary_jaccard > 0:
|
||||||
|
score = min(1.0, 0.55 + len(shared_entities) * 0.1 + summary_jaccard * 0.35)
|
||||||
|
return score, "strong_entity_overlap", {
|
||||||
|
"shared_entities": shared_entities,
|
||||||
|
"title_similarity": round(title_ratio, 3),
|
||||||
|
"title_jaccard": round(title_jaccard, 3),
|
||||||
|
"summary_jaccard": round(summary_jaccard, 3),
|
||||||
|
}
|
||||||
|
|
||||||
|
if title_ratio >= float(_config_value(config, "title_similarity_threshold")) and (
|
||||||
|
title_jaccard >= float(_config_value(config, "title_jaccard_threshold"))
|
||||||
|
or summary_jaccard >= float(_config_value(config, "summary_jaccard_threshold")) * 2
|
||||||
|
or shared_entities
|
||||||
|
):
|
||||||
|
return title_ratio, "title_similarity", {
|
||||||
|
"title_similarity": round(title_ratio, 3),
|
||||||
|
"title_jaccard": round(title_jaccard, 3),
|
||||||
|
"summary_jaccard": round(summary_jaccard, 3),
|
||||||
|
}
|
||||||
|
|
||||||
|
if (
|
||||||
|
title_jaccard >= float(_config_value(config, "title_jaccard_threshold"))
|
||||||
|
and summary_jaccard >= float(_config_value(config, "summary_jaccard_threshold"))
|
||||||
|
):
|
||||||
|
score = (title_jaccard + summary_jaccard) / 2
|
||||||
|
return score, "title_summary_jaccard", {
|
||||||
|
"title_similarity": round(title_ratio, 3),
|
||||||
|
"title_jaccard": round(title_jaccard, 3),
|
||||||
|
"summary_jaccard": round(summary_jaccard, 3),
|
||||||
|
}
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def recall_semantic_candidates(
|
||||||
|
items: list[NewsItem],
|
||||||
|
*,
|
||||||
|
existing_candidates: list[dict[str, Any]] | None = None,
|
||||||
|
config: dict[str, Any] | None = None,
|
||||||
|
) -> tuple[list[dict[str, Any]], dict[str, Any]]:
|
||||||
|
config = {**DEFAULT_CONFIG, **(config or {})}
|
||||||
|
existing_candidates = list(existing_candidates or [])
|
||||||
|
if not bool(config.get("enabled", True)):
|
||||||
|
return existing_candidates, {
|
||||||
|
"enabled": False,
|
||||||
|
"input_count": len(items),
|
||||||
|
"existing_candidate_group_count": len(existing_candidates),
|
||||||
|
"added_candidate_group_count": 0,
|
||||||
|
"candidate_group_count": len(existing_candidates),
|
||||||
|
"candidates": existing_candidates,
|
||||||
|
}
|
||||||
|
|
||||||
|
existing_keys = {_pair_key(list(candidate.get("item_ids", []) or [])) for candidate in existing_candidates}
|
||||||
|
pair_counts: defaultdict[str, int] = defaultdict(int)
|
||||||
|
recalled: list[dict[str, Any]] = []
|
||||||
|
|
||||||
|
for index, left in enumerate(items):
|
||||||
|
for right in items[index + 1 :]:
|
||||||
|
if pair_counts[left.id] >= int(config["max_pairs_per_item"]):
|
||||||
|
continue
|
||||||
|
if pair_counts[right.id] >= int(config["max_pairs_per_item"]):
|
||||||
|
continue
|
||||||
|
key = frozenset({left.id, right.id})
|
||||||
|
if key in existing_keys:
|
||||||
|
continue
|
||||||
|
scored = _candidate_score(left, right, config)
|
||||||
|
if scored is None:
|
||||||
|
continue
|
||||||
|
score, reason, evidence = scored
|
||||||
|
recalled.append(
|
||||||
|
{
|
||||||
|
"item_ids": [left.id, right.id],
|
||||||
|
"reason": reason,
|
||||||
|
"score": round(score, 3),
|
||||||
|
"confidence": "medium",
|
||||||
|
**evidence,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
pair_counts[left.id] += 1
|
||||||
|
pair_counts[right.id] += 1
|
||||||
|
if len(recalled) >= int(config["max_pairs"]):
|
||||||
|
break
|
||||||
|
if len(recalled) >= int(config["max_pairs"]):
|
||||||
|
break
|
||||||
|
|
||||||
|
candidates = existing_candidates + recalled
|
||||||
|
report = {
|
||||||
|
"enabled": True,
|
||||||
|
"input_count": len(items),
|
||||||
|
"existing_candidate_group_count": len(existing_candidates),
|
||||||
|
"added_candidate_group_count": len(recalled),
|
||||||
|
"candidate_group_count": len(candidates),
|
||||||
|
"candidates": candidates,
|
||||||
|
}
|
||||||
|
return candidates, report
|
||||||
118
ai_daily_report/classify.py
Normal file
118
ai_daily_report/classify.py
Normal file
@@ -0,0 +1,118 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from collections import Counter
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
SECTION_ORDER = [
|
||||||
|
"模型与能力",
|
||||||
|
"产品与应用",
|
||||||
|
"开发与基础设施",
|
||||||
|
"公司与资本",
|
||||||
|
"政策与安全",
|
||||||
|
"论文与研究",
|
||||||
|
"观点与教程",
|
||||||
|
"人物与动态",
|
||||||
|
]
|
||||||
|
|
||||||
|
SECTION_ALIASES = {
|
||||||
|
"模型发布/更新": "模型与能力",
|
||||||
|
"产品发布/更新": "产品与应用",
|
||||||
|
"产品与工具": "产品与应用",
|
||||||
|
"开发与工程": "开发与基础设施",
|
||||||
|
"行业动态": "公司与资本",
|
||||||
|
"行业与公司": "公司与资本",
|
||||||
|
"论文研究": "论文与研究",
|
||||||
|
"论文与研究": "论文与研究",
|
||||||
|
"技巧与观点": "观点与教程",
|
||||||
|
"观点与教程": "观点与教程",
|
||||||
|
"人物与花絮": "人物与动态",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
RULES = [
|
||||||
|
("政策与安全", ("监管", "政策", "安全", "风险", "滥用", "攻击", "合规", "版权")),
|
||||||
|
("论文与研究", ("论文", "研究", "arxiv", "cvpr", "benchmark", "评测", "实验")),
|
||||||
|
("开发与基础设施", ("sdk", "api", "mcp", "kubernetes", "框架", "开源", "github", "部署", "基础设施")),
|
||||||
|
("公司与资本", ("融资", "ipo", "上市", "招股书", "合作", "估值", "收购", "资本")),
|
||||||
|
("模型与能力", ("模型", "gpt", "claude", "gemini", "grok", "token", "参数", "多模态", "语音", "推理")),
|
||||||
|
("产品与应用", ("agent", "应用", "产品", "平台", "上线", "工具", "智能体")),
|
||||||
|
("观点与教程", ("教程", "观点", "方法论", "guide", "实践", "技巧")),
|
||||||
|
("人物与动态", ("黄仁勋", "纳德拉", "访谈", "演讲", "人物")),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_section_hint(section_hint: str) -> str:
|
||||||
|
hint = (section_hint or "").strip()
|
||||||
|
if hint in SECTION_ORDER:
|
||||||
|
return hint
|
||||||
|
return SECTION_ALIASES.get(hint, "")
|
||||||
|
|
||||||
|
|
||||||
|
def rule_classify(item: NewsItem) -> str:
|
||||||
|
text = f"{item.title or item.title_raw} {item.summary or item.summary_raw}".lower()
|
||||||
|
for section, keywords in RULES:
|
||||||
|
if any(keyword.lower() in text for keyword in keywords):
|
||||||
|
return section
|
||||||
|
return "公司与资本"
|
||||||
|
|
||||||
|
|
||||||
|
def rank_score(item: NewsItem) -> int:
|
||||||
|
text = f"{item.title or item.title_raw} {item.summary or item.summary_raw}"
|
||||||
|
score = max(0, 200 - item.source_priority)
|
||||||
|
if item.source_role == "primary":
|
||||||
|
score += 10
|
||||||
|
if item.canonical_url:
|
||||||
|
score += 10
|
||||||
|
if any(ch.isdigit() for ch in text):
|
||||||
|
score += 10
|
||||||
|
if item.duplicate_sources:
|
||||||
|
score += min(20, len(item.duplicate_sources) * 5)
|
||||||
|
score -= len(item.quality_flags) * 10
|
||||||
|
return score
|
||||||
|
|
||||||
|
|
||||||
|
def classify_and_order_items(items: list[NewsItem]) -> tuple[list[NewsItem], dict[str, Any]]:
|
||||||
|
llm_classified = 0
|
||||||
|
hint_classified = 0
|
||||||
|
rule_classified = 0
|
||||||
|
invalid_llm_section_count = 0
|
||||||
|
|
||||||
|
for item in items:
|
||||||
|
if item.section:
|
||||||
|
if item.section in SECTION_ORDER:
|
||||||
|
llm_classified += 1
|
||||||
|
continue
|
||||||
|
invalid_llm_section_count += 1
|
||||||
|
|
||||||
|
mapped = normalize_section_hint(item.section_hint)
|
||||||
|
if mapped:
|
||||||
|
item.section = mapped
|
||||||
|
hint_classified += 1
|
||||||
|
else:
|
||||||
|
item.section = rule_classify(item)
|
||||||
|
rule_classified += 1
|
||||||
|
|
||||||
|
section_index = {section: index for index, section in enumerate(SECTION_ORDER)}
|
||||||
|
ordered = sorted(
|
||||||
|
items,
|
||||||
|
key=lambda item: (
|
||||||
|
section_index.get(item.section or "", len(SECTION_ORDER)),
|
||||||
|
-rank_score(item),
|
||||||
|
item.title or item.title_raw,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
section_counts = Counter(item.section for item in ordered if item.section)
|
||||||
|
report = {
|
||||||
|
"input_count": len(items),
|
||||||
|
"section_counts": dict(section_counts),
|
||||||
|
"hint_classified": hint_classified,
|
||||||
|
"rule_classified": rule_classified,
|
||||||
|
"llm_classified": llm_classified,
|
||||||
|
"fallback_classified": hint_classified + rule_classified,
|
||||||
|
"invalid_llm_section_count": invalid_llm_section_count,
|
||||||
|
"invalid_section_count": sum(1 for item in ordered if item.section not in SECTION_ORDER),
|
||||||
|
}
|
||||||
|
return ordered, report
|
||||||
50
ai_daily_report/cli.py
Normal file
50
ai_daily_report/cli.py
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from .audit import render_markdown, summarize_reports
|
||||||
|
from .runner import run_daily_report
|
||||||
|
|
||||||
|
|
||||||
|
def build_parser() -> argparse.ArgumentParser:
|
||||||
|
parser = argparse.ArgumentParser(prog="ai-daily-report")
|
||||||
|
subcommands = parser.add_subparsers(dest="command")
|
||||||
|
run = subcommands.add_parser("run")
|
||||||
|
run.add_argument("--date", default="today")
|
||||||
|
run.add_argument("--mode", choices=["dry-run", "draft", "publish"], default="dry-run")
|
||||||
|
run.add_argument("--source-mode", choices=["mock", "live"], default="mock")
|
||||||
|
run.add_argument("--llm-mode", choices=["mock", "live"], default="mock")
|
||||||
|
run.add_argument("--out-dir", default="runs")
|
||||||
|
run.add_argument("--base-url", default="https://blog.ephron.ren")
|
||||||
|
run.add_argument("--sources-path", default=None)
|
||||||
|
run.add_argument("--pipeline-path", default=None)
|
||||||
|
run.add_argument("--history-path", default=None)
|
||||||
|
audit = subcommands.add_parser("audit")
|
||||||
|
audit.add_argument("--out-dir", default=str(Path.home() / ".hermes" / "scripts" / "ai_morning_out"))
|
||||||
|
audit.add_argument("--limit-days", type=int, default=7)
|
||||||
|
return parser
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
parser = build_parser()
|
||||||
|
args = parser.parse_args(argv)
|
||||||
|
if args.command == "run":
|
||||||
|
run_daily_report(
|
||||||
|
run_date=args.date,
|
||||||
|
mode=args.mode,
|
||||||
|
source_mode=args.source_mode,
|
||||||
|
llm_mode=args.llm_mode,
|
||||||
|
out_dir=Path(args.out_dir),
|
||||||
|
base_url=args.base_url,
|
||||||
|
sources_path=Path(args.sources_path) if args.sources_path else None,
|
||||||
|
pipeline_path=Path(args.pipeline_path) if args.pipeline_path else None,
|
||||||
|
history_path=Path(args.history_path) if args.history_path else None,
|
||||||
|
)
|
||||||
|
elif args.command == "audit":
|
||||||
|
print(render_markdown(summarize_reports(Path(args.out_dir), limit_days=args.limit_days)))
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
164
ai_daily_report/clients.py
Normal file
164
ai_daily_report/clients.py
Normal file
@@ -0,0 +1,164 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import socket
|
||||||
|
import time
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from urllib.error import HTTPError, URLError
|
||||||
|
from urllib.parse import urlencode
|
||||||
|
import urllib.request
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
|
||||||
|
UA = "Mozilla/5.0 (compatible; ai-daily-report/1.0)"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class FetchTextError(Exception):
|
||||||
|
error_type: str
|
||||||
|
message: str
|
||||||
|
http_status: int | None = None
|
||||||
|
attempts: int = 1
|
||||||
|
|
||||||
|
def __str__(self) -> str:
|
||||||
|
return self.message
|
||||||
|
|
||||||
|
|
||||||
|
def _classify_fetch_exception(exc: Exception) -> tuple[str, int | None, bool]:
|
||||||
|
if isinstance(exc, HTTPError):
|
||||||
|
if exc.code == 404:
|
||||||
|
return "http_404", exc.code, False
|
||||||
|
if exc.code in {429, 500, 502, 503, 504}:
|
||||||
|
return f"http_{exc.code}", exc.code, True
|
||||||
|
return f"http_{exc.code}", exc.code, False
|
||||||
|
if isinstance(exc, TimeoutError | socket.timeout):
|
||||||
|
return "timeout", None, True
|
||||||
|
if isinstance(exc, URLError):
|
||||||
|
reason = exc.reason
|
||||||
|
if isinstance(reason, TimeoutError | socket.timeout):
|
||||||
|
return "timeout", None, True
|
||||||
|
return "network_error", None, True
|
||||||
|
return "fetch_error", None, False
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_text(
|
||||||
|
url: str,
|
||||||
|
timeout_seconds: int,
|
||||||
|
*,
|
||||||
|
retries: int = 0,
|
||||||
|
backoff_seconds: float = 0.5,
|
||||||
|
) -> str:
|
||||||
|
req = urllib.request.Request(url, headers={"User-Agent": UA})
|
||||||
|
attempts = max(1, retries + 1)
|
||||||
|
last_error: FetchTextError | None = None
|
||||||
|
for attempt in range(1, attempts + 1):
|
||||||
|
try:
|
||||||
|
with urllib.request.urlopen(req, timeout=timeout_seconds) as response:
|
||||||
|
return response.read().decode("utf-8", "ignore")
|
||||||
|
except Exception as exc:
|
||||||
|
error_type, http_status, retryable = _classify_fetch_exception(exc)
|
||||||
|
last_error = FetchTextError(
|
||||||
|
error_type=error_type,
|
||||||
|
message=f"{type(exc).__name__}: {exc}",
|
||||||
|
http_status=http_status,
|
||||||
|
attempts=attempt,
|
||||||
|
)
|
||||||
|
if not retryable or attempt >= attempts:
|
||||||
|
raise last_error from exc
|
||||||
|
if backoff_seconds > 0:
|
||||||
|
time.sleep(backoff_seconds * (2 ** (attempt - 1)))
|
||||||
|
raise last_error or FetchTextError("fetch_error", "unknown fetch error", attempts=attempts)
|
||||||
|
|
||||||
|
|
||||||
|
class OpenAICompatibleClient:
|
||||||
|
def __init__(self, *, api_key: str, base_url: str, model: str, timeout_seconds: int = 600):
|
||||||
|
self.api_key = api_key
|
||||||
|
self.base_url = base_url.rstrip("/")
|
||||||
|
self.model = model
|
||||||
|
self.timeout_seconds = timeout_seconds
|
||||||
|
|
||||||
|
def chat(self, prompt: str) -> str:
|
||||||
|
payload = json.dumps(
|
||||||
|
{
|
||||||
|
"model": self.model,
|
||||||
|
"messages": [{"role": "user", "content": prompt}],
|
||||||
|
"temperature": 0.2,
|
||||||
|
"max_tokens": 8000,
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
).encode("utf-8")
|
||||||
|
req = urllib.request.Request(
|
||||||
|
f"{self.base_url}/chat/completions",
|
||||||
|
data=payload,
|
||||||
|
headers={"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"},
|
||||||
|
)
|
||||||
|
with urllib.request.urlopen(req, timeout=self.timeout_seconds) as response:
|
||||||
|
data = json.loads(response.read().decode("utf-8"))
|
||||||
|
return data["choices"][0]["message"]["content"].strip()
|
||||||
|
|
||||||
|
|
||||||
|
class BlogApiClient:
|
||||||
|
def __init__(self, *, base_url: str, token: str, timeout_seconds: int = 25):
|
||||||
|
self.base_url = base_url.rstrip("/")
|
||||||
|
self.token = token
|
||||||
|
self.timeout_seconds = timeout_seconds
|
||||||
|
|
||||||
|
def _request(self, method: str, path: str, payload: dict[str, Any] | None = None) -> dict[str, Any]:
|
||||||
|
data = None
|
||||||
|
headers = {"Authorization": f"Bearer {self.token}", "User-Agent": UA}
|
||||||
|
if payload is not None:
|
||||||
|
data = json.dumps(payload, ensure_ascii=False).encode("utf-8")
|
||||||
|
headers["Content-Type"] = "application/json"
|
||||||
|
req = urllib.request.Request(f"{self.base_url}{path}", data=data, headers=headers, method=method)
|
||||||
|
with urllib.request.urlopen(req, timeout=self.timeout_seconds) as response:
|
||||||
|
return json.loads(response.read().decode("utf-8"))
|
||||||
|
|
||||||
|
def create_post(self, payload: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
return self._request("POST", "/api/service/posts", payload)
|
||||||
|
|
||||||
|
def _normalize_post_response(self, value: Any, slug: str) -> dict[str, Any] | None:
|
||||||
|
if isinstance(value, dict):
|
||||||
|
if isinstance(value.get("post"), dict):
|
||||||
|
value = value["post"]
|
||||||
|
elif isinstance(value.get("data"), dict):
|
||||||
|
value = value["data"]
|
||||||
|
elif isinstance(value.get("items"), list):
|
||||||
|
for item in value["items"]:
|
||||||
|
if isinstance(item, dict) and item.get("slug") == slug:
|
||||||
|
return item
|
||||||
|
return None
|
||||||
|
if value.get("slug") == slug or value.get("id") or value.get("content") or value.get("markdown"):
|
||||||
|
return value
|
||||||
|
if isinstance(value, list):
|
||||||
|
for item in value:
|
||||||
|
if isinstance(item, dict) and item.get("slug") == slug:
|
||||||
|
return item
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _request_optional(self, method: str, path: str, payload: dict[str, Any] | None = None) -> dict[str, Any] | list[Any] | None:
|
||||||
|
try:
|
||||||
|
return self._request(method, path, payload)
|
||||||
|
except HTTPError as exc:
|
||||||
|
if exc.code in {403, 404}:
|
||||||
|
return None
|
||||||
|
raise
|
||||||
|
except FetchTextError as exc:
|
||||||
|
if exc.error_type in {"http_403", "http_404"}:
|
||||||
|
return None
|
||||||
|
raise
|
||||||
|
|
||||||
|
def get_post_by_slug(self, slug: str) -> dict[str, Any] | None:
|
||||||
|
paths = [
|
||||||
|
f"/api/service/posts/{slug}",
|
||||||
|
f"/api/service/posts?{urlencode({'slug': slug})}",
|
||||||
|
f"/api/service/posts/slug/{slug}",
|
||||||
|
]
|
||||||
|
for path in paths:
|
||||||
|
value = self._request_optional("GET", path)
|
||||||
|
post = self._normalize_post_response(value, slug)
|
||||||
|
if post is not None:
|
||||||
|
return post
|
||||||
|
return None
|
||||||
|
|
||||||
|
def publish_post(self, slug: str) -> None:
|
||||||
|
self._request("POST", f"/api/service/posts/{slug}/publish")
|
||||||
114
ai_daily_report/collect.py
Normal file
114
ai_daily_report/collect.py
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from time import perf_counter
|
||||||
|
from typing import Callable, Iterable, Any
|
||||||
|
|
||||||
|
from .clients import FetchTextError
|
||||||
|
from .models import SourceConfig, SourceResult
|
||||||
|
|
||||||
|
|
||||||
|
Fetcher = Callable[[SourceConfig, str], list[dict[str, Any]]]
|
||||||
|
|
||||||
|
|
||||||
|
def _status_from_exception(exc: Exception) -> str:
|
||||||
|
if isinstance(exc, FetchTextError):
|
||||||
|
return exc.error_type
|
||||||
|
if isinstance(exc, TimeoutError):
|
||||||
|
return "timeout"
|
||||||
|
return "error"
|
||||||
|
|
||||||
|
|
||||||
|
def _retry_count_from_exception(exc: Exception) -> int:
|
||||||
|
if isinstance(exc, FetchTextError):
|
||||||
|
return max(0, exc.attempts - 1)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def _collect_one(config: SourceConfig, run_date: str, fetcher: Fetcher) -> SourceResult:
|
||||||
|
fetched_at = datetime.now(timezone.utc).isoformat()
|
||||||
|
if not config.enabled:
|
||||||
|
return SourceResult(
|
||||||
|
source=config.name,
|
||||||
|
role=config.role,
|
||||||
|
ok=False,
|
||||||
|
status="disabled",
|
||||||
|
fetched_at=fetched_at,
|
||||||
|
error=f"failure_policy={config.failure_policy}; min_items={config.min_items}",
|
||||||
|
)
|
||||||
|
|
||||||
|
started = perf_counter()
|
||||||
|
try:
|
||||||
|
items = fetcher(config, run_date)
|
||||||
|
elapsed_ms = int((perf_counter() - started) * 1000)
|
||||||
|
status = "ok" if items else "empty"
|
||||||
|
if status == "ok" and config.min_items and len(items) < config.min_items:
|
||||||
|
status = "below_min_items"
|
||||||
|
return SourceResult(
|
||||||
|
source=config.name,
|
||||||
|
role=config.role,
|
||||||
|
ok=status == "ok",
|
||||||
|
status=status,
|
||||||
|
items=items,
|
||||||
|
error=None if status == "ok" else f"items={len(items)}; min_items={config.min_items}; failure_policy={config.failure_policy}",
|
||||||
|
elapsed_ms=elapsed_ms,
|
||||||
|
fetched_at=fetched_at,
|
||||||
|
)
|
||||||
|
except Exception as exc:
|
||||||
|
elapsed_ms = int((perf_counter() - started) * 1000)
|
||||||
|
return SourceResult(
|
||||||
|
source=config.name,
|
||||||
|
role=config.role,
|
||||||
|
ok=False,
|
||||||
|
status=_status_from_exception(exc),
|
||||||
|
error=f"{type(exc).__name__}: {exc}; failure_policy={config.failure_policy}; min_items={config.min_items}",
|
||||||
|
elapsed_ms=elapsed_ms,
|
||||||
|
retry_count=_retry_count_from_exception(exc),
|
||||||
|
fetched_at=fetched_at,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def collect_sources(
|
||||||
|
configs: Iterable[SourceConfig],
|
||||||
|
run_date: str,
|
||||||
|
*,
|
||||||
|
fetcher: Fetcher,
|
||||||
|
max_workers: int | None = None,
|
||||||
|
) -> tuple[list[SourceResult], dict[str, Any]]:
|
||||||
|
ordered_configs = list(configs)
|
||||||
|
if not ordered_configs:
|
||||||
|
return [], {
|
||||||
|
"input_source_count": 0,
|
||||||
|
"ok_source_count": 0,
|
||||||
|
"failed_source_count": 0,
|
||||||
|
"raw_item_count": 0,
|
||||||
|
}
|
||||||
|
|
||||||
|
workers = max_workers or min(8, len(ordered_configs))
|
||||||
|
result_by_name: dict[str, SourceResult] = {}
|
||||||
|
|
||||||
|
with ThreadPoolExecutor(max_workers=workers) as executor:
|
||||||
|
futures = {
|
||||||
|
executor.submit(_collect_one, config, run_date, fetcher): config
|
||||||
|
for config in ordered_configs
|
||||||
|
}
|
||||||
|
for future in as_completed(futures):
|
||||||
|
config = futures[future]
|
||||||
|
result_by_name[config.name] = future.result()
|
||||||
|
|
||||||
|
results = [result_by_name[config.name] for config in ordered_configs]
|
||||||
|
report = {
|
||||||
|
"input_source_count": len(results),
|
||||||
|
"ok_source_count": sum(1 for result in results if result.ok),
|
||||||
|
"failed_source_count": sum(1 for result in results if not result.ok),
|
||||||
|
"raw_item_count": sum(len(result.items) for result in results),
|
||||||
|
"source_counts": {result.source: len(result.items) for result in results},
|
||||||
|
"statuses": {result.source: result.status for result in results},
|
||||||
|
"error_types": {
|
||||||
|
result.source: result.status
|
||||||
|
for result in results
|
||||||
|
if not result.ok and result.status != "disabled"
|
||||||
|
},
|
||||||
|
}
|
||||||
|
return results, report
|
||||||
28
ai_daily_report/config.py
Normal file
28
ai_daily_report/config.py
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .models import SourceConfig
|
||||||
|
from .pipeline import _source_config_from_dict
|
||||||
|
|
||||||
|
|
||||||
|
def load_json(path: Path) -> Any:
|
||||||
|
return json.loads(path.read_text(encoding="utf-8"))
|
||||||
|
|
||||||
|
|
||||||
|
def load_source_configs(path: Path) -> list[SourceConfig]:
|
||||||
|
raw = load_json(path)
|
||||||
|
if not isinstance(raw, list):
|
||||||
|
raise ValueError("sources config must be a list")
|
||||||
|
return [_source_config_from_dict(item) for item in raw]
|
||||||
|
|
||||||
|
|
||||||
|
def load_pipeline_config(path: Path) -> dict[str, Any]:
|
||||||
|
if not path.exists():
|
||||||
|
return {}
|
||||||
|
raw = load_json(path)
|
||||||
|
if not isinstance(raw, dict):
|
||||||
|
raise ValueError("pipeline config must be an object")
|
||||||
|
return raw
|
||||||
182
ai_daily_report/dedupe.py
Normal file
182
ai_daily_report/dedupe.py
Normal file
@@ -0,0 +1,182 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import difflib
|
||||||
|
import re
|
||||||
|
from datetime import date, datetime
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .models import NewsItem, PublishedUrlEntry, PublishedUrls
|
||||||
|
|
||||||
|
|
||||||
|
TITLE_SIMILARITY_THRESHOLD = 0.50
|
||||||
|
TOKEN_JACCARD_THRESHOLD = 0.40
|
||||||
|
TOKEN_EDIT_DISTANCE_THRESHOLD = 0.40
|
||||||
|
|
||||||
|
|
||||||
|
def _item_score(item: NewsItem) -> int:
|
||||||
|
score = 0
|
||||||
|
score += max(0, 200 - item.source_priority)
|
||||||
|
if item.canonical_url:
|
||||||
|
score += 20
|
||||||
|
if item.summary_raw:
|
||||||
|
score += min(40, len(item.summary_raw))
|
||||||
|
if item.section_hint:
|
||||||
|
score += 10
|
||||||
|
if item.source_role == "primary":
|
||||||
|
score += 10
|
||||||
|
score -= len(item.quality_flags) * 10
|
||||||
|
return score
|
||||||
|
|
||||||
|
|
||||||
|
def _merge_group(group: list[NewsItem], reason: str) -> tuple[NewsItem, list[NewsItem], dict[str, Any]]:
|
||||||
|
keep = max(group, key=_item_score)
|
||||||
|
removed = [item for item in group if item is not keep]
|
||||||
|
for removed_item in removed:
|
||||||
|
keep.duplicate_sources.append(
|
||||||
|
{
|
||||||
|
"id": removed_item.id,
|
||||||
|
"source_group": removed_item.source_group,
|
||||||
|
"source_label": removed_item.source_label,
|
||||||
|
"url": removed_item.url,
|
||||||
|
"reason": reason,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
report_group = {
|
||||||
|
"reason": reason,
|
||||||
|
"keep_id": keep.id,
|
||||||
|
"removed_ids": [item.id for item in removed],
|
||||||
|
"confidence": "high",
|
||||||
|
}
|
||||||
|
return keep, removed, report_group
|
||||||
|
|
||||||
|
|
||||||
|
def _group_by_key(items: list[NewsItem], key_name: str) -> dict[str, list[NewsItem]]:
|
||||||
|
groups: dict[str, list[NewsItem]] = {}
|
||||||
|
for item in items:
|
||||||
|
key = getattr(item, key_name)
|
||||||
|
if key:
|
||||||
|
groups.setdefault(key, []).append(item)
|
||||||
|
return {key: group for key, group in groups.items() if len(group) > 1}
|
||||||
|
|
||||||
|
|
||||||
|
def _title_tokens(value: str) -> set[str]:
|
||||||
|
if not value:
|
||||||
|
return set()
|
||||||
|
return set(re.findall(r"[a-z0-9]+|[\u4e00-\u9fff]", value.lower()))
|
||||||
|
|
||||||
|
|
||||||
|
def _jaccard_similarity(left: set[str], right: set[str]) -> float:
|
||||||
|
if not left or not right:
|
||||||
|
return 0.0
|
||||||
|
return len(left & right) / len(left | right)
|
||||||
|
|
||||||
|
|
||||||
|
def _possible_duplicates(items: list[NewsItem]) -> list[dict[str, Any]]:
|
||||||
|
possible: list[dict[str, Any]] = []
|
||||||
|
for index, left in enumerate(items):
|
||||||
|
for right in items[index + 1 :]:
|
||||||
|
if not left.title_norm or not right.title_norm:
|
||||||
|
continue
|
||||||
|
ratio = difflib.SequenceMatcher(None, left.title_norm, right.title_norm).ratio()
|
||||||
|
jaccard = _jaccard_similarity(_title_tokens(left.title_norm), _title_tokens(right.title_norm))
|
||||||
|
if ratio >= TITLE_SIMILARITY_THRESHOLD or (
|
||||||
|
ratio >= TOKEN_EDIT_DISTANCE_THRESHOLD and jaccard >= TOKEN_JACCARD_THRESHOLD
|
||||||
|
):
|
||||||
|
possible.append(
|
||||||
|
{
|
||||||
|
"item_ids": [left.id, right.id],
|
||||||
|
"reason": "title_similarity",
|
||||||
|
"similarity": round(ratio, 3),
|
||||||
|
"token_jaccard": round(jaccard, 3),
|
||||||
|
"confidence": "medium",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return possible
|
||||||
|
|
||||||
|
|
||||||
|
def hard_dedup_items(items: list[NewsItem]) -> tuple[list[NewsItem], dict[str, Any]]:
|
||||||
|
remaining = list(items)
|
||||||
|
removed_object_ids: set[int] = set()
|
||||||
|
groups_report: list[dict[str, Any]] = []
|
||||||
|
|
||||||
|
for key_name, reason in (
|
||||||
|
("canonical_url", "same_canonical_url"),
|
||||||
|
("title_norm", "same_title_norm"),
|
||||||
|
):
|
||||||
|
grouped = _group_by_key([item for item in remaining if id(item) not in removed_object_ids], key_name)
|
||||||
|
for group in grouped.values():
|
||||||
|
active_group = [item for item in group if id(item) not in removed_object_ids]
|
||||||
|
if len(active_group) < 2:
|
||||||
|
continue
|
||||||
|
keep, removed, report_group = _merge_group(active_group, reason)
|
||||||
|
removed_object_ids.update(id(item) for item in removed)
|
||||||
|
groups_report.append(report_group)
|
||||||
|
|
||||||
|
deduped = [item for item in remaining if id(item) not in removed_object_ids]
|
||||||
|
report = {
|
||||||
|
"input_count": len(items),
|
||||||
|
"output_count": len(deduped),
|
||||||
|
"removed_count": len(removed_object_ids),
|
||||||
|
"groups": groups_report,
|
||||||
|
"possible_duplicates": _possible_duplicates(deduped),
|
||||||
|
}
|
||||||
|
return deduped, report
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_date(value: str | None) -> date | None:
|
||||||
|
if not value:
|
||||||
|
return None
|
||||||
|
text = value.strip()
|
||||||
|
try:
|
||||||
|
return date.fromisoformat(text[:10])
|
||||||
|
except ValueError:
|
||||||
|
try:
|
||||||
|
return datetime.fromisoformat(text).date()
|
||||||
|
except ValueError:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _entry_within_window(entry: PublishedUrlEntry, *, run_date: str, max_age_days: int) -> bool:
|
||||||
|
if max_age_days < 0:
|
||||||
|
return True
|
||||||
|
current = _parse_date(run_date)
|
||||||
|
previous = _parse_date(entry.last_published) or _parse_date(entry.first_seen)
|
||||||
|
if current is None or previous is None:
|
||||||
|
return True
|
||||||
|
return (current - previous).days <= max_age_days
|
||||||
|
|
||||||
|
|
||||||
|
def cross_day_dedup_items(
|
||||||
|
items: list[NewsItem],
|
||||||
|
published_urls: PublishedUrls | None,
|
||||||
|
*,
|
||||||
|
run_date: str,
|
||||||
|
max_age_days: int = 7,
|
||||||
|
) -> tuple[list[NewsItem], dict[str, Any]]:
|
||||||
|
history = published_urls or PublishedUrls()
|
||||||
|
deduped: list[NewsItem] = []
|
||||||
|
removed: list[dict[str, Any]] = []
|
||||||
|
|
||||||
|
for item in items:
|
||||||
|
entry = history.urls.get(item.canonical_url) if item.canonical_url else None
|
||||||
|
if entry and _entry_within_window(entry, run_date=run_date, max_age_days=max_age_days):
|
||||||
|
removed.append(
|
||||||
|
{
|
||||||
|
"item_id": item.id,
|
||||||
|
"canonical_url": item.canonical_url,
|
||||||
|
"title": item.title or item.title_raw,
|
||||||
|
"first_seen": entry.first_seen,
|
||||||
|
"last_published": entry.last_published,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
deduped.append(item)
|
||||||
|
|
||||||
|
report = {
|
||||||
|
"input_count": len(items),
|
||||||
|
"output_count": len(deduped),
|
||||||
|
"removed_count": len(removed),
|
||||||
|
"removed": removed,
|
||||||
|
"max_age_days": max_age_days,
|
||||||
|
}
|
||||||
|
return deduped, report
|
||||||
143
ai_daily_report/env.py
Normal file
143
ai_daily_report/env.py
Normal file
@@ -0,0 +1,143 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
PROJECT_ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
|
||||||
|
|
||||||
|
def read_env_file(env_path: Path) -> dict[str, str]:
|
||||||
|
env: dict[str, str] = {}
|
||||||
|
if not env_path.exists():
|
||||||
|
return env
|
||||||
|
text = env_path.read_text(encoding="utf-8", errors="ignore")
|
||||||
|
for line in text.splitlines():
|
||||||
|
line = line.strip()
|
||||||
|
if not line or line.startswith("#") or "=" not in line:
|
||||||
|
continue
|
||||||
|
key, value = line.split("=", 1)
|
||||||
|
env[key.strip()] = value.strip().strip('"').strip("'")
|
||||||
|
return env
|
||||||
|
|
||||||
|
|
||||||
|
def load_env() -> dict[str, str]:
|
||||||
|
env: dict[str, str] = {}
|
||||||
|
env.update(read_env_file(PROJECT_ROOT / ".env"))
|
||||||
|
env.update(read_env_file(Path.home() / ".hermes" / ".env"))
|
||||||
|
env.update({key: value for key, value in os.environ.items() if value})
|
||||||
|
return env
|
||||||
|
|
||||||
|
|
||||||
|
def first_env(env: dict[str, str], *names: str) -> str:
|
||||||
|
for name in names:
|
||||||
|
value = (env.get(name) or "").strip()
|
||||||
|
if value:
|
||||||
|
return value
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
def _load_simple_yaml(path: Path) -> dict[str, object]:
|
||||||
|
if not path.exists():
|
||||||
|
return {}
|
||||||
|
root: dict[str, object] = {}
|
||||||
|
stack: list[tuple[int, dict[str, object]]] = [(-1, root)]
|
||||||
|
for raw_line in path.read_text(encoding="utf-8", errors="ignore").splitlines():
|
||||||
|
if not raw_line.strip() or raw_line.lstrip().startswith("#") or ":" not in raw_line:
|
||||||
|
continue
|
||||||
|
indent = len(raw_line) - len(raw_line.lstrip(" "))
|
||||||
|
key, value = raw_line.strip().split(":", 1)
|
||||||
|
key = key.strip()
|
||||||
|
value = value.strip().strip('"').strip("'")
|
||||||
|
while stack and indent <= stack[-1][0]:
|
||||||
|
stack.pop()
|
||||||
|
current = stack[-1][1]
|
||||||
|
if value:
|
||||||
|
current[key] = value
|
||||||
|
else:
|
||||||
|
child: dict[str, object] = {}
|
||||||
|
current[key] = child
|
||||||
|
stack.append((indent, child))
|
||||||
|
return root
|
||||||
|
|
||||||
|
|
||||||
|
def _env_with_hermes(env: dict[str, str], hermes_dir: Path) -> dict[str, str]:
|
||||||
|
merged = dict(read_env_file(hermes_dir / ".env"))
|
||||||
|
merged.update(env)
|
||||||
|
return merged
|
||||||
|
|
||||||
|
|
||||||
|
def _provider_env_names(provider: str) -> tuple[str, str, str]:
|
||||||
|
prefix = provider.upper().replace("-", "_")
|
||||||
|
return f"{prefix}_API_KEY", f"{prefix}_BASE_URL", f"{prefix}_MODEL"
|
||||||
|
|
||||||
|
|
||||||
|
def _auth_json_key(env: dict[str, str], hermes_dir: Path, provider: str) -> str:
|
||||||
|
auth_path = hermes_dir / "auth.json"
|
||||||
|
if not auth_path.exists() or not provider:
|
||||||
|
return ""
|
||||||
|
try:
|
||||||
|
auth = json.loads(auth_path.read_text(encoding="utf-8"))
|
||||||
|
except Exception:
|
||||||
|
return ""
|
||||||
|
pool = auth.get("credential_pool", {}) or {}
|
||||||
|
provider_keys = [provider, provider.replace("-", "_")]
|
||||||
|
for key in provider_keys:
|
||||||
|
creds = pool.get(key, []) or []
|
||||||
|
if not creds:
|
||||||
|
continue
|
||||||
|
cred = creds[0]
|
||||||
|
source = str(cred.get("source") or "")
|
||||||
|
if source.startswith("env:"):
|
||||||
|
resolved = first_env(env, source[4:])
|
||||||
|
if resolved:
|
||||||
|
return resolved
|
||||||
|
token = str(cred.get("access_token") or "").strip()
|
||||||
|
if token:
|
||||||
|
return token
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_llm_config(env: dict[str, str], *, hermes_dir: Path | None = None) -> dict[str, str]:
|
||||||
|
hermes_dir = hermes_dir or Path.home() / ".hermes"
|
||||||
|
env = _env_with_hermes(env, hermes_dir)
|
||||||
|
hermes_config = _load_simple_yaml(hermes_dir / "config.yaml")
|
||||||
|
model_config = hermes_config.get("model", {}) if isinstance(hermes_config.get("model"), dict) else {}
|
||||||
|
provider = str(model_config.get("provider") or "").strip()
|
||||||
|
provider_key, provider_base_url, provider_model = _provider_env_names(provider) if provider else ("", "", "")
|
||||||
|
|
||||||
|
api_key = first_env(env, "LLM_API_KEY")
|
||||||
|
base_url = first_env(env, "LLM_BASE_URL")
|
||||||
|
model = first_env(env, "LLM_MODEL")
|
||||||
|
|
||||||
|
if not api_key and provider:
|
||||||
|
api_key = first_env(env, provider_key) or _auth_json_key(env, hermes_dir, provider)
|
||||||
|
if not base_url and provider:
|
||||||
|
base_url = first_env(env, provider_base_url) or str(model_config.get("base_url") or "").strip()
|
||||||
|
if not model and provider:
|
||||||
|
model = first_env(env, provider_model) or str(model_config.get("default") or "").strip()
|
||||||
|
|
||||||
|
if not api_key:
|
||||||
|
api_key = first_env(env, "SUB2API_API_KEY", "XIAOMI_API_KEY", "OPENROUTER_API_KEY")
|
||||||
|
if not base_url:
|
||||||
|
base_url = first_env(env, "SUB2API_BASE_URL", "XIAOMI_BASE_URL", "OPENROUTER_BASE_URL")
|
||||||
|
if not model:
|
||||||
|
model = first_env(env, "SUB2API_MODEL", "XIAOMI_MODEL")
|
||||||
|
|
||||||
|
missing = [
|
||||||
|
name
|
||||||
|
for name, value in (
|
||||||
|
("LLM_API_KEY", api_key),
|
||||||
|
("LLM_BASE_URL", base_url),
|
||||||
|
("LLM_MODEL", model),
|
||||||
|
)
|
||||||
|
if not value
|
||||||
|
]
|
||||||
|
if missing:
|
||||||
|
raise ValueError("missing_llm_config: " + ",".join(missing))
|
||||||
|
return {"api_key": api_key, "base_url": base_url, "model": model}
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_blog_token(env: dict[str, str]) -> str:
|
||||||
|
return first_env(env, "BLOG_SERVICE_TOKEN", "EPHRON_SERVICE_TOKEN")
|
||||||
123
ai_daily_report/guide.py
Normal file
123
ai_daily_report/guide.py
Normal file
@@ -0,0 +1,123 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from typing import Any, Callable
|
||||||
|
|
||||||
|
from .llm import parse_json_object
|
||||||
|
from .models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
GuideLlmCall = Callable[[str], str]
|
||||||
|
|
||||||
|
|
||||||
|
def _clean_text(text: str, limit: int | None = None) -> str:
|
||||||
|
value = re.sub(r"^\s*>\s*", "", text or "").strip()
|
||||||
|
value = re.sub(r"\[\d+\]|\[N\]", "", value)
|
||||||
|
value = re.sub(r"\s+", " ", value).strip()
|
||||||
|
if limit and len(value) > limit:
|
||||||
|
value = value[:limit].rstrip()
|
||||||
|
return value
|
||||||
|
|
||||||
|
|
||||||
|
def _build_prompt(items: list[NewsItem]) -> str:
|
||||||
|
payload = {
|
||||||
|
"task": (
|
||||||
|
"Generate a concise Chinese AI daily report guide. Return JSON only. "
|
||||||
|
"Do not use 强信号/中信号/待验证. Do not add facts. "
|
||||||
|
"Write one opening intro, a short theme, 2-4 daily threads, and one closing conclusion. "
|
||||||
|
"Every thread must reference existing item_ids."
|
||||||
|
),
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"id": item.id,
|
||||||
|
"title": item.title or item.title_raw,
|
||||||
|
"summary": item.summary or item.summary_raw,
|
||||||
|
"section": item.section,
|
||||||
|
"source": item.source_label,
|
||||||
|
}
|
||||||
|
for item in items
|
||||||
|
],
|
||||||
|
"output_schema": {
|
||||||
|
"intro": "one opening paragraph under 160 Chinese characters",
|
||||||
|
"theme": "one sentence under 120 Chinese characters",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "thread title",
|
||||||
|
"text": "one or two sentences",
|
||||||
|
"item_ids": ["existing item id"],
|
||||||
|
"kind": "thread|uncertain",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"conclusion": "one closing paragraph under 180 Chinese characters",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
return json.dumps(payload, ensure_ascii=False)
|
||||||
|
|
||||||
|
|
||||||
|
def _empty_guide() -> dict[str, Any]:
|
||||||
|
return {"intro": "", "theme": "", "threads": [], "conclusion": ""}
|
||||||
|
|
||||||
|
|
||||||
|
def generate_guide(
|
||||||
|
items: list[NewsItem],
|
||||||
|
*,
|
||||||
|
llm_call: GuideLlmCall,
|
||||||
|
) -> tuple[dict[str, Any], dict[str, Any]]:
|
||||||
|
if not items:
|
||||||
|
return _empty_guide(), {
|
||||||
|
"input_count": 0,
|
||||||
|
"intro_present": False,
|
||||||
|
"theme_present": False,
|
||||||
|
"conclusion_present": False,
|
||||||
|
"thread_count": 0,
|
||||||
|
"dropped_thread_count": 0,
|
||||||
|
"fallback_used": False,
|
||||||
|
"errors": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
obj = parse_json_object(llm_call(_build_prompt(items)))
|
||||||
|
except Exception as exc:
|
||||||
|
return _empty_guide(), {
|
||||||
|
"input_count": len(items),
|
||||||
|
"intro_present": False,
|
||||||
|
"theme_present": False,
|
||||||
|
"conclusion_present": False,
|
||||||
|
"thread_count": 0,
|
||||||
|
"dropped_thread_count": 0,
|
||||||
|
"fallback_used": True,
|
||||||
|
"errors": [f"{type(exc).__name__}: {exc}"],
|
||||||
|
}
|
||||||
|
|
||||||
|
valid_ids = {item.id for item in items}
|
||||||
|
threads: list[dict[str, Any]] = []
|
||||||
|
dropped = 0
|
||||||
|
for thread in obj.get("threads", []) or []:
|
||||||
|
item_ids = [item_id for item_id in thread.get("item_ids", []) if item_id in valid_ids]
|
||||||
|
if not item_ids:
|
||||||
|
dropped += 1
|
||||||
|
continue
|
||||||
|
title = _clean_text(str(thread.get("title") or ""), limit=80)
|
||||||
|
text = _clean_text(str(thread.get("text") or ""), limit=220)
|
||||||
|
if not title or not text:
|
||||||
|
dropped += 1
|
||||||
|
continue
|
||||||
|
kind = thread.get("kind") if thread.get("kind") in ("thread", "uncertain") else "thread"
|
||||||
|
threads.append({"title": title, "text": text, "item_ids": item_ids, "kind": kind})
|
||||||
|
|
||||||
|
intro = _clean_text(str(obj.get("intro") or ""), limit=160)
|
||||||
|
theme = _clean_text(str(obj.get("theme") or ""), limit=120)
|
||||||
|
conclusion = _clean_text(str(obj.get("conclusion") or ""), limit=180)
|
||||||
|
guide = {"intro": intro, "theme": theme, "threads": threads, "conclusion": conclusion}
|
||||||
|
report = {
|
||||||
|
"input_count": len(items),
|
||||||
|
"intro_present": bool(intro),
|
||||||
|
"theme_present": bool(theme),
|
||||||
|
"conclusion_present": bool(conclusion),
|
||||||
|
"thread_count": len(threads),
|
||||||
|
"dropped_thread_count": dropped,
|
||||||
|
"fallback_used": False,
|
||||||
|
"errors": [],
|
||||||
|
}
|
||||||
|
return guide, report
|
||||||
18
ai_daily_report/llm.py
Normal file
18
ai_daily_report/llm.py
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from typing import Any, Callable
|
||||||
|
|
||||||
|
|
||||||
|
LlmCall = Callable[[str], str]
|
||||||
|
|
||||||
|
|
||||||
|
def parse_json_object(text: str) -> dict[str, Any]:
|
||||||
|
text = re.sub(r"^```(?:json)?\s*\n?", "", text.strip())
|
||||||
|
text = re.sub(r"\n?```\s*$", "", text)
|
||||||
|
match = re.search(r"\{.*\}\s*$", text, re.S)
|
||||||
|
if not match:
|
||||||
|
raise ValueError("LLM output does not contain a JSON object")
|
||||||
|
return json.loads(match.group(0))
|
||||||
|
|
||||||
69
ai_daily_report/models.py
Normal file
69
ai_daily_report/models.py
Normal file
@@ -0,0 +1,69 @@
|
|||||||
|
from dataclasses import dataclass, field
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class SourceConfig:
|
||||||
|
name: str
|
||||||
|
type: str
|
||||||
|
role: str = "supplement"
|
||||||
|
priority: int = 100
|
||||||
|
required: bool = False
|
||||||
|
enabled: bool = True
|
||||||
|
timeout_seconds: int = 25
|
||||||
|
retries: int = 0
|
||||||
|
min_items: int = 0
|
||||||
|
url: str = ""
|
||||||
|
max_item_age_days: int | None = None
|
||||||
|
failure_policy: str = "warn"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SourceResult:
|
||||||
|
source: str
|
||||||
|
role: str
|
||||||
|
ok: bool
|
||||||
|
status: str
|
||||||
|
items: list[dict[str, Any]] = field(default_factory=list)
|
||||||
|
error: str | None = None
|
||||||
|
elapsed_ms: int = 0
|
||||||
|
retry_count: int = 0
|
||||||
|
fetched_at: str = ""
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class NewsItem:
|
||||||
|
id: str
|
||||||
|
source_group: str
|
||||||
|
source_label: str
|
||||||
|
source_role: str
|
||||||
|
source_priority: int
|
||||||
|
title_raw: str
|
||||||
|
title_norm: str
|
||||||
|
summary_raw: str
|
||||||
|
url: str
|
||||||
|
canonical_url: str
|
||||||
|
published_at: str | None = None
|
||||||
|
collected_at: str = ""
|
||||||
|
origin_type: str = ""
|
||||||
|
section_hint: str = ""
|
||||||
|
language_hint: str = ""
|
||||||
|
title: str | None = None
|
||||||
|
summary: str | None = None
|
||||||
|
section: str | None = None
|
||||||
|
quality_flags: list[str] = field(default_factory=list)
|
||||||
|
duplicate_sources: list[dict[str, Any]] = field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class PublishedUrlEntry:
|
||||||
|
first_seen: str
|
||||||
|
last_published: str
|
||||||
|
titles: list[str] = field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class PublishedUrls:
|
||||||
|
version: int = 1
|
||||||
|
urls: dict[str, PublishedUrlEntry] = field(default_factory=dict)
|
||||||
|
updated_at: str = ""
|
||||||
132
ai_daily_report/normalize.py
Normal file
132
ai_daily_report/normalize.py
Normal file
@@ -0,0 +1,132 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import hashlib
|
||||||
|
import html
|
||||||
|
import re
|
||||||
|
import unicodedata
|
||||||
|
from collections import Counter
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from typing import Any
|
||||||
|
from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse
|
||||||
|
|
||||||
|
from .models import NewsItem, SourceResult
|
||||||
|
|
||||||
|
|
||||||
|
TRACKING_QUERY_PREFIXES = ("utm_",)
|
||||||
|
TRACKING_QUERY_KEYS = {"fbclid", "gclid", "spm", "from", "ref"}
|
||||||
|
|
||||||
|
|
||||||
|
def clean_text(value: str) -> str:
|
||||||
|
text = html.unescape(value or "")
|
||||||
|
text = re.sub(r"<[^>]+>", " ", text)
|
||||||
|
text = re.sub(r"\s+", " ", text).strip()
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
def canonicalize_url(url: str) -> str:
|
||||||
|
if not url:
|
||||||
|
return ""
|
||||||
|
parsed = urlparse(url.strip())
|
||||||
|
scheme = (parsed.scheme or "https").lower()
|
||||||
|
host = (parsed.netloc or "").lower()
|
||||||
|
if host.startswith("www."):
|
||||||
|
host = host[4:]
|
||||||
|
if host == "twitter.com":
|
||||||
|
host = "x.com"
|
||||||
|
|
||||||
|
query = []
|
||||||
|
for key, value in parse_qsl(parsed.query, keep_blank_values=True):
|
||||||
|
key_lower = key.lower()
|
||||||
|
if key_lower in TRACKING_QUERY_KEYS:
|
||||||
|
continue
|
||||||
|
if any(key_lower.startswith(prefix) for prefix in TRACKING_QUERY_PREFIXES):
|
||||||
|
continue
|
||||||
|
query.append((key, value))
|
||||||
|
|
||||||
|
path = parsed.path or ""
|
||||||
|
if len(path) > 1:
|
||||||
|
path = path.rstrip("/")
|
||||||
|
|
||||||
|
return urlunparse((scheme, host, path, "", urlencode(query), ""))
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_title(title: str) -> str:
|
||||||
|
text = unicodedata.normalize("NFKC", title or "").lower()
|
||||||
|
text = re.sub(r"[^\w\u4e00-\u9fff]+", "", text)
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
def _item_id(canonical_url: str, source_group: str, title_norm: str, published_at: str | None) -> str:
|
||||||
|
seed = canonical_url or "|".join([source_group, title_norm, published_at or ""])
|
||||||
|
digest = hashlib.sha1(seed.encode("utf-8")).hexdigest()[:16]
|
||||||
|
return f"item_{digest}"
|
||||||
|
|
||||||
|
|
||||||
|
def _quality_flags(title: str, summary: str, url: str) -> list[str]:
|
||||||
|
flags: list[str] = []
|
||||||
|
if not url:
|
||||||
|
flags.append("missing_url")
|
||||||
|
if not summary:
|
||||||
|
flags.append("missing_summary")
|
||||||
|
if len(normalize_title(title)) < 3:
|
||||||
|
flags.append("short_title")
|
||||||
|
return flags
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_items(
|
||||||
|
source_results: list[SourceResult],
|
||||||
|
*,
|
||||||
|
run_date: str,
|
||||||
|
source_priorities: dict[str, int] | None = None,
|
||||||
|
) -> tuple[list[NewsItem], dict[str, Any]]:
|
||||||
|
source_priorities = source_priorities or {}
|
||||||
|
collected_at = datetime.now(timezone.utc).isoformat()
|
||||||
|
items: list[NewsItem] = []
|
||||||
|
flag_counts: Counter[str] = Counter()
|
||||||
|
id_counts: Counter[str] = Counter()
|
||||||
|
input_count = 0
|
||||||
|
|
||||||
|
for source_result in source_results:
|
||||||
|
for raw in source_result.items:
|
||||||
|
input_count += 1
|
||||||
|
title = clean_text(str(raw.get("title_raw") or raw.get("title") or ""))
|
||||||
|
summary = clean_text(str(raw.get("summary_raw") or raw.get("summary") or ""))
|
||||||
|
url = str(raw.get("url") or "").strip()
|
||||||
|
canonical_url = canonicalize_url(url)
|
||||||
|
title_norm = normalize_title(title)
|
||||||
|
flags = _quality_flags(title, summary, canonical_url)
|
||||||
|
flag_counts.update(flags)
|
||||||
|
source_label = clean_text(str(raw.get("source_label") or source_result.source))
|
||||||
|
published_at = raw.get("published_at")
|
||||||
|
base_id = _item_id(canonical_url, source_result.source, title_norm, published_at)
|
||||||
|
id_counts[base_id] += 1
|
||||||
|
item_id = base_id if id_counts[base_id] == 1 else f"{base_id}_{id_counts[base_id]}"
|
||||||
|
|
||||||
|
items.append(
|
||||||
|
NewsItem(
|
||||||
|
id=item_id,
|
||||||
|
source_group=source_result.source,
|
||||||
|
source_label=source_label,
|
||||||
|
source_role=source_result.role,
|
||||||
|
source_priority=source_priorities.get(source_result.source, 100),
|
||||||
|
title_raw=title,
|
||||||
|
title_norm=title_norm,
|
||||||
|
summary_raw=summary,
|
||||||
|
url=url,
|
||||||
|
canonical_url=canonical_url,
|
||||||
|
published_at=published_at,
|
||||||
|
collected_at=collected_at,
|
||||||
|
origin_type=str(raw.get("origin_type") or ""),
|
||||||
|
section_hint=str(raw.get("section_hint") or ""),
|
||||||
|
language_hint=str(raw.get("language_hint") or ""),
|
||||||
|
quality_flags=flags,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
report = {
|
||||||
|
"run_date": run_date,
|
||||||
|
"input_count": input_count,
|
||||||
|
"output_count": len(items),
|
||||||
|
"quality_flag_counts": dict(flag_counts),
|
||||||
|
}
|
||||||
|
return items, report
|
||||||
54
ai_daily_report/observability.py
Normal file
54
ai_daily_report/observability.py
Normal file
@@ -0,0 +1,54 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import hashlib
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from typing import Any, Callable
|
||||||
|
|
||||||
|
|
||||||
|
def sha256_text(value: str) -> str:
|
||||||
|
return hashlib.sha256((value or "").encode("utf-8")).hexdigest()
|
||||||
|
|
||||||
|
|
||||||
|
def truncate_text(value: str, limit: int = 500) -> str:
|
||||||
|
text = value or ""
|
||||||
|
if len(text) <= limit:
|
||||||
|
return text
|
||||||
|
return f"{text[:limit]}…[truncated {len(text) - limit} chars]"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class LlmCallObserver:
|
||||||
|
call: Callable[[str], str]
|
||||||
|
stage: str
|
||||||
|
records: list[dict[str, Any]] = field(default_factory=list)
|
||||||
|
prompt_preview_chars: int = 500
|
||||||
|
response_preview_chars: int = 500
|
||||||
|
|
||||||
|
def __call__(self, prompt: str) -> str:
|
||||||
|
response = self.call(prompt)
|
||||||
|
self.records.append(
|
||||||
|
{
|
||||||
|
"stage": self.stage,
|
||||||
|
"call_index": len(self.records) + 1,
|
||||||
|
"prompt_hash": sha256_text(prompt),
|
||||||
|
"response_hash": sha256_text(response),
|
||||||
|
"prompt_chars": len(prompt or ""),
|
||||||
|
"response_chars": len(response or ""),
|
||||||
|
"prompt_preview": truncate_text(prompt, self.prompt_preview_chars),
|
||||||
|
"response_preview": truncate_text(response, self.response_preview_chars),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return response
|
||||||
|
|
||||||
|
|
||||||
|
def summarize_observed_calls(observers: list[LlmCallObserver]) -> dict[str, Any]:
|
||||||
|
records: list[dict[str, Any]] = []
|
||||||
|
by_stage: dict[str, int] = {}
|
||||||
|
for observer in observers:
|
||||||
|
records.extend(observer.records)
|
||||||
|
by_stage[observer.stage] = by_stage.get(observer.stage, 0) + len(observer.records)
|
||||||
|
return {
|
||||||
|
"total_calls": len(records),
|
||||||
|
"by_stage": by_stage,
|
||||||
|
"records": records,
|
||||||
|
}
|
||||||
386
ai_daily_report/pipeline.py
Normal file
386
ai_daily_report/pipeline.py
Normal file
@@ -0,0 +1,386 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .assemble import assemble_markdown
|
||||||
|
from .candidate_recall import recall_semantic_candidates
|
||||||
|
from .classify import classify_and_order_items
|
||||||
|
from .collect import Fetcher, collect_sources
|
||||||
|
from .dedupe import cross_day_dedup_items, hard_dedup_items
|
||||||
|
from .guide import GuideLlmCall, generate_guide
|
||||||
|
from .models import PublishedUrls, SourceConfig
|
||||||
|
from .normalize import normalize_items
|
||||||
|
from .publish import BlogClient, publish_markdown
|
||||||
|
from .quality_gate import evaluate_quality_gate
|
||||||
|
from .rewrite import RewriteLlmCall, rewrite_items
|
||||||
|
from .semantic_dedupe import SemanticLlmCall, semantic_dedup_items
|
||||||
|
|
||||||
|
|
||||||
|
def _source_config_from_dict(value: dict[str, Any]) -> SourceConfig:
|
||||||
|
max_item_age_days = value.get("max_item_age_days")
|
||||||
|
return SourceConfig(
|
||||||
|
name=value["name"],
|
||||||
|
type=value["type"],
|
||||||
|
role=value.get("role", "supplement"),
|
||||||
|
priority=int(value.get("priority", 100)),
|
||||||
|
required=bool(value.get("required", False)),
|
||||||
|
enabled=bool(value.get("enabled", True)),
|
||||||
|
timeout_seconds=int(value.get("timeout_seconds", 25)),
|
||||||
|
retries=int(value.get("retries", 0)),
|
||||||
|
min_items=int(value.get("min_items", 0)),
|
||||||
|
url=value.get("url", ""),
|
||||||
|
max_item_age_days=int(max_item_age_days) if max_item_age_days is not None else None,
|
||||||
|
failure_policy=str(value.get("failure_policy") or ("block" if bool(value.get("required", False)) else "warn")),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def run_stage0_to_stage2(
|
||||||
|
source_configs: list[dict[str, Any] | SourceConfig],
|
||||||
|
run_date: str,
|
||||||
|
*,
|
||||||
|
fetcher: Fetcher,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
configs = [
|
||||||
|
config if isinstance(config, SourceConfig) else _source_config_from_dict(config)
|
||||||
|
for config in source_configs
|
||||||
|
]
|
||||||
|
source_results, stage0_report = collect_sources(configs, run_date, fetcher=fetcher)
|
||||||
|
source_priorities = {config.name: config.priority for config in configs}
|
||||||
|
normalized_items, stage1_report = normalize_items(
|
||||||
|
source_results,
|
||||||
|
run_date=run_date,
|
||||||
|
source_priorities=source_priorities,
|
||||||
|
)
|
||||||
|
deduped_items, stage2_report = hard_dedup_items(normalized_items)
|
||||||
|
artifacts = {
|
||||||
|
"stage0_sources": source_results,
|
||||||
|
"stage1_items": normalized_items,
|
||||||
|
"stage2_items": deduped_items,
|
||||||
|
}
|
||||||
|
return {
|
||||||
|
"source_results": source_results,
|
||||||
|
"items": deduped_items,
|
||||||
|
"reports": {
|
||||||
|
"stage0": stage0_report,
|
||||||
|
"stage1": stage1_report,
|
||||||
|
"stage2": stage2_report,
|
||||||
|
},
|
||||||
|
"artifacts": artifacts,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def run_stage0_to_stage2_5(
|
||||||
|
source_configs: list[dict[str, Any] | SourceConfig],
|
||||||
|
run_date: str,
|
||||||
|
*,
|
||||||
|
fetcher: Fetcher,
|
||||||
|
published_urls: PublishedUrls | None = None,
|
||||||
|
cross_day_dedup_enabled: bool = True,
|
||||||
|
cross_day_dedup_max_age_days: int = 7,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
stage2_result = run_stage0_to_stage2(source_configs, run_date, fetcher=fetcher)
|
||||||
|
if cross_day_dedup_enabled:
|
||||||
|
items, stage2_5_report = cross_day_dedup_items(
|
||||||
|
stage2_result["items"],
|
||||||
|
published_urls,
|
||||||
|
run_date=run_date,
|
||||||
|
max_age_days=cross_day_dedup_max_age_days,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
items = stage2_result["items"]
|
||||||
|
stage2_5_report = {
|
||||||
|
"input_count": len(items),
|
||||||
|
"output_count": len(items),
|
||||||
|
"removed_count": 0,
|
||||||
|
"removed": [],
|
||||||
|
"enabled": False,
|
||||||
|
"max_age_days": cross_day_dedup_max_age_days,
|
||||||
|
}
|
||||||
|
reports = dict(stage2_result["reports"])
|
||||||
|
stage2_5_report.setdefault("enabled", cross_day_dedup_enabled)
|
||||||
|
reports["stage2_5"] = stage2_5_report
|
||||||
|
artifacts = dict(stage2_result.get("artifacts", {}))
|
||||||
|
artifacts["stage2_5_items"] = items
|
||||||
|
return {
|
||||||
|
"source_results": stage2_result["source_results"],
|
||||||
|
"items": items,
|
||||||
|
"reports": reports,
|
||||||
|
"artifacts": artifacts,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def run_stage0_to_stage4(
|
||||||
|
source_configs: list[dict[str, Any] | SourceConfig],
|
||||||
|
run_date: str,
|
||||||
|
*,
|
||||||
|
fetcher: Fetcher,
|
||||||
|
semantic_llm_call: SemanticLlmCall,
|
||||||
|
rewrite_llm_call: RewriteLlmCall,
|
||||||
|
published_urls: PublishedUrls | None = None,
|
||||||
|
cross_day_dedup_enabled: bool = True,
|
||||||
|
cross_day_dedup_max_age_days: int = 7,
|
||||||
|
semantic_dedup_max_deletion_ratio: float = 0.5,
|
||||||
|
rewrite_batch_size: int = 30,
|
||||||
|
semantic_candidate_recall_config: dict[str, Any] | None = None,
|
||||||
|
quality_gate_config: dict[str, Any] | None = None,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
stage2_5_result = run_stage0_to_stage2_5(
|
||||||
|
source_configs,
|
||||||
|
run_date,
|
||||||
|
fetcher=fetcher,
|
||||||
|
published_urls=published_urls,
|
||||||
|
cross_day_dedup_enabled=cross_day_dedup_enabled,
|
||||||
|
cross_day_dedup_max_age_days=cross_day_dedup_max_age_days,
|
||||||
|
)
|
||||||
|
items = stage2_5_result["items"]
|
||||||
|
remaining_ids = {item.id for item in items}
|
||||||
|
candidates = [
|
||||||
|
candidate
|
||||||
|
for candidate in stage2_5_result["reports"]["stage2"].get("possible_duplicates", [])
|
||||||
|
if set(candidate.get("item_ids", [])).issubset(remaining_ids)
|
||||||
|
]
|
||||||
|
candidates, stage2_8_report = recall_semantic_candidates(
|
||||||
|
items,
|
||||||
|
existing_candidates=candidates,
|
||||||
|
config=semantic_candidate_recall_config,
|
||||||
|
)
|
||||||
|
semantic_items, stage3_report = semantic_dedup_items(
|
||||||
|
items,
|
||||||
|
candidates,
|
||||||
|
llm_call=semantic_llm_call,
|
||||||
|
max_deletion_ratio=semantic_dedup_max_deletion_ratio,
|
||||||
|
)
|
||||||
|
rewritten_items, stage4_report = rewrite_items(
|
||||||
|
semantic_items,
|
||||||
|
llm_call=rewrite_llm_call,
|
||||||
|
batch_size=rewrite_batch_size,
|
||||||
|
)
|
||||||
|
reports = dict(stage2_5_result["reports"])
|
||||||
|
reports["stage2_8"] = stage2_8_report
|
||||||
|
reports["stage3"] = stage3_report
|
||||||
|
reports["stage4"] = stage4_report
|
||||||
|
artifacts = dict(stage2_5_result.get("artifacts", {}))
|
||||||
|
artifacts["stage2_8_candidates"] = candidates
|
||||||
|
artifacts["stage3_items"] = semantic_items
|
||||||
|
artifacts["stage4_items"] = rewritten_items
|
||||||
|
return {
|
||||||
|
"source_results": stage2_5_result["source_results"],
|
||||||
|
"items": rewritten_items,
|
||||||
|
"reports": reports,
|
||||||
|
"artifacts": artifacts,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def run_stage0_to_stage5(
|
||||||
|
source_configs: list[dict[str, Any] | SourceConfig],
|
||||||
|
run_date: str,
|
||||||
|
*,
|
||||||
|
fetcher: Fetcher,
|
||||||
|
semantic_llm_call: SemanticLlmCall,
|
||||||
|
rewrite_llm_call: RewriteLlmCall,
|
||||||
|
published_urls: PublishedUrls | None = None,
|
||||||
|
cross_day_dedup_enabled: bool = True,
|
||||||
|
cross_day_dedup_max_age_days: int = 7,
|
||||||
|
semantic_dedup_max_deletion_ratio: float = 0.5,
|
||||||
|
rewrite_batch_size: int = 30,
|
||||||
|
semantic_candidate_recall_config: dict[str, Any] | None = None,
|
||||||
|
quality_gate_config: dict[str, Any] | None = None,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
stage4_result = run_stage0_to_stage4(
|
||||||
|
source_configs,
|
||||||
|
run_date,
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
published_urls=published_urls,
|
||||||
|
cross_day_dedup_enabled=cross_day_dedup_enabled,
|
||||||
|
cross_day_dedup_max_age_days=cross_day_dedup_max_age_days,
|
||||||
|
semantic_dedup_max_deletion_ratio=semantic_dedup_max_deletion_ratio,
|
||||||
|
rewrite_batch_size=rewrite_batch_size,
|
||||||
|
semantic_candidate_recall_config=semantic_candidate_recall_config,
|
||||||
|
)
|
||||||
|
classified_items, stage5_report = classify_and_order_items(stage4_result["items"])
|
||||||
|
reports = dict(stage4_result["reports"])
|
||||||
|
reports["stage5"] = stage5_report
|
||||||
|
return {
|
||||||
|
"source_results": stage4_result["source_results"],
|
||||||
|
"items": classified_items,
|
||||||
|
"reports": reports,
|
||||||
|
"artifacts": stage4_result.get("artifacts", {}),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def run_stage0_to_stage6(
|
||||||
|
source_configs: list[dict[str, Any] | SourceConfig],
|
||||||
|
run_date: str,
|
||||||
|
*,
|
||||||
|
fetcher: Fetcher,
|
||||||
|
semantic_llm_call: SemanticLlmCall,
|
||||||
|
rewrite_llm_call: RewriteLlmCall,
|
||||||
|
guide_llm_call: GuideLlmCall,
|
||||||
|
published_urls: PublishedUrls | None = None,
|
||||||
|
cross_day_dedup_enabled: bool = True,
|
||||||
|
cross_day_dedup_max_age_days: int = 7,
|
||||||
|
semantic_dedup_max_deletion_ratio: float = 0.5,
|
||||||
|
rewrite_batch_size: int = 30,
|
||||||
|
semantic_candidate_recall_config: dict[str, Any] | None = None,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
stage5_result = run_stage0_to_stage5(
|
||||||
|
source_configs,
|
||||||
|
run_date,
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
published_urls=published_urls,
|
||||||
|
cross_day_dedup_enabled=cross_day_dedup_enabled,
|
||||||
|
cross_day_dedup_max_age_days=cross_day_dedup_max_age_days,
|
||||||
|
semantic_dedup_max_deletion_ratio=semantic_dedup_max_deletion_ratio,
|
||||||
|
rewrite_batch_size=rewrite_batch_size,
|
||||||
|
semantic_candidate_recall_config=semantic_candidate_recall_config,
|
||||||
|
)
|
||||||
|
guide, stage6_report = generate_guide(stage5_result["items"], llm_call=guide_llm_call)
|
||||||
|
reports = dict(stage5_result["reports"])
|
||||||
|
reports["stage6"] = stage6_report
|
||||||
|
return {
|
||||||
|
"source_results": stage5_result["source_results"],
|
||||||
|
"items": stage5_result["items"],
|
||||||
|
"guide": guide,
|
||||||
|
"reports": reports,
|
||||||
|
"artifacts": stage5_result.get("artifacts", {}),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def run_stage0_to_stage7(
|
||||||
|
source_configs: list[dict[str, Any] | SourceConfig],
|
||||||
|
run_date: str,
|
||||||
|
*,
|
||||||
|
fetcher: Fetcher,
|
||||||
|
semantic_llm_call: SemanticLlmCall,
|
||||||
|
rewrite_llm_call: RewriteLlmCall,
|
||||||
|
guide_llm_call: GuideLlmCall,
|
||||||
|
published_urls: PublishedUrls | None = None,
|
||||||
|
cross_day_dedup_enabled: bool = True,
|
||||||
|
cross_day_dedup_max_age_days: int = 7,
|
||||||
|
semantic_dedup_max_deletion_ratio: float = 0.5,
|
||||||
|
rewrite_batch_size: int = 30,
|
||||||
|
semantic_candidate_recall_config: dict[str, Any] | None = None,
|
||||||
|
quality_gate_config: dict[str, Any] | None = None,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
stage6_result = run_stage0_to_stage6(
|
||||||
|
source_configs,
|
||||||
|
run_date,
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
guide_llm_call=guide_llm_call,
|
||||||
|
published_urls=published_urls,
|
||||||
|
cross_day_dedup_enabled=cross_day_dedup_enabled,
|
||||||
|
cross_day_dedup_max_age_days=cross_day_dedup_max_age_days,
|
||||||
|
semantic_dedup_max_deletion_ratio=semantic_dedup_max_deletion_ratio,
|
||||||
|
rewrite_batch_size=rewrite_batch_size,
|
||||||
|
semantic_candidate_recall_config=semantic_candidate_recall_config,
|
||||||
|
)
|
||||||
|
markdown, stage7_report = assemble_markdown(stage6_result["items"], stage6_result["guide"])
|
||||||
|
upstream_blocking_errors: list[str] = []
|
||||||
|
for stage_name in ("stage3", "stage4", "stage5", "stage6"):
|
||||||
|
for error in stage6_result["reports"].get(stage_name, {}).get("blocking_errors", []) or []:
|
||||||
|
upstream_blocking_errors.append(str(error))
|
||||||
|
if upstream_blocking_errors:
|
||||||
|
existing_errors = list(stage7_report.get("blocking_errors", []) or [])
|
||||||
|
stage7_report["blocking_errors"] = existing_errors + upstream_blocking_errors
|
||||||
|
reports = dict(stage6_result["reports"])
|
||||||
|
quality_gate_report = evaluate_quality_gate(
|
||||||
|
stage6_result["items"],
|
||||||
|
source_results=stage6_result["source_results"],
|
||||||
|
reports=reports,
|
||||||
|
config=quality_gate_config,
|
||||||
|
)
|
||||||
|
if quality_gate_report.get("blocking_errors"):
|
||||||
|
existing_errors = list(stage7_report.get("blocking_errors", []) or [])
|
||||||
|
stage7_report["blocking_errors"] = existing_errors + list(quality_gate_report["blocking_errors"])
|
||||||
|
reports["quality_gate"] = quality_gate_report
|
||||||
|
reports["stage7"] = stage7_report
|
||||||
|
artifacts = dict(stage6_result.get("artifacts", {}))
|
||||||
|
artifacts["quality_gate"] = quality_gate_report
|
||||||
|
return {
|
||||||
|
"source_results": stage6_result["source_results"],
|
||||||
|
"items": stage6_result["items"],
|
||||||
|
"guide": stage6_result["guide"],
|
||||||
|
"markdown": markdown,
|
||||||
|
"reports": reports,
|
||||||
|
"artifacts": artifacts,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def run_stage0_to_stage8(
|
||||||
|
source_configs: list[dict[str, Any] | SourceConfig],
|
||||||
|
run_date: str,
|
||||||
|
*,
|
||||||
|
fetcher: Fetcher,
|
||||||
|
semantic_llm_call: SemanticLlmCall,
|
||||||
|
rewrite_llm_call: RewriteLlmCall,
|
||||||
|
guide_llm_call: GuideLlmCall,
|
||||||
|
mode: str,
|
||||||
|
base_url: str,
|
||||||
|
client: BlogClient | None,
|
||||||
|
published_urls: PublishedUrls | None = None,
|
||||||
|
cross_day_dedup_enabled: bool = True,
|
||||||
|
cross_day_dedup_max_age_days: int = 7,
|
||||||
|
semantic_dedup_max_deletion_ratio: float = 0.5,
|
||||||
|
rewrite_batch_size: int = 30,
|
||||||
|
semantic_candidate_recall_config: dict[str, Any] | None = None,
|
||||||
|
quality_gate_config: dict[str, Any] | None = None,
|
||||||
|
publish_idempotency_config: dict[str, Any] | None = None,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
stage7_result = run_stage0_to_stage7(
|
||||||
|
source_configs,
|
||||||
|
run_date,
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
guide_llm_call=guide_llm_call,
|
||||||
|
published_urls=published_urls,
|
||||||
|
cross_day_dedup_enabled=cross_day_dedup_enabled,
|
||||||
|
cross_day_dedup_max_age_days=cross_day_dedup_max_age_days,
|
||||||
|
semantic_dedup_max_deletion_ratio=semantic_dedup_max_deletion_ratio,
|
||||||
|
rewrite_batch_size=rewrite_batch_size,
|
||||||
|
semantic_candidate_recall_config=semantic_candidate_recall_config,
|
||||||
|
quality_gate_config=quality_gate_config,
|
||||||
|
)
|
||||||
|
slug = f"ai-{run_date}"
|
||||||
|
effective_mode = mode
|
||||||
|
quality_gate_report = stage7_result["reports"].get("quality_gate", {}) or {}
|
||||||
|
required_policy = str(quality_gate_report.get("required_source_failure_policy") or "block")
|
||||||
|
if quality_gate_report.get("required_source_failures") and required_policy in {"draft", "dry_run"}:
|
||||||
|
effective_mode = "dry-run" if required_policy == "dry_run" else "draft"
|
||||||
|
|
||||||
|
publish_result = publish_markdown(
|
||||||
|
title=f"AI日报 · {run_date}",
|
||||||
|
markdown=stage7_result["markdown"],
|
||||||
|
tags=["AI日报", "AI资讯", "人工智能"],
|
||||||
|
slug=slug,
|
||||||
|
base_url=base_url,
|
||||||
|
mode=effective_mode,
|
||||||
|
markdown_report=stage7_result["reports"]["stage7"],
|
||||||
|
client=client,
|
||||||
|
idempotency_config=publish_idempotency_config,
|
||||||
|
)
|
||||||
|
reports = dict(stage7_result["reports"])
|
||||||
|
reports["stage8"] = {
|
||||||
|
"requested_mode": mode,
|
||||||
|
"mode": publish_result.mode,
|
||||||
|
"status": publish_result.status,
|
||||||
|
"slug": publish_result.slug,
|
||||||
|
"blog_url": publish_result.blog_url,
|
||||||
|
"public_ok": publish_result.public_ok,
|
||||||
|
"error": publish_result.error,
|
||||||
|
}
|
||||||
|
return {
|
||||||
|
"source_results": stage7_result["source_results"],
|
||||||
|
"items": stage7_result["items"],
|
||||||
|
"guide": stage7_result["guide"],
|
||||||
|
"markdown": stage7_result["markdown"],
|
||||||
|
"publish": publish_result,
|
||||||
|
"reports": reports,
|
||||||
|
"artifacts": stage7_result.get("artifacts", {}),
|
||||||
|
}
|
||||||
261
ai_daily_report/publish.py
Normal file
261
ai_daily_report/publish.py
Normal file
@@ -0,0 +1,261 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import hashlib
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from datetime import date, datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Protocol
|
||||||
|
|
||||||
|
from .models import NewsItem, PublishedUrlEntry, PublishedUrls
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class PublishResult:
|
||||||
|
mode: str
|
||||||
|
status: str
|
||||||
|
slug: str
|
||||||
|
blog_url: str
|
||||||
|
public_ok: bool = False
|
||||||
|
error: str | None = None
|
||||||
|
|
||||||
|
|
||||||
|
class BlogClient(Protocol):
|
||||||
|
def get_post_by_slug(self, slug: str) -> dict[str, Any] | None:
|
||||||
|
...
|
||||||
|
|
||||||
|
def create_post(self, payload: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
...
|
||||||
|
|
||||||
|
def publish_post(self, slug: str) -> None:
|
||||||
|
...
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_date(value: str | None) -> date | None:
|
||||||
|
if not value:
|
||||||
|
return None
|
||||||
|
text = value.strip()
|
||||||
|
try:
|
||||||
|
return date.fromisoformat(text[:10])
|
||||||
|
except ValueError:
|
||||||
|
try:
|
||||||
|
return datetime.fromisoformat(text).date()
|
||||||
|
except ValueError:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _published_entry_from_dict(value: Any) -> PublishedUrlEntry | None:
|
||||||
|
if not isinstance(value, dict):
|
||||||
|
return None
|
||||||
|
first_seen = str(value.get("first_seen") or "")
|
||||||
|
last_published = str(value.get("last_published") or first_seen)
|
||||||
|
titles = [str(title) for title in value.get("titles", []) or [] if str(title)]
|
||||||
|
if not first_seen and not last_published:
|
||||||
|
return None
|
||||||
|
return PublishedUrlEntry(
|
||||||
|
first_seen=first_seen or last_published,
|
||||||
|
last_published=last_published or first_seen,
|
||||||
|
titles=titles,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def load_published_urls(path: Path) -> PublishedUrls:
|
||||||
|
if not path.exists():
|
||||||
|
return PublishedUrls()
|
||||||
|
try:
|
||||||
|
raw = json.loads(path.read_text(encoding="utf-8"))
|
||||||
|
except Exception:
|
||||||
|
return PublishedUrls()
|
||||||
|
if not isinstance(raw, dict):
|
||||||
|
return PublishedUrls()
|
||||||
|
|
||||||
|
urls: dict[str, PublishedUrlEntry] = {}
|
||||||
|
for canonical_url, value in (raw.get("urls") or {}).items():
|
||||||
|
if not canonical_url:
|
||||||
|
continue
|
||||||
|
entry = _published_entry_from_dict(value)
|
||||||
|
if entry is not None:
|
||||||
|
urls[str(canonical_url)] = entry
|
||||||
|
return PublishedUrls(
|
||||||
|
version=int(raw.get("version") or 1),
|
||||||
|
urls=urls,
|
||||||
|
updated_at=str(raw.get("updated_at") or ""),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _entry_within_window(entry: PublishedUrlEntry, *, run_date: str, max_age_days: int) -> bool:
|
||||||
|
if max_age_days < 0:
|
||||||
|
return True
|
||||||
|
current = _parse_date(run_date)
|
||||||
|
previous = _parse_date(entry.last_published) or _parse_date(entry.first_seen)
|
||||||
|
if current is None or previous is None:
|
||||||
|
return True
|
||||||
|
return (current - previous).days <= max_age_days
|
||||||
|
|
||||||
|
|
||||||
|
def _published_urls_to_dict(history: PublishedUrls) -> dict[str, Any]:
|
||||||
|
return {
|
||||||
|
"version": history.version,
|
||||||
|
"urls": {
|
||||||
|
canonical_url: {
|
||||||
|
"first_seen": entry.first_seen,
|
||||||
|
"last_published": entry.last_published,
|
||||||
|
"titles": entry.titles,
|
||||||
|
}
|
||||||
|
for canonical_url, entry in sorted(history.urls.items())
|
||||||
|
},
|
||||||
|
"updated_at": history.updated_at,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def update_published_urls(
|
||||||
|
path: Path,
|
||||||
|
items: list[NewsItem],
|
||||||
|
*,
|
||||||
|
run_date: str,
|
||||||
|
max_age_days: int = 7,
|
||||||
|
) -> PublishedUrls:
|
||||||
|
history = load_published_urls(path)
|
||||||
|
history.urls = {
|
||||||
|
canonical_url: entry
|
||||||
|
for canonical_url, entry in history.urls.items()
|
||||||
|
if _entry_within_window(entry, run_date=run_date, max_age_days=max_age_days)
|
||||||
|
}
|
||||||
|
|
||||||
|
for item in items:
|
||||||
|
if not item.canonical_url:
|
||||||
|
continue
|
||||||
|
title = item.title or item.title_raw
|
||||||
|
entry = history.urls.get(item.canonical_url)
|
||||||
|
if entry is None:
|
||||||
|
entry = PublishedUrlEntry(
|
||||||
|
first_seen=run_date,
|
||||||
|
last_published=run_date,
|
||||||
|
titles=[],
|
||||||
|
)
|
||||||
|
history.urls[item.canonical_url] = entry
|
||||||
|
entry.last_published = run_date
|
||||||
|
if title and title not in entry.titles:
|
||||||
|
entry.titles.append(title)
|
||||||
|
|
||||||
|
history.updated_at = datetime.now(timezone.utc).isoformat()
|
||||||
|
path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
path.write_text(
|
||||||
|
json.dumps(_published_urls_to_dict(history), ensure_ascii=False, indent=2),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
return history
|
||||||
|
|
||||||
|
|
||||||
|
def dry_run_publish(slug: str, base_url: str) -> PublishResult:
|
||||||
|
return PublishResult(
|
||||||
|
mode="dry-run",
|
||||||
|
status="ok",
|
||||||
|
slug=slug,
|
||||||
|
blog_url=f"{base_url.rstrip('/')}/posts/{slug}",
|
||||||
|
public_ok=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _content_hash(value: str) -> str:
|
||||||
|
return hashlib.sha256((value or "").encode("utf-8")).hexdigest()
|
||||||
|
|
||||||
|
|
||||||
|
def _get_existing_post(client: BlogClient, slug: str) -> dict[str, Any] | None:
|
||||||
|
getter = getattr(client, "get_post_by_slug", None)
|
||||||
|
if getter is None:
|
||||||
|
return None
|
||||||
|
existing = getter(slug)
|
||||||
|
return existing if isinstance(existing, dict) else None
|
||||||
|
|
||||||
|
|
||||||
|
def publish_markdown(
|
||||||
|
*,
|
||||||
|
title: str,
|
||||||
|
markdown: str,
|
||||||
|
tags: list[str],
|
||||||
|
slug: str,
|
||||||
|
base_url: str,
|
||||||
|
mode: str,
|
||||||
|
markdown_report: dict[str, Any],
|
||||||
|
client: BlogClient | None,
|
||||||
|
idempotency_config: dict[str, Any] | None = None,
|
||||||
|
) -> PublishResult:
|
||||||
|
blocking_errors = markdown_report.get("blocking_errors", []) or []
|
||||||
|
blog_url = f"{base_url.rstrip('/')}/posts/{slug}"
|
||||||
|
if blocking_errors:
|
||||||
|
return PublishResult(
|
||||||
|
mode=mode,
|
||||||
|
status="blocked",
|
||||||
|
slug=slug,
|
||||||
|
blog_url=blog_url,
|
||||||
|
public_ok=False,
|
||||||
|
error=";".join(blocking_errors),
|
||||||
|
)
|
||||||
|
if mode == "dry-run":
|
||||||
|
return dry_run_publish(slug, base_url)
|
||||||
|
if client is None:
|
||||||
|
return PublishResult(
|
||||||
|
mode=mode,
|
||||||
|
status="failed",
|
||||||
|
slug=slug,
|
||||||
|
blog_url=blog_url,
|
||||||
|
public_ok=False,
|
||||||
|
error="missing_blog_client",
|
||||||
|
)
|
||||||
|
|
||||||
|
idempotency_config = idempotency_config or {}
|
||||||
|
if bool(idempotency_config.get("enabled", False)):
|
||||||
|
try:
|
||||||
|
existing_post = _get_existing_post(client, slug)
|
||||||
|
except Exception as exc:
|
||||||
|
return PublishResult(
|
||||||
|
mode=mode,
|
||||||
|
status="failed",
|
||||||
|
slug=slug,
|
||||||
|
blog_url=blog_url,
|
||||||
|
public_ok=False,
|
||||||
|
error=f"idempotency_check_failed:{type(exc).__name__}: {exc}",
|
||||||
|
)
|
||||||
|
if existing_post is not None:
|
||||||
|
existing_content = str(existing_post.get("content") or existing_post.get("markdown") or "")
|
||||||
|
if _content_hash(existing_content) == _content_hash(markdown):
|
||||||
|
return PublishResult(
|
||||||
|
mode=mode,
|
||||||
|
status="already_published",
|
||||||
|
slug=slug,
|
||||||
|
blog_url=blog_url,
|
||||||
|
public_ok=True,
|
||||||
|
)
|
||||||
|
if not bool(idempotency_config.get("allow_republish", False)):
|
||||||
|
return PublishResult(
|
||||||
|
mode=mode,
|
||||||
|
status="blocked",
|
||||||
|
slug=slug,
|
||||||
|
blog_url=blog_url,
|
||||||
|
public_ok=False,
|
||||||
|
error="slug_already_exists",
|
||||||
|
)
|
||||||
|
|
||||||
|
payload = {"title": title, "content": markdown, "tags": tags, "slug": slug}
|
||||||
|
try:
|
||||||
|
create_resp = client.create_post(payload)
|
||||||
|
created_slug = create_resp.get("slug") or slug
|
||||||
|
if mode == "publish":
|
||||||
|
client.publish_post(created_slug)
|
||||||
|
return PublishResult(
|
||||||
|
mode=mode,
|
||||||
|
status="ok",
|
||||||
|
slug=created_slug,
|
||||||
|
blog_url=f"{base_url.rstrip('/')}/posts/{created_slug}",
|
||||||
|
public_ok=mode == "publish",
|
||||||
|
)
|
||||||
|
except Exception as exc:
|
||||||
|
return PublishResult(
|
||||||
|
mode=mode,
|
||||||
|
status="failed",
|
||||||
|
slug=slug,
|
||||||
|
blog_url=blog_url,
|
||||||
|
public_ok=False,
|
||||||
|
error=f"{type(exc).__name__}: {exc}",
|
||||||
|
)
|
||||||
98
ai_daily_report/quality_gate.py
Normal file
98
ai_daily_report/quality_gate.py
Normal file
@@ -0,0 +1,98 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import difflib
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .dedupe import _title_tokens
|
||||||
|
from .models import NewsItem, SourceResult
|
||||||
|
|
||||||
|
|
||||||
|
DEFAULT_CONFIG = {
|
||||||
|
"required_source_failure_policy": "block", # block | draft | dry_run | warn
|
||||||
|
"block_on_required_source_failure": True,
|
||||||
|
"warn_on_enabled_source_failure": True,
|
||||||
|
"warn_when_stage3_candidates_zero_min_items": 30,
|
||||||
|
"warn_on_final_title_similarity": 0.55,
|
||||||
|
"warn_on_entity_frequency": 3,
|
||||||
|
"required_sources": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _config(config: dict[str, Any] | None) -> dict[str, Any]:
|
||||||
|
return {**DEFAULT_CONFIG, **(config or {})}
|
||||||
|
|
||||||
|
|
||||||
|
def _source_failures(source_results: list[SourceResult]) -> list[dict[str, Any]]:
|
||||||
|
failures: list[dict[str, Any]] = []
|
||||||
|
for result in source_results:
|
||||||
|
if result.ok or result.status == "disabled":
|
||||||
|
continue
|
||||||
|
failures.append(
|
||||||
|
{
|
||||||
|
"source": result.source,
|
||||||
|
"role": result.role,
|
||||||
|
"status": result.status,
|
||||||
|
"error": result.error,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return failures
|
||||||
|
|
||||||
|
|
||||||
|
def _similar_title_warnings(items: list[NewsItem], threshold: float) -> list[str]:
|
||||||
|
warnings: list[str] = []
|
||||||
|
for index, left in enumerate(items):
|
||||||
|
left_title = left.title or left.title_raw
|
||||||
|
for right in items[index + 1 :]:
|
||||||
|
right_title = right.title or right.title_raw
|
||||||
|
if len(_title_tokens(left_title)) < 2 or len(_title_tokens(right_title)) < 2:
|
||||||
|
continue
|
||||||
|
ratio = difflib.SequenceMatcher(None, left_title.lower(), right_title.lower()).ratio()
|
||||||
|
if ratio >= threshold:
|
||||||
|
warnings.append(f"final_title_similarity:{left.id}:{right.id}:{ratio:.3f}")
|
||||||
|
return warnings
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate_quality_gate(
|
||||||
|
items: list[NewsItem],
|
||||||
|
*,
|
||||||
|
source_results: list[SourceResult],
|
||||||
|
reports: dict[str, Any],
|
||||||
|
config: dict[str, Any] | None = None,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
config = _config(config)
|
||||||
|
warnings: list[str] = []
|
||||||
|
blocking_errors: list[str] = []
|
||||||
|
|
||||||
|
stage3_report = reports.get("stage3", {}) or {}
|
||||||
|
min_items = int(config["warn_when_stage3_candidates_zero_min_items"])
|
||||||
|
if len(items) > min_items and int(stage3_report.get("candidate_group_count", 0)) == 0:
|
||||||
|
warnings.append("stage3_candidates_zero")
|
||||||
|
|
||||||
|
failures = _source_failures(source_results)
|
||||||
|
if bool(config["warn_on_enabled_source_failure"]):
|
||||||
|
for failure in failures:
|
||||||
|
warnings.append(f"enabled_source_failed:{failure['source']}:{failure['status']}")
|
||||||
|
|
||||||
|
required_sources = set(config.get("required_sources") or [])
|
||||||
|
required_failures = [failure for failure in failures if failure["source"] in required_sources]
|
||||||
|
policy = str(config.get("required_source_failure_policy") or "block")
|
||||||
|
if bool(config["block_on_required_source_failure"]) and policy == "block":
|
||||||
|
for failure in required_failures:
|
||||||
|
blocking_errors.append(f"required_source_failed:{failure['source']}:{failure['status']}")
|
||||||
|
elif required_failures:
|
||||||
|
for failure in required_failures:
|
||||||
|
warnings.append(f"required_source_failed:{failure['source']}:{failure['status']}:{policy}")
|
||||||
|
|
||||||
|
title_threshold = float(config["warn_on_final_title_similarity"])
|
||||||
|
if title_threshold > 0:
|
||||||
|
warnings.extend(_similar_title_warnings(items, title_threshold))
|
||||||
|
|
||||||
|
return {
|
||||||
|
"input_count": len(items),
|
||||||
|
"warnings": warnings,
|
||||||
|
"blocking_errors": blocking_errors,
|
||||||
|
"source_failures": failures,
|
||||||
|
"required_source_failures": required_failures,
|
||||||
|
"required_source_failure_policy": policy,
|
||||||
|
"quality_gate_failed": bool(blocking_errors),
|
||||||
|
}
|
||||||
192
ai_daily_report/rewrite.py
Normal file
192
ai_daily_report/rewrite.py
Normal file
@@ -0,0 +1,192 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from typing import Any, Callable
|
||||||
|
from urllib.error import HTTPError
|
||||||
|
|
||||||
|
from .classify import SECTION_ORDER
|
||||||
|
from .llm import parse_json_object
|
||||||
|
from .models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
RewriteLlmCall = Callable[[str], str]
|
||||||
|
|
||||||
|
|
||||||
|
def _chunks(items: list[NewsItem], size: int) -> list[list[NewsItem]]:
|
||||||
|
return [items[index : index + size] for index in range(0, len(items), size)]
|
||||||
|
|
||||||
|
|
||||||
|
def _build_prompt(batch: list[NewsItem]) -> str:
|
||||||
|
payload = {
|
||||||
|
"task": (
|
||||||
|
"For each AI news item, translate when needed, rewrite the title and summary into concise Chinese, "
|
||||||
|
"and classify it into exactly one allowed section. Preserve brand/model/API names such as GPT-5, "
|
||||||
|
"Codex, Gemini, Claude, API, MCP. Do not add facts."
|
||||||
|
),
|
||||||
|
"allowed_sections": SECTION_ORDER,
|
||||||
|
"section_guidance": {
|
||||||
|
"模型与能力": "model releases, capability upgrades, modalities, context windows, inference, benchmarks tied to model ability",
|
||||||
|
"产品与应用": "end-user products, apps, agents, workflows, product launches, practical business or consumer use cases",
|
||||||
|
"开发与基础设施": "developer tools, APIs, SDKs, MCP, frameworks, deployment, chips, cloud, infra, open source engineering",
|
||||||
|
"公司与资本": "company strategy, financing, IPO, acquisitions, partnerships, revenue, business competition",
|
||||||
|
"政策与安全": "policy, regulation, safety, privacy, copyright, misuse, security incidents, governance",
|
||||||
|
"论文与研究": "papers, academic research, arXiv, methods, experiments, datasets, evaluations",
|
||||||
|
"观点与教程": "opinions, analysis, explainers, tutorials, guides, practices",
|
||||||
|
"人物与动态": "people-focused interviews, speeches, career moves, public appearances",
|
||||||
|
},
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"id": item.id,
|
||||||
|
"title_raw": item.title_raw,
|
||||||
|
"summary_raw": item.summary_raw,
|
||||||
|
"source": item.source_label,
|
||||||
|
"language_hint": item.language_hint,
|
||||||
|
"source_section_hint": item.section_hint,
|
||||||
|
}
|
||||||
|
for item in batch
|
||||||
|
],
|
||||||
|
"output_schema": {
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": "item id",
|
||||||
|
"title": "display title",
|
||||||
|
"summary": "display summary",
|
||||||
|
"section": "one allowed section",
|
||||||
|
"confidence": 0.0,
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
}
|
||||||
|
return json.dumps(payload, ensure_ascii=False)
|
||||||
|
|
||||||
|
|
||||||
|
def _fallback(item: NewsItem) -> None:
|
||||||
|
item.title = item.title_raw
|
||||||
|
item.summary = item.summary_raw or "该条目暂无摘要。"
|
||||||
|
|
||||||
|
|
||||||
|
def _is_transient_llm_error(exc: Exception) -> bool:
|
||||||
|
if isinstance(exc, TimeoutError):
|
||||||
|
return True
|
||||||
|
if isinstance(exc, HTTPError):
|
||||||
|
return exc.code in {429, 500, 502, 503, 504}
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def _apply_rewrite_results(batch: list[NewsItem], rewrites: list[Any]) -> tuple[int, int]:
|
||||||
|
by_id = {item.id: item for item in batch}
|
||||||
|
seen_ids: set[str] = set()
|
||||||
|
section_count = 0
|
||||||
|
for entry in rewrites:
|
||||||
|
if not isinstance(entry, dict):
|
||||||
|
continue
|
||||||
|
item_id = entry.get("id")
|
||||||
|
title = str(entry.get("title") or "").strip()
|
||||||
|
summary = str(entry.get("summary") or "").strip()
|
||||||
|
if item_id in by_id and title and summary:
|
||||||
|
by_id[item_id].title = title
|
||||||
|
by_id[item_id].summary = summary
|
||||||
|
section = str(entry.get("section") or "").strip()
|
||||||
|
if section in SECTION_ORDER:
|
||||||
|
by_id[item_id].section = section
|
||||||
|
section_count += 1
|
||||||
|
seen_ids.add(item_id)
|
||||||
|
return len(seen_ids), section_count
|
||||||
|
|
||||||
|
|
||||||
|
def _apply_rewrite_batch(batch: list[NewsItem], llm_call: RewriteLlmCall) -> tuple[int, int]:
|
||||||
|
obj = parse_json_object(llm_call(_build_prompt(batch)))
|
||||||
|
rewrites = obj.get("rewrites", [])
|
||||||
|
if not isinstance(rewrites, list):
|
||||||
|
raise ValueError("rewrites is not a list")
|
||||||
|
return _apply_rewrite_results(batch, rewrites)
|
||||||
|
|
||||||
|
|
||||||
|
def rewrite_items(
|
||||||
|
items: list[NewsItem],
|
||||||
|
*,
|
||||||
|
llm_call: RewriteLlmCall,
|
||||||
|
batch_size: int = 30,
|
||||||
|
retry_batch_size: int = 10,
|
||||||
|
max_fallback_ratio: float = 0.2,
|
||||||
|
retry_single_items: bool = False,
|
||||||
|
) -> tuple[list[NewsItem], dict[str, Any]]:
|
||||||
|
rewritten_count = 0
|
||||||
|
llm_section_count = 0
|
||||||
|
fallback_count = 0
|
||||||
|
missing_rewrite_count = 0
|
||||||
|
batch_retry_count = 0
|
||||||
|
errors: list[str] = []
|
||||||
|
|
||||||
|
for batch in _chunks(items, max(1, batch_size)):
|
||||||
|
try:
|
||||||
|
batch_rewritten_count, batch_section_count = _apply_rewrite_batch(batch, llm_call)
|
||||||
|
rewritten_count += batch_rewritten_count
|
||||||
|
llm_section_count += batch_section_count
|
||||||
|
for item in batch:
|
||||||
|
if item.title is None or item.summary is None:
|
||||||
|
errors.append(f"missing_rewrite_for_item: {item.id}")
|
||||||
|
_fallback(item)
|
||||||
|
fallback_count += 1
|
||||||
|
missing_rewrite_count += 1
|
||||||
|
except Exception as exc:
|
||||||
|
errors.append(f"batch:{type(exc).__name__}: {exc}")
|
||||||
|
if _is_transient_llm_error(exc):
|
||||||
|
for item in batch:
|
||||||
|
_fallback(item)
|
||||||
|
fallback_count += 1
|
||||||
|
continue
|
||||||
|
if len(batch) > max(1, retry_batch_size):
|
||||||
|
for retry_batch in _chunks(batch, max(1, retry_batch_size)):
|
||||||
|
batch_retry_count += 1
|
||||||
|
try:
|
||||||
|
retry_rewritten_count, retry_section_count = _apply_rewrite_batch(retry_batch, llm_call)
|
||||||
|
rewritten_count += retry_rewritten_count
|
||||||
|
llm_section_count += retry_section_count
|
||||||
|
for item in retry_batch:
|
||||||
|
if item.title is None or item.summary is None:
|
||||||
|
errors.append(f"missing_rewrite_for_item: {item.id}")
|
||||||
|
_fallback(item)
|
||||||
|
fallback_count += 1
|
||||||
|
missing_rewrite_count += 1
|
||||||
|
except Exception as retry_exc:
|
||||||
|
errors.append(f"batch_retry:{type(retry_exc).__name__}: {retry_exc}")
|
||||||
|
for item in retry_batch:
|
||||||
|
_fallback(item)
|
||||||
|
fallback_count += 1
|
||||||
|
continue
|
||||||
|
if not retry_single_items:
|
||||||
|
for item in batch:
|
||||||
|
_fallback(item)
|
||||||
|
fallback_count += 1
|
||||||
|
continue
|
||||||
|
for item in batch:
|
||||||
|
try:
|
||||||
|
item_rewritten_count, item_section_count = _apply_rewrite_batch([item], llm_call)
|
||||||
|
rewritten_count += item_rewritten_count
|
||||||
|
llm_section_count += item_section_count
|
||||||
|
except Exception as item_exc:
|
||||||
|
errors.append(f"item:{item.id}:{type(item_exc).__name__}: {item_exc}")
|
||||||
|
_fallback(item)
|
||||||
|
fallback_count += 1
|
||||||
|
|
||||||
|
fallback_ratio = fallback_count / len(items) if items else 0
|
||||||
|
blocking_errors: list[str] = []
|
||||||
|
if fallback_ratio > max_fallback_ratio:
|
||||||
|
blocking_errors.append("rewrite_fallback_ratio_exceeded")
|
||||||
|
|
||||||
|
report = {
|
||||||
|
"input_count": len(items),
|
||||||
|
"rewritten_count": rewritten_count,
|
||||||
|
"llm_section_count": llm_section_count,
|
||||||
|
"fallback_count": fallback_count,
|
||||||
|
"missing_rewrite_count": missing_rewrite_count,
|
||||||
|
"fallback_ratio": round(fallback_ratio, 4),
|
||||||
|
"batch_count": len(_chunks(items, max(1, batch_size))),
|
||||||
|
"batch_retry_count": batch_retry_count,
|
||||||
|
"errors": errors,
|
||||||
|
"blocking_errors": blocking_errors,
|
||||||
|
"quality_gate_failed": bool(blocking_errors),
|
||||||
|
}
|
||||||
|
return items, report
|
||||||
225
ai_daily_report/runner.py
Normal file
225
ai_daily_report/runner.py
Normal file
@@ -0,0 +1,225 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from dataclasses import asdict, is_dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .clients import BlogApiClient, OpenAICompatibleClient, fetch_text as default_fetch_text
|
||||||
|
from .config import load_pipeline_config, load_source_configs
|
||||||
|
from .env import load_env, resolve_blog_token, resolve_llm_config
|
||||||
|
from .models import SourceConfig
|
||||||
|
from .observability import LlmCallObserver, summarize_observed_calls
|
||||||
|
from .pipeline import run_stage0_to_stage8
|
||||||
|
from .publish import load_published_urls, update_published_urls
|
||||||
|
from .sources.registry import get_source_fetcher
|
||||||
|
|
||||||
|
|
||||||
|
def _json_default(value: Any):
|
||||||
|
if is_dataclass(value):
|
||||||
|
return asdict(value)
|
||||||
|
raise TypeError(f"Object is not JSON serializable: {type(value).__name__}")
|
||||||
|
|
||||||
|
|
||||||
|
def _mock_source_configs() -> list[SourceConfig]:
|
||||||
|
return [SourceConfig(name="Mock AI HOT", type="mock", role="primary", priority=10)]
|
||||||
|
|
||||||
|
|
||||||
|
def _mock_fetcher(config: SourceConfig, run_date: str) -> list[dict[str, Any]]:
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": "GPT-5 API 发布",
|
||||||
|
"summary_raw": "OpenAI 发布 GPT-5 API,用于本地 mock 测试。",
|
||||||
|
"url": "https://example.com/gpt5",
|
||||||
|
"source_label": "OpenAI:Blog",
|
||||||
|
"section_hint": "模型发布/更新",
|
||||||
|
"origin_type": "mock",
|
||||||
|
"language_hint": "zh",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _mock_semantic_llm(prompt: str) -> str:
|
||||||
|
return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []}, ensure_ascii=False)
|
||||||
|
|
||||||
|
|
||||||
|
def _mock_rewrite_llm(prompt: str) -> str:
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": item["id"],
|
||||||
|
"title": item["title_raw"],
|
||||||
|
"summary": item["summary_raw"],
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for item in payload["items"]
|
||||||
|
]
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _mock_guide_llm(prompt: str) -> str:
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
item_ids = [item["id"] for item in payload["items"][:3]]
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"intro": "本地 mock 模式已生成 AI 日报,用于验证流水线。",
|
||||||
|
"theme": "本地 mock 模式已生成 AI 日报,用于验证流水线。",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "本地链路验证",
|
||||||
|
"text": "采集、改写、分类、导览、Markdown 和发布报告都已通过 mock 数据串联。",
|
||||||
|
"item_ids": item_ids,
|
||||||
|
"kind": "thread",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"conclusion": "本地 mock 结果可用于确认定时任务入口和文件输出是否正常。",
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def run_daily_report(
|
||||||
|
*,
|
||||||
|
run_date: str,
|
||||||
|
mode: str,
|
||||||
|
source_mode: str,
|
||||||
|
llm_mode: str,
|
||||||
|
out_dir: Path,
|
||||||
|
base_url: str,
|
||||||
|
sources_path: Path | None = None,
|
||||||
|
pipeline_path: Path | None = None,
|
||||||
|
history_path: Path | None = None,
|
||||||
|
fetch_text=None,
|
||||||
|
env: dict[str, str] | None = None,
|
||||||
|
llm_client_factory=OpenAICompatibleClient,
|
||||||
|
blog_client_factory=BlogApiClient,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
fetch_text = fetch_text or default_fetch_text
|
||||||
|
env = env if env is not None else load_env()
|
||||||
|
pipeline_config_path = pipeline_path or Path("config") / "pipeline.json"
|
||||||
|
pipeline_config = load_pipeline_config(pipeline_config_path)
|
||||||
|
cross_day_config = pipeline_config.get("cross_day_dedup", {}) or {}
|
||||||
|
cross_day_enabled = bool(cross_day_config.get("enabled", True))
|
||||||
|
cross_day_max_age_days = int(cross_day_config.get("max_age_days", 7))
|
||||||
|
semantic_dedup_max_deletion_ratio = float(pipeline_config.get("semantic_dedup_max_deletion_ratio", 0.5))
|
||||||
|
rewrite_batch_size = int(pipeline_config.get("rewrite_batch_size", 30))
|
||||||
|
semantic_candidate_recall_config = pipeline_config.get("semantic_candidate_recall", {}) or {}
|
||||||
|
quality_gate_config = pipeline_config.get("quality_gate", {}) or {}
|
||||||
|
publish_idempotency_config = pipeline_config.get("publish_idempotency", {}) or {}
|
||||||
|
configured_history_path = history_path or Path(
|
||||||
|
str(cross_day_config.get("history_path") or "~/.hermes/scripts/ai_morning_out/published_urls.json")
|
||||||
|
).expanduser()
|
||||||
|
published_urls = load_published_urls(configured_history_path) if cross_day_enabled else None
|
||||||
|
|
||||||
|
if source_mode == "mock":
|
||||||
|
source_configs = _mock_source_configs()
|
||||||
|
fetcher = _mock_fetcher
|
||||||
|
elif source_mode == "live":
|
||||||
|
if sources_path is None:
|
||||||
|
sources_path = Path("config") / "sources.json"
|
||||||
|
source_configs = load_source_configs(sources_path)
|
||||||
|
|
||||||
|
def fetcher(config: SourceConfig, current_date: str) -> list[dict[str, Any]]:
|
||||||
|
source_fetcher = get_source_fetcher(config.type)
|
||||||
|
def configured_fetch_text(url: str, timeout_seconds: int) -> str:
|
||||||
|
try:
|
||||||
|
return fetch_text(url, timeout_seconds, retries=config.retries)
|
||||||
|
except TypeError:
|
||||||
|
return fetch_text(url, timeout_seconds)
|
||||||
|
|
||||||
|
return source_fetcher(config, current_date, configured_fetch_text)
|
||||||
|
|
||||||
|
else:
|
||||||
|
raise ValueError("source_mode must be 'mock' or 'live'")
|
||||||
|
|
||||||
|
llm_observability_config = pipeline_config.get("llm_observability", {}) or {}
|
||||||
|
llm_observers: list[LlmCallObserver] = []
|
||||||
|
observe_llm = bool(llm_observability_config.get("enabled", True))
|
||||||
|
prompt_preview_chars = int(llm_observability_config.get("prompt_preview_chars", 500))
|
||||||
|
response_preview_chars = int(llm_observability_config.get("response_preview_chars", 500))
|
||||||
|
|
||||||
|
def maybe_observe(stage: str, call):
|
||||||
|
if not observe_llm:
|
||||||
|
return call
|
||||||
|
observer = LlmCallObserver(
|
||||||
|
call=call,
|
||||||
|
stage=stage,
|
||||||
|
prompt_preview_chars=prompt_preview_chars,
|
||||||
|
response_preview_chars=response_preview_chars,
|
||||||
|
)
|
||||||
|
llm_observers.append(observer)
|
||||||
|
return observer
|
||||||
|
|
||||||
|
if llm_mode == "mock":
|
||||||
|
semantic_llm_call = maybe_observe("stage3", _mock_semantic_llm)
|
||||||
|
rewrite_llm_call = maybe_observe("stage4", _mock_rewrite_llm)
|
||||||
|
guide_llm_call = maybe_observe("stage6", _mock_guide_llm)
|
||||||
|
elif llm_mode == "live":
|
||||||
|
llm_client = llm_client_factory(**resolve_llm_config(env))
|
||||||
|
semantic_llm_call = maybe_observe("stage3", llm_client.chat)
|
||||||
|
rewrite_llm_call = maybe_observe("stage4", llm_client.chat)
|
||||||
|
guide_llm_call = maybe_observe("stage6", llm_client.chat)
|
||||||
|
else:
|
||||||
|
raise ValueError("llm_mode must be 'mock' or 'live'")
|
||||||
|
|
||||||
|
blog_client = None
|
||||||
|
if mode in ("draft", "publish"):
|
||||||
|
token = resolve_blog_token(env)
|
||||||
|
if not token:
|
||||||
|
raise ValueError("missing_blog_token: set BLOG_SERVICE_TOKEN or EPHRON_SERVICE_TOKEN")
|
||||||
|
blog_client = blog_client_factory(base_url=base_url, token=token)
|
||||||
|
|
||||||
|
result = run_stage0_to_stage8(
|
||||||
|
source_configs,
|
||||||
|
run_date,
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
guide_llm_call=guide_llm_call,
|
||||||
|
mode=mode,
|
||||||
|
base_url=base_url,
|
||||||
|
client=blog_client,
|
||||||
|
published_urls=published_urls,
|
||||||
|
cross_day_dedup_enabled=cross_day_enabled,
|
||||||
|
cross_day_dedup_max_age_days=cross_day_max_age_days,
|
||||||
|
semantic_dedup_max_deletion_ratio=semantic_dedup_max_deletion_ratio,
|
||||||
|
rewrite_batch_size=rewrite_batch_size,
|
||||||
|
semantic_candidate_recall_config=semantic_candidate_recall_config,
|
||||||
|
quality_gate_config=quality_gate_config,
|
||||||
|
publish_idempotency_config=publish_idempotency_config,
|
||||||
|
)
|
||||||
|
|
||||||
|
if cross_day_enabled and result["publish"].mode == "publish" and result["publish"].status == "ok":
|
||||||
|
update_published_urls(
|
||||||
|
configured_history_path,
|
||||||
|
result["items"],
|
||||||
|
run_date=run_date,
|
||||||
|
max_age_days=cross_day_max_age_days,
|
||||||
|
)
|
||||||
|
|
||||||
|
llm_observability_report = summarize_observed_calls(llm_observers)
|
||||||
|
result["reports"]["llm_observability"] = llm_observability_report
|
||||||
|
|
||||||
|
run_dir = out_dir / run_date
|
||||||
|
run_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
(run_dir / "blog_markdown.md").write_text(result["markdown"], encoding="utf-8")
|
||||||
|
(run_dir / "run_report.json").write_text(
|
||||||
|
json.dumps(result["reports"], ensure_ascii=False, indent=2, default=_json_default),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
for artifact_name, artifact_value in result.get("artifacts", {}).items():
|
||||||
|
(run_dir / f"{artifact_name}.json").write_text(
|
||||||
|
json.dumps(artifact_value, ensure_ascii=False, indent=2, default=_json_default),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
return {
|
||||||
|
"run_dir": str(run_dir),
|
||||||
|
"markdown": result["markdown"],
|
||||||
|
"reports": result["reports"],
|
||||||
|
"publish": result["publish"],
|
||||||
|
"artifacts": result.get("artifacts", {}),
|
||||||
|
}
|
||||||
224
ai_daily_report/semantic_dedupe.py
Normal file
224
ai_daily_report/semantic_dedupe.py
Normal file
@@ -0,0 +1,224 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from typing import Any, Callable
|
||||||
|
|
||||||
|
from .llm import parse_json_object
|
||||||
|
from .models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
SemanticLlmCall = Callable[[str], str]
|
||||||
|
|
||||||
|
|
||||||
|
def _build_prompt(items: list[NewsItem], candidates: list[dict[str, Any]]) -> str:
|
||||||
|
item_payload = [
|
||||||
|
{
|
||||||
|
"id": item.id,
|
||||||
|
"title": item.title or item.title_raw,
|
||||||
|
"summary": item.summary or item.summary_raw,
|
||||||
|
"source": item.source_label,
|
||||||
|
"section_hint": item.section_hint,
|
||||||
|
}
|
||||||
|
for item in items
|
||||||
|
]
|
||||||
|
prompt = {
|
||||||
|
"task": "Identify only high-confidence semantic duplicates. Do not curate or remove by importance.",
|
||||||
|
"items": item_payload,
|
||||||
|
"candidates": candidates,
|
||||||
|
"dedupe_policy": [
|
||||||
|
"Use duplicate_groups only when items are substantially the same article/event and one can be removed.",
|
||||||
|
"Use merge_groups when items cover the same concrete event from different angles; keep the best item and attach the others as supplementary sources instead of dropping the event context.",
|
||||||
|
"Do not curate by importance. Do not merge unrelated follow-ups just because they mention the same company/model.",
|
||||||
|
],
|
||||||
|
"output_schema": {
|
||||||
|
"duplicate_groups": [
|
||||||
|
{
|
||||||
|
"keep_id": "item id",
|
||||||
|
"remove_ids": ["item id"],
|
||||||
|
"confidence": "high|medium|low",
|
||||||
|
"reason": "same concrete event reason",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"merge_groups": [
|
||||||
|
{
|
||||||
|
"keep_id": "item id",
|
||||||
|
"merge_ids": ["item id"],
|
||||||
|
"confidence": "high|medium|low",
|
||||||
|
"reason": "same event, complementary angle/source",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"not_duplicates": [],
|
||||||
|
"uncertain": [],
|
||||||
|
},
|
||||||
|
}
|
||||||
|
return json.dumps(prompt, ensure_ascii=False)
|
||||||
|
|
||||||
|
|
||||||
|
def _score(item: NewsItem) -> int:
|
||||||
|
score = max(0, 200 - item.source_priority)
|
||||||
|
if item.source_role == "primary":
|
||||||
|
score += 10
|
||||||
|
if item.summary_raw:
|
||||||
|
score += min(40, len(item.summary_raw))
|
||||||
|
if item.canonical_url:
|
||||||
|
score += 20
|
||||||
|
score -= len(item.quality_flags) * 10
|
||||||
|
return score
|
||||||
|
|
||||||
|
|
||||||
|
def _choose_keep(group_items: list[NewsItem], suggested_keep_id: str) -> NewsItem:
|
||||||
|
suggested = [item for item in group_items if item.id == suggested_keep_id]
|
||||||
|
if suggested:
|
||||||
|
best = max(group_items, key=_score)
|
||||||
|
if _score(suggested[0]) >= _score(best) - 10:
|
||||||
|
return suggested[0]
|
||||||
|
return max(group_items, key=_score)
|
||||||
|
|
||||||
|
|
||||||
|
def semantic_dedup_items(
|
||||||
|
items: list[NewsItem],
|
||||||
|
candidates: list[dict[str, Any]],
|
||||||
|
*,
|
||||||
|
llm_call: SemanticLlmCall,
|
||||||
|
max_deletion_ratio: float = 0.5,
|
||||||
|
) -> tuple[list[NewsItem], dict[str, Any]]:
|
||||||
|
if not items or not candidates:
|
||||||
|
return items, {
|
||||||
|
"input_count": len(items),
|
||||||
|
"candidate_group_count": len(candidates),
|
||||||
|
"removed_count": 0,
|
||||||
|
"duplicate_groups": [],
|
||||||
|
"merge_groups": [],
|
||||||
|
"uncertain": [],
|
||||||
|
"errors": [],
|
||||||
|
"skipped_for_deletion_ratio": False,
|
||||||
|
}
|
||||||
|
|
||||||
|
errors: list[str] = []
|
||||||
|
try:
|
||||||
|
obj = parse_json_object(llm_call(_build_prompt(items, candidates)))
|
||||||
|
except Exception as exc:
|
||||||
|
return items, {
|
||||||
|
"input_count": len(items),
|
||||||
|
"candidate_group_count": len(candidates),
|
||||||
|
"removed_count": 0,
|
||||||
|
"duplicate_groups": [],
|
||||||
|
"merge_groups": [],
|
||||||
|
"uncertain": [],
|
||||||
|
"errors": [f"{type(exc).__name__}: {exc}"],
|
||||||
|
"skipped_for_deletion_ratio": False,
|
||||||
|
}
|
||||||
|
|
||||||
|
by_id = {item.id: item for item in items}
|
||||||
|
candidate_sets = {
|
||||||
|
frozenset(item_id for item_id in candidate.get("item_ids", []) if isinstance(item_id, str))
|
||||||
|
for candidate in candidates
|
||||||
|
}
|
||||||
|
candidate_removals: set[str] = set()
|
||||||
|
valid_groups: list[dict[str, Any]] = []
|
||||||
|
valid_merge_groups: list[dict[str, Any]] = []
|
||||||
|
|
||||||
|
def _validate_group_ids(group: dict[str, Any], member_key: str) -> tuple[list[str], list[NewsItem]] | None:
|
||||||
|
raw_ids = [group.get("keep_id")] + list(group.get(member_key) or [])
|
||||||
|
if any(not isinstance(item_id, str) or item_id not in by_id for item_id in raw_ids):
|
||||||
|
errors.append(f"invalid_ids_in_group: {group}")
|
||||||
|
return None
|
||||||
|
ids = [str(item_id) for item_id in raw_ids]
|
||||||
|
group_set = frozenset(ids)
|
||||||
|
if not any(group_set.issubset(candidate_set) for candidate_set in candidate_sets):
|
||||||
|
errors.append(f"group_outside_candidates: {group}")
|
||||||
|
return None
|
||||||
|
return ids, [by_id[item_id] for item_id in ids]
|
||||||
|
|
||||||
|
for group in obj.get("duplicate_groups", []) or []:
|
||||||
|
if group.get("confidence") != "high":
|
||||||
|
continue
|
||||||
|
validated = _validate_group_ids(group, "remove_ids")
|
||||||
|
if validated is None:
|
||||||
|
continue
|
||||||
|
ids, group_items = validated
|
||||||
|
keep = _choose_keep(group_items, str(group.get("keep_id")))
|
||||||
|
remove_items = [item for item in group_items if item is not keep]
|
||||||
|
candidate_removals.update(item.id for item in remove_items)
|
||||||
|
valid_groups.append(
|
||||||
|
{
|
||||||
|
"keep_id": keep.id,
|
||||||
|
"remove_ids": [item.id for item in remove_items],
|
||||||
|
"confidence": "high",
|
||||||
|
"reason": str(group.get("reason") or "semantic_duplicate"),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
for group in obj.get("merge_groups", []) or []:
|
||||||
|
if group.get("confidence") != "high":
|
||||||
|
continue
|
||||||
|
validated = _validate_group_ids(group, "merge_ids")
|
||||||
|
if validated is None:
|
||||||
|
continue
|
||||||
|
ids, group_items = validated
|
||||||
|
keep = _choose_keep(group_items, str(group.get("keep_id")))
|
||||||
|
merge_items = [item for item in group_items if item is not keep]
|
||||||
|
valid_merge_groups.append(
|
||||||
|
{
|
||||||
|
"keep_id": keep.id,
|
||||||
|
"merge_ids": [item.id for item in merge_items],
|
||||||
|
"confidence": "high",
|
||||||
|
"reason": str(group.get("reason") or "semantic_merge"),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
deletion_ratio = len(candidate_removals) / len(items) if items else 0
|
||||||
|
if deletion_ratio > max_deletion_ratio:
|
||||||
|
return items, {
|
||||||
|
"input_count": len(items),
|
||||||
|
"candidate_group_count": len(candidates),
|
||||||
|
"removed_count": 0,
|
||||||
|
"duplicate_groups": valid_groups,
|
||||||
|
"merge_groups": valid_merge_groups,
|
||||||
|
"uncertain": obj.get("uncertain", []) or [],
|
||||||
|
"errors": errors,
|
||||||
|
"skipped_for_deletion_ratio": True,
|
||||||
|
}
|
||||||
|
|
||||||
|
removed_ids: set[str] = set()
|
||||||
|
|
||||||
|
def append_supplement(keep: NewsItem, source_item: NewsItem, reason: str, action: str) -> None:
|
||||||
|
keep.duplicate_sources.append(
|
||||||
|
{
|
||||||
|
"id": source_item.id,
|
||||||
|
"source_group": source_item.source_group,
|
||||||
|
"source_label": source_item.source_label,
|
||||||
|
"url": source_item.url,
|
||||||
|
"title": source_item.title or source_item.title_raw,
|
||||||
|
"summary": source_item.summary or source_item.summary_raw,
|
||||||
|
"reason": reason,
|
||||||
|
"action": action,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
for group in valid_groups:
|
||||||
|
keep = by_id[group["keep_id"]]
|
||||||
|
for remove_id in group["remove_ids"]:
|
||||||
|
removed = by_id[remove_id]
|
||||||
|
append_supplement(keep, removed, group["reason"], "dedupe_remove")
|
||||||
|
removed_ids.add(remove_id)
|
||||||
|
|
||||||
|
for group in valid_merge_groups:
|
||||||
|
keep = by_id[group["keep_id"]]
|
||||||
|
for merge_id in group["merge_ids"]:
|
||||||
|
if merge_id in removed_ids:
|
||||||
|
continue
|
||||||
|
append_supplement(keep, by_id[merge_id], group["reason"], "merge_supplement")
|
||||||
|
|
||||||
|
deduped = [item for item in items if item.id not in removed_ids]
|
||||||
|
report = {
|
||||||
|
"input_count": len(items),
|
||||||
|
"candidate_group_count": len(candidates),
|
||||||
|
"removed_count": len(removed_ids),
|
||||||
|
"duplicate_groups": valid_groups,
|
||||||
|
"merge_groups": valid_merge_groups,
|
||||||
|
"uncertain": obj.get("uncertain", []) or [],
|
||||||
|
"errors": errors,
|
||||||
|
"skipped_for_deletion_ratio": False,
|
||||||
|
}
|
||||||
|
return deduped, report
|
||||||
2
ai_daily_report/sources/__init__.py
Normal file
2
ai_daily_report/sources/__init__.py
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
"""Source adapters for the AI daily report pipeline."""
|
||||||
|
|
||||||
32
ai_daily_report/sources/aihot.py
Normal file
32
ai_daily_report/sources/aihot.py
Normal file
@@ -0,0 +1,32 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from typing import Any, Callable
|
||||||
|
|
||||||
|
from ai_daily_report.models import SourceConfig
|
||||||
|
|
||||||
|
|
||||||
|
FetchText = Callable[[str, int], str]
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_aihot(config: SourceConfig, run_date: str, fetch_text: FetchText) -> list[dict[str, Any]]:
|
||||||
|
data = json.loads(fetch_text(f"https://aihot.virxact.com/api/public/daily/{run_date}", config.timeout_seconds))
|
||||||
|
items: list[dict[str, Any]] = []
|
||||||
|
generated = data.get("generatedAt")
|
||||||
|
for section in data.get("sections", []) or []:
|
||||||
|
for raw in section.get("items", []) or []:
|
||||||
|
items.append(
|
||||||
|
{
|
||||||
|
"source_group": config.name,
|
||||||
|
"source_label": raw.get("sourceName") or config.name,
|
||||||
|
"title_raw": raw.get("title") or "",
|
||||||
|
"summary_raw": raw.get("summary") or "",
|
||||||
|
"url": raw.get("sourceUrl") or "",
|
||||||
|
"published_at": generated,
|
||||||
|
"origin_type": "aihot_json",
|
||||||
|
"section_hint": section.get("label") or "",
|
||||||
|
"language_hint": "zh",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return items
|
||||||
|
|
||||||
58
ai_daily_report/sources/juya.py
Normal file
58
ai_daily_report/sources/juya.py
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
import xml.etree.ElementTree as ET
|
||||||
|
from typing import Any, Callable
|
||||||
|
|
||||||
|
from ai_daily_report.models import SourceConfig
|
||||||
|
from ai_daily_report.normalize import clean_text
|
||||||
|
from ai_daily_report.sources.labels import source_label_from_url
|
||||||
|
|
||||||
|
|
||||||
|
FetchText = Callable[[str, int], str]
|
||||||
|
|
||||||
|
|
||||||
|
def parse_juya_rss(config: SourceConfig, xml_text: str, run_date: str) -> list[dict[str, Any]]:
|
||||||
|
root = ET.fromstring(xml_text)
|
||||||
|
channel = root.find("channel")
|
||||||
|
raw_items = channel.findall("item") if channel is not None else []
|
||||||
|
article_html = ""
|
||||||
|
for raw in raw_items:
|
||||||
|
if (raw.findtext("title") or "").strip() != run_date:
|
||||||
|
continue
|
||||||
|
content_el = raw.find("{http://purl.org/rss/1.0/modules/content/}encoded")
|
||||||
|
article_html = content_el.text if content_el is not None and content_el.text else ""
|
||||||
|
break
|
||||||
|
if not article_html:
|
||||||
|
return []
|
||||||
|
|
||||||
|
block_pattern = re.compile(
|
||||||
|
r'<h2[^>]*>\s*(?:<a[^>]*href="(?P<title_url>[^"]+)"[^>]*>)?(?P<title_html>[^<]*?)</a>?\s*<code>#(?P<num>\d+)</code>\s*</h2>(?P<body>.*?)(?=<hr\s*/?>\s*<h2|<p><strong>提示</strong>|$)',
|
||||||
|
re.S | re.I,
|
||||||
|
)
|
||||||
|
items: list[dict[str, Any]] = []
|
||||||
|
for match in block_pattern.finditer(article_html):
|
||||||
|
title = clean_text(match.group("title_html") or "")
|
||||||
|
body_html = match.group("body") or ""
|
||||||
|
links = re.findall(r'<a[^>]*href="([^"]+)"[^>]*>', body_html, re.I)
|
||||||
|
url = links[0].replace("&", "&").strip() if links else (match.group("title_url") or "")
|
||||||
|
summary = clean_text(re.sub(r"<[^>]+>", " ", body_html))
|
||||||
|
if title:
|
||||||
|
items.append(
|
||||||
|
{
|
||||||
|
"source_group": config.name,
|
||||||
|
"source_label": source_label_from_url(url, fallback=config.name),
|
||||||
|
"title_raw": title,
|
||||||
|
"summary_raw": summary[:500],
|
||||||
|
"url": url,
|
||||||
|
"published_at": None,
|
||||||
|
"origin_type": "juya_issue",
|
||||||
|
"section_hint": "",
|
||||||
|
"language_hint": "zh",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return items
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_juya(config: SourceConfig, run_date: str, fetch_text: FetchText) -> list[dict[str, Any]]:
|
||||||
|
return parse_juya_rss(config, fetch_text(config.url, config.timeout_seconds), run_date)
|
||||||
78
ai_daily_report/sources/labels.py
Normal file
78
ai_daily_report/sources/labels.py
Normal file
@@ -0,0 +1,78 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
|
||||||
|
|
||||||
|
DOMAIN_LABELS = {
|
||||||
|
"anthropic.com": "Anthropic",
|
||||||
|
"arxiv.org": "arXiv",
|
||||||
|
"bloomberg.com": "Bloomberg",
|
||||||
|
"deepseek.com": "DeepSeek",
|
||||||
|
"github.blog": "GitHub Blog",
|
||||||
|
"github.com": "GitHub",
|
||||||
|
"huggingface.co": "Hugging Face",
|
||||||
|
"infoq.com": "InfoQ",
|
||||||
|
"mp.weixin.qq.com": "微信公众号",
|
||||||
|
"openai.com": "OpenAI",
|
||||||
|
"platform.minimaxi.com": "MiniMax:Docs",
|
||||||
|
"qbitai.com": "量子位",
|
||||||
|
"techcrunch.com": "TechCrunch",
|
||||||
|
"technologyreview.com": "MIT科技评论AI",
|
||||||
|
"theverge.com": "The Verge",
|
||||||
|
"x.com": "X",
|
||||||
|
"twitter.com": "X",
|
||||||
|
}
|
||||||
|
|
||||||
|
X_DISPLAY_NAMES = {
|
||||||
|
"MiniMax_AI": "MiniMax",
|
||||||
|
"OpenAIDevs": "OpenAI Developers",
|
||||||
|
"openai": "OpenAI",
|
||||||
|
"openclaw": "OpenClaw",
|
||||||
|
"xai": "xAI",
|
||||||
|
"krea_ai": "Krea AI",
|
||||||
|
"nvidia": "NVIDIA",
|
||||||
|
"NVIDIAAI": "NVIDIA AI",
|
||||||
|
"alibaba_cloud": "阿里云 / Alibaba Cloud",
|
||||||
|
"cb_doge": "cb_doge",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _host(url: str) -> str:
|
||||||
|
host = (urlparse(url).netloc or "").lower()
|
||||||
|
return host[4:] if host.startswith("www.") else host
|
||||||
|
|
||||||
|
|
||||||
|
def _domain_label(host: str) -> str:
|
||||||
|
for domain, label in DOMAIN_LABELS.items():
|
||||||
|
if host == domain or host.endswith("." + domain):
|
||||||
|
return label
|
||||||
|
return host
|
||||||
|
|
||||||
|
|
||||||
|
def _x_handle(url: str) -> str:
|
||||||
|
parts = [part for part in urlparse(url).path.split("/") if part]
|
||||||
|
if not parts:
|
||||||
|
return ""
|
||||||
|
handle = parts[0]
|
||||||
|
if handle in {"i", "search", "explore", "settings", "notifications", "home", "compose"}:
|
||||||
|
return ""
|
||||||
|
return handle
|
||||||
|
|
||||||
|
|
||||||
|
def source_label_from_url(url: str, *, fallback: str = "来源") -> str:
|
||||||
|
if not url:
|
||||||
|
return fallback
|
||||||
|
host = _host(url)
|
||||||
|
if host in {"x.com", "twitter.com"}:
|
||||||
|
handle = _x_handle(url)
|
||||||
|
if handle:
|
||||||
|
display = X_DISPLAY_NAMES.get(handle, handle)
|
||||||
|
return f"X:{display} (@{handle})"
|
||||||
|
return "X"
|
||||||
|
|
||||||
|
label = _domain_label(host)
|
||||||
|
parsed = urlparse(url)
|
||||||
|
path = (parsed.path or "").lower()
|
||||||
|
if label and ("blog" in host or "/blog" in path or "/research" in path):
|
||||||
|
return f"{label}:Blog"
|
||||||
|
return label or fallback
|
||||||
24
ai_daily_report/sources/registry.py
Normal file
24
ai_daily_report/sources/registry.py
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Callable
|
||||||
|
|
||||||
|
from ai_daily_report.models import SourceConfig
|
||||||
|
from ai_daily_report.sources.aihot import fetch_aihot
|
||||||
|
from ai_daily_report.sources.juya import fetch_juya
|
||||||
|
from ai_daily_report.sources.rss import fetch_rss
|
||||||
|
|
||||||
|
|
||||||
|
SourceFetcher = Callable[[SourceConfig, str, Callable[[str, int], str]], list[dict]]
|
||||||
|
|
||||||
|
SOURCE_FETCHERS: dict[str, SourceFetcher] = {
|
||||||
|
"aihot": fetch_aihot,
|
||||||
|
"rss": fetch_rss,
|
||||||
|
"juya_rss": fetch_juya,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def get_source_fetcher(source_type: str) -> SourceFetcher:
|
||||||
|
if source_type not in SOURCE_FETCHERS:
|
||||||
|
raise KeyError(f"Unknown source type: {source_type}")
|
||||||
|
return SOURCE_FETCHERS[source_type]
|
||||||
|
|
||||||
94
ai_daily_report/sources/rss.py
Normal file
94
ai_daily_report/sources/rss.py
Normal file
@@ -0,0 +1,94 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import xml.etree.ElementTree as ET
|
||||||
|
from datetime import date, datetime
|
||||||
|
from email.utils import parsedate_to_datetime
|
||||||
|
from typing import Any, Callable
|
||||||
|
|
||||||
|
from ai_daily_report.models import SourceConfig
|
||||||
|
from ai_daily_report.normalize import clean_text
|
||||||
|
|
||||||
|
|
||||||
|
FetchText = Callable[[str, int], str]
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_pubdate(value: str) -> str | None:
|
||||||
|
if not value:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
return parsedate_to_datetime(value).isoformat()
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_run_date(value: str | None) -> date | None:
|
||||||
|
if not value:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
return date.fromisoformat(value[:10])
|
||||||
|
except ValueError:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_iso_date(value: str | None) -> date | None:
|
||||||
|
if not value:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
return datetime.fromisoformat(value).date()
|
||||||
|
except ValueError:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _within_max_item_age(published_at: str | None, *, run_date: str | None, max_item_age_days: int | None) -> bool:
|
||||||
|
if max_item_age_days is None:
|
||||||
|
return True
|
||||||
|
published_date = _parse_iso_date(published_at)
|
||||||
|
current_date = _parse_run_date(run_date)
|
||||||
|
if published_date is None or current_date is None:
|
||||||
|
return True
|
||||||
|
return (current_date - published_date).days <= max_item_age_days
|
||||||
|
|
||||||
|
|
||||||
|
def parse_rss_items(
|
||||||
|
config: SourceConfig,
|
||||||
|
xml_text: str,
|
||||||
|
*,
|
||||||
|
limit: int = 20,
|
||||||
|
run_date: str | None = None,
|
||||||
|
) -> list[dict[str, Any]]:
|
||||||
|
root = ET.fromstring(xml_text)
|
||||||
|
channel = root.find("channel")
|
||||||
|
raw_items = channel.findall("item") if channel is not None else []
|
||||||
|
items: list[dict[str, Any]] = []
|
||||||
|
for raw in raw_items:
|
||||||
|
title = clean_text(raw.findtext("title") or "")
|
||||||
|
if not title:
|
||||||
|
continue
|
||||||
|
summary = clean_text(raw.findtext("description") or "")
|
||||||
|
published_at = _parse_pubdate(raw.findtext("pubDate") or "")
|
||||||
|
if not _within_max_item_age(
|
||||||
|
published_at,
|
||||||
|
run_date=run_date,
|
||||||
|
max_item_age_days=config.max_item_age_days,
|
||||||
|
):
|
||||||
|
continue
|
||||||
|
items.append(
|
||||||
|
{
|
||||||
|
"source_group": config.name,
|
||||||
|
"source_label": config.name,
|
||||||
|
"title_raw": title,
|
||||||
|
"summary_raw": summary,
|
||||||
|
"url": (raw.findtext("link") or "").strip(),
|
||||||
|
"published_at": published_at,
|
||||||
|
"origin_type": "rss",
|
||||||
|
"section_hint": "",
|
||||||
|
"language_hint": "en" if title.encode("utf-8").isascii() else "zh",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
if len(items) >= limit:
|
||||||
|
break
|
||||||
|
return items
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rss(config: SourceConfig, run_date: str, fetch_text: FetchText) -> list[dict[str, Any]]:
|
||||||
|
return parse_rss_items(config, fetch_text(config.url, config.timeout_seconds), run_date=run_date)
|
||||||
46
ai_daily_report/validate.py
Normal file
46
ai_daily_report/validate.py
Normal file
@@ -0,0 +1,46 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .classify import SECTION_ORDER
|
||||||
|
from .models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
def validate_report_markdown(markdown: str, items: list[NewsItem]) -> dict[str, Any]:
|
||||||
|
return validate_markdown(markdown, items)
|
||||||
|
|
||||||
|
|
||||||
|
def validate_markdown(markdown: str, items: list[NewsItem]) -> dict[str, Any]:
|
||||||
|
blocking_errors: list[str] = []
|
||||||
|
auto_fixes: list[str] = []
|
||||||
|
warnings: list[dict[str, str]] = []
|
||||||
|
|
||||||
|
if not items:
|
||||||
|
blocking_errors.append("no_items")
|
||||||
|
if len((markdown or "").strip()) < 80:
|
||||||
|
blocking_errors.append("markdown_too_short")
|
||||||
|
if items and "## " not in markdown:
|
||||||
|
blocking_errors.append("no_sections")
|
||||||
|
if re.search(r"\{[^{}]*\}", markdown or ""):
|
||||||
|
blocking_errors.append("json_fragment_detected")
|
||||||
|
if "> >" in (markdown or ""):
|
||||||
|
auto_fixes.append("double_blockquote_detected")
|
||||||
|
if re.search(r"\[\d+\]|\[N\]", markdown or ""):
|
||||||
|
auto_fixes.append("reference_marker_detected")
|
||||||
|
|
||||||
|
for item in items:
|
||||||
|
if not item.url:
|
||||||
|
warnings.append({"type": "missing_url", "item_id": item.id})
|
||||||
|
if item.section not in SECTION_ORDER:
|
||||||
|
blocking_errors.append("invalid_section")
|
||||||
|
break
|
||||||
|
|
||||||
|
return {
|
||||||
|
"item_count": len(items),
|
||||||
|
"section_count": len({item.section for item in items if item.section}),
|
||||||
|
"markdown_length": len(markdown or ""),
|
||||||
|
"auto_fixes": auto_fixes,
|
||||||
|
"warnings": warnings,
|
||||||
|
"blocking_errors": blocking_errors,
|
||||||
|
}
|
||||||
52
config/pipeline.json
Normal file
52
config/pipeline.json
Normal file
@@ -0,0 +1,52 @@
|
|||||||
|
{
|
||||||
|
"sections": [
|
||||||
|
"模型与能力",
|
||||||
|
"产品与应用",
|
||||||
|
"开发与基础设施",
|
||||||
|
"公司与资本",
|
||||||
|
"政策与安全",
|
||||||
|
"论文与研究",
|
||||||
|
"观点与教程",
|
||||||
|
"人物与动态"
|
||||||
|
],
|
||||||
|
"rewrite_batch_size": 10,
|
||||||
|
"semantic_dedup_max_deletion_ratio": 0.5,
|
||||||
|
"default_mode": "dry-run",
|
||||||
|
"cross_day_dedup": {
|
||||||
|
"enabled": true,
|
||||||
|
"max_age_days": 7,
|
||||||
|
"history_path": "~/.hermes/scripts/ai_morning_out/published_urls.json"
|
||||||
|
},
|
||||||
|
"semantic_candidate_recall": {
|
||||||
|
"enabled": true,
|
||||||
|
"max_pairs": 80,
|
||||||
|
"max_pairs_per_item": 5,
|
||||||
|
"title_similarity_threshold": 0.45,
|
||||||
|
"title_jaccard_threshold": 0.25,
|
||||||
|
"summary_jaccard_threshold": 0.18,
|
||||||
|
"strong_entity_overlap_threshold": 2
|
||||||
|
},
|
||||||
|
"quality_gate": {
|
||||||
|
"required_source_failure_policy": "block",
|
||||||
|
"block_on_required_source_failure": true,
|
||||||
|
"warn_on_enabled_source_failure": true,
|
||||||
|
"warn_when_stage3_candidates_zero_min_items": 30,
|
||||||
|
"warn_on_final_title_similarity": 0.55,
|
||||||
|
"warn_on_entity_frequency": 3,
|
||||||
|
"required_sources": ["AI HOT"]
|
||||||
|
},
|
||||||
|
"publish_idempotency": {
|
||||||
|
"enabled": true,
|
||||||
|
"allow_republish": false,
|
||||||
|
"slug_lookup_paths": [
|
||||||
|
"/api/service/posts/{slug}",
|
||||||
|
"/api/service/posts?slug={slug}",
|
||||||
|
"/api/service/posts/slug/{slug}"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"llm_observability": {
|
||||||
|
"enabled": true,
|
||||||
|
"prompt_preview_chars": 500,
|
||||||
|
"response_preview_chars": 500
|
||||||
|
}
|
||||||
|
}
|
||||||
68
config/sources.json
Normal file
68
config/sources.json
Normal file
@@ -0,0 +1,68 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"name": "AI HOT",
|
||||||
|
"type": "aihot",
|
||||||
|
"role": "primary",
|
||||||
|
"required": true,
|
||||||
|
"failure_policy": "block",
|
||||||
|
"priority": 10,
|
||||||
|
"timeout_seconds": 25,
|
||||||
|
"retries": 2,
|
||||||
|
"min_items": 10,
|
||||||
|
"enabled": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "橘鸦AI早报",
|
||||||
|
"type": "juya_rss",
|
||||||
|
"url": "https://imjuya.github.io/juya-ai-daily/rss.xml",
|
||||||
|
"role": "supplement",
|
||||||
|
"required": false,
|
||||||
|
"failure_policy": "warn",
|
||||||
|
"priority": 20,
|
||||||
|
"timeout_seconds": 45,
|
||||||
|
"retries": 2,
|
||||||
|
"min_items": 0,
|
||||||
|
"enabled": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "量子位",
|
||||||
|
"type": "rss",
|
||||||
|
"url": "https://www.qbitai.com/feed",
|
||||||
|
"role": "supplement",
|
||||||
|
"required": false,
|
||||||
|
"failure_policy": "warn",
|
||||||
|
"priority": 30,
|
||||||
|
"timeout_seconds": 25,
|
||||||
|
"retries": 1,
|
||||||
|
"min_items": 0,
|
||||||
|
"enabled": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "InfoQ AI",
|
||||||
|
"type": "rss",
|
||||||
|
"url": "https://feed.infoq.com/ai-ml-data-eng/",
|
||||||
|
"role": "supplement",
|
||||||
|
"required": false,
|
||||||
|
"failure_policy": "warn",
|
||||||
|
"priority": 40,
|
||||||
|
"timeout_seconds": 25,
|
||||||
|
"retries": 1,
|
||||||
|
"min_items": 0,
|
||||||
|
"max_item_age_days": 3,
|
||||||
|
"enabled": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "MIT科技评论AI",
|
||||||
|
"type": "rss",
|
||||||
|
"url": "https://www.technologyreview.com/topic/artificial-intelligence/feed",
|
||||||
|
"role": "supplement",
|
||||||
|
"required": false,
|
||||||
|
"failure_policy": "warn",
|
||||||
|
"priority": 50,
|
||||||
|
"timeout_seconds": 25,
|
||||||
|
"retries": 1,
|
||||||
|
"min_items": 0,
|
||||||
|
"max_item_age_days": 5,
|
||||||
|
"enabled": true
|
||||||
|
}
|
||||||
|
]
|
||||||
33
docs/ops-thresholds.generated.md
Normal file
33
docs/ops-thresholds.generated.md
Normal file
@@ -0,0 +1,33 @@
|
|||||||
|
# AI日报运维阈值(自动生成)
|
||||||
|
|
||||||
|
> 由 `scripts/generate_ops_docs.py` 从 `config/pipeline.json` 和 `config/sources.json` 生成;不要手改本文件。
|
||||||
|
|
||||||
|
## Quality Gate
|
||||||
|
|
||||||
|
- `block_on_required_source_failure`: `True`
|
||||||
|
- `required_source_failure_policy`: `block`
|
||||||
|
- `required_sources`: `['AI HOT']`
|
||||||
|
- `warn_on_enabled_source_failure`: `True`
|
||||||
|
- `warn_on_entity_frequency`: `3`
|
||||||
|
- `warn_on_final_title_similarity`: `0.55`
|
||||||
|
- `warn_when_stage3_candidates_zero_min_items`: `30`
|
||||||
|
|
||||||
|
## Semantic Candidate Recall
|
||||||
|
|
||||||
|
- `enabled`: `True`
|
||||||
|
- `max_pairs`: `80`
|
||||||
|
- `max_pairs_per_item`: `5`
|
||||||
|
- `strong_entity_overlap_threshold`: `2`
|
||||||
|
- `summary_jaccard_threshold`: `0.18`
|
||||||
|
- `title_jaccard_threshold`: `0.25`
|
||||||
|
- `title_similarity_threshold`: `0.45`
|
||||||
|
|
||||||
|
## Sources
|
||||||
|
|
||||||
|
| source | required | failure_policy | min_items | retries | timeout_seconds |
|
||||||
|
|---|---:|---|---:|---:|---:|
|
||||||
|
| AI HOT | True | block | 10 | 2 | 25 |
|
||||||
|
| 橘鸦AI早报 | False | warn | 0 | 2 | 45 |
|
||||||
|
| 量子位 | False | warn | 0 | 1 | 25 |
|
||||||
|
| InfoQ AI | False | warn | 0 | 1 | 25 |
|
||||||
|
| MIT科技评论AI | False | warn | 0 | 1 | 25 |
|
||||||
786
docs/pipeline-optimization-plan.md
Normal file
786
docs/pipeline-optimization-plan.md
Normal file
@@ -0,0 +1,786 @@
|
|||||||
|
# AI Daily Report Pipeline Optimization Plan
|
||||||
|
|
||||||
|
## Objective
|
||||||
|
|
||||||
|
This project should become a stable, long-running AI daily report system for Hermes, OpenClaw, and similar agents. The goal is not only to keep the current script runnable, but to make the whole pipeline observable, replayable, maintainable, and safe to run on a daily schedule.
|
||||||
|
|
||||||
|
The recommended direction is:
|
||||||
|
|
||||||
|
```text
|
||||||
|
stable core library + CLI + skill wrapper
|
||||||
|
```
|
||||||
|
|
||||||
|
Core business logic should live in deterministic code. The skill should describe how agents run, diagnose, replay, publish, and extend the pipeline.
|
||||||
|
|
||||||
|
## Stage Model
|
||||||
|
|
||||||
|
Use this stage model going forward:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Stage 0: Collect Sources
|
||||||
|
Stage 1: Normalize Items
|
||||||
|
Stage 2: Hard Dedup
|
||||||
|
Stage 3: Semantic Dedup
|
||||||
|
Stage 4: Rewrite Titles and Summaries
|
||||||
|
Stage 5: Classify and Order
|
||||||
|
Stage 6: Guide and Daily Threads
|
||||||
|
Stage 7: Assemble and Validate Markdown
|
||||||
|
Stage 8: Publish and Deliver
|
||||||
|
```
|
||||||
|
|
||||||
|
The current script names script-level deduplication as Stage 0. That should be treated as old terminology. In the long-term pipeline, the first stage is source collection.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
Recommended structure:
|
||||||
|
|
||||||
|
```text
|
||||||
|
ai-daily-report/
|
||||||
|
├── ai_daily_report/
|
||||||
|
│ ├── models.py
|
||||||
|
│ ├── sources/
|
||||||
|
│ │ ├── aihot.py
|
||||||
|
│ │ ├── rss.py
|
||||||
|
│ │ ├── juya.py
|
||||||
|
│ │ └── registry.py
|
||||||
|
│ ├── collect.py
|
||||||
|
│ ├── normalize.py
|
||||||
|
│ ├── dedupe.py
|
||||||
|
│ ├── llm.py
|
||||||
|
│ ├── rewrite.py
|
||||||
|
│ ├── classify.py
|
||||||
|
│ ├── assemble.py
|
||||||
|
│ ├── validate.py
|
||||||
|
│ ├── publish.py
|
||||||
|
│ └── cli.py
|
||||||
|
├── config/
|
||||||
|
│ ├── sources.json
|
||||||
|
│ └── pipeline.json
|
||||||
|
├── docs/
|
||||||
|
├── skill/
|
||||||
|
│ ├── SKILL.md
|
||||||
|
│ ├── scripts/
|
||||||
|
│ └── references/
|
||||||
|
├── tests/
|
||||||
|
│ └── fixtures/
|
||||||
|
└── script/
|
||||||
|
└── ai_daily_blog_pipeline.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Keep `script/ai_daily_blog_pipeline.py` as a compatibility entrypoint during migration, but move implementation into importable modules.
|
||||||
|
|
||||||
|
## Data Model
|
||||||
|
|
||||||
|
### SourceResult
|
||||||
|
|
||||||
|
Every data source should return a structured result:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"source": "AI HOT",
|
||||||
|
"role": "primary",
|
||||||
|
"ok": true,
|
||||||
|
"status": "ok",
|
||||||
|
"items": [],
|
||||||
|
"error": null,
|
||||||
|
"elapsed_ms": 820,
|
||||||
|
"retry_count": 0,
|
||||||
|
"fetched_at": "2026-06-04T10:00:00+08:00"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Supported statuses:
|
||||||
|
|
||||||
|
```text
|
||||||
|
ok
|
||||||
|
empty
|
||||||
|
not_ready
|
||||||
|
timeout
|
||||||
|
http_error
|
||||||
|
parse_error
|
||||||
|
disabled
|
||||||
|
```
|
||||||
|
|
||||||
|
### NewsItem
|
||||||
|
|
||||||
|
All raw source items should be normalized into one structure:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"id": "item_...",
|
||||||
|
"source_group": "AI HOT",
|
||||||
|
"source_label": "OpenAI: Blog",
|
||||||
|
"source_role": "primary",
|
||||||
|
"source_priority": 10,
|
||||||
|
"title_raw": "...",
|
||||||
|
"title_norm": "...",
|
||||||
|
"summary_raw": "...",
|
||||||
|
"title": null,
|
||||||
|
"summary": null,
|
||||||
|
"url": "...",
|
||||||
|
"canonical_url": "...",
|
||||||
|
"published_at": "...",
|
||||||
|
"collected_at": "...",
|
||||||
|
"origin_type": "aihot_json",
|
||||||
|
"section_hint": "...",
|
||||||
|
"section": null,
|
||||||
|
"language_hint": "zh",
|
||||||
|
"quality_flags": [],
|
||||||
|
"duplicate_sources": []
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Do not overwrite raw fields with LLM output. Keep display fields separate.
|
||||||
|
|
||||||
|
## Stage 0: Collect Sources
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Collect candidate news from all configured sources in a stable, observable, and recoverable way.
|
||||||
|
|
||||||
|
### Design
|
||||||
|
|
||||||
|
Use a primary-plus-supplement model at the quality layer, and parallel execution at the scheduling layer.
|
||||||
|
|
||||||
|
```text
|
||||||
|
Quality layer:
|
||||||
|
AI HOT = primary source
|
||||||
|
RSS / Juya / InfoQ / QbitAI / MIT = supplement sources
|
||||||
|
|
||||||
|
Execution layer:
|
||||||
|
start all sources concurrently with per-source timeout, retry, and reporting
|
||||||
|
```
|
||||||
|
|
||||||
|
### Source Config
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "AI HOT",
|
||||||
|
"type": "aihot",
|
||||||
|
"role": "primary",
|
||||||
|
"required": true,
|
||||||
|
"priority": 10,
|
||||||
|
"timeout_seconds": 20,
|
||||||
|
"retries": 2,
|
||||||
|
"min_items": 10,
|
||||||
|
"enabled": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Supplement source example:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "Juya AI Daily",
|
||||||
|
"type": "juya_rss",
|
||||||
|
"url": "https://imjuya.github.io/juya-ai-daily/rss.xml",
|
||||||
|
"role": "supplement",
|
||||||
|
"required": false,
|
||||||
|
"priority": 20,
|
||||||
|
"timeout_seconds": 45,
|
||||||
|
"retries": 2,
|
||||||
|
"enabled": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Optimizations
|
||||||
|
|
||||||
|
- Run supplement sources concurrently.
|
||||||
|
- Do not let one slow source block the whole pipeline.
|
||||||
|
- Replace the fixed Juya `sleep(120)` with bounded short retries and a clear `not_ready` or `timeout` status.
|
||||||
|
- Treat AI HOT 404 as "not ready" rather than a generic failure.
|
||||||
|
- Allow degraded generation if the primary source has a temporary network failure and supplement sources are usable.
|
||||||
|
- Persist raw source results for replay.
|
||||||
|
|
||||||
|
### Artifacts
|
||||||
|
|
||||||
|
```text
|
||||||
|
source_results.json
|
||||||
|
raw_items.json
|
||||||
|
stage0_collect_report.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage 1: Normalize Items
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Convert heterogeneous source output into clean, comparable, traceable `NewsItem` objects.
|
||||||
|
|
||||||
|
### Optimizations
|
||||||
|
|
||||||
|
- Normalize text with HTML stripping, entity decoding, whitespace cleanup, and RSS boilerplate removal.
|
||||||
|
- Generate stable `id` values from canonical URL when possible, otherwise from source, normalized title, and date.
|
||||||
|
- Canonicalize URLs:
|
||||||
|
- Lowercase scheme and host.
|
||||||
|
- Remove `utm_*`, `fbclid`, `gclid`, `spm`, `from`, and fragments.
|
||||||
|
- Normalize trailing slashes.
|
||||||
|
- Normalize `twitter.com` and `x.com` URLs.
|
||||||
|
- Generate `title_norm`:
|
||||||
|
- Unicode NFKC normalization.
|
||||||
|
- Lowercase English text.
|
||||||
|
- Normalize whitespace and weak punctuation.
|
||||||
|
- Preserve numbers, versions, model names, and product names.
|
||||||
|
- Standardize source labels:
|
||||||
|
- X links as `X:@username`.
|
||||||
|
- Official blogs as `OpenAI: Blog`, `Google Research: Blog`, etc.
|
||||||
|
- Avoid generic labels such as "technology media" when a domain label is available.
|
||||||
|
- Add `quality_flags` instead of silently dropping items:
|
||||||
|
- `missing_url`
|
||||||
|
- `missing_summary`
|
||||||
|
- `short_title`
|
||||||
|
- `bad_url`
|
||||||
|
- `old_item`
|
||||||
|
- `parse_suspect`
|
||||||
|
|
||||||
|
### Non-goals
|
||||||
|
|
||||||
|
- Do not dedupe.
|
||||||
|
- Do not rewrite content.
|
||||||
|
- Do not call the LLM.
|
||||||
|
- Do not remove items based on importance.
|
||||||
|
|
||||||
|
### Artifacts
|
||||||
|
|
||||||
|
```text
|
||||||
|
normalized_items.json
|
||||||
|
stage1_normalize_report.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage 2: Hard Dedup
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Remove only high-confidence duplicates with deterministic rules. Mark uncertain similarities for Stage 3.
|
||||||
|
|
||||||
|
### Rules
|
||||||
|
|
||||||
|
High-confidence removal:
|
||||||
|
|
||||||
|
- Same canonical URL.
|
||||||
|
- Same normalized title.
|
||||||
|
- Same platform entity, such as the same X status ID.
|
||||||
|
- Same source and same exact normalized title.
|
||||||
|
|
||||||
|
Uncertain cases:
|
||||||
|
|
||||||
|
- Similar title but different URL.
|
||||||
|
- Same company or model, but unclear whether the event is identical.
|
||||||
|
- Same topic across multiple sources with different factual details.
|
||||||
|
|
||||||
|
Uncertain cases should go to `possible_duplicates`, not be removed.
|
||||||
|
|
||||||
|
### Replacement for Current Logic
|
||||||
|
|
||||||
|
The current `SequenceMatcher > 0.7` direct deletion is too risky. Replace it with:
|
||||||
|
|
||||||
|
- Exact deterministic deletion.
|
||||||
|
- Similarity-based candidate marking only.
|
||||||
|
|
||||||
|
### Keep Item Selection
|
||||||
|
|
||||||
|
When merging a duplicate group, choose the item with a local score:
|
||||||
|
|
||||||
|
```text
|
||||||
|
official source bonus
|
||||||
|
+ primary source bonus
|
||||||
|
+ source priority
|
||||||
|
+ has URL
|
||||||
|
+ has summary
|
||||||
|
+ has section hint
|
||||||
|
+ newer published_at
|
||||||
|
- quality flag penalty
|
||||||
|
```
|
||||||
|
|
||||||
|
Attach removed items to `duplicate_sources` on the kept item.
|
||||||
|
|
||||||
|
### Artifacts
|
||||||
|
|
||||||
|
```text
|
||||||
|
deduped_items.json
|
||||||
|
stage2_dedupe_report.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage 3: Semantic Dedup
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Use the LLM to identify semantic duplicates that deterministic rules cannot safely remove.
|
||||||
|
|
||||||
|
### Principles
|
||||||
|
|
||||||
|
- The LLM judges duplicate candidates; local code enforces safety.
|
||||||
|
- The LLM must not select, curate, or remove items by importance.
|
||||||
|
- Only remove `confidence = high` duplicate groups.
|
||||||
|
- Treat medium or uncertain results as non-removal.
|
||||||
|
|
||||||
|
### Input
|
||||||
|
|
||||||
|
Prefer candidate groups from Stage 2. Avoid sending all items at once unless the item count is small.
|
||||||
|
|
||||||
|
Example item payload:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"id": "item_123",
|
||||||
|
"title": "...",
|
||||||
|
"summary": "...",
|
||||||
|
"source": "QbitAI",
|
||||||
|
"url_host": "qbitai.com",
|
||||||
|
"published_at": "...",
|
||||||
|
"section_hint": "Company and Capital"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Output Schema
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"duplicate_groups": [
|
||||||
|
{
|
||||||
|
"keep_id": "item_123",
|
||||||
|
"remove_ids": ["item_456"],
|
||||||
|
"confidence": "high",
|
||||||
|
"reason": "Both items report the same concrete event."
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"not_duplicates": [],
|
||||||
|
"uncertain": []
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Safety Checks
|
||||||
|
|
||||||
|
- Validate all IDs exist.
|
||||||
|
- Validate confidence values.
|
||||||
|
- Apply local keep-item scoring instead of blindly trusting `keep_id`.
|
||||||
|
- Skip deletion if the deletion ratio exceeds a configured threshold.
|
||||||
|
- Skip deletion when versions, product names, or dates conflict.
|
||||||
|
|
||||||
|
### Failure Behavior
|
||||||
|
|
||||||
|
If timeout, JSON parse failure, or schema validation failure occurs, keep Stage 2 output and continue.
|
||||||
|
|
||||||
|
### Artifacts
|
||||||
|
|
||||||
|
```text
|
||||||
|
semantic_dedup_input.json
|
||||||
|
semantic_dedup_output.json
|
||||||
|
stage3_semantic_dedup_report.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage 4: Rewrite Titles and Summaries
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Produce concise, accurate Chinese display titles and summaries.
|
||||||
|
|
||||||
|
### Rules
|
||||||
|
|
||||||
|
- Keep `title_raw` and `summary_raw` unchanged.
|
||||||
|
- Write display fields to `title` and `summary`.
|
||||||
|
- Preserve brand names, model names, API names, and common technical acronyms in English.
|
||||||
|
- Translate the rest into natural Chinese.
|
||||||
|
- Avoid marketing words such as "heavyweight", "explosive", or "just now" unless they are factual and necessary.
|
||||||
|
- Summaries should be factual, concise, and usually 80-140 Chinese characters.
|
||||||
|
- Do not add facts not present in the raw title or summary.
|
||||||
|
- Do not write advice or commentary.
|
||||||
|
|
||||||
|
### Batch Strategy
|
||||||
|
|
||||||
|
- Process 8-12 items per batch.
|
||||||
|
- Allow limited parallel batches.
|
||||||
|
- Retry a failed batch once.
|
||||||
|
- Fall back per item or per batch if needed.
|
||||||
|
|
||||||
|
### Validation
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- Non-empty title and summary.
|
||||||
|
- No markdown links in title.
|
||||||
|
- No URL in summary.
|
||||||
|
- No `[N]` or reference markers.
|
||||||
|
- No emoji.
|
||||||
|
- Summary length under limit.
|
||||||
|
- Key numbers, versions, and model names are preserved when present in raw input.
|
||||||
|
|
||||||
|
### Artifacts
|
||||||
|
|
||||||
|
```text
|
||||||
|
rewritten_items.json
|
||||||
|
rewrite_llm_outputs.json
|
||||||
|
stage4_rewrite_report.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage 5: Classify and Order
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Place each item into a stable section and order items for readable scanning.
|
||||||
|
|
||||||
|
### Recommended Sections
|
||||||
|
|
||||||
|
Use a fixed section whitelist:
|
||||||
|
|
||||||
|
```text
|
||||||
|
模型与能力
|
||||||
|
产品与应用
|
||||||
|
开发与基础设施
|
||||||
|
公司与资本
|
||||||
|
政策与安全
|
||||||
|
论文与研究
|
||||||
|
观点与教程
|
||||||
|
人物与动态
|
||||||
|
```
|
||||||
|
|
||||||
|
Hide empty sections. Do not create dynamic section names.
|
||||||
|
|
||||||
|
### Classification Strategy
|
||||||
|
|
||||||
|
Use a three-layer approach:
|
||||||
|
|
||||||
|
1. Source hint mapping.
|
||||||
|
2. Local rule fallback.
|
||||||
|
3. LLM classification for ambiguous items only.
|
||||||
|
|
||||||
|
Example alias mapping:
|
||||||
|
|
||||||
|
```text
|
||||||
|
模型发布/更新 -> 模型与能力
|
||||||
|
产品发布/更新 -> 产品与应用
|
||||||
|
产品与工具 -> 产品与应用
|
||||||
|
开发与工程 -> 开发与基础设施
|
||||||
|
行业动态 -> 公司与资本
|
||||||
|
行业与公司 -> 公司与资本
|
||||||
|
论文研究 -> 论文与研究
|
||||||
|
技巧与观点 -> 观点与教程
|
||||||
|
人物与花絮 -> 人物与动态
|
||||||
|
```
|
||||||
|
|
||||||
|
### Ordering Strategy
|
||||||
|
|
||||||
|
Do not let the LLM freely order all items. Use local scoring:
|
||||||
|
|
||||||
|
```text
|
||||||
|
rank_score =
|
||||||
|
source priority
|
||||||
|
+ official source bonus
|
||||||
|
+ primary source bonus
|
||||||
|
+ recency score
|
||||||
|
+ key metric bonus
|
||||||
|
+ duplicate source bonus
|
||||||
|
- quality flag penalty
|
||||||
|
```
|
||||||
|
|
||||||
|
Ordering is for readability only. It must not remove items.
|
||||||
|
|
||||||
|
### Artifacts
|
||||||
|
|
||||||
|
```text
|
||||||
|
classified_items.json
|
||||||
|
stage5_classify_order_report.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage 6: Guide and Daily Threads
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Generate a concise top guide and a bottom "daily threads" section that helps readers understand the day's shape without turning the report into an investment memo.
|
||||||
|
|
||||||
|
### Replace Current Summary Style
|
||||||
|
|
||||||
|
Do not use:
|
||||||
|
|
||||||
|
```text
|
||||||
|
强信号 / 中信号 / 待验证
|
||||||
|
```
|
||||||
|
|
||||||
|
This style feels too much like an industry rating or investment brief.
|
||||||
|
|
||||||
|
Use:
|
||||||
|
|
||||||
|
```text
|
||||||
|
导览
|
||||||
|
今日脉络
|
||||||
|
仍待确认, when needed
|
||||||
|
```
|
||||||
|
|
||||||
|
### Output Schema
|
||||||
|
|
||||||
|
The LLM should output structured JSON, not Markdown:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"theme": "One concise daily theme.",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "模型能力继续向长上下文、实时语音、多模态生成推进",
|
||||||
|
"text": "MiniMax M3、Miso One、Ideogram v4.0 分别从长上下文解码、语音克隆和图像生成质量上更新能力边界。",
|
||||||
|
"item_ids": ["item_1", "item_2", "item_3"],
|
||||||
|
"kind": "thread"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "仍待确认",
|
||||||
|
"text": "融资传闻、排行榜和单源爆料类消息需要等待官方或更多来源确认。",
|
||||||
|
"item_ids": ["item_8"],
|
||||||
|
"kind": "uncertain"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rules
|
||||||
|
|
||||||
|
- Theme should be one paragraph under 120 Chinese characters.
|
||||||
|
- Threads should be 2-4 items.
|
||||||
|
- Each thread must bind to existing `item_ids`.
|
||||||
|
- Do not add facts absent from the item list.
|
||||||
|
- Do not write advice.
|
||||||
|
- Do not include reference numbers.
|
||||||
|
- Do not include Markdown blockquote syntax. Stage 7 will render Markdown.
|
||||||
|
|
||||||
|
### Failure Behavior
|
||||||
|
|
||||||
|
- If theme generation fails, omit the guide or use a conservative fallback.
|
||||||
|
- If threads fail, omit `今日脉络`.
|
||||||
|
- Invalid thread IDs should drop that thread.
|
||||||
|
|
||||||
|
### Artifacts
|
||||||
|
|
||||||
|
```text
|
||||||
|
guide_input.json
|
||||||
|
guide_output.json
|
||||||
|
stage6_guide_report.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage 7: Assemble and Validate Markdown
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Render final Markdown deterministically and validate it before publishing.
|
||||||
|
|
||||||
|
### Recommended Structure
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## 导览
|
||||||
|
|
||||||
|
> 一句话主线。
|
||||||
|
|
||||||
|
## 模型与能力
|
||||||
|
|
||||||
|
**1. 新闻标题**
|
||||||
|
|
||||||
|
> 新闻摘要。[来源 ↗](https://example.com)
|
||||||
|
|
||||||
|
## 今日脉络
|
||||||
|
|
||||||
|
- **主题**
|
||||||
|
说明...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rendering Rules
|
||||||
|
|
||||||
|
- Render Markdown in code only.
|
||||||
|
- Use global continuous numbering.
|
||||||
|
- Hide empty sections.
|
||||||
|
- Add blockquote syntax for the guide in code.
|
||||||
|
- Strip any leading `>` from LLM-provided theme text before rendering.
|
||||||
|
- Use source links consistently:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
[OpenAI: Blog ↗](https://example.com)
|
||||||
|
```
|
||||||
|
|
||||||
|
If URL is unavailable, render the source label without a link.
|
||||||
|
|
||||||
|
### Auto-fixes
|
||||||
|
|
||||||
|
- Remove `> >`.
|
||||||
|
- Remove `[N]` and numeric reference markers.
|
||||||
|
- Remove code fences from guide/thread text.
|
||||||
|
- Normalize extra blank lines.
|
||||||
|
- Add missing Chinese punctuation to summaries.
|
||||||
|
- Remove `主线判断:` prefixes if present.
|
||||||
|
|
||||||
|
### Blocking Checks
|
||||||
|
|
||||||
|
Block publish or downgrade to draft when:
|
||||||
|
|
||||||
|
- Item count is zero.
|
||||||
|
- No sections are rendered.
|
||||||
|
- Markdown is abnormally short.
|
||||||
|
- Section name is outside the whitelist.
|
||||||
|
- JSON fragments remain in Markdown.
|
||||||
|
- Link formatting is broadly broken.
|
||||||
|
- Forbidden advisory language appears in guide/thread text.
|
||||||
|
|
||||||
|
### Artifacts
|
||||||
|
|
||||||
|
```text
|
||||||
|
blog_markdown.md
|
||||||
|
stage7_markdown_report.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage 8: Publish and Deliver
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Publish only validated Markdown, verify the public page, and make the operation idempotent and recoverable.
|
||||||
|
|
||||||
|
### Modes
|
||||||
|
|
||||||
|
```text
|
||||||
|
dry-run
|
||||||
|
draft
|
||||||
|
publish
|
||||||
|
```
|
||||||
|
|
||||||
|
### Requirements
|
||||||
|
|
||||||
|
- Do not publish when Stage 7 has blocking errors.
|
||||||
|
- Use a deterministic slug such as `ai-YYYY-MM-DD`.
|
||||||
|
- Check whether the slug already exists before creating a new post.
|
||||||
|
- Support existence strategies:
|
||||||
|
- `skip`
|
||||||
|
- `update-draft`
|
||||||
|
- `replace`
|
||||||
|
- `republish`
|
||||||
|
- Verify the public URL with retries.
|
||||||
|
- Preserve Markdown and reports when publishing fails.
|
||||||
|
- Support publishing from an existing run directory.
|
||||||
|
|
||||||
|
### Artifacts
|
||||||
|
|
||||||
|
```text
|
||||||
|
stage8_publish_report.json
|
||||||
|
run_report.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Run Directory
|
||||||
|
|
||||||
|
Every run should write to an isolated directory:
|
||||||
|
|
||||||
|
```text
|
||||||
|
runs/2026-06-04/
|
||||||
|
source_results.json
|
||||||
|
raw_items.json
|
||||||
|
stage0_collect_report.json
|
||||||
|
normalized_items.json
|
||||||
|
stage1_normalize_report.json
|
||||||
|
deduped_items.json
|
||||||
|
stage2_dedupe_report.json
|
||||||
|
semantic_dedup_output.json
|
||||||
|
stage3_semantic_dedup_report.json
|
||||||
|
rewritten_items.json
|
||||||
|
stage4_rewrite_report.json
|
||||||
|
classified_items.json
|
||||||
|
stage5_classify_order_report.json
|
||||||
|
guide_output.json
|
||||||
|
stage6_guide_report.json
|
||||||
|
blog_markdown.md
|
||||||
|
stage7_markdown_report.json
|
||||||
|
stage8_publish_report.json
|
||||||
|
run_report.json
|
||||||
|
```
|
||||||
|
|
||||||
|
This makes the pipeline replayable and debuggable.
|
||||||
|
|
||||||
|
## CLI
|
||||||
|
|
||||||
|
Provide agent-friendly commands:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ai-daily-report run --date today --mode publish
|
||||||
|
ai-daily-report run --date today --mode dry-run
|
||||||
|
ai-daily-report run --date 2026-06-04 --mode draft
|
||||||
|
ai-daily-report replay --run-id 2026-06-04 --from-stage 4
|
||||||
|
ai-daily-report publish --from-run 2026-06-04
|
||||||
|
ai-daily-report status --date 2026-06-04
|
||||||
|
```
|
||||||
|
|
||||||
|
The current cron can keep invoking the compatibility script, which should delegate to the CLI.
|
||||||
|
|
||||||
|
## Skill Strategy
|
||||||
|
|
||||||
|
Create or update an `ai-daily-report` skill for Hermes/OpenClaw. The skill should not contain business logic. It should provide:
|
||||||
|
|
||||||
|
- How to run daily generation.
|
||||||
|
- How to dry-run.
|
||||||
|
- How to replay from an existing run.
|
||||||
|
- How to publish already generated Markdown.
|
||||||
|
- How to diagnose source, LLM, Markdown, or publish failures.
|
||||||
|
- How to add a new RSS source.
|
||||||
|
- How to adjust output style without breaking the pipeline.
|
||||||
|
|
||||||
|
Suggested skill references:
|
||||||
|
|
||||||
|
```text
|
||||||
|
skill/references/sources.md
|
||||||
|
skill/references/output-style.md
|
||||||
|
skill/references/troubleshooting.md
|
||||||
|
skill/references/llm-config.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
Add fixtures and tests for:
|
||||||
|
|
||||||
|
- AI HOT sample parsing.
|
||||||
|
- RSS parsing.
|
||||||
|
- Juya `content:encoded` parsing.
|
||||||
|
- URL canonicalization.
|
||||||
|
- Title normalization.
|
||||||
|
- Deterministic deduplication.
|
||||||
|
- LLM JSON schema validation.
|
||||||
|
- Rewrite output validation.
|
||||||
|
- Section alias mapping.
|
||||||
|
- Markdown rendering.
|
||||||
|
- Markdown validation.
|
||||||
|
- Publish dry-run behavior.
|
||||||
|
|
||||||
|
Start with local fixture tests. They will give most of the stability benefit without needing live network calls.
|
||||||
|
|
||||||
|
## Migration Plan
|
||||||
|
|
||||||
|
### Phase 1: Stabilize Current Script
|
||||||
|
|
||||||
|
- Add run directories.
|
||||||
|
- Add SourceResult and stage reports.
|
||||||
|
- Add URL canonicalization.
|
||||||
|
- Replace risky Stage 0 dedupe with hard dedup.
|
||||||
|
- Add Markdown validation and auto-fixes.
|
||||||
|
|
||||||
|
### Phase 2: Improve Quality
|
||||||
|
|
||||||
|
- Add semantic dedup schema and safety checks.
|
||||||
|
- Batch rewrite title and summary.
|
||||||
|
- Add section alias mapping and rule-first classification.
|
||||||
|
- Replace the current summary with `今日脉络`.
|
||||||
|
|
||||||
|
### Phase 3: Modularize
|
||||||
|
|
||||||
|
- Extract modules under `ai_daily_report/`.
|
||||||
|
- Add CLI.
|
||||||
|
- Keep old script as compatibility entrypoint.
|
||||||
|
- Add fixture tests.
|
||||||
|
|
||||||
|
### Phase 4: Skill Integration
|
||||||
|
|
||||||
|
- Update `skill/SKILL.md`.
|
||||||
|
- Add references for sources, style, troubleshooting, and LLM config.
|
||||||
|
- Make Hermes/OpenClaw call the CLI.
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
The optimized pipeline should satisfy:
|
||||||
|
|
||||||
|
- A usable Markdown report is generated whenever enough source data exists.
|
||||||
|
- Optional source failures degrade the run but do not stop it.
|
||||||
|
- LLM failures degrade individual stages but do not destroy the whole report.
|
||||||
|
- No non-duplicate item is removed by importance or editorial selection.
|
||||||
|
- Every removed duplicate has a reason.
|
||||||
|
- Every stage writes inspectable artifacts.
|
||||||
|
- A failed publish can be retried from an existing run.
|
||||||
|
- Agents can run, diagnose, replay, and publish via stable commands.
|
||||||
159
docs/plans/2026-06-04-local-dry-run-foundation.md
Normal file
159
docs/plans/2026-06-04-local-dry-run-foundation.md
Normal file
@@ -0,0 +1,159 @@
|
|||||||
|
# Local Dry-Run Foundation Implementation Plan
|
||||||
|
|
||||||
|
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||||
|
|
||||||
|
**Goal:** Make the current pipeline testable on a local machine without Hermes credentials, blog credentials, or live LLM calls.
|
||||||
|
|
||||||
|
**Architecture:** Keep the existing single script as the compatibility entrypoint. Add small, tested helpers for project `.env` loading, dry-run token behavior, and mock LLM responses. This creates a safe base for later Stage 0-8 modularization.
|
||||||
|
|
||||||
|
**Tech Stack:** Python standard library, `unittest`, current `script/ai_daily_blog_pipeline.py`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Add Local `.env` Loading
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `script/ai_daily_blog_pipeline.py`
|
||||||
|
- Create: `tests/test_env_loading.py`
|
||||||
|
|
||||||
|
**Step 1: Write the failing test**
|
||||||
|
|
||||||
|
Test that `load_env()` reads project-root `.env` values when Hermes env is absent, and that real process environment variables override file values.
|
||||||
|
|
||||||
|
**Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `python -m unittest tests.test_env_loading -v`
|
||||||
|
|
||||||
|
Expected: FAIL because the script currently only reads `~/.hermes/.env`.
|
||||||
|
|
||||||
|
**Step 3: Implement minimal code**
|
||||||
|
|
||||||
|
Add a helper to parse env files and update `load_env()` to read:
|
||||||
|
|
||||||
|
1. Project `.env`
|
||||||
|
2. `~/.hermes/.env`
|
||||||
|
3. process environment
|
||||||
|
|
||||||
|
Later sources override earlier ones.
|
||||||
|
|
||||||
|
**Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
Run: `python -m unittest tests.test_env_loading -v`
|
||||||
|
|
||||||
|
Expected: PASS.
|
||||||
|
|
||||||
|
### Task 2: Let Dry-Run Skip Blog Token Requirement
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `script/ai_daily_blog_pipeline.py`
|
||||||
|
- Create: `tests/test_dry_run_config.py`
|
||||||
|
|
||||||
|
**Step 1: Write the failing test**
|
||||||
|
|
||||||
|
Extract a small helper such as `is_dry_run(env)` and `require_blog_token(env)`, then test:
|
||||||
|
|
||||||
|
- `AI_DAILY_DRY_RUN=1` does not require `BLOG_SERVICE_TOKEN`.
|
||||||
|
- normal publish mode still requires a token.
|
||||||
|
|
||||||
|
**Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `python -m unittest tests.test_dry_run_config -v`
|
||||||
|
|
||||||
|
Expected: FAIL because no helper exists and `main()` checks token before dry-run.
|
||||||
|
|
||||||
|
**Step 3: Implement minimal code**
|
||||||
|
|
||||||
|
Move dry-run detection before token validation in `main()`.
|
||||||
|
|
||||||
|
**Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
Run: `python -m unittest tests.test_dry_run_config -v`
|
||||||
|
|
||||||
|
Expected: PASS.
|
||||||
|
|
||||||
|
### Task 3: Add Mock LLM Mode
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `script/ai_daily_blog_pipeline.py`
|
||||||
|
- Create: `tests/test_mock_llm.py`
|
||||||
|
|
||||||
|
**Step 1: Write the failing test**
|
||||||
|
|
||||||
|
Test that `llm_call(prompt, {"AI_DAILY_LLM_MODE": "mock"})` returns valid JSON for:
|
||||||
|
|
||||||
|
- semantic dedup prompts
|
||||||
|
- summary rewrite prompts
|
||||||
|
- classify prompts
|
||||||
|
|
||||||
|
Also test that guide generation can get a non-empty mock response.
|
||||||
|
|
||||||
|
**Step 2: Run test to verify it fails**
|
||||||
|
|
||||||
|
Run: `python -m unittest tests.test_mock_llm -v`
|
||||||
|
|
||||||
|
Expected: FAIL because mock mode does not exist.
|
||||||
|
|
||||||
|
**Step 3: Implement minimal code**
|
||||||
|
|
||||||
|
Add `AI_DAILY_LLM_MODE=mock` support in `llm_call()`.
|
||||||
|
|
||||||
|
**Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
Run: `python -m unittest tests.test_mock_llm -v`
|
||||||
|
|
||||||
|
Expected: PASS.
|
||||||
|
|
||||||
|
### Task 4: Add Markdown Smoke Test
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `tests/test_markdown_rendering.py`
|
||||||
|
- Modify: `script/ai_daily_blog_pipeline.py` only if necessary.
|
||||||
|
|
||||||
|
**Step 1: Write the failing or characterization test**
|
||||||
|
|
||||||
|
Test that `blog_markdown()` renders:
|
||||||
|
|
||||||
|
- `## 导览`
|
||||||
|
- at least one section
|
||||||
|
- source links
|
||||||
|
- no `> >`
|
||||||
|
- no `[N]`
|
||||||
|
|
||||||
|
**Step 2: Run test**
|
||||||
|
|
||||||
|
Run: `python -m unittest tests.test_markdown_rendering -v`
|
||||||
|
|
||||||
|
Expected: If it already passes, keep it as characterization coverage. If it fails because of `> >`, implement a focused fix.
|
||||||
|
|
||||||
|
**Step 3: Implement minimal fix if needed**
|
||||||
|
|
||||||
|
Strip leading `>` from guide text before adding blockquote syntax.
|
||||||
|
|
||||||
|
**Step 4: Run test to verify it passes**
|
||||||
|
|
||||||
|
Run: `python -m unittest tests.test_markdown_rendering -v`
|
||||||
|
|
||||||
|
Expected: PASS.
|
||||||
|
|
||||||
|
### Task 5: Run Full Verification
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- No new files.
|
||||||
|
|
||||||
|
**Step 1: Run unit tests**
|
||||||
|
|
||||||
|
Run: `python -m unittest discover -s tests -v`
|
||||||
|
|
||||||
|
Expected: PASS.
|
||||||
|
|
||||||
|
**Step 2: Run compile check**
|
||||||
|
|
||||||
|
Run: `python -m py_compile script/ai_daily_blog_pipeline.py`
|
||||||
|
|
||||||
|
Expected: exit code 0.
|
||||||
|
|
||||||
|
**Step 3: Check git status**
|
||||||
|
|
||||||
|
Run: `git status --short`
|
||||||
|
|
||||||
|
Expected: only intended files are modified or added.
|
||||||
130
docs/plans/2026-06-10-ai-daily-full-chain-optimization.md
Normal file
130
docs/plans/2026-06-10-ai-daily-full-chain-optimization.md
Normal file
@@ -0,0 +1,130 @@
|
|||||||
|
# AI Daily Full Chain Optimization Implementation Plan
|
||||||
|
|
||||||
|
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||||
|
|
||||||
|
**Goal:** Add the first quality safety layer for the AI daily report pipeline: semantic candidate recall, quality gate reporting, stage snapshots, and effective pipeline configuration.
|
||||||
|
|
||||||
|
**Architecture:** Keep the existing stage functions and add a rule-based Stage 2.8 between cross-day URL dedupe and LLM semantic dedupe. Quality gate stays deterministic and report-only for dry-run visibility, while publish blocking can consume its `blocking_errors` through the existing Stage 7/8 guard path. Runner persists stage artifacts from the pipeline result without changing generated content.
|
||||||
|
|
||||||
|
**Tech Stack:** Python standard library, `unittest`, existing dataclass models and pipeline modules.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Make Pipeline Config Effective
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `ai_daily_report/pipeline.py`
|
||||||
|
- Modify: `ai_daily_report/runner.py`
|
||||||
|
- Test: `tests/test_stage0_to_4_pipeline.py`
|
||||||
|
- Test: `tests/test_runner.py`
|
||||||
|
|
||||||
|
**Step 1: Write failing tests**
|
||||||
|
|
||||||
|
Use existing tests that call `run_stage0_to_stage4(..., semantic_dedup_max_deletion_ratio=0.1, rewrite_batch_size=1)` and expect Stage 4 `batch_count == 3`.
|
||||||
|
|
||||||
|
**Step 2: Run tests to verify failure**
|
||||||
|
|
||||||
|
Run: `python -m pytest tests/test_stage0_to_4_pipeline.py tests/test_runner.py -q`
|
||||||
|
|
||||||
|
Expected: failure from unexpected keyword arguments or ignored config.
|
||||||
|
|
||||||
|
**Step 3: Implement minimal code**
|
||||||
|
|
||||||
|
Thread `semantic_dedup_max_deletion_ratio` into `semantic_dedup_items()` and `rewrite_batch_size` into `rewrite_items()`. Read both from `pipeline.json` in `runner.py`.
|
||||||
|
|
||||||
|
**Step 4: Verify**
|
||||||
|
|
||||||
|
Run the same tests and expect pass.
|
||||||
|
|
||||||
|
### Task 2: Add Stage 2.8 Candidate Recall
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `ai_daily_report/candidate_recall.py`
|
||||||
|
- Modify: `ai_daily_report/pipeline.py`
|
||||||
|
- Test: `tests/test_candidate_recall.py`
|
||||||
|
- Test: `tests/test_stage0_to_4_pipeline.py`
|
||||||
|
|
||||||
|
**Step 1: Write failing tests**
|
||||||
|
|
||||||
|
Add tests proving related Claude Fable/Mythos items are recalled even when Stage 2 title candidates are empty, while unrelated Gemini/Gemma items are not grouped by company name alone.
|
||||||
|
|
||||||
|
**Step 2: Run tests to verify failure**
|
||||||
|
|
||||||
|
Run: `python -m pytest tests/test_candidate_recall.py tests/test_stage0_to_4_pipeline.py -q`
|
||||||
|
|
||||||
|
Expected: import failure for the new module or zero recalled candidates.
|
||||||
|
|
||||||
|
**Step 3: Implement minimal code**
|
||||||
|
|
||||||
|
Use deterministic title similarity, token Jaccard, summary Jaccard, and strong entity overlap to produce candidate groups with `item_ids`, `reason`, `score`, and evidence fields.
|
||||||
|
|
||||||
|
**Step 4: Verify**
|
||||||
|
|
||||||
|
Run targeted tests and expect pass.
|
||||||
|
|
||||||
|
### Task 3: Add Quality Gate Reporting
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Create: `ai_daily_report/quality_gate.py`
|
||||||
|
- Modify: `ai_daily_report/pipeline.py`
|
||||||
|
- Test: `tests/test_quality_gate.py`
|
||||||
|
|
||||||
|
**Step 1: Write failing tests**
|
||||||
|
|
||||||
|
Add tests for warnings when Stage 3 candidates are zero for large item sets, enabled sources fail, and required sources fail.
|
||||||
|
|
||||||
|
**Step 2: Run tests to verify failure**
|
||||||
|
|
||||||
|
Run: `python -m pytest tests/test_quality_gate.py -q`
|
||||||
|
|
||||||
|
Expected: import failure for the new module.
|
||||||
|
|
||||||
|
**Step 3: Implement minimal code**
|
||||||
|
|
||||||
|
Return a report with `warnings`, `blocking_errors`, `source_failures`, and `quality_gate_failed`. Add it after Stage 7 and propagate blocking errors into Stage 7 before publish.
|
||||||
|
|
||||||
|
**Step 4: Verify**
|
||||||
|
|
||||||
|
Run quality gate and publish-path tests.
|
||||||
|
|
||||||
|
### Task 4: Persist Stage Snapshots
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `ai_daily_report/pipeline.py`
|
||||||
|
- Modify: `ai_daily_report/runner.py`
|
||||||
|
- Test: `tests/test_runner.py`
|
||||||
|
|
||||||
|
**Step 1: Write failing tests**
|
||||||
|
|
||||||
|
Assert that a mock run writes `stage0_sources.json`, `stage1_items.json`, `stage2_items.json`, `stage2_5_items.json`, `stage2_8_candidates.json`, `stage3_items.json`, `stage4_items.json`, and `quality_gate.json`.
|
||||||
|
|
||||||
|
**Step 2: Run tests to verify failure**
|
||||||
|
|
||||||
|
Run: `python -m pytest tests/test_runner.py -q`
|
||||||
|
|
||||||
|
Expected: snapshot files are missing.
|
||||||
|
|
||||||
|
**Step 3: Implement minimal code**
|
||||||
|
|
||||||
|
Have pipeline results carry an `artifacts` dict and have runner serialize the requested JSON files using the existing dataclass serializer.
|
||||||
|
|
||||||
|
**Step 4: Verify**
|
||||||
|
|
||||||
|
Run runner tests and inspect generated files through assertions.
|
||||||
|
|
||||||
|
### Task 5: Full Regression
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- All touched files
|
||||||
|
|
||||||
|
**Step 1: Run targeted tests**
|
||||||
|
|
||||||
|
Run: `python -m pytest tests/test_candidate_recall.py tests/test_quality_gate.py tests/test_stage0_to_4_pipeline.py tests/test_runner.py -q`
|
||||||
|
|
||||||
|
**Step 2: Run full test suite**
|
||||||
|
|
||||||
|
Run: `python -m pytest -q`
|
||||||
|
|
||||||
|
**Step 3: Fix regressions**
|
||||||
|
|
||||||
|
Fix only issues caused by this change set.
|
||||||
File diff suppressed because it is too large
Load Diff
@@ -1,198 +0,0 @@
|
|||||||
## 导览
|
|
||||||
|
|
||||||
> > 微软与OpenAI正式分家、Anthropic提交招股书、DeepSeek计划融500亿——AI行业正在从“联盟军”转向“诸侯争霸”。
|
|
||||||
|
|
||||||
## 模型发布/更新
|
|
||||||
|
|
||||||
**1. Grok Imagine 1.5 预览版发布**
|
|
||||||
|
|
||||||
> Grok Imagine 1.5 预览版即日起在 API 中上线,SpaceXAI 持续发力。[X:@cb_doge ↗](https://x.com/cb_doge/status/2062242490745594085)
|
|
||||||
|
|
||||||
**2. MiniMax M3 1M token 解码加速 15.6 倍**
|
|
||||||
|
|
||||||
> MiniMax M3 在 1M token 下解码加速 15.6 倍,FireworksAI_HQ 提供推理支持。[X:@MiniMax_AI ↗](https://x.com/MiniMax_AI/status/2062316914618388758)
|
|
||||||
|
|
||||||
**3. Miso One 开源语音模型:8B 参数、110ms 延迟、一次语音克隆**
|
|
||||||
|
|
||||||
> Miso One 发布 8B 参数开源语音模型,支持一次语音克隆(短样本),推理延迟 110ms,权重已开源,可自托管,API 即将推出,演示已上线。[X:@kimmonismus ↗](https://x.com/kimmonismus/status/2062210845308780639)
|
|
||||||
|
|
||||||
**4. Ideogram v4.0 发布:2K 分辨率和 JSON 提示支持**
|
|
||||||
|
|
||||||
> Ideogram v4.0 发布,原生 2K 分辨率,文字渲染出色,支持 JSON 提示词,可在 Krea 中体验。[X:@krea_ai ↗](https://x.com/krea_ai/status/2062227837130887567)
|
|
||||||
|
|
||||||
## 产品与工具
|
|
||||||
|
|
||||||
**5. Meta 面向 WhatsApp Business 的 AI 智能体现已全球上线**
|
|
||||||
|
|
||||||
> Meta 为 WhatsApp Business 推出的 AI 智能体面向全球商家开放,按模型 token 使用量收费。[TechCrunch ↗](https://techcrunch.com/2026/06/03/metas-ai-agent-for-whatsapp-business-is-now-available-globally)
|
|
||||||
|
|
||||||
**6. NousResearch 发布 Hermes Agent 桌面应用公测版**
|
|
||||||
|
|
||||||
> NousResearch 推出 Hermes Agent 桌面应用公测版。[X:@SiliconFlowAI ↗](https://x.com/SiliconFlowAI/status/2062042813852995899)
|
|
||||||
|
|
||||||
**7. xAI Grok 语音模型上线 Vapi 平台**
|
|
||||||
|
|
||||||
> xAI 的 Grok STT 和 TTS 语音模型登陆企业语音 AI 平台 Vapi,可用于构建自定义语音智能体。[X:@xai ↗](https://x.com/xai/status/2062209374039499178)
|
|
||||||
|
|
||||||
**8. Grok 模型登陆 Cloudflare AI Gateway**
|
|
||||||
|
|
||||||
> Grok 模型现已可在 Cloudflare AI Gateway 上试用。[X:@xai ↗](https://x.com/xai/status/2062294202625696081)
|
|
||||||
|
|
||||||
**9. OpenShell v0.0.55 发布:新增 Vertex AI 推理支持**
|
|
||||||
|
|
||||||
> OpenShell v0.0.55 发布,新增 Google Vertex AI 推理支持,改进策略可见性、Podman 检测和 GPU 沙箱行为。[X:@NVIDIAAI ↗](https://x.com/NVIDIAAI/status/2062210034109677665)
|
|
||||||
|
|
||||||
**10. Replit 上线 SEO Agent 助应用被发现**
|
|
||||||
|
|
||||||
> Replit 推出 SEO Agent,扫描应用并提供修复建议,帮助应用在网页和 AI 搜索中被发现。[X:@Replit ↗](https://x.com/Replit/status/2062211976995188871)
|
|
||||||
|
|
||||||
**11. OpenClaw 2026.6.1 发布:新增 Windows 节点与技能工坊**
|
|
||||||
|
|
||||||
> OpenClaw 2026.6.1 发布,新增原生 Windows 节点主机、技能工坊和工作板编排,支持 MiniMax M3。[X:@openclaw ↗](https://x.com/openclaw/status/2062288421406785710)
|
|
||||||
|
|
||||||
**12. Reachy Mini 添加 MCP 工具**
|
|
||||||
|
|
||||||
> Reachy Mini 推出公开 MCP canary Space,支持远程工具调用。[Hugging Face:Blog ↗](https://huggingface.co/blog/adding-mcp-tools-to-reachy-mini)
|
|
||||||
|
|
||||||
**13. 刚刚,Meta Skill 来了**
|
|
||||||
|
|
||||||
> GitHub 热门仓库 OpenSquilla 发布,代表 Meta Skill 新动向。[量子位 ↗](https://www.qbitai.com/2026/06/428335.html)
|
|
||||||
|
|
||||||
## 开发与工程
|
|
||||||
|
|
||||||
**14. Qwen Cloud 全球 AI 黑客马拉松启动**
|
|
||||||
|
|
||||||
> 首届 Qwen Cloud 全球 AI 黑客马拉松启动,5 大赛道,总奖金超 7 万美元(赛道冠军 1 万美元),Devpost 报名。[X:@alibaba_cloud ↗](https://x.com/alibaba_cloud/status/2062113338994172169)
|
|
||||||
|
|
||||||
**15. 洪水韧性新篇章:Google 开源水文建模框架**
|
|
||||||
|
|
||||||
> Google Research 开源基于 PyTorch 的水文建模框架,采用 Flood Hub 相同架构,允许各国气象部门在本地训练 AI 洪水预报模型。[Google Research:Blog ↗](https://research.google/blog/the-next-chapter-in-flood-resilience-open-sourcing-googles-hydrology-framework)
|
|
||||||
|
|
||||||
**16. 文章:导致 Spark 在 Kubernetes 上 OOM 失败的两个错误配置**
|
|
||||||
|
|
||||||
> 迁移 Spark 到 AKS 后,两个配置交互导致 OOM:spark.kubernetes.local.dirs.tmpfs 使 shuffle spill 改用 RAM 而非磁盘。[InfoQ AI ↗](https://www.infoq.com/articles/spark-oom-kubernetes-misconfigurations/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=AI%2C+ML+%26+Data+Engineering)
|
|
||||||
|
|
||||||
## 行业与公司
|
|
||||||
|
|
||||||
**17. 微软与 OpenAI 分道扬镳——如今双方准备正面交锋**
|
|
||||||
|
|
||||||
> 微软与 OpenAI 合作关系破裂,进入直接竞争。微软 AI 主管 Mustafa Suleyman 称微软需独立证明能力。[The Verge ↗](https://www.theverge.com/ai-artificial-intelligence/942242/microsoft-build-ai-agents-openai-competition)
|
|
||||||
|
|
||||||
**18. 欧盟公布全面技术主权计划,推动芯片与 AI 自主发展**
|
|
||||||
|
|
||||||
> 欧盟推出技术主权计划,扩大本土半导体、AI 和云计算供应链,减少对美亚依赖。[Bloomberg ↗](https://www.bloomberg.com/news/articles/2026-06-03/europe-unveils-sweeping-tech-sovereignty-plan-to-boost-chips-ai)
|
|
||||||
|
|
||||||
**19. Sensor Tower:OpenAI 旗下 ChatGPT 月活已破 10 亿,史上最快**
|
|
||||||
|
|
||||||
> Sensor Tower 估计 ChatGPT 月活于 2025 年 5 月突破 10 亿,增速史上最快;Claude 月活 5600 万,同比增 640%。[IT之家 ↗](https://www.ithome.com/0/959/083.htm)
|
|
||||||
|
|
||||||
**20. 消息称 DeepSeek 首轮融资拟筹集 500 亿元,腾讯、宁德时代等参投**
|
|
||||||
|
|
||||||
> DeepSeek 首轮拟融资 500 亿元,投后估值 3500-4000 亿元。创始人梁文峰出资 200 亿,腾讯拟投 100 亿,宁德时代 50 亿。[IT之家 ↗](https://www.ithome.com/0/959/249.htm)
|
|
||||||
|
|
||||||
**21. Suno 完成 4 亿美元 D 轮融资**
|
|
||||||
|
|
||||||
> Suno 完成 4 亿美元 D 轮融资,估值 54 亿美元,致力于让更多人体验音乐制作。[X:@suno ↗](https://x.com/suno/status/2062183524887675243)
|
|
||||||
|
|
||||||
**22. 宏利香港与阿里云达成 AI 战略合作**
|
|
||||||
|
|
||||||
> 宏利香港与阿里云建立战略合作,共建负责任 AI 创新框架,加速 AI 部署。[X:@alibaba_cloud ↗](https://x.com/alibaba_cloud/status/2062006591377829922)
|
|
||||||
|
|
||||||
**23. 优步每月 1,500 美元的 AI 使用上限为 AI 工具定价提供参考**
|
|
||||||
|
|
||||||
> 优步将 AI 工具月使用上限设为 1500 美元,为行业 AI 定价提供参考信号。[Simon Willison ↗](https://simonwillison.net/2026/Jun/3/uber-caps-usage)
|
|
||||||
|
|
||||||
**24. 世界模型榜首易主!跨维智能登顶 WorldArena**
|
|
||||||
|
|
||||||
> 跨维智能在 WorldArena 上登顶,成为世界模型新榜首。[量子位 ↗](https://www.qbitai.com/2026/06/428435.html)
|
|
||||||
|
|
||||||
**25. 刚刚,Anthropic 提交了招股书!**
|
|
||||||
|
|
||||||
> Anthropic 已提交招股书,预计最快 Q4 上市。[量子位 ↗](https://www.qbitai.com/2026/06/428407.html)
|
|
||||||
|
|
||||||
## 论文与研究
|
|
||||||
|
|
||||||
**26. 斯坦福大学法学院研究:人工智能的表现优于法学教授**
|
|
||||||
|
|
||||||
> 斯坦福大学法学院研究显示,AI 表现优于法学教授,该结果在 Hacker News 获 104 个 Points。[law.stanford.edu ↗](https://law.stanford.edu/press/ai-outperforms-law-professors-in-stanford-law-study)
|
|
||||||
|
|
||||||
**27. NVIDIA Research 在 CVPR 2026 发表三篇论文:规模化训练实现抓取、自动驾驶与智能体泛化**
|
|
||||||
|
|
||||||
> NVIDIA Research 在 CVPR 2026 发表三篇论文:零样本抓取模型 GraspGen-X、自动驾驶 LCDrive、具身智能体 NitroGen,均基于大规模训练。[blogs.nvidia.com:Blog ↗](https://blogs.nvidia.com/blog/cvpr-research-grasping-driving-agent-training)
|
|
||||||
|
|
||||||
**28. Anthropic 分析 832 个 AI 恶意账户:中高风险攻击者半年从 33% 跃至 56%**
|
|
||||||
|
|
||||||
> Anthropic 分析 832 个被封恶意账户,67.3% 使用 AI 编写恶意软件,中高风险占比半年内从 33% 升至 56%,传统威胁评估失效。[Anthropic ↗](https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack)
|
|
||||||
|
|
||||||
**29. 微软研究:装瓶厂 AI 从聊天到决策**
|
|
||||||
|
|
||||||
> 微软在中西部装瓶厂试点三个月显示,AI 超越聊天进入决策领域,需应对真实风险和可靠性要求。[X:@MSFTResearch ↗](https://x.com/MSFTResearch/status/2062204914223169635)
|
|
||||||
|
|
||||||
**30. 世界模型的功能分类**
|
|
||||||
|
|
||||||
> World Labs 与李飞飞发文梳理“世界模型”概念,基于 POMDP 框架分类,指出当前所谓世界模型本质是同一循环的不同投影(如渲染器)。[X:@drfeifei ↗](https://x.com/drfeifei/status/2062247238143996275)
|
|
||||||
|
|
||||||
**31. 从看懂世界到做对动作,卧安机器人 OneModel 1.7 用一条「隐式通路」打通了具身智能的关键断层**
|
|
||||||
|
|
||||||
> 卧安机器人 OneModel 1.7 通过隐式通路在潜在空间完成信息传导,打通具身智能关键断层。[量子位 ↗](https://www.qbitai.com/2026/06/428703.html)
|
|
||||||
|
|
||||||
## 人物与花絮
|
|
||||||
|
|
||||||
**32. 黄仁勋与纳德拉共议智能体 AI 时代**
|
|
||||||
|
|
||||||
> 黄仁勋与纳德拉在台北 MSBuild 同台,展示 NVIDIA 与微软从 Windows 到 AI 工厂的协作。[X:@nvidia ↗](https://x.com/nvidia/status/2062228974273716457)
|
|
||||||
|
|
||||||
**33. Satya Nadella 谈微软 Build 大会主旨演讲**
|
|
||||||
|
|
||||||
> Satya Nadella 在 Microsoft Build 主旨演讲,强调共同构建前沿智能生态系统。[X:@satyanadella ↗](https://x.com/satyanadella/status/2062022060176801826)
|
|
||||||
|
|
||||||
**34. Karpathy 的 llm-wiki 项目获超五千星**
|
|
||||||
|
|
||||||
> @karpathy 的 llm-wiki 项目几周内获 5000+ 星,理念是让 LLM 构建并维护可持续进化的维基知识库。[X:@SiliconFlowAI ↗](https://x.com/SiliconFlowAI/status/2062054848762450324)
|
|
||||||
|
|
||||||
## 观点与教程
|
|
||||||
|
|
||||||
**35. 智能体工程实战窍门全录**
|
|
||||||
|
|
||||||
> @mvanhorn 分享智能体工程方法论:人主导方向、智能体执行,核心为 plan.md 约束行为,总结 22 条实战技巧及完整工具栈。[X:@shao__meng ↗](https://x.com/shao__meng/status/2061974983094755575)
|
|
||||||
|
|
||||||
**36. Anthropic 用 Claude 赋能自助数据分析**
|
|
||||||
|
|
||||||
> Anthropic 用 Claude 自动化 95% 业务分析查询,准确率约 95%,通过智能体分析栈解决概念-实体歧义等三大错误来源。[Claude:Blog ↗](https://claude.com/blog/how-anthropic-enables-self-service-data-analytics-with-claude)
|
|
||||||
|
|
||||||
**37. 超越聊天机器人的直接偏好优化**
|
|
||||||
|
|
||||||
> Dharma-AI 在 Hugging Face 博客发文,探讨直接偏好优化(DPO)在聊天机器人之外的广泛应用。[Hugging Face:Blog ↗](https://huggingface.co/blog/Dharma-AI/direct-preference-optimization-beyond-chatbots)
|
|
||||||
|
|
||||||
**38. 演讲:选择你的 AI 副驾驶:最大化开发效率**
|
|
||||||
|
|
||||||
> Sepehr Khosravi 探讨开发效率工具演变,评估 Cursor 和 Claude Code 等优势,为高级工程师提供可行技巧。[InfoQ AI ↗](https://www.infoq.com/presentations/choosing-ai-copilot/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=AI%2C+ML+%26+Data+Engineering)
|
|
||||||
|
|
||||||
## 总结
|
|
||||||
|
|
||||||
**强信号**
|
|
||||||
|
|
||||||
- **微软与OpenAl分道扬镳,双方开始正面竞争**
|
|
||||||
合作终结后,微软AI主管Mustafa Suleyman称公司必须独立证明能力,这意味着微软将不再依赖OpenAI的模型,而是全力押注自研,OpenAI也失去最大云盟友。
|
|
||||||
|
|
||||||
- **Anthropic提交招股书,预计最快Q4上市**
|
|
||||||
这标志着安全派AI公司正式进入资本市场,与OpenAI争夺投资者注意,Claude的月活同比增长640%也为其估值提供了底气。
|
|
||||||
|
|
||||||
- **ChatGPT月活突破10亿,成为史上增长最快的应用**
|
|
||||||
Sensor Tower数据显示ChatGPT在2025年5月达到这一里程碑,Claude月活5600万,两家头部消费级AI应用的用户粘性正在拉开差距。
|
|
||||||
|
|
||||||
**中信号**
|
|
||||||
|
|
||||||
- **Miso One发布8B开源语音模型,支持一次语音克隆且延迟仅110ms**
|
|
||||||
权重已开放、可自托管,意味着实时语音克隆的门槛从专有API降到了个人部署,可能加速语音交互在开发者中的普及。
|
|
||||||
|
|
||||||
- **欧盟公布全面技术主权计划,推动芯片与AI自主发展**
|
|
||||||
计划扩大本土半导体、AI和云计算供应链,目标减少对美亚依赖——这将对全球AI公司的合规、市场准入和数据主权产生实质影响。
|
|
||||||
|
|
||||||
**待验证**
|
|
||||||
|
|
||||||
- **DeepSeek首轮融资拟筹500亿元,腾讯、宁德时代参投**
|
|
||||||
投后估值高达3500-4000亿元,但融资消息来源为IT之家,未见官方确认。如此大体量的AI融资在国内市场是否顺利落地,存在不确定性。
|
|
||||||
|
|
||||||
- **跨维智能登顶WorldArena世界模型榜首**
|
|
||||||
WorldArena的评测权威性尚未被广泛验证,且“世界模型”概念本身缺乏统一标准,需要看后续是否有独立第三方复现其能力。
|
|
||||||
@@ -1,35 +0,0 @@
|
|||||||
{
|
|
||||||
"date": "2026-06-04",
|
|
||||||
"slug": "ai-2026-06-04",
|
|
||||||
"blog_url": "https://blog.ephron.ren/posts/ai-2026-06-04",
|
|
||||||
"public_ok": true,
|
|
||||||
"errors": [
|
|
||||||
"橘鸦AI早报(重试): TimeoutError"
|
|
||||||
],
|
|
||||||
"aihot_sections": [
|
|
||||||
"模型发布/更新",
|
|
||||||
"产品发布/更新",
|
|
||||||
"行业动态",
|
|
||||||
"论文研究",
|
|
||||||
"技巧与观点"
|
|
||||||
],
|
|
||||||
"raw_item_count": 39,
|
|
||||||
"stage0_count": 39,
|
|
||||||
"final_item_count": 38,
|
|
||||||
"has_juya": false,
|
|
||||||
"source_counts": {
|
|
||||||
"AI HOT": 32,
|
|
||||||
"InfoQ AI": 2,
|
|
||||||
"MIT科技评论AI": 0,
|
|
||||||
"量子位": 5,
|
|
||||||
"橘鸦AI早报": 0
|
|
||||||
},
|
|
||||||
"featured_titles": [
|
|
||||||
"Grok Imagine 1.5 预览版发布",
|
|
||||||
"MiniMax M3 1M token 解码加速 15.6 倍",
|
|
||||||
"Miso One 开源语音模型:8B 参数、110ms 延迟、一次语音克隆",
|
|
||||||
"Ideogram v4.0 发布:2K 分辨率和 JSON 提示支持",
|
|
||||||
"Meta 面向 WhatsApp Business 的 AI 智能体现已全球上线",
|
|
||||||
"NousResearch 发布 Hermes Agent 桌面应用公测版"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
41
scripts/generate_ops_docs.py
Normal file
41
scripts/generate_ops_docs.py
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
PIPELINE = json.loads((ROOT / "config" / "pipeline.json").read_text(encoding="utf-8"))
|
||||||
|
SOURCES = json.loads((ROOT / "config" / "sources.json").read_text(encoding="utf-8"))
|
||||||
|
DOC = ROOT / "docs" / "ops-thresholds.generated.md"
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
quality = PIPELINE.get("quality_gate", {})
|
||||||
|
recall = PIPELINE.get("semantic_candidate_recall", {})
|
||||||
|
lines = [
|
||||||
|
"# AI日报运维阈值(自动生成)",
|
||||||
|
"",
|
||||||
|
"> 由 `scripts/generate_ops_docs.py` 从 `config/pipeline.json` 和 `config/sources.json` 生成;不要手改本文件。",
|
||||||
|
"",
|
||||||
|
"## Quality Gate",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
for key in sorted(quality):
|
||||||
|
lines.append(f"- `{key}`: `{quality[key]}`")
|
||||||
|
lines.extend(["", "## Semantic Candidate Recall", ""])
|
||||||
|
for key in sorted(recall):
|
||||||
|
lines.append(f"- `{key}`: `{recall[key]}`")
|
||||||
|
lines.extend(["", "## Sources", "", "| source | required | failure_policy | min_items | retries | timeout_seconds |", "|---|---:|---|---:|---:|---:|"])
|
||||||
|
for source in SOURCES:
|
||||||
|
lines.append(
|
||||||
|
f"| {source['name']} | {source.get('required', False)} | {source.get('failure_policy', '')} | "
|
||||||
|
f"{source.get('min_items', 0)} | {source.get('retries', 0)} | {source.get('timeout_seconds', '')} |"
|
||||||
|
)
|
||||||
|
DOC.write_text("\n".join(lines) + "\n", encoding="utf-8")
|
||||||
|
print(DOC)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
1
skill/scripts/.gitkeep
Normal file
1
skill/scripts/.gitkeep
Normal file
@@ -0,0 +1 @@
|
|||||||
|
|
||||||
7
skill/scripts/run_daily_report.py
Normal file
7
skill/scripts/run_daily_report.py
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
from ai_daily_report.cli import main
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
|
|
||||||
24
skill/scripts/weekly_audit.py
Normal file
24
skill/scripts/weekly_audit.py
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
REPO_DIR = Path(__file__).resolve().parents[2]
|
||||||
|
if str(REPO_DIR) not in sys.path:
|
||||||
|
sys.path.insert(0, str(REPO_DIR))
|
||||||
|
|
||||||
|
from ai_daily_report.audit import render_markdown, summarize_reports
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
out_dir = Path.home() / ".hermes" / "scripts" / "ai_morning_out"
|
||||||
|
if not out_dir.exists():
|
||||||
|
print("AI日报每周审计:未找到输出目录")
|
||||||
|
return 1
|
||||||
|
print(render_markdown(summarize_reports(out_dir, limit_days=7)))
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
1
tests/fixtures/.gitkeep
vendored
Normal file
1
tests/fixtures/.gitkeep
vendored
Normal file
@@ -0,0 +1 @@
|
|||||||
|
|
||||||
74
tests/fixtures/history_replay_2026_06_04_2026_06_10.json
vendored
Normal file
74
tests/fixtures/history_replay_2026_06_04_2026_06_10.json
vendored
Normal file
@@ -0,0 +1,74 @@
|
|||||||
|
{
|
||||||
|
"date_range": ["2026-06-04", "2026-06-10"],
|
||||||
|
"purpose": "Historical replay fixtures for semantic candidate recall, Stage 3 merge_groups, and cross-day regression tests.",
|
||||||
|
"events": [
|
||||||
|
{
|
||||||
|
"event_id": "claude-fable-mythos",
|
||||||
|
"title": "Claude Fable/Mythos",
|
||||||
|
"expected_behavior": "same_event_merge_or_dedupe",
|
||||||
|
"items": [
|
||||||
|
{
|
||||||
|
"date": "2026-06-04",
|
||||||
|
"id": "claude-fable-1",
|
||||||
|
"source": "AI HOT",
|
||||||
|
"title_raw": "Anthropic 推出 Claude Fable,用长篇叙事测试模型记忆",
|
||||||
|
"summary_raw": "Claude Fable 面向长篇故事生成,强调角色一致性和上下文管理。",
|
||||||
|
"url": "https://example.com/claude-fable"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"date": "2026-06-05",
|
||||||
|
"id": "claude-mythos-1",
|
||||||
|
"source": "InfoQ AI",
|
||||||
|
"title_raw": "Claude Mythos/Fable 项目扩展到多角色故事工作流",
|
||||||
|
"summary_raw": "报道从创作流程角度补充 Anthropic Fable/Mythos 的应用场景。",
|
||||||
|
"url": "https://example.com/claude-mythos"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"event_id": "openclaw-suno",
|
||||||
|
"title": "OpenClaw/Suno",
|
||||||
|
"expected_behavior": "same_event_merge_or_dedupe",
|
||||||
|
"items": [
|
||||||
|
{"date": "2026-06-05", "id": "openclaw-suno-1", "source": "AI HOT", "title_raw": "OpenClaw 集成 Suno 音乐生成能力", "summary_raw": "OpenClaw 新版加入 Suno 风格的音乐生成入口。", "url": "https://example.com/openclaw-suno-a"},
|
||||||
|
{"date": "2026-06-05", "id": "openclaw-suno-2", "source": "量子位", "title_raw": "Suno 能力进入 OpenClaw,开源智能体开始做音乐", "summary_raw": "量子位从开源智能体生态角度报道 OpenClaw 与 Suno 相关能力。", "url": "https://example.com/openclaw-suno-b"}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"event_id": "magenta-realtime-2",
|
||||||
|
"title": "Magenta RealTime 2",
|
||||||
|
"expected_behavior": "same_event_merge_or_dedupe",
|
||||||
|
"items": [
|
||||||
|
{"date": "2026-06-06", "id": "magenta-rt2-1", "source": "AI HOT", "title_raw": "Google 发布 Magenta RealTime 2,主打实时音乐生成", "summary_raw": "Magenta RealTime 2 降低延迟,支持互动式音乐创作。", "url": "https://example.com/magenta-rt2-a"},
|
||||||
|
{"date": "2026-06-06", "id": "magenta-rt2-2", "source": "MIT科技评论AI", "title_raw": "Magenta RealTime 2 shows live AI music co-creation", "summary_raw": "MIT Tech Review explains the latency and interaction improvements in Magenta RealTime 2.", "url": "https://example.com/magenta-rt2-b"}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"event_id": "open-code-review",
|
||||||
|
"title": "Open Code Review",
|
||||||
|
"expected_behavior": "same_event_merge_or_dedupe",
|
||||||
|
"items": [
|
||||||
|
{"date": "2026-06-07", "id": "open-code-review-1", "source": "AI HOT", "title_raw": "Open Code Review 发布,开源代码审查智能体上线", "summary_raw": "Open Code Review 面向 GitHub/Gitea 仓库自动生成审查意见。", "url": "https://example.com/open-code-review-a"},
|
||||||
|
{"date": "2026-06-07", "id": "open-code-review-2", "source": "InfoQ AI", "title_raw": "Open Code Review brings agentic review to open-source repos", "summary_raw": "InfoQ focuses on CI integration and review workflows for Open Code Review.", "url": "https://example.com/open-code-review-b"}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"event_id": "openai-chip-talent-move",
|
||||||
|
"title": "OpenAI 芯片成员跳槽",
|
||||||
|
"expected_behavior": "same_event_merge_or_dedupe",
|
||||||
|
"items": [
|
||||||
|
{"date": "2026-06-08", "id": "openai-chip-1", "source": "AI HOT", "title_raw": "OpenAI 定制芯片核心成员跳槽 Anthropic", "summary_raw": "OpenAI 芯片团队关键工程师在量产前离职加入 Anthropic。", "url": "https://example.com/openai-chip-a"},
|
||||||
|
{"date": "2026-06-08", "id": "openai-chip-2", "source": "量子位", "title_raw": "OpenAI 芯片核心叛逃 Anthropic,就在量产前夜", "summary_raw": "量子位强调人才流动对 OpenAI 自研芯片进度的潜在影响。", "url": "https://example.com/openai-chip-b"}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"event_id": "amap-abot",
|
||||||
|
"title": "高德 ABot",
|
||||||
|
"expected_behavior": "same_event_merge_or_dedupe",
|
||||||
|
"items": [
|
||||||
|
{"date": "2026-06-10", "id": "amap-abot-1", "source": "AI HOT", "title_raw": "高德推出 ABot,地图入口接入智能体服务", "summary_raw": "高德 ABot 将出行、搜索和本地生活任务整合到地图智能体。", "url": "https://example.com/amap-abot-a"},
|
||||||
|
{"date": "2026-06-10", "id": "amap-abot-2", "source": "橘鸦AI早报", "title_raw": "高德 ABot 上线,本地生活智能体开始进入地图", "summary_raw": "橘鸦从产品入口角度记录高德 ABot 的上线。", "url": "https://example.com/amap-abot-b"}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
42
tests/test_audit.py
Normal file
42
tests/test_audit.py
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
import json
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from ai_daily_report.audit import render_markdown, summarize_reports
|
||||||
|
|
||||||
|
|
||||||
|
class AuditTests(unittest.TestCase):
|
||||||
|
def test_summarizes_weekly_metrics(self):
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
run_dir = Path(tmp) / "2026-06-10"
|
||||||
|
run_dir.mkdir()
|
||||||
|
(run_dir / "run_report.json").write_text(
|
||||||
|
json.dumps(
|
||||||
|
{
|
||||||
|
"quality_gate": {
|
||||||
|
"source_failures": [{"source": "橘鸦AI早报"}],
|
||||||
|
"warnings": ["enabled_source_failed:橘鸦AI早报:error"],
|
||||||
|
"blocking_errors": [],
|
||||||
|
},
|
||||||
|
"stage2_8": {"candidate_group_count": 6},
|
||||||
|
"stage4": {"fallback_count": 2, "output_count": 20},
|
||||||
|
"stage5": {"output_count": 20},
|
||||||
|
"stage8": {"status": "ok", "slug": "ai-2026-06-10"},
|
||||||
|
}
|
||||||
|
),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
summary = summarize_reports(Path(tmp), limit_days=7)
|
||||||
|
markdown = render_markdown(summary)
|
||||||
|
|
||||||
|
self.assertEqual(summary["run_count"], 1)
|
||||||
|
self.assertEqual(summary["totals"]["source_failures"], 1)
|
||||||
|
self.assertEqual(summary["totals"]["duplicate_candidates"], 6)
|
||||||
|
self.assertEqual(summary["totals"]["fallback_ratio"], 0.1)
|
||||||
|
self.assertIn("AI日报每周自动审计报告", markdown)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
79
tests/test_candidate_recall.py
Normal file
79
tests/test_candidate_recall.py
Normal file
@@ -0,0 +1,79 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.candidate_recall import recall_semantic_candidates
|
||||||
|
from ai_daily_report.models import NewsItem
|
||||||
|
from ai_daily_report.normalize import normalize_title
|
||||||
|
|
||||||
|
|
||||||
|
def item(item_id, title, summary):
|
||||||
|
return NewsItem(
|
||||||
|
id=item_id,
|
||||||
|
source_group="AI HOT",
|
||||||
|
source_label="AI HOT",
|
||||||
|
source_role="primary",
|
||||||
|
source_priority=10,
|
||||||
|
title_raw=title,
|
||||||
|
title_norm=normalize_title(title),
|
||||||
|
summary_raw=summary,
|
||||||
|
url=f"https://example.com/{item_id}",
|
||||||
|
canonical_url=f"https://example.com/{item_id}",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class CandidateRecallTests(unittest.TestCase):
|
||||||
|
def test_recalls_shared_event_entities_when_titles_are_not_stage2_similar(self):
|
||||||
|
items = [
|
||||||
|
item(
|
||||||
|
"a",
|
||||||
|
"Anthropic 被曝开发 Claude Fable",
|
||||||
|
"Anthropic 正在开发名为 Claude Fable 和 Claude Mythos 的新产品。",
|
||||||
|
),
|
||||||
|
item(
|
||||||
|
"b",
|
||||||
|
"Claude Mythos 进入内部测试",
|
||||||
|
"Anthropic 的 Claude Mythos 与 Claude Fable 面向内容生成场景。",
|
||||||
|
),
|
||||||
|
item(
|
||||||
|
"c",
|
||||||
|
"Gemini CLI 发布更新",
|
||||||
|
"Google 为 Gemini CLI 增加新的开发者命令。",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
candidates, report = recall_semantic_candidates(items, existing_candidates=[])
|
||||||
|
|
||||||
|
candidate_sets = [set(candidate["item_ids"]) for candidate in candidates]
|
||||||
|
self.assertIn({"a", "b"}, candidate_sets)
|
||||||
|
self.assertNotIn({"a", "c"}, candidate_sets)
|
||||||
|
self.assertEqual(report["candidate_group_count"], 1)
|
||||||
|
self.assertEqual(candidates[0]["reason"], "strong_entity_overlap")
|
||||||
|
|
||||||
|
def test_does_not_group_same_company_different_products_without_event_overlap(self):
|
||||||
|
items = [
|
||||||
|
item("gemini", "Google 发布 Gemini CLI", "Google 发布面向开发者的 Gemini CLI 工具。"),
|
||||||
|
item("gemma", "Google 开源 Gemma 3n", "Google 开源 Gemma 3n 模型,面向端侧部署。"),
|
||||||
|
]
|
||||||
|
|
||||||
|
candidates, report = recall_semantic_candidates(items, existing_candidates=[])
|
||||||
|
|
||||||
|
self.assertEqual(candidates, [])
|
||||||
|
self.assertEqual(report["candidate_group_count"], 0)
|
||||||
|
|
||||||
|
def test_preserves_existing_candidates_and_adds_new_ones_without_duplicates(self):
|
||||||
|
items = [
|
||||||
|
item("a", "Anthropic 发布 Claude Fable", "Claude Fable 与 Claude Mythos 同时曝光。"),
|
||||||
|
item("b", "Claude Mythos 新功能曝光", "Claude Mythos 和 Claude Fable 是 Anthropic 新项目。"),
|
||||||
|
]
|
||||||
|
|
||||||
|
candidates, report = recall_semantic_candidates(
|
||||||
|
items,
|
||||||
|
existing_candidates=[{"item_ids": ["a", "b"], "reason": "title_similarity"}],
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(len(candidates), 1)
|
||||||
|
self.assertEqual(candidates[0]["reason"], "title_similarity")
|
||||||
|
self.assertEqual(report["existing_candidate_group_count"], 1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
47
tests/test_cli.py
Normal file
47
tests/test_cli.py
Normal file
@@ -0,0 +1,47 @@
|
|||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
from tempfile import TemporaryDirectory
|
||||||
|
|
||||||
|
from ai_daily_report.cli import build_parser, main
|
||||||
|
|
||||||
|
|
||||||
|
class CliTests(unittest.TestCase):
|
||||||
|
def test_run_command_parses_date_and_mode(self):
|
||||||
|
parser = build_parser()
|
||||||
|
|
||||||
|
args = parser.parse_args(["run", "--date", "2026-06-04", "--mode", "dry-run", "--source-mode", "live", "--llm-mode", "live", "--sources-path", "config/sources.json"])
|
||||||
|
|
||||||
|
self.assertEqual(args.command, "run")
|
||||||
|
self.assertEqual(args.date, "2026-06-04")
|
||||||
|
self.assertEqual(args.mode, "dry-run")
|
||||||
|
self.assertEqual(args.source_mode, "live")
|
||||||
|
self.assertEqual(args.llm_mode, "live")
|
||||||
|
self.assertEqual(args.sources_path, "config/sources.json")
|
||||||
|
|
||||||
|
def test_main_returns_zero_for_parseable_command(self):
|
||||||
|
self.assertEqual(main(["run", "--date", "2026-06-04", "--mode", "dry-run"]), 0)
|
||||||
|
|
||||||
|
def test_main_mock_run_writes_outputs(self):
|
||||||
|
with TemporaryDirectory() as temp_dir:
|
||||||
|
exit_code = main(
|
||||||
|
[
|
||||||
|
"run",
|
||||||
|
"--date",
|
||||||
|
"2026-06-04",
|
||||||
|
"--mode",
|
||||||
|
"dry-run",
|
||||||
|
"--source-mode",
|
||||||
|
"mock",
|
||||||
|
"--llm-mode",
|
||||||
|
"mock",
|
||||||
|
"--out-dir",
|
||||||
|
temp_dir,
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(exit_code, 0)
|
||||||
|
self.assertTrue((Path(temp_dir) / "2026-06-04" / "blog_markdown.md").exists())
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
85
tests/test_clients.py
Normal file
85
tests/test_clients.py
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
from email.message import Message
|
||||||
|
from urllib.error import HTTPError
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
from ai_daily_report.clients import FetchTextError, BlogApiClient, OpenAICompatibleClient, fetch_text
|
||||||
|
|
||||||
|
|
||||||
|
class FakeResponse:
|
||||||
|
status = 200
|
||||||
|
|
||||||
|
def __init__(self, body):
|
||||||
|
self.body = body
|
||||||
|
|
||||||
|
def __enter__(self):
|
||||||
|
return self
|
||||||
|
|
||||||
|
def __exit__(self, exc_type, exc, tb):
|
||||||
|
return False
|
||||||
|
|
||||||
|
def read(self):
|
||||||
|
return self.body
|
||||||
|
|
||||||
|
|
||||||
|
class ClientTests(unittest.TestCase):
|
||||||
|
def test_fetch_text_decodes_response(self):
|
||||||
|
with patch("urllib.request.urlopen", return_value=FakeResponse("ok".encode("utf-8"))):
|
||||||
|
self.assertEqual(fetch_text("https://example.com", 1), "ok")
|
||||||
|
|
||||||
|
def test_fetch_text_retries_transient_http_errors(self):
|
||||||
|
responses = [
|
||||||
|
HTTPError("https://example.com", 503, "Service Unavailable", {}, None),
|
||||||
|
FakeResponse("ok".encode("utf-8")),
|
||||||
|
]
|
||||||
|
with patch("urllib.request.urlopen", side_effect=responses) as urlopen:
|
||||||
|
self.assertEqual(fetch_text("https://example.com", 1, retries=1, backoff_seconds=0), "ok")
|
||||||
|
|
||||||
|
self.assertEqual(urlopen.call_count, 2)
|
||||||
|
|
||||||
|
def test_fetch_text_does_not_retry_404_and_classifies_error(self):
|
||||||
|
with patch(
|
||||||
|
"urllib.request.urlopen",
|
||||||
|
side_effect=HTTPError("https://example.com", 404, "Not Found", {}, None),
|
||||||
|
) as urlopen:
|
||||||
|
with self.assertRaises(FetchTextError) as context:
|
||||||
|
fetch_text("https://example.com", 1, retries=2, backoff_seconds=0)
|
||||||
|
|
||||||
|
self.assertEqual(urlopen.call_count, 1)
|
||||||
|
self.assertEqual(context.exception.error_type, "http_404")
|
||||||
|
self.assertEqual(context.exception.http_status, 404)
|
||||||
|
|
||||||
|
def test_openai_compatible_client_returns_message_content(self):
|
||||||
|
body = json.dumps({"choices": [{"message": {"content": "hello"}}]}).encode("utf-8")
|
||||||
|
with patch("urllib.request.urlopen", return_value=FakeResponse(body)):
|
||||||
|
client = OpenAICompatibleClient(api_key="key", base_url="https://llm.example/v1", model="model")
|
||||||
|
self.assertEqual(client.chat("prompt"), "hello")
|
||||||
|
|
||||||
|
def test_blog_api_client_create_and_publish(self):
|
||||||
|
responses = [
|
||||||
|
FakeResponse(json.dumps({"slug": "ai-2026-06-04"}).encode("utf-8")),
|
||||||
|
FakeResponse(json.dumps({"ok": True}).encode("utf-8")),
|
||||||
|
]
|
||||||
|
with patch("urllib.request.urlopen", side_effect=responses):
|
||||||
|
client = BlogApiClient(base_url="https://blog.example", token="token")
|
||||||
|
self.assertEqual(client.create_post({"title": "t"})["slug"], "ai-2026-06-04")
|
||||||
|
client.publish_post("ai-2026-06-04")
|
||||||
|
|
||||||
|
def test_blog_api_client_slug_lookup_falls_back_to_query_endpoint(self):
|
||||||
|
responses = [
|
||||||
|
HTTPError("https://blog.example/api/service/posts/ai-2026-06-10", 404, "Not Found", Message(), None),
|
||||||
|
FakeResponse(json.dumps({"items": [{"slug": "ai-2026-06-10", "content": "body"}]}).encode("utf-8")),
|
||||||
|
]
|
||||||
|
with patch("urllib.request.urlopen", side_effect=responses) as urlopen:
|
||||||
|
client = BlogApiClient(base_url="https://blog.example", token="token")
|
||||||
|
post = client.get_post_by_slug("ai-2026-06-10")
|
||||||
|
|
||||||
|
self.assertIsNotNone(post)
|
||||||
|
assert post is not None
|
||||||
|
self.assertEqual(post["slug"], "ai-2026-06-10")
|
||||||
|
self.assertEqual(urlopen.call_count, 2)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
33
tests/test_config_loading.py
Normal file
33
tests/test_config_loading.py
Normal file
@@ -0,0 +1,33 @@
|
|||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from ai_daily_report.config import load_source_configs
|
||||||
|
from ai_daily_report.sources.registry import get_source_fetcher
|
||||||
|
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
|
||||||
|
|
||||||
|
class ConfigLoadingTests(unittest.TestCase):
|
||||||
|
def test_load_source_configs_from_json(self):
|
||||||
|
configs = load_source_configs(ROOT / "config" / "sources.json")
|
||||||
|
|
||||||
|
self.assertGreaterEqual(len(configs), 5)
|
||||||
|
self.assertEqual(configs[0].name, "AI HOT")
|
||||||
|
self.assertEqual(configs[0].type, "aihot")
|
||||||
|
|
||||||
|
def test_rss_configs_can_set_max_item_age_days(self):
|
||||||
|
configs = load_source_configs(ROOT / "config" / "sources.json")
|
||||||
|
by_name = {config.name: config for config in configs}
|
||||||
|
|
||||||
|
self.assertEqual(by_name["InfoQ AI"].max_item_age_days, 3)
|
||||||
|
|
||||||
|
def test_all_configured_source_types_are_registered(self):
|
||||||
|
configs = load_source_configs(ROOT / "config" / "sources.json")
|
||||||
|
|
||||||
|
for config in configs:
|
||||||
|
self.assertTrue(callable(get_source_fetcher(config.type)))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
33
tests/test_dry_run_config.py
Normal file
33
tests/test_dry_run_config.py
Normal file
@@ -0,0 +1,33 @@
|
|||||||
|
import importlib.util
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
SCRIPT = ROOT / "script" / "ai_daily_blog_pipeline.py"
|
||||||
|
|
||||||
|
|
||||||
|
def load_pipeline_module():
|
||||||
|
spec = importlib.util.spec_from_file_location("ai_daily_blog_pipeline", SCRIPT)
|
||||||
|
module = importlib.util.module_from_spec(spec)
|
||||||
|
spec.loader.exec_module(module)
|
||||||
|
return module
|
||||||
|
|
||||||
|
|
||||||
|
class DryRunConfigTests(unittest.TestCase):
|
||||||
|
def test_dry_run_does_not_require_blog_token(self):
|
||||||
|
module = load_pipeline_module()
|
||||||
|
|
||||||
|
self.assertTrue(module.is_dry_run({"AI_DAILY_DRY_RUN": "1"}))
|
||||||
|
self.assertFalse(module.requires_blog_token({"AI_DAILY_DRY_RUN": "1"}))
|
||||||
|
|
||||||
|
def test_publish_mode_requires_blog_token(self):
|
||||||
|
module = load_pipeline_module()
|
||||||
|
|
||||||
|
self.assertFalse(module.is_dry_run({}))
|
||||||
|
self.assertTrue(module.requires_blog_token({}))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
|
|
||||||
88
tests/test_env_config.py
Normal file
88
tests/test_env_config.py
Normal file
@@ -0,0 +1,88 @@
|
|||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
from tempfile import TemporaryDirectory
|
||||||
|
|
||||||
|
from ai_daily_report.env import resolve_blog_token, resolve_llm_config
|
||||||
|
|
||||||
|
|
||||||
|
class EnvConfigTests(unittest.TestCase):
|
||||||
|
def test_resolve_llm_config_prefers_generic_values(self):
|
||||||
|
config = resolve_llm_config(
|
||||||
|
{
|
||||||
|
"LLM_API_KEY": "generic-key",
|
||||||
|
"LLM_BASE_URL": "https://generic.example/v1",
|
||||||
|
"LLM_MODEL": "generic-model",
|
||||||
|
"SUB2API_API_KEY": "sub-key",
|
||||||
|
"SUB2API_BASE_URL": "https://sub.example/v1",
|
||||||
|
"SUB2API_MODEL": "sub-model",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
config,
|
||||||
|
{
|
||||||
|
"api_key": "generic-key",
|
||||||
|
"base_url": "https://generic.example/v1",
|
||||||
|
"model": "generic-model",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_resolve_llm_config_reports_missing_fields(self):
|
||||||
|
with TemporaryDirectory() as temp_dir:
|
||||||
|
with self.assertRaisesRegex(ValueError, "missing_llm_config: LLM_BASE_URL,LLM_MODEL"):
|
||||||
|
resolve_llm_config({"LLM_API_KEY": "key"}, hermes_dir=Path(temp_dir))
|
||||||
|
|
||||||
|
def test_resolve_llm_config_follows_hermes_provider_config(self):
|
||||||
|
with TemporaryDirectory() as temp_dir:
|
||||||
|
hermes_dir = Path(temp_dir)
|
||||||
|
(hermes_dir / "config.yaml").write_text(
|
||||||
|
"""
|
||||||
|
model:
|
||||||
|
provider: sub2api
|
||||||
|
default: findmini/gpt-5.5
|
||||||
|
base_url: http://sub2api.example/v1
|
||||||
|
""".strip(),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
(hermes_dir / ".env").write_text("SUB2API_API_KEY=hermes-key\n", encoding="utf-8")
|
||||||
|
|
||||||
|
config = resolve_llm_config({}, hermes_dir=hermes_dir)
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
config,
|
||||||
|
{
|
||||||
|
"api_key": "hermes-key",
|
||||||
|
"base_url": "http://sub2api.example/v1",
|
||||||
|
"model": "findmini/gpt-5.5",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_resolve_llm_config_uses_hermes_auth_json_env_source(self):
|
||||||
|
with TemporaryDirectory() as temp_dir:
|
||||||
|
hermes_dir = Path(temp_dir)
|
||||||
|
(hermes_dir / "config.yaml").write_text(
|
||||||
|
"""
|
||||||
|
model:
|
||||||
|
provider: sub2api
|
||||||
|
default: findmini/gpt-5.5
|
||||||
|
base_url: http://sub2api.example/v1
|
||||||
|
""".strip(),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
(hermes_dir / "auth.json").write_text(
|
||||||
|
'{"credential_pool": {"sub2api": [{"source": "env:SUB2API_API_KEY"}]}}',
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
config = resolve_llm_config({"SUB2API_API_KEY": "auth-env-key"}, hermes_dir=hermes_dir)
|
||||||
|
|
||||||
|
self.assertEqual(config["api_key"], "auth-env-key")
|
||||||
|
self.assertEqual(config["base_url"], "http://sub2api.example/v1")
|
||||||
|
self.assertEqual(config["model"], "findmini/gpt-5.5")
|
||||||
|
|
||||||
|
def test_resolve_blog_token_uses_supported_names(self):
|
||||||
|
self.assertEqual(resolve_blog_token({"EPHRON_SERVICE_TOKEN": "token"}), "token")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
39
tests/test_env_loading.py
Normal file
39
tests/test_env_loading.py
Normal file
@@ -0,0 +1,39 @@
|
|||||||
|
import importlib.util
|
||||||
|
import os
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
SCRIPT = ROOT / "script" / "ai_daily_blog_pipeline.py"
|
||||||
|
|
||||||
|
|
||||||
|
def load_pipeline_module():
|
||||||
|
spec = importlib.util.spec_from_file_location("ai_daily_blog_pipeline", SCRIPT)
|
||||||
|
module = importlib.util.module_from_spec(spec)
|
||||||
|
spec.loader.exec_module(module)
|
||||||
|
return module
|
||||||
|
|
||||||
|
|
||||||
|
class EnvLoadingTests(unittest.TestCase):
|
||||||
|
def test_project_env_is_loaded_and_process_env_wins(self):
|
||||||
|
module = load_pipeline_module()
|
||||||
|
env_text = "LLM_MODEL=file-model\nLLM_BASE_URL=https://file.example/v1\n"
|
||||||
|
|
||||||
|
with patch.object(module.Path, "home", return_value=ROOT / "missing-home"):
|
||||||
|
with patch.dict(os.environ, {"LLM_MODEL": "process-model"}, clear=False):
|
||||||
|
with patch.object(module, "PROJECT_ENV_PATH", ROOT / ".env.test"):
|
||||||
|
(ROOT / ".env.test").write_text(env_text, encoding="utf-8")
|
||||||
|
try:
|
||||||
|
env = module.load_env()
|
||||||
|
finally:
|
||||||
|
(ROOT / ".env.test").unlink(missing_ok=True)
|
||||||
|
|
||||||
|
self.assertEqual(env["LLM_BASE_URL"], "https://file.example/v1")
|
||||||
|
self.assertEqual(env["LLM_MODEL"], "process-model")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
|
|
||||||
17
tests/test_generated_docs.py
Normal file
17
tests/test_generated_docs.py
Normal file
@@ -0,0 +1,17 @@
|
|||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
class GeneratedDocsTests(unittest.TestCase):
|
||||||
|
def test_ops_threshold_doc_is_up_to_date(self):
|
||||||
|
root = Path(__file__).resolve().parents[1]
|
||||||
|
before = (root / "docs" / "ops-thresholds.generated.md").read_text(encoding="utf-8")
|
||||||
|
subprocess.run([sys.executable, "scripts/generate_ops_docs.py"], cwd=root, check=True, capture_output=True, text=True)
|
||||||
|
after = (root / "docs" / "ops-thresholds.generated.md").read_text(encoding="utf-8")
|
||||||
|
self.assertEqual(after, before)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
67
tests/test_history_replay_fixtures.py
Normal file
67
tests/test_history_replay_fixtures.py
Normal file
@@ -0,0 +1,67 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from ai_daily_report.candidate_recall import recall_semantic_candidates
|
||||||
|
from ai_daily_report.models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
FIXTURE_PATH = Path(__file__).parent / "fixtures" / "history_replay_2026_06_04_2026_06_10.json"
|
||||||
|
|
||||||
|
|
||||||
|
def make_item(raw, index):
|
||||||
|
return NewsItem(
|
||||||
|
id=raw["id"],
|
||||||
|
source_group=raw["source"],
|
||||||
|
source_label=raw["source"],
|
||||||
|
source_role="primary" if raw["source"] == "AI HOT" else "supplement",
|
||||||
|
source_priority=10 if raw["source"] == "AI HOT" else 50,
|
||||||
|
title_raw=raw["title_raw"],
|
||||||
|
title_norm=raw["title_raw"].lower(),
|
||||||
|
summary_raw=raw["summary_raw"],
|
||||||
|
url=raw["url"],
|
||||||
|
canonical_url=raw["url"],
|
||||||
|
published_at=raw["date"],
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class HistoryReplayFixtureTests(unittest.TestCase):
|
||||||
|
def test_fixture_covers_required_incidents(self):
|
||||||
|
data = json.loads(FIXTURE_PATH.read_text(encoding="utf-8"))
|
||||||
|
event_ids = {event["event_id"] for event in data["events"]}
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
event_ids,
|
||||||
|
{
|
||||||
|
"claude-fable-mythos",
|
||||||
|
"openclaw-suno",
|
||||||
|
"magenta-realtime-2",
|
||||||
|
"open-code-review",
|
||||||
|
"openai-chip-talent-move",
|
||||||
|
"amap-abot",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_candidate_recall_finds_fixture_event_pairs(self):
|
||||||
|
data = json.loads(FIXTURE_PATH.read_text(encoding="utf-8"))
|
||||||
|
misses = []
|
||||||
|
for event in data["events"]:
|
||||||
|
items = [make_item(item, index) for index, item in enumerate(event["items"])]
|
||||||
|
candidates, report = recall_semantic_candidates(
|
||||||
|
items,
|
||||||
|
config={
|
||||||
|
"enabled": True,
|
||||||
|
"title_similarity_threshold": 0.25,
|
||||||
|
"title_jaccard_threshold": 0.10,
|
||||||
|
"summary_jaccard_threshold": 0.05,
|
||||||
|
"strong_entity_overlap_threshold": 1,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
if not candidates:
|
||||||
|
misses.append(event["event_id"])
|
||||||
|
|
||||||
|
self.assertEqual(misses, [])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
70
tests/test_legacy_script_delegation.py
Normal file
70
tests/test_legacy_script_delegation.py
Normal file
@@ -0,0 +1,70 @@
|
|||||||
|
import importlib.util
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
SCRIPT = ROOT / "script" / "ai_daily_blog_pipeline.py"
|
||||||
|
|
||||||
|
|
||||||
|
def load_pipeline_module():
|
||||||
|
spec = importlib.util.spec_from_file_location("ai_daily_blog_pipeline", SCRIPT)
|
||||||
|
module = importlib.util.module_from_spec(spec)
|
||||||
|
spec.loader.exec_module(module)
|
||||||
|
return module
|
||||||
|
|
||||||
|
|
||||||
|
class LegacyScriptDelegationTests(unittest.TestCase):
|
||||||
|
def test_main_delegates_to_new_pipeline_by_default(self):
|
||||||
|
module = load_pipeline_module()
|
||||||
|
calls = []
|
||||||
|
|
||||||
|
def fake_run_daily_report(**kwargs):
|
||||||
|
calls.append(kwargs)
|
||||||
|
return {"reports": {"stage8": {"status": "ok"}}}
|
||||||
|
|
||||||
|
with patch.object(module, "load_env", return_value={"AI_DAILY_DRY_RUN": "1"}):
|
||||||
|
with patch("ai_daily_report.runner.run_daily_report", side_effect=fake_run_daily_report):
|
||||||
|
module.main()
|
||||||
|
|
||||||
|
self.assertEqual(len(calls), 1)
|
||||||
|
self.assertEqual(calls[0]["mode"], "dry-run")
|
||||||
|
self.assertEqual(calls[0]["source_mode"], "live")
|
||||||
|
self.assertEqual(calls[0]["llm_mode"], "live")
|
||||||
|
|
||||||
|
def test_main_allows_mock_modes_for_local_test(self):
|
||||||
|
module = load_pipeline_module()
|
||||||
|
calls = []
|
||||||
|
|
||||||
|
def fake_run_daily_report(**kwargs):
|
||||||
|
calls.append(kwargs)
|
||||||
|
return {"reports": {"stage8": {"status": "ok"}}}
|
||||||
|
|
||||||
|
with patch.object(
|
||||||
|
module,
|
||||||
|
"load_env",
|
||||||
|
return_value={"AI_DAILY_DRY_RUN": "1", "AI_DAILY_SOURCE_MODE": "mock", "AI_DAILY_LLM_MODE": "mock"},
|
||||||
|
):
|
||||||
|
with patch("ai_daily_report.runner.run_daily_report", side_effect=fake_run_daily_report):
|
||||||
|
module.main()
|
||||||
|
|
||||||
|
self.assertEqual(calls[0]["source_mode"], "mock")
|
||||||
|
self.assertEqual(calls[0]["llm_mode"], "mock")
|
||||||
|
|
||||||
|
def test_main_exits_nonzero_when_new_pipeline_blocks_publish(self):
|
||||||
|
module = load_pipeline_module()
|
||||||
|
|
||||||
|
def fake_run_daily_report(**kwargs):
|
||||||
|
return {"reports": {"stage8": {"status": "blocked", "error": "rewrite_fallback_ratio_exceeded"}}}
|
||||||
|
|
||||||
|
with patch.object(module, "load_env", return_value={}):
|
||||||
|
with patch("ai_daily_report.runner.run_daily_report", side_effect=fake_run_daily_report):
|
||||||
|
with self.assertRaises(SystemExit) as raised:
|
||||||
|
module.main()
|
||||||
|
|
||||||
|
self.assertEqual(raised.exception.code, 2)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
17
tests/test_llm_utils.py
Normal file
17
tests/test_llm_utils.py
Normal file
@@ -0,0 +1,17 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.llm import parse_json_object
|
||||||
|
|
||||||
|
|
||||||
|
class LlmUtilsTests(unittest.TestCase):
|
||||||
|
def test_parse_json_object_strips_markdown_fence(self):
|
||||||
|
self.assertEqual(parse_json_object('```json\n{"ok": true}\n```'), {"ok": True})
|
||||||
|
|
||||||
|
def test_parse_json_object_raises_without_json(self):
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
parse_json_object("not json")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
|
|
||||||
39
tests/test_markdown_rendering.py
Normal file
39
tests/test_markdown_rendering.py
Normal file
@@ -0,0 +1,39 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.assemble import assemble_markdown
|
||||||
|
from ai_daily_report.models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
class MarkdownRenderingTests(unittest.TestCase):
|
||||||
|
def test_blog_markdown_strips_double_blockquote_and_reference_markers(self):
|
||||||
|
items = [
|
||||||
|
NewsItem(
|
||||||
|
id="a",
|
||||||
|
source_group="AI HOT",
|
||||||
|
source_label="OpenAI:Blog",
|
||||||
|
source_role="primary",
|
||||||
|
source_priority=10,
|
||||||
|
title_raw="测试模型发布",
|
||||||
|
title_norm="测试模型发布",
|
||||||
|
summary_raw="测试摘要",
|
||||||
|
title="测试模型发布",
|
||||||
|
summary="测试摘要",
|
||||||
|
url="https://openai.com/blog/test",
|
||||||
|
canonical_url="https://openai.com/blog/test",
|
||||||
|
section="模型与能力",
|
||||||
|
)
|
||||||
|
]
|
||||||
|
guide = {"theme": "> 主线判断:测试主线[1]", "threads": []}
|
||||||
|
|
||||||
|
md, _ = assemble_markdown(items, guide)
|
||||||
|
|
||||||
|
self.assertNotIn("## 导览", md)
|
||||||
|
self.assertIn("## 模型与能力", md)
|
||||||
|
self.assertIn("[OpenAI:Blog ↗](https://openai.com/blog/test)", md)
|
||||||
|
self.assertNotIn("> >", md)
|
||||||
|
self.assertNotIn("[1]", md)
|
||||||
|
self.assertNotIn("主线判断", md)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
34
tests/test_observability.py
Normal file
34
tests/test_observability.py
Normal file
@@ -0,0 +1,34 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.observability import LlmCallObserver, summarize_observed_calls
|
||||||
|
|
||||||
|
|
||||||
|
class ObservabilityTests(unittest.TestCase):
|
||||||
|
def test_records_prompt_and_response_hashes(self):
|
||||||
|
observer = LlmCallObserver(lambda prompt: json.dumps({"ok": True}), stage="stage3")
|
||||||
|
response = observer("prompt")
|
||||||
|
|
||||||
|
self.assertEqual(response, '{"ok": true}')
|
||||||
|
self.assertEqual(len(observer.records), 1)
|
||||||
|
self.assertEqual(observer.records[0]["stage"], "stage3")
|
||||||
|
self.assertEqual(observer.records[0]["prompt_chars"], 6)
|
||||||
|
self.assertEqual(observer.records[0]["response_chars"], len(response))
|
||||||
|
self.assertRegex(observer.records[0]["prompt_hash"], r"^[0-9a-f]{64}$")
|
||||||
|
self.assertRegex(observer.records[0]["response_hash"], r"^[0-9a-f]{64}$")
|
||||||
|
|
||||||
|
def test_summarizes_observed_calls(self):
|
||||||
|
left = LlmCallObserver(lambda prompt: "a", stage="stage3")
|
||||||
|
right = LlmCallObserver(lambda prompt: "b", stage="stage4")
|
||||||
|
left("x")
|
||||||
|
right("y")
|
||||||
|
right("z")
|
||||||
|
|
||||||
|
report = summarize_observed_calls([left, right])
|
||||||
|
|
||||||
|
self.assertEqual(report["total_calls"], 3)
|
||||||
|
self.assertEqual(report["by_stage"], {"stage3": 1, "stage4": 2})
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
33
tests/test_project_structure.py
Normal file
33
tests/test_project_structure.py
Normal file
@@ -0,0 +1,33 @@
|
|||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
|
||||||
|
|
||||||
|
class ProjectStructureTests(unittest.TestCase):
|
||||||
|
def test_pipeline_plan_structure_exists(self):
|
||||||
|
expected_paths = [
|
||||||
|
"ai_daily_report/sources/__init__.py",
|
||||||
|
"ai_daily_report/sources/aihot.py",
|
||||||
|
"ai_daily_report/sources/rss.py",
|
||||||
|
"ai_daily_report/sources/juya.py",
|
||||||
|
"ai_daily_report/sources/registry.py",
|
||||||
|
"ai_daily_report/llm.py",
|
||||||
|
"ai_daily_report/validate.py",
|
||||||
|
"ai_daily_report/publish.py",
|
||||||
|
"ai_daily_report/cli.py",
|
||||||
|
"config/sources.json",
|
||||||
|
"config/pipeline.json",
|
||||||
|
"tests/fixtures/.gitkeep",
|
||||||
|
"skill/scripts/.gitkeep",
|
||||||
|
"skill/scripts/run_daily_report.py",
|
||||||
|
]
|
||||||
|
|
||||||
|
missing = [path for path in expected_paths if not (ROOT / path).exists()]
|
||||||
|
|
||||||
|
self.assertEqual(missing, [])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
78
tests/test_quality_gate.py
Normal file
78
tests/test_quality_gate.py
Normal file
@@ -0,0 +1,78 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.models import NewsItem, SourceResult
|
||||||
|
from ai_daily_report.quality_gate import evaluate_quality_gate
|
||||||
|
|
||||||
|
|
||||||
|
def news_item(item_id, title="Story"):
|
||||||
|
return NewsItem(
|
||||||
|
id=item_id,
|
||||||
|
source_group="AI HOT",
|
||||||
|
source_label="AI HOT",
|
||||||
|
source_role="primary",
|
||||||
|
source_priority=10,
|
||||||
|
title_raw=f"{title} {item_id}",
|
||||||
|
title_norm=f"{title} {item_id}".lower(),
|
||||||
|
summary_raw="summary",
|
||||||
|
url=f"https://example.com/{item_id}",
|
||||||
|
canonical_url=f"https://example.com/{item_id}",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class QualityGateTests(unittest.TestCase):
|
||||||
|
def test_warns_when_stage3_candidates_zero_for_large_item_set(self):
|
||||||
|
items = [news_item(str(index)) for index in range(31)]
|
||||||
|
report = evaluate_quality_gate(
|
||||||
|
items,
|
||||||
|
source_results=[],
|
||||||
|
reports={"stage3": {"candidate_group_count": 0}},
|
||||||
|
config={"warn_when_stage3_candidates_zero_min_items": 30},
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertIn("stage3_candidates_zero", report["warnings"])
|
||||||
|
self.assertEqual(report["blocking_errors"], [])
|
||||||
|
|
||||||
|
def test_warns_on_enabled_source_failure(self):
|
||||||
|
report = evaluate_quality_gate(
|
||||||
|
[news_item("a")],
|
||||||
|
source_results=[
|
||||||
|
SourceResult(
|
||||||
|
source="橘鸦AI早报",
|
||||||
|
role="supplement",
|
||||||
|
ok=False,
|
||||||
|
status="error",
|
||||||
|
error="HTTPError: 404",
|
||||||
|
)
|
||||||
|
],
|
||||||
|
reports={"stage3": {"candidate_group_count": 1}},
|
||||||
|
config={"warn_on_enabled_source_failure": True},
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertIn("enabled_source_failed:橘鸦AI早报:error", report["warnings"])
|
||||||
|
self.assertEqual(report["source_failures"][0]["source"], "橘鸦AI早报")
|
||||||
|
|
||||||
|
def test_blocks_required_source_failure_when_configured(self):
|
||||||
|
report = evaluate_quality_gate(
|
||||||
|
[news_item("a")],
|
||||||
|
source_results=[
|
||||||
|
SourceResult(
|
||||||
|
source="AI HOT",
|
||||||
|
role="primary",
|
||||||
|
ok=False,
|
||||||
|
status="timeout",
|
||||||
|
error="TimeoutError",
|
||||||
|
)
|
||||||
|
],
|
||||||
|
reports={"stage3": {"candidate_group_count": 1}},
|
||||||
|
config={
|
||||||
|
"block_on_required_source_failure": True,
|
||||||
|
"required_sources": ["AI HOT"],
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertIn("required_source_failed:AI HOT:timeout", report["blocking_errors"])
|
||||||
|
self.assertTrue(report["quality_gate_failed"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
58
tests/test_rss.py
Normal file
58
tests/test_rss.py
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.models import SourceConfig
|
||||||
|
from ai_daily_report.sources.rss import parse_rss_items
|
||||||
|
|
||||||
|
|
||||||
|
class RssSourceTests(unittest.TestCase):
|
||||||
|
def test_parse_rss_items_filters_entries_older_than_configured_age(self):
|
||||||
|
config = SourceConfig(
|
||||||
|
name="InfoQ AI",
|
||||||
|
type="rss",
|
||||||
|
url="https://feed.example/rss",
|
||||||
|
max_item_age_days=3,
|
||||||
|
)
|
||||||
|
xml = """<?xml version="1.0"?>
|
||||||
|
<rss><channel>
|
||||||
|
<item>
|
||||||
|
<title>Fresh item</title>
|
||||||
|
<link>https://example.com/fresh</link>
|
||||||
|
<description>Fresh summary</description>
|
||||||
|
<pubDate>Sun, 07 Jun 2026 06:25:00 GMT</pubDate>
|
||||||
|
</item>
|
||||||
|
<item>
|
||||||
|
<title>Old item</title>
|
||||||
|
<link>https://example.com/old</link>
|
||||||
|
<description>Old summary</description>
|
||||||
|
<pubDate>Mon, 01 Jun 2026 06:25:00 GMT</pubDate>
|
||||||
|
</item>
|
||||||
|
</channel></rss>"""
|
||||||
|
|
||||||
|
items = parse_rss_items(config, xml, run_date="2026-06-08")
|
||||||
|
|
||||||
|
self.assertEqual([item["title_raw"] for item in items], ["Fresh item"])
|
||||||
|
|
||||||
|
def test_parse_rss_items_keeps_unparseable_dates_to_avoid_false_drops(self):
|
||||||
|
config = SourceConfig(
|
||||||
|
name="InfoQ AI",
|
||||||
|
type="rss",
|
||||||
|
url="https://feed.example/rss",
|
||||||
|
max_item_age_days=3,
|
||||||
|
)
|
||||||
|
xml = """<?xml version="1.0"?>
|
||||||
|
<rss><channel>
|
||||||
|
<item>
|
||||||
|
<title>No date item</title>
|
||||||
|
<link>https://example.com/no-date</link>
|
||||||
|
<description>No date summary</description>
|
||||||
|
<pubDate>not a date</pubDate>
|
||||||
|
</item>
|
||||||
|
</channel></rss>"""
|
||||||
|
|
||||||
|
items = parse_rss_items(config, xml, run_date="2026-06-08")
|
||||||
|
|
||||||
|
self.assertEqual([item["title_raw"] for item in items], ["No date item"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
283
tests/test_runner.py
Normal file
283
tests/test_runner.py
Normal file
@@ -0,0 +1,283 @@
|
|||||||
|
import unittest
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from tempfile import TemporaryDirectory
|
||||||
|
|
||||||
|
from ai_daily_report.publish import load_published_urls
|
||||||
|
from ai_daily_report.runner import run_daily_report
|
||||||
|
|
||||||
|
|
||||||
|
class RunnerTests(unittest.TestCase):
|
||||||
|
def test_run_daily_report_mock_mode_writes_markdown_and_reports(self):
|
||||||
|
with TemporaryDirectory() as temp_dir:
|
||||||
|
result = run_daily_report(
|
||||||
|
run_date="2026-06-04",
|
||||||
|
mode="dry-run",
|
||||||
|
source_mode="mock",
|
||||||
|
llm_mode="mock",
|
||||||
|
out_dir=Path(temp_dir),
|
||||||
|
base_url="https://blog.example",
|
||||||
|
)
|
||||||
|
|
||||||
|
run_dir = Path(result["run_dir"])
|
||||||
|
self.assertTrue((run_dir / "blog_markdown.md").exists())
|
||||||
|
self.assertTrue((run_dir / "run_report.json").exists())
|
||||||
|
for filename in [
|
||||||
|
"stage0_sources.json",
|
||||||
|
"stage1_items.json",
|
||||||
|
"stage2_items.json",
|
||||||
|
"stage2_5_items.json",
|
||||||
|
"stage2_8_candidates.json",
|
||||||
|
"stage3_items.json",
|
||||||
|
"stage4_items.json",
|
||||||
|
"quality_gate.json",
|
||||||
|
]:
|
||||||
|
self.assertTrue((run_dir / filename).exists(), filename)
|
||||||
|
self.assertEqual(result["reports"]["stage8"]["status"], "ok")
|
||||||
|
|
||||||
|
def test_run_daily_report_passes_pipeline_config_to_stage_functions(self):
|
||||||
|
class FakeLlmClient:
|
||||||
|
def chat(self, prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
if "candidates" in payload:
|
||||||
|
first_candidate = payload["candidates"][0]["item_ids"]
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"duplicate_groups": [
|
||||||
|
{
|
||||||
|
"keep_id": first_candidate[0],
|
||||||
|
"remove_ids": [first_candidate[1]],
|
||||||
|
"confidence": "high",
|
||||||
|
"reason": "same event",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"not_duplicates": [],
|
||||||
|
"uncertain": [],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
if "allowed_sections" in payload:
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": item["id"],
|
||||||
|
"title": item["title_raw"],
|
||||||
|
"summary": item["summary_raw"],
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for item in payload["items"]
|
||||||
|
]
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"intro": "Daily intro.",
|
||||||
|
"theme": "Pipeline config.",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "Config thread",
|
||||||
|
"text": "Config values reached the pipeline.",
|
||||||
|
"item_ids": [payload["items"][0]["id"]],
|
||||||
|
"kind": "thread",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"conclusion": "Done.",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
with TemporaryDirectory() as temp_dir:
|
||||||
|
temp_path = Path(temp_dir)
|
||||||
|
pipeline_config = temp_path / "pipeline.json"
|
||||||
|
pipeline_config.write_text(
|
||||||
|
json.dumps(
|
||||||
|
{
|
||||||
|
"semantic_dedup_max_deletion_ratio": 0.1,
|
||||||
|
"rewrite_batch_size": 1,
|
||||||
|
"cross_day_dedup": {"enabled": False},
|
||||||
|
}
|
||||||
|
),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
source_config = temp_path / "sources.json"
|
||||||
|
source_config.write_text(
|
||||||
|
json.dumps(
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"name": "AI HOT",
|
||||||
|
"type": "rss",
|
||||||
|
"url": "https://feed.example/rss",
|
||||||
|
"role": "primary",
|
||||||
|
"priority": 10,
|
||||||
|
"enabled": True,
|
||||||
|
}
|
||||||
|
]
|
||||||
|
),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
def fetch_text(url, timeout):
|
||||||
|
return """<?xml version="1.0"?><rss><channel>
|
||||||
|
<item><title>Anthropic launches Claude Code</title><link>https://example.com/a</link><description>Anthropic launches Claude Code for developers.</description></item>
|
||||||
|
<item><title>Anthropic launch Claude Code</title><link>https://example.com/b</link><description>Anthropic launch Claude Code for coding.</description></item>
|
||||||
|
<item><title>Gemini CLI update</title><link>https://example.com/c</link><description>Google updates Gemini CLI.</description></item>
|
||||||
|
</channel></rss>"""
|
||||||
|
|
||||||
|
result = run_daily_report(
|
||||||
|
run_date="2026-06-10",
|
||||||
|
mode="dry-run",
|
||||||
|
source_mode="live",
|
||||||
|
llm_mode="live",
|
||||||
|
out_dir=temp_path / "out",
|
||||||
|
base_url="https://blog.example",
|
||||||
|
sources_path=source_config,
|
||||||
|
pipeline_path=pipeline_config,
|
||||||
|
fetch_text=fetch_text,
|
||||||
|
env={
|
||||||
|
"LLM_API_KEY": "test-key",
|
||||||
|
"LLM_BASE_URL": "https://llm.example/v1",
|
||||||
|
"LLM_MODEL": "test-model",
|
||||||
|
},
|
||||||
|
llm_client_factory=lambda **config: FakeLlmClient(),
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertTrue(result["reports"]["stage3"]["skipped_for_deletion_ratio"])
|
||||||
|
self.assertEqual(result["reports"]["stage4"]["batch_count"], 3)
|
||||||
|
self.assertIn("quality_gate", result["reports"])
|
||||||
|
|
||||||
|
def test_run_daily_report_live_sources_can_use_config_and_fetch_text(self):
|
||||||
|
with TemporaryDirectory() as temp_dir:
|
||||||
|
out_dir = Path(temp_dir) / "out"
|
||||||
|
source_config = Path(temp_dir) / "sources.json"
|
||||||
|
source_config.write_text(
|
||||||
|
json.dumps(
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"name": "InfoQ AI",
|
||||||
|
"type": "rss",
|
||||||
|
"url": "https://feed.example/rss",
|
||||||
|
"role": "supplement",
|
||||||
|
"priority": 40,
|
||||||
|
"enabled": True,
|
||||||
|
}
|
||||||
|
]
|
||||||
|
),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
def fetch_text(url, timeout):
|
||||||
|
return """<?xml version="1.0"?><rss><channel><item><title>GPT-5 API 发布</title><link>https://example.com/gpt5</link><description>OpenAI 发布 GPT-5 API。</description></item></channel></rss>"""
|
||||||
|
|
||||||
|
result = run_daily_report(
|
||||||
|
run_date="2026-06-04",
|
||||||
|
mode="dry-run",
|
||||||
|
source_mode="live",
|
||||||
|
llm_mode="mock",
|
||||||
|
out_dir=out_dir,
|
||||||
|
base_url="https://blog.example",
|
||||||
|
sources_path=source_config,
|
||||||
|
fetch_text=fetch_text,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(result["reports"]["stage0"]["raw_item_count"], 1)
|
||||||
|
self.assertTrue((out_dir / "2026-06-04" / "blog_markdown.md").exists())
|
||||||
|
|
||||||
|
def test_run_daily_report_live_llm_uses_env_config_in_dry_run(self):
|
||||||
|
class FakeLlmClient:
|
||||||
|
def __init__(self):
|
||||||
|
self.prompts = []
|
||||||
|
|
||||||
|
def chat(self, prompt):
|
||||||
|
self.prompts.append(prompt)
|
||||||
|
if "duplicate_groups" in prompt:
|
||||||
|
return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
|
||||||
|
if "rewrites" in prompt:
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": item["id"],
|
||||||
|
"title": item["title_raw"],
|
||||||
|
"summary": item["summary_raw"],
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for item in payload["items"]
|
||||||
|
]
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"theme": "模型能力继续进入产品入口。",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "模型 API 更新",
|
||||||
|
"text": "GPT-5 API 发布,说明模型能力继续进入产品入口。",
|
||||||
|
"item_ids": [json.loads(prompt)["items"][0]["id"]],
|
||||||
|
"kind": "thread",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
fake_client = FakeLlmClient()
|
||||||
|
captured_config = {}
|
||||||
|
|
||||||
|
def llm_client_factory(**config):
|
||||||
|
captured_config.update(config)
|
||||||
|
return fake_client
|
||||||
|
|
||||||
|
with TemporaryDirectory() as temp_dir:
|
||||||
|
result = run_daily_report(
|
||||||
|
run_date="2026-06-04",
|
||||||
|
mode="dry-run",
|
||||||
|
source_mode="mock",
|
||||||
|
llm_mode="live",
|
||||||
|
out_dir=Path(temp_dir),
|
||||||
|
base_url="https://blog.example",
|
||||||
|
env={
|
||||||
|
"LLM_API_KEY": "test-key",
|
||||||
|
"LLM_BASE_URL": "https://llm.example/v1",
|
||||||
|
"LLM_MODEL": "test-model",
|
||||||
|
},
|
||||||
|
llm_client_factory=llm_client_factory,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(captured_config["api_key"], "test-key")
|
||||||
|
self.assertEqual(captured_config["base_url"], "https://llm.example/v1")
|
||||||
|
self.assertEqual(captured_config["model"], "test-model")
|
||||||
|
self.assertGreaterEqual(len(fake_client.prompts), 2)
|
||||||
|
self.assertEqual(result["reports"]["stage8"]["status"], "ok")
|
||||||
|
|
||||||
|
def test_run_daily_report_publish_updates_published_url_history(self):
|
||||||
|
class FakeBlogClient:
|
||||||
|
def __init__(self, **kwargs):
|
||||||
|
self.kwargs = kwargs
|
||||||
|
|
||||||
|
def create_post(self, payload):
|
||||||
|
return {"slug": payload["slug"]}
|
||||||
|
|
||||||
|
def publish_post(self, slug):
|
||||||
|
self.slug = slug
|
||||||
|
|
||||||
|
with TemporaryDirectory() as temp_dir:
|
||||||
|
history_path = Path(temp_dir) / "published_urls.json"
|
||||||
|
result = run_daily_report(
|
||||||
|
run_date="2026-06-08",
|
||||||
|
mode="publish",
|
||||||
|
source_mode="mock",
|
||||||
|
llm_mode="mock",
|
||||||
|
out_dir=Path(temp_dir) / "out",
|
||||||
|
base_url="https://blog.example",
|
||||||
|
env={"BLOG_SERVICE_TOKEN": "token"},
|
||||||
|
blog_client_factory=FakeBlogClient,
|
||||||
|
history_path=history_path,
|
||||||
|
)
|
||||||
|
history = load_published_urls(history_path)
|
||||||
|
|
||||||
|
self.assertEqual(result["reports"]["stage8"]["status"], "ok")
|
||||||
|
self.assertIn("https://example.com/gpt5", history.urls)
|
||||||
|
self.assertEqual(history.urls["https://example.com/gpt5"].last_published, "2026-06-08")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
55
tests/test_source_labels.py
Normal file
55
tests/test_source_labels.py
Normal file
@@ -0,0 +1,55 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.models import SourceConfig
|
||||||
|
from ai_daily_report.sources.juya import parse_juya_rss
|
||||||
|
from ai_daily_report.sources.labels import source_label_from_url
|
||||||
|
|
||||||
|
|
||||||
|
class SourceLabelTests(unittest.TestCase):
|
||||||
|
def test_source_label_from_x_url_includes_handle(self):
|
||||||
|
self.assertEqual(
|
||||||
|
source_label_from_url("https://x.com/MiniMax_AI/status/123", fallback="橘鸦AI早报"),
|
||||||
|
"X:MiniMax (@MiniMax_AI)",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_source_label_from_blog_url_marks_blog(self):
|
||||||
|
self.assertEqual(
|
||||||
|
source_label_from_url("https://openai.com/blog/example", fallback="橘鸦AI早报"),
|
||||||
|
"OpenAI:Blog",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_source_label_from_known_non_blog_domains(self):
|
||||||
|
self.assertEqual(
|
||||||
|
source_label_from_url("https://mp.weixin.qq.com/s/example", fallback="橘鸦AI早报"),
|
||||||
|
"微信公众号",
|
||||||
|
)
|
||||||
|
self.assertEqual(
|
||||||
|
source_label_from_url("https://platform.minimaxi.com/docs/token-plan/migration", fallback="橘鸦AI早报"),
|
||||||
|
"MiniMax:Docs",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_parse_juya_rss_uses_item_url_as_source_label(self):
|
||||||
|
config = SourceConfig(name="橘鸦AI早报", type="juya_rss", url="https://juya.example/rss")
|
||||||
|
xml = """<?xml version="1.0"?>
|
||||||
|
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/">
|
||||||
|
<channel>
|
||||||
|
<item>
|
||||||
|
<title>2026-06-04</title>
|
||||||
|
<content:encoded><![CDATA[
|
||||||
|
<h2><a href="https://x.com/MiniMax_AI/status/123">MiniMax M3 加速</a> <code>#1</code></h2>
|
||||||
|
<p>MiniMax M3 加速。</p>
|
||||||
|
<p><a href="https://x.com/MiniMax_AI/status/123">来源</a></p>
|
||||||
|
<hr/>
|
||||||
|
]]></content:encoded>
|
||||||
|
</item>
|
||||||
|
</channel>
|
||||||
|
</rss>"""
|
||||||
|
|
||||||
|
items = parse_juya_rss(config, xml, "2026-06-04")
|
||||||
|
|
||||||
|
self.assertEqual(items[0]["source_label"], "X:MiniMax (@MiniMax_AI)")
|
||||||
|
self.assertNotEqual(items[0]["source_label"], "橘鸦AI早报")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
62
tests/test_stage0_collect.py
Normal file
62
tests/test_stage0_collect.py
Normal file
@@ -0,0 +1,62 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.clients import FetchTextError
|
||||||
|
from ai_daily_report.collect import collect_sources
|
||||||
|
from ai_daily_report.models import SourceConfig
|
||||||
|
|
||||||
|
|
||||||
|
class Stage0CollectTests(unittest.TestCase):
|
||||||
|
def test_collect_sources_returns_structured_results_for_each_source(self):
|
||||||
|
configs = [
|
||||||
|
SourceConfig(name="Primary", type="fake", role="primary", priority=10),
|
||||||
|
SourceConfig(name="Supplement", type="fake", role="supplement", priority=20),
|
||||||
|
]
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [{"title_raw": f"{config.name} item", "url": f"https://example.com/{config.name}"}]
|
||||||
|
|
||||||
|
results, report = collect_sources(configs, "2026-06-04", fetcher=fetcher)
|
||||||
|
|
||||||
|
self.assertEqual([r.source for r in results], ["Primary", "Supplement"])
|
||||||
|
self.assertTrue(all(r.ok for r in results))
|
||||||
|
self.assertEqual(sum(len(r.items) for r in results), 2)
|
||||||
|
self.assertEqual(report["input_source_count"], 2)
|
||||||
|
self.assertEqual(report["ok_source_count"], 2)
|
||||||
|
self.assertEqual(report["raw_item_count"], 2)
|
||||||
|
|
||||||
|
def test_collect_sources_records_failed_source_without_blocking_others(self):
|
||||||
|
configs = [
|
||||||
|
SourceConfig(name="Broken", type="fake", role="supplement", priority=20),
|
||||||
|
SourceConfig(name="Healthy", type="fake", role="supplement", priority=30),
|
||||||
|
]
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
if config.name == "Broken":
|
||||||
|
raise TimeoutError("timed out")
|
||||||
|
return [{"title_raw": "healthy item", "url": "https://example.com/healthy"}]
|
||||||
|
|
||||||
|
results, report = collect_sources(configs, "2026-06-04", fetcher=fetcher)
|
||||||
|
|
||||||
|
by_source = {r.source: r for r in results}
|
||||||
|
self.assertFalse(by_source["Broken"].ok)
|
||||||
|
self.assertEqual(by_source["Broken"].status, "timeout")
|
||||||
|
self.assertIn("TimeoutError", by_source["Broken"].error)
|
||||||
|
self.assertTrue(by_source["Healthy"].ok)
|
||||||
|
self.assertEqual(report["failed_source_count"], 1)
|
||||||
|
self.assertEqual(report["raw_item_count"], 1)
|
||||||
|
|
||||||
|
def test_collect_sources_records_fetch_text_error_metadata(self):
|
||||||
|
configs = [SourceConfig(name="RSS", type="rss", retries=2)]
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
raise FetchTextError("http_404", "HTTPError: 404", http_status=404, attempts=1)
|
||||||
|
|
||||||
|
results, report = collect_sources(configs, "2026-06-10", fetcher=fetcher)
|
||||||
|
|
||||||
|
self.assertEqual(results[0].status, "http_404")
|
||||||
|
self.assertEqual(results[0].retry_count, 0)
|
||||||
|
self.assertIn("http_404", report["error_types"]["RSS"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
32
tests/test_stage0_to_2_pipeline.py
Normal file
32
tests/test_stage0_to_2_pipeline.py
Normal file
@@ -0,0 +1,32 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.pipeline import run_stage0_to_stage2
|
||||||
|
|
||||||
|
|
||||||
|
class Stage0To2PipelineTests(unittest.TestCase):
|
||||||
|
def test_run_stage0_to_stage2_returns_deduped_items_and_reports(self):
|
||||||
|
configs = [
|
||||||
|
{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10},
|
||||||
|
{"name": "RSS", "type": "fake", "role": "supplement", "priority": 50},
|
||||||
|
]
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": "OpenAI 发布 GPT-5",
|
||||||
|
"summary_raw": f"{config.name} summary",
|
||||||
|
"url": "https://openai.com/blog/gpt-5?utm_source=test",
|
||||||
|
"source_label": config.name,
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
result = run_stage0_to_stage2(configs, "2026-06-04", fetcher=fetcher)
|
||||||
|
|
||||||
|
self.assertEqual(len(result["items"]), 1)
|
||||||
|
self.assertEqual(result["reports"]["stage0"]["raw_item_count"], 2)
|
||||||
|
self.assertEqual(result["reports"]["stage1"]["output_count"], 2)
|
||||||
|
self.assertEqual(result["reports"]["stage2"]["removed_count"], 1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
268
tests/test_stage0_to_4_pipeline.py
Normal file
268
tests/test_stage0_to_4_pipeline.py
Normal file
@@ -0,0 +1,268 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.pipeline import run_stage0_to_stage4
|
||||||
|
from ai_daily_report.models import PublishedUrlEntry, PublishedUrls
|
||||||
|
|
||||||
|
|
||||||
|
class Stage0To4PipelineTests(unittest.TestCase):
|
||||||
|
def test_run_stage0_to_stage4_passes_semantic_and_rewrite_config(self):
|
||||||
|
configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
|
||||||
|
seen = {}
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": "Anthropic launches Claude Code",
|
||||||
|
"summary_raw": "Anthropic launches Claude Code for developers.",
|
||||||
|
"url": "https://example.com/a",
|
||||||
|
"source_label": config.name,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title_raw": "Anthropic launch Claude Code",
|
||||||
|
"summary_raw": "Anthropic launch Claude Code for coding.",
|
||||||
|
"url": "https://example.com/b",
|
||||||
|
"source_label": config.name,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title_raw": "Gemini CLI update",
|
||||||
|
"summary_raw": "Google updates Gemini CLI.",
|
||||||
|
"url": "https://example.com/c",
|
||||||
|
"source_label": config.name,
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
def semantic_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
seen["semantic_prompt"] = payload
|
||||||
|
first_candidate = payload["candidates"][0]["item_ids"]
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"duplicate_groups": [
|
||||||
|
{
|
||||||
|
"keep_id": first_candidate[0],
|
||||||
|
"remove_ids": [first_candidate[1]],
|
||||||
|
"confidence": "high",
|
||||||
|
"reason": "same event",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"not_duplicates": [],
|
||||||
|
"uncertain": [],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
def rewrite_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
seen.setdefault("rewrite_batches", []).append(len(payload["items"]))
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": item["id"],
|
||||||
|
"title": item["title_raw"],
|
||||||
|
"summary": item["summary_raw"],
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for item in payload["items"]
|
||||||
|
]
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
result = run_stage0_to_stage4(
|
||||||
|
configs,
|
||||||
|
"2026-06-10",
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
semantic_dedup_max_deletion_ratio=0.1,
|
||||||
|
rewrite_batch_size=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertTrue(result["reports"]["stage3"]["skipped_for_deletion_ratio"])
|
||||||
|
self.assertEqual(seen["rewrite_batches"], [1, 1, 1])
|
||||||
|
|
||||||
|
def test_run_stage0_to_stage4_semantic_dedupes_and_rewrites(self):
|
||||||
|
configs = [
|
||||||
|
{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10},
|
||||||
|
{"name": "RSS", "type": "fake", "role": "supplement", "priority": 50},
|
||||||
|
]
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": f"{config.name} Anthropic IPO",
|
||||||
|
"summary_raw": f"{config.name} reports Anthropic IPO filing.",
|
||||||
|
"url": f"https://example.com/{config.name}",
|
||||||
|
"source_label": config.name,
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
def semantic_llm_call(prompt):
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"duplicate_groups": [],
|
||||||
|
"not_duplicates": [],
|
||||||
|
"uncertain": [],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
def rewrite_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": entry["id"],
|
||||||
|
"title": "Anthropic 提交 IPO 文件",
|
||||||
|
"summary": "Anthropic 被报道提交 IPO 文件。",
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for entry in payload["items"]
|
||||||
|
]
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = run_stage0_to_stage4(
|
||||||
|
configs,
|
||||||
|
"2026-06-04",
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(len(result["items"]), 2)
|
||||||
|
self.assertEqual(result["items"][0].title, "Anthropic 提交 IPO 文件")
|
||||||
|
self.assertIn("stage3", result["reports"])
|
||||||
|
self.assertIn("stage4", result["reports"])
|
||||||
|
self.assertEqual(result["reports"]["stage4"]["rewritten_count"], 2)
|
||||||
|
|
||||||
|
def test_run_stage0_to_stage4_filters_published_urls_before_semantic_dedupe(self):
|
||||||
|
configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
|
||||||
|
seen_semantic_payloads = []
|
||||||
|
seen_rewrite_payloads = []
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": "Already published",
|
||||||
|
"summary_raw": "Old summary",
|
||||||
|
"url": "https://example.com/already",
|
||||||
|
"source_label": config.name,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title_raw": "Fresh story",
|
||||||
|
"summary_raw": "Fresh summary",
|
||||||
|
"url": "https://example.com/fresh",
|
||||||
|
"source_label": config.name,
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
def semantic_llm_call(prompt):
|
||||||
|
seen_semantic_payloads.append(json.loads(prompt))
|
||||||
|
return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
|
||||||
|
|
||||||
|
def rewrite_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
seen_rewrite_payloads.append(payload)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": entry["id"],
|
||||||
|
"title": entry["title_raw"],
|
||||||
|
"summary": entry["summary_raw"],
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for entry in payload["items"]
|
||||||
|
]
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
published_urls = PublishedUrls(
|
||||||
|
urls={
|
||||||
|
"https://example.com/already": PublishedUrlEntry(
|
||||||
|
first_seen="2026-06-07",
|
||||||
|
last_published="2026-06-07",
|
||||||
|
titles=["Already published"],
|
||||||
|
)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
result = run_stage0_to_stage4(
|
||||||
|
configs,
|
||||||
|
"2026-06-08",
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
published_urls=published_urls,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual([entry.title_raw for entry in result["items"]], ["Fresh story"])
|
||||||
|
self.assertEqual(result["reports"]["stage2_5"]["removed_count"], 1)
|
||||||
|
self.assertEqual([entry["title_raw"] for entry in seen_rewrite_payloads[0]["items"]], ["Fresh story"])
|
||||||
|
|
||||||
|
def test_run_stage0_to_stage4_uses_stage2_8_recalled_candidates(self):
|
||||||
|
configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
|
||||||
|
seen = {}
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": "Anthropic 被曝开发 Claude Fable",
|
||||||
|
"summary_raw": "Anthropic 正在开发名为 Claude Fable 和 Claude Mythos 的新产品。",
|
||||||
|
"url": "https://example.com/fable",
|
||||||
|
"source_label": config.name,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title_raw": "Claude Mythos 进入内部测试",
|
||||||
|
"summary_raw": "Anthropic 的 Claude Mythos 与 Claude Fable 面向内容生成场景。",
|
||||||
|
"url": "https://example.com/mythos",
|
||||||
|
"source_label": config.name,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title_raw": "Google 开源 Gemma 3n",
|
||||||
|
"summary_raw": "Google 开源 Gemma 3n 模型,面向端侧部署。",
|
||||||
|
"url": "https://example.com/gemma",
|
||||||
|
"source_label": config.name,
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
def semantic_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
seen["candidate_count"] = len(payload["candidates"])
|
||||||
|
seen["candidate_reasons"] = [candidate["reason"] for candidate in payload["candidates"]]
|
||||||
|
return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
|
||||||
|
|
||||||
|
def rewrite_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": entry["id"],
|
||||||
|
"title": entry["title_raw"],
|
||||||
|
"summary": entry["summary_raw"],
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for entry in payload["items"]
|
||||||
|
]
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = run_stage0_to_stage4(
|
||||||
|
configs,
|
||||||
|
"2026-06-10",
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(seen["candidate_count"], 1)
|
||||||
|
self.assertIn("strong_entity_overlap", seen["candidate_reasons"])
|
||||||
|
self.assertEqual(result["reports"]["stage2_8"]["added_candidate_group_count"], 1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
63
tests/test_stage0_to_5_pipeline.py
Normal file
63
tests/test_stage0_to_5_pipeline.py
Normal file
@@ -0,0 +1,63 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.pipeline import run_stage0_to_stage5
|
||||||
|
|
||||||
|
|
||||||
|
class Stage0To5PipelineTests(unittest.TestCase):
|
||||||
|
def test_run_stage0_to_stage5_classifies_and_orders_items(self):
|
||||||
|
configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": "Anthropic 提交 IPO 文件",
|
||||||
|
"summary_raw": "Anthropic 被报道提交 IPO 文件。",
|
||||||
|
"url": "https://example.com/ipo",
|
||||||
|
"source_label": config.name,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title_raw": "GPT-5 API 发布,延迟降低 30%",
|
||||||
|
"summary_raw": "OpenAI 发布 GPT-5 API。",
|
||||||
|
"url": "https://example.com/gpt5",
|
||||||
|
"source_label": config.name,
|
||||||
|
"section_hint": "模型发布/更新",
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
def semantic_llm_call(prompt):
|
||||||
|
return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
|
||||||
|
|
||||||
|
def rewrite_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": entry["id"],
|
||||||
|
"title": entry["title_raw"],
|
||||||
|
"summary": entry["summary_raw"],
|
||||||
|
"section": "模型与能力" if "GPT-5" in entry["title_raw"] else "公司与资本",
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for entry in payload["items"]
|
||||||
|
]
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = run_stage0_to_stage5(
|
||||||
|
configs,
|
||||||
|
"2026-06-04",
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual([item.section for item in result["items"]], ["模型与能力", "公司与资本"])
|
||||||
|
self.assertEqual(result["reports"]["stage5"]["section_counts"]["模型与能力"], 1)
|
||||||
|
self.assertEqual(result["reports"]["stage5"]["section_counts"]["公司与资本"], 1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
75
tests/test_stage0_to_6_pipeline.py
Normal file
75
tests/test_stage0_to_6_pipeline.py
Normal file
@@ -0,0 +1,75 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.pipeline import run_stage0_to_stage6
|
||||||
|
|
||||||
|
|
||||||
|
class Stage0To6PipelineTests(unittest.TestCase):
|
||||||
|
def test_run_stage0_to_stage6_generates_guide(self):
|
||||||
|
configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": "GPT-5 API 发布",
|
||||||
|
"summary_raw": "OpenAI 发布 GPT-5 API。",
|
||||||
|
"url": "https://example.com/gpt5",
|
||||||
|
"source_label": config.name,
|
||||||
|
"section_hint": "模型发布/更新",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
def semantic_llm_call(prompt):
|
||||||
|
return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
|
||||||
|
|
||||||
|
def rewrite_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": entry["id"],
|
||||||
|
"title": entry["title_raw"],
|
||||||
|
"summary": entry["summary_raw"],
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for entry in payload["items"]
|
||||||
|
]
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
def guide_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
item_id = payload["items"][0]["id"]
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"theme": "模型 API 能力继续更新。",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "模型能力更新",
|
||||||
|
"text": "GPT-5 API 发布,体现模型能力继续产品化。",
|
||||||
|
"item_ids": [item_id],
|
||||||
|
"kind": "thread",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = run_stage0_to_stage6(
|
||||||
|
configs,
|
||||||
|
"2026-06-04",
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
guide_llm_call=guide_llm_call,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(result["guide"]["theme"], "模型 API 能力继续更新。")
|
||||||
|
self.assertEqual(len(result["guide"]["threads"]), 1)
|
||||||
|
self.assertTrue(result["reports"]["stage6"]["theme_present"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
76
tests/test_stage0_to_7_pipeline.py
Normal file
76
tests/test_stage0_to_7_pipeline.py
Normal file
@@ -0,0 +1,76 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.pipeline import run_stage0_to_stage7
|
||||||
|
|
||||||
|
|
||||||
|
class Stage0To7PipelineTests(unittest.TestCase):
|
||||||
|
def test_run_stage0_to_stage7_assembles_markdown(self):
|
||||||
|
configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": "GPT-5 API 发布",
|
||||||
|
"summary_raw": "OpenAI 发布 GPT-5 API。",
|
||||||
|
"url": "https://example.com/gpt5",
|
||||||
|
"source_label": "OpenAI:Blog",
|
||||||
|
"section_hint": "模型发布/更新",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
def semantic_llm_call(prompt):
|
||||||
|
return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
|
||||||
|
|
||||||
|
def rewrite_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": entry["id"],
|
||||||
|
"title": entry["title_raw"],
|
||||||
|
"summary": entry["summary_raw"],
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for entry in payload["items"]
|
||||||
|
]
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
def guide_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
item_id = payload["items"][0]["id"]
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"theme": "模型 API 能力继续更新。",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "模型能力产品化",
|
||||||
|
"text": "GPT-5 API 发布,说明模型能力继续进入产品入口。",
|
||||||
|
"item_ids": [item_id],
|
||||||
|
"kind": "thread",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = run_stage0_to_stage7(
|
||||||
|
configs,
|
||||||
|
"2026-06-04",
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
guide_llm_call=guide_llm_call,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertNotIn("## 导览", result["markdown"])
|
||||||
|
self.assertIn("## 模型与能力", result["markdown"])
|
||||||
|
self.assertIn("## 今日脉络", result["markdown"])
|
||||||
|
self.assertEqual(result["reports"]["stage7"]["blocking_errors"], [])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
139
tests/test_stage0_to_8_pipeline.py
Normal file
139
tests/test_stage0_to_8_pipeline.py
Normal file
@@ -0,0 +1,139 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
from urllib.error import HTTPError
|
||||||
|
|
||||||
|
from ai_daily_report.pipeline import run_stage0_to_stage8
|
||||||
|
|
||||||
|
|
||||||
|
class Stage0To8PipelineTests(unittest.TestCase):
|
||||||
|
def test_run_stage0_to_stage8_dry_run_publishes_report(self):
|
||||||
|
configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": "GPT-5 API 发布",
|
||||||
|
"summary_raw": "OpenAI 发布 GPT-5 API。",
|
||||||
|
"url": "https://example.com/gpt5",
|
||||||
|
"source_label": "OpenAI:Blog",
|
||||||
|
"section_hint": "模型发布/更新",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
def semantic_llm_call(prompt):
|
||||||
|
return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
|
||||||
|
|
||||||
|
def rewrite_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": entry["id"],
|
||||||
|
"title": entry["title_raw"],
|
||||||
|
"summary": entry["summary_raw"],
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for entry in payload["items"]
|
||||||
|
]
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
def guide_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
item_id = payload["items"][0]["id"]
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"theme": "模型 API 能力继续更新。",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "模型能力产品化",
|
||||||
|
"text": "GPT-5 API 发布,说明模型能力继续进入产品入口。",
|
||||||
|
"item_ids": [item_id],
|
||||||
|
"kind": "thread",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = run_stage0_to_stage8(
|
||||||
|
configs,
|
||||||
|
"2026-06-04",
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
guide_llm_call=guide_llm_call,
|
||||||
|
mode="dry-run",
|
||||||
|
base_url="https://blog.example",
|
||||||
|
client=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(result["publish"].status, "ok")
|
||||||
|
self.assertEqual(result["publish"].blog_url, "https://blog.example/posts/ai-2026-06-04")
|
||||||
|
self.assertIn("stage8", result["reports"])
|
||||||
|
self.assertEqual(result["reports"]["stage8"]["status"], "ok")
|
||||||
|
|
||||||
|
def test_run_stage0_to_stage8_blocks_publish_when_rewrite_quality_gate_fails(self):
|
||||||
|
configs = [{"name": "AI HOT", "type": "fake", "role": "primary", "priority": 10}]
|
||||||
|
|
||||||
|
def fetcher(config, run_date):
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"title_raw": f"News {index}",
|
||||||
|
"summary_raw": f"Summary {index}",
|
||||||
|
"url": f"https://example.com/{index}",
|
||||||
|
"source_label": "Example",
|
||||||
|
"section_hint": "模型发布/更新",
|
||||||
|
}
|
||||||
|
for index in range(6)
|
||||||
|
]
|
||||||
|
|
||||||
|
def semantic_llm_call(prompt):
|
||||||
|
return json.dumps({"duplicate_groups": [], "not_duplicates": [], "uncertain": []})
|
||||||
|
|
||||||
|
def rewrite_llm_call(prompt):
|
||||||
|
raise HTTPError(
|
||||||
|
url="https://llm.example/v1/chat/completions",
|
||||||
|
code=503,
|
||||||
|
msg="Service Unavailable",
|
||||||
|
hdrs=None,
|
||||||
|
fp=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
def guide_llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"theme": "模型能力继续更新。",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "模型更新",
|
||||||
|
"text": "多条模型新闻更新。",
|
||||||
|
"item_ids": [payload["items"][0]["id"]],
|
||||||
|
"kind": "thread",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
result = run_stage0_to_stage8(
|
||||||
|
configs,
|
||||||
|
"2026-06-04",
|
||||||
|
fetcher=fetcher,
|
||||||
|
semantic_llm_call=semantic_llm_call,
|
||||||
|
rewrite_llm_call=rewrite_llm_call,
|
||||||
|
guide_llm_call=guide_llm_call,
|
||||||
|
mode="publish",
|
||||||
|
base_url="https://blog.example",
|
||||||
|
client=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(result["publish"].status, "blocked")
|
||||||
|
self.assertIn("rewrite_fallback_ratio_exceeded", result["reports"]["stage7"]["blocking_errors"])
|
||||||
|
self.assertIn("rewrite_fallback_ratio_exceeded", result["reports"]["stage8"]["error"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
85
tests/test_stage1_normalize.py
Normal file
85
tests/test_stage1_normalize.py
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.models import SourceResult
|
||||||
|
from ai_daily_report.normalize import canonicalize_url, normalize_items, normalize_title
|
||||||
|
|
||||||
|
|
||||||
|
class Stage1NormalizeTests(unittest.TestCase):
|
||||||
|
def test_canonicalize_url_removes_tracking_and_normalizes_x_host(self):
|
||||||
|
url = "HTTPS://Twitter.com/OpenAI/status/123/?utm_source=newsletter&fbclid=abc#fragment"
|
||||||
|
|
||||||
|
self.assertEqual(canonicalize_url(url), "https://x.com/OpenAI/status/123")
|
||||||
|
|
||||||
|
def test_normalize_items_builds_news_items_with_ids_and_norms(self):
|
||||||
|
source_result = SourceResult(
|
||||||
|
source="AI HOT",
|
||||||
|
role="primary",
|
||||||
|
ok=True,
|
||||||
|
status="ok",
|
||||||
|
items=[
|
||||||
|
{
|
||||||
|
"title_raw": " GPT-5 发布:速度提升 2x! ",
|
||||||
|
"summary_raw": " <p>OpenAI 发布更新。</p> ",
|
||||||
|
"url": "https://openai.com/blog/gpt-5?utm_campaign=test",
|
||||||
|
"source_label": "OpenAI:Blog",
|
||||||
|
"section_hint": "模型发布/更新",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
items, report = normalize_items([source_result], run_date="2026-06-04")
|
||||||
|
|
||||||
|
self.assertEqual(len(items), 1)
|
||||||
|
self.assertTrue(items[0].id.startswith("item_"))
|
||||||
|
self.assertEqual(items[0].canonical_url, "https://openai.com/blog/gpt-5")
|
||||||
|
self.assertEqual(items[0].title_norm, normalize_title("GPT-5 发布:速度提升 2x!"))
|
||||||
|
self.assertEqual(items[0].summary_raw, "OpenAI 发布更新。")
|
||||||
|
self.assertEqual(items[0].source_role, "primary")
|
||||||
|
self.assertEqual(report["input_count"], 1)
|
||||||
|
self.assertEqual(report["output_count"], 1)
|
||||||
|
|
||||||
|
def test_normalize_items_marks_quality_flags_without_dropping_item(self):
|
||||||
|
source_result = SourceResult(
|
||||||
|
source="RSS",
|
||||||
|
role="supplement",
|
||||||
|
ok=True,
|
||||||
|
status="ok",
|
||||||
|
items=[{"title_raw": "短", "summary_raw": "", "url": ""}],
|
||||||
|
)
|
||||||
|
|
||||||
|
items, report = normalize_items([source_result], run_date="2026-06-04")
|
||||||
|
|
||||||
|
self.assertEqual(len(items), 1)
|
||||||
|
self.assertIn("missing_url", items[0].quality_flags)
|
||||||
|
self.assertIn("missing_summary", items[0].quality_flags)
|
||||||
|
self.assertIn("short_title", items[0].quality_flags)
|
||||||
|
self.assertEqual(report["quality_flag_counts"]["missing_url"], 1)
|
||||||
|
|
||||||
|
def test_normalize_items_keeps_ids_unique_for_same_canonical_url(self):
|
||||||
|
source_result = SourceResult(
|
||||||
|
source="AI HOT",
|
||||||
|
role="primary",
|
||||||
|
ok=True,
|
||||||
|
status="ok",
|
||||||
|
items=[
|
||||||
|
{
|
||||||
|
"title_raw": "OpenAI 发布 GPT-5",
|
||||||
|
"summary_raw": "summary a",
|
||||||
|
"url": "https://example.com/news?utm_source=a",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title_raw": "OpenAI 发布 GPT-5",
|
||||||
|
"summary_raw": "summary b",
|
||||||
|
"url": "https://example.com/news",
|
||||||
|
},
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
items, _ = normalize_items([source_result], run_date="2026-06-04")
|
||||||
|
|
||||||
|
self.assertEqual(len({item.id for item in items}), 2)
|
||||||
|
self.assertEqual(items[0].canonical_url, items[1].canonical_url)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
129
tests/test_stage2_dedupe.py
Normal file
129
tests/test_stage2_dedupe.py
Normal file
@@ -0,0 +1,129 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.dedupe import cross_day_dedup_items, hard_dedup_items
|
||||||
|
from ai_daily_report.models import NewsItem, PublishedUrlEntry, PublishedUrls
|
||||||
|
|
||||||
|
|
||||||
|
def item(
|
||||||
|
item_id,
|
||||||
|
title,
|
||||||
|
title_norm,
|
||||||
|
url,
|
||||||
|
canonical_url,
|
||||||
|
source_group="AI HOT",
|
||||||
|
source_label="AI HOT",
|
||||||
|
source_priority=100,
|
||||||
|
summary="summary",
|
||||||
|
):
|
||||||
|
return NewsItem(
|
||||||
|
id=item_id,
|
||||||
|
source_group=source_group,
|
||||||
|
source_label=source_label,
|
||||||
|
source_role="primary" if source_group == "AI HOT" else "supplement",
|
||||||
|
source_priority=source_priority,
|
||||||
|
title_raw=title,
|
||||||
|
title_norm=title_norm,
|
||||||
|
summary_raw=summary,
|
||||||
|
url=url,
|
||||||
|
canonical_url=canonical_url,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class Stage2DedupeTests(unittest.TestCase):
|
||||||
|
def test_hard_dedup_merges_same_canonical_url_and_keeps_better_item(self):
|
||||||
|
items = [
|
||||||
|
item("a", "OpenAI 发布 GPT-5", "openai发布gpt5", "https://example.com/a?utm_source=x", "https://example.com/a", source_group="RSS", source_priority=50, summary="short"),
|
||||||
|
item("b", "OpenAI 发布 GPT-5", "openai发布gpt5", "https://example.com/a", "https://example.com/a", source_group="AI HOT", source_priority=10, summary="longer summary"),
|
||||||
|
]
|
||||||
|
|
||||||
|
deduped, report = hard_dedup_items(items)
|
||||||
|
|
||||||
|
self.assertEqual([i.id for i in deduped], ["b"])
|
||||||
|
self.assertEqual(report["input_count"], 2)
|
||||||
|
self.assertEqual(report["output_count"], 1)
|
||||||
|
self.assertEqual(report["removed_count"], 1)
|
||||||
|
self.assertEqual(report["groups"][0]["reason"], "same_canonical_url")
|
||||||
|
self.assertEqual(deduped[0].duplicate_sources[0]["source_group"], "RSS")
|
||||||
|
|
||||||
|
def test_hard_dedup_marks_similar_titles_without_removing(self):
|
||||||
|
items = [
|
||||||
|
item("a", "Grok API 上线 Cloudflare Gateway", "grokapi上线cloudflaregateway", "https://x.com/a", "https://x.com/a"),
|
||||||
|
item("b", "Grok 模型登陆 Cloudflare AI Gateway", "grok模型登陆cloudflareaigateway", "https://x.com/b", "https://x.com/b"),
|
||||||
|
]
|
||||||
|
|
||||||
|
deduped, report = hard_dedup_items(items)
|
||||||
|
|
||||||
|
self.assertEqual(len(deduped), 2)
|
||||||
|
self.assertEqual(report["removed_count"], 0)
|
||||||
|
self.assertEqual(len(report["possible_duplicates"]), 1)
|
||||||
|
self.assertEqual(set(report["possible_duplicates"][0]["item_ids"]), {"a", "b"})
|
||||||
|
|
||||||
|
def test_hard_dedup_marks_lower_similarity_mixed_language_titles_as_candidates(self):
|
||||||
|
items = [
|
||||||
|
item("a", "OpenAI custom chip lead Clive Chan joins Anthropic", "openai定制芯片核心成员clivechan跳槽至anthropic", "https://example.com/a", "https://example.com/a"),
|
||||||
|
item("b", "OpenAI chip core member defects to Anthropic before mass production", "openai芯片核心叛逃anthropic就在量产前夜", "https://example.com/b", "https://example.com/b"),
|
||||||
|
]
|
||||||
|
|
||||||
|
deduped, report = hard_dedup_items(items)
|
||||||
|
|
||||||
|
self.assertEqual(len(deduped), 2)
|
||||||
|
self.assertEqual(report["removed_count"], 0)
|
||||||
|
self.assertEqual(len(report["possible_duplicates"]), 1)
|
||||||
|
self.assertEqual(set(report["possible_duplicates"][0]["item_ids"]), {"a", "b"})
|
||||||
|
|
||||||
|
def test_cross_day_dedup_filters_recently_published_canonical_urls_only(self):
|
||||||
|
items = [
|
||||||
|
item("old", "Old URL", "oldurl", "https://example.com/old", "https://example.com/old"),
|
||||||
|
item("new", "New URL", "newurl", "https://example.com/new", "https://example.com/new"),
|
||||||
|
item("missing", "Missing URL", "missingurl", "", ""),
|
||||||
|
]
|
||||||
|
published_urls = PublishedUrls(
|
||||||
|
urls={
|
||||||
|
"https://example.com/old": PublishedUrlEntry(
|
||||||
|
first_seen="2026-06-07",
|
||||||
|
last_published="2026-06-07",
|
||||||
|
titles=["Old URL"],
|
||||||
|
)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
deduped, report = cross_day_dedup_items(
|
||||||
|
items,
|
||||||
|
published_urls,
|
||||||
|
run_date="2026-06-08",
|
||||||
|
max_age_days=7,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual([entry.id for entry in deduped], ["new", "missing"])
|
||||||
|
self.assertEqual(report["input_count"], 3)
|
||||||
|
self.assertEqual(report["output_count"], 2)
|
||||||
|
self.assertEqual(report["removed_count"], 1)
|
||||||
|
self.assertEqual(report["removed"][0]["item_id"], "old")
|
||||||
|
|
||||||
|
def test_cross_day_dedup_ignores_urls_outside_history_window(self):
|
||||||
|
items = [
|
||||||
|
item("stale", "Stale URL", "staleurl", "https://example.com/stale", "https://example.com/stale"),
|
||||||
|
]
|
||||||
|
published_urls = PublishedUrls(
|
||||||
|
urls={
|
||||||
|
"https://example.com/stale": PublishedUrlEntry(
|
||||||
|
first_seen="2026-05-01",
|
||||||
|
last_published="2026-05-01",
|
||||||
|
titles=["Stale URL"],
|
||||||
|
)
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
deduped, report = cross_day_dedup_items(
|
||||||
|
items,
|
||||||
|
published_urls,
|
||||||
|
run_date="2026-06-08",
|
||||||
|
max_age_days=7,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual([entry.id for entry in deduped], ["stale"])
|
||||||
|
self.assertEqual(report["removed_count"], 0)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
163
tests/test_stage3_semantic_dedupe.py
Normal file
163
tests/test_stage3_semantic_dedupe.py
Normal file
@@ -0,0 +1,163 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.models import NewsItem
|
||||||
|
from ai_daily_report.semantic_dedupe import semantic_dedup_items
|
||||||
|
|
||||||
|
|
||||||
|
def news_item(item_id, title, source_group="AI HOT"):
|
||||||
|
return NewsItem(
|
||||||
|
id=item_id,
|
||||||
|
source_group=source_group,
|
||||||
|
source_label=source_group,
|
||||||
|
source_role="primary" if source_group == "AI HOT" else "supplement",
|
||||||
|
source_priority=10 if source_group == "AI HOT" else 50,
|
||||||
|
title_raw=title,
|
||||||
|
title_norm=title.lower(),
|
||||||
|
summary_raw=f"{title} summary",
|
||||||
|
url=f"https://example.com/{item_id}",
|
||||||
|
canonical_url=f"https://example.com/{item_id}",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class Stage3SemanticDedupeTests(unittest.TestCase):
|
||||||
|
def test_semantic_dedup_removes_only_high_confidence_duplicates(self):
|
||||||
|
items = [
|
||||||
|
news_item("a", "Anthropic 提交 IPO 招股书", "AI HOT"),
|
||||||
|
news_item("b", "刚刚,Anthropic 提交了招股书", "量子位"),
|
||||||
|
news_item("c", "Grok 上线 Cloudflare Gateway", "AI HOT"),
|
||||||
|
]
|
||||||
|
candidates = [{"item_ids": ["a", "b"], "reason": "title_similarity"}]
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"duplicate_groups": [
|
||||||
|
{
|
||||||
|
"keep_id": "a",
|
||||||
|
"remove_ids": ["b"],
|
||||||
|
"confidence": "high",
|
||||||
|
"reason": "same IPO filing event",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"not_duplicates": [],
|
||||||
|
"uncertain": [],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
deduped, report = semantic_dedup_items(items, candidates, llm_call=llm_call)
|
||||||
|
|
||||||
|
self.assertEqual([item.id for item in deduped], ["a", "c"])
|
||||||
|
self.assertEqual(report["removed_count"], 1)
|
||||||
|
self.assertEqual(report["duplicate_groups"][0]["reason"], "same IPO filing event")
|
||||||
|
self.assertEqual(deduped[0].duplicate_sources[0]["id"], "b")
|
||||||
|
|
||||||
|
def test_semantic_dedup_skips_deletion_when_ratio_exceeds_limit(self):
|
||||||
|
items = [
|
||||||
|
news_item("a", "A"),
|
||||||
|
news_item("b", "B"),
|
||||||
|
news_item("c", "C"),
|
||||||
|
]
|
||||||
|
candidates = [{"item_ids": ["a", "b", "c"], "reason": "llm_candidate"}]
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"duplicate_groups": [
|
||||||
|
{
|
||||||
|
"keep_id": "a",
|
||||||
|
"remove_ids": ["b", "c"],
|
||||||
|
"confidence": "high",
|
||||||
|
"reason": "too broad",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"not_duplicates": [],
|
||||||
|
"uncertain": [],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
deduped, report = semantic_dedup_items(
|
||||||
|
items,
|
||||||
|
candidates,
|
||||||
|
llm_call=llm_call,
|
||||||
|
max_deletion_ratio=0.5,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(len(deduped), 3)
|
||||||
|
self.assertEqual(report["removed_count"], 0)
|
||||||
|
self.assertTrue(report["skipped_for_deletion_ratio"])
|
||||||
|
|
||||||
|
def test_semantic_dedup_supports_merge_groups_as_supplementary_sources(self):
|
||||||
|
items = [
|
||||||
|
news_item("a", "高德推出 ABot", "AI HOT"),
|
||||||
|
news_item("b", "高德 ABot 进入本地生活入口", "橘鸦AI早报"),
|
||||||
|
news_item("c", "Meta 发布新眼镜", "InfoQ AI"),
|
||||||
|
]
|
||||||
|
candidates = [{"item_ids": ["a", "b"], "reason": "same_event_complementary"}]
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
self.assertIn("merge_groups", prompt)
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"duplicate_groups": [],
|
||||||
|
"merge_groups": [
|
||||||
|
{
|
||||||
|
"keep_id": "a",
|
||||||
|
"merge_ids": ["b"],
|
||||||
|
"confidence": "high",
|
||||||
|
"reason": "same ABot launch, different angle",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"not_duplicates": [],
|
||||||
|
"uncertain": [],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
deduped, report = semantic_dedup_items(items, candidates, llm_call=llm_call)
|
||||||
|
|
||||||
|
self.assertEqual([item.id for item in deduped], ["a", "b", "c"])
|
||||||
|
self.assertEqual(report["removed_count"], 0)
|
||||||
|
self.assertEqual(report["merge_groups"][0]["merge_ids"], ["b"])
|
||||||
|
self.assertEqual(deduped[0].duplicate_sources[0]["action"], "merge_supplement")
|
||||||
|
self.assertEqual(deduped[0].duplicate_sources[0]["id"], "b")
|
||||||
|
|
||||||
|
def test_semantic_dedup_ignores_groups_outside_candidate_sets(self):
|
||||||
|
items = [
|
||||||
|
news_item("a", "Suno 完成融资"),
|
||||||
|
news_item("b", "Suno 完成 D 轮融资"),
|
||||||
|
news_item("c", "Ideogram 发布 v4"),
|
||||||
|
news_item("d", "OpenClaw 发布新版"),
|
||||||
|
]
|
||||||
|
candidates = [{"item_ids": ["a", "b"], "reason": "title_similarity"}]
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"duplicate_groups": [
|
||||||
|
{
|
||||||
|
"keep_id": "a",
|
||||||
|
"remove_ids": ["b"],
|
||||||
|
"confidence": "high",
|
||||||
|
"reason": "same Suno event",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"keep_id": "c",
|
||||||
|
"remove_ids": ["d"],
|
||||||
|
"confidence": "high",
|
||||||
|
"reason": "not part of candidates",
|
||||||
|
},
|
||||||
|
],
|
||||||
|
"not_duplicates": [],
|
||||||
|
"uncertain": [],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
deduped, report = semantic_dedup_items(items, candidates, llm_call=llm_call)
|
||||||
|
|
||||||
|
self.assertEqual([item.id for item in deduped], ["a", "c", "d"])
|
||||||
|
self.assertEqual(report["removed_count"], 1)
|
||||||
|
self.assertIn("group_outside_candidates", report["errors"][0])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
242
tests/test_stage4_rewrite.py
Normal file
242
tests/test_stage4_rewrite.py
Normal file
@@ -0,0 +1,242 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
from urllib.error import HTTPError
|
||||||
|
|
||||||
|
from ai_daily_report.models import NewsItem
|
||||||
|
from ai_daily_report.rewrite import rewrite_items
|
||||||
|
|
||||||
|
|
||||||
|
def news_item(item_id="a"):
|
||||||
|
return NewsItem(
|
||||||
|
id=item_id,
|
||||||
|
source_group="AI HOT",
|
||||||
|
source_label="AI HOT",
|
||||||
|
source_role="primary",
|
||||||
|
source_priority=10,
|
||||||
|
title_raw="OpenAI launches GPT-5 API",
|
||||||
|
title_norm="openailaunchesgpt5api",
|
||||||
|
summary_raw="OpenAI launched the GPT-5 API with better latency.",
|
||||||
|
url="https://example.com/a",
|
||||||
|
canonical_url="https://example.com/a",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class Stage4RewriteTests(unittest.TestCase):
|
||||||
|
def test_rewrite_items_writes_display_fields_without_overwriting_raw(self):
|
||||||
|
items = [news_item("a")]
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": "a",
|
||||||
|
"title": "OpenAI 发布 GPT-5 API",
|
||||||
|
"summary": "OpenAI 发布 GPT-5 API,延迟表现更好。",
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=10)
|
||||||
|
|
||||||
|
self.assertEqual(rewritten[0].title, "OpenAI 发布 GPT-5 API")
|
||||||
|
self.assertEqual(rewritten[0].summary, "OpenAI 发布 GPT-5 API,延迟表现更好。")
|
||||||
|
self.assertEqual(rewritten[0].title_raw, "OpenAI launches GPT-5 API")
|
||||||
|
self.assertEqual(report["rewritten_count"], 1)
|
||||||
|
self.assertEqual(report["fallback_count"], 0)
|
||||||
|
|
||||||
|
def test_rewrite_items_accepts_llm_section_classification(self):
|
||||||
|
items = [news_item("a")]
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": "a",
|
||||||
|
"title": "OpenAI 发布 GPT-5 API",
|
||||||
|
"summary": "OpenAI 发布 GPT-5 API,延迟表现更好。",
|
||||||
|
"section": "模型与能力",
|
||||||
|
"confidence": 0.92,
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=10)
|
||||||
|
|
||||||
|
self.assertEqual(rewritten[0].section, "模型与能力")
|
||||||
|
self.assertEqual(report["llm_section_count"], 1)
|
||||||
|
|
||||||
|
def test_rewrite_items_falls_back_when_llm_fails(self):
|
||||||
|
items = [news_item("a")]
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
raise TimeoutError("slow")
|
||||||
|
|
||||||
|
rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=10)
|
||||||
|
|
||||||
|
self.assertEqual(rewritten[0].title, "OpenAI launches GPT-5 API")
|
||||||
|
self.assertEqual(rewritten[0].summary, "OpenAI launched the GPT-5 API with better latency.")
|
||||||
|
self.assertEqual(report["rewritten_count"], 0)
|
||||||
|
self.assertEqual(report["fallback_count"], 1)
|
||||||
|
self.assertIn("TimeoutError", report["errors"][0])
|
||||||
|
|
||||||
|
def test_rewrite_items_can_retry_failed_batch_as_single_items_when_enabled(self):
|
||||||
|
items = [news_item("a"), news_item("b")]
|
||||||
|
calls = []
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
ids = [item["id"] for item in payload["items"]]
|
||||||
|
calls.append(ids)
|
||||||
|
if len(ids) > 1:
|
||||||
|
return "not json"
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": ids[0],
|
||||||
|
"title": f"title {ids[0]}",
|
||||||
|
"summary": f"summary {ids[0]}",
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=2, retry_single_items=True)
|
||||||
|
|
||||||
|
self.assertEqual([item.title for item in rewritten], ["title a", "title b"])
|
||||||
|
self.assertEqual(report["rewritten_count"], 2)
|
||||||
|
self.assertEqual(report["fallback_count"], 0)
|
||||||
|
self.assertEqual(calls, [["a", "b"], ["a"], ["b"]])
|
||||||
|
|
||||||
|
def test_rewrite_items_does_not_retry_single_items_by_default(self):
|
||||||
|
items = [news_item("a"), news_item("b")]
|
||||||
|
calls = []
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
calls.append([item["id"] for item in payload["items"]])
|
||||||
|
return "not json"
|
||||||
|
|
||||||
|
rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=2)
|
||||||
|
|
||||||
|
self.assertEqual(calls, [["a", "b"]])
|
||||||
|
self.assertEqual([item.title for item in rewritten], ["OpenAI launches GPT-5 API", "OpenAI launches GPT-5 API"])
|
||||||
|
self.assertEqual(report["fallback_count"], 2)
|
||||||
|
|
||||||
|
def test_rewrite_items_retries_failed_large_batch_as_smaller_batches_by_default(self):
|
||||||
|
items = [news_item(str(index)) for index in range(30)]
|
||||||
|
calls = []
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
ids = [item["id"] for item in payload["items"]]
|
||||||
|
calls.append(ids)
|
||||||
|
if len(ids) == 30:
|
||||||
|
return "not json"
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": item_id,
|
||||||
|
"title": f"title {item_id}",
|
||||||
|
"summary": f"summary {item_id}",
|
||||||
|
"section": "模型与能力",
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for item_id in ids
|
||||||
|
]
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
rewritten, report = rewrite_items(items, llm_call=llm_call)
|
||||||
|
|
||||||
|
self.assertEqual([len(call) for call in calls], [30, 10, 10, 10])
|
||||||
|
self.assertEqual(report["rewritten_count"], 30)
|
||||||
|
self.assertEqual(report["llm_section_count"], 30)
|
||||||
|
self.assertEqual(report["fallback_count"], 0)
|
||||||
|
self.assertEqual(report["batch_retry_count"], 3)
|
||||||
|
self.assertEqual(report["blocking_errors"], [])
|
||||||
|
self.assertEqual(rewritten[0].title, "title 0")
|
||||||
|
|
||||||
|
def test_rewrite_items_keeps_partial_batch_rewrites_when_some_ids_are_missing(self):
|
||||||
|
items = [news_item("a"), news_item("b"), news_item("c")]
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{"id": "a", "title": "title a", "summary": "summary a", "flags": []},
|
||||||
|
{"id": "c", "title": "title c", "summary": "summary c", "flags": []},
|
||||||
|
]
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=3, max_fallback_ratio=0.5)
|
||||||
|
|
||||||
|
self.assertEqual([item.title for item in rewritten], ["title a", "OpenAI launches GPT-5 API", "title c"])
|
||||||
|
self.assertEqual(report["rewritten_count"], 2)
|
||||||
|
self.assertEqual(report["fallback_count"], 1)
|
||||||
|
self.assertEqual(report["missing_rewrite_count"], 1)
|
||||||
|
self.assertEqual(report["blocking_errors"], [])
|
||||||
|
|
||||||
|
def test_rewrite_items_defaults_to_large_batches_to_reduce_llm_requests(self):
|
||||||
|
items = [news_item(str(index)) for index in range(61)]
|
||||||
|
batch_sizes = []
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
payload = json.loads(prompt)
|
||||||
|
batch_sizes.append(len(payload["items"]))
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"rewrites": [
|
||||||
|
{
|
||||||
|
"id": entry["id"],
|
||||||
|
"title": entry["title_raw"],
|
||||||
|
"summary": entry["summary_raw"],
|
||||||
|
"flags": [],
|
||||||
|
}
|
||||||
|
for entry in payload["items"]
|
||||||
|
]
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
rewrite_items(items, llm_call=llm_call)
|
||||||
|
|
||||||
|
self.assertEqual(batch_sizes, [30, 30, 1])
|
||||||
|
|
||||||
|
def test_rewrite_items_does_not_retry_single_items_after_transient_http_error(self):
|
||||||
|
items = [news_item("a"), news_item("b")]
|
||||||
|
calls = 0
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
nonlocal calls
|
||||||
|
calls += 1
|
||||||
|
raise HTTPError(
|
||||||
|
url="https://llm.example/v1/chat/completions",
|
||||||
|
code=503,
|
||||||
|
msg="Service Unavailable",
|
||||||
|
hdrs=None,
|
||||||
|
fp=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
rewritten, report = rewrite_items(items, llm_call=llm_call, batch_size=2)
|
||||||
|
|
||||||
|
self.assertEqual(calls, 1)
|
||||||
|
self.assertEqual([item.title for item in rewritten], ["OpenAI launches GPT-5 API", "OpenAI launches GPT-5 API"])
|
||||||
|
self.assertEqual(report["fallback_count"], 2)
|
||||||
|
self.assertTrue(report["quality_gate_failed"])
|
||||||
|
self.assertIn("rewrite_fallback_ratio_exceeded", report["blocking_errors"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
88
tests/test_stage5_classify.py
Normal file
88
tests/test_stage5_classify.py
Normal file
@@ -0,0 +1,88 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.classify import SECTION_ORDER, classify_and_order_items
|
||||||
|
from ai_daily_report.models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
def news_item(item_id, title, summary="", section_hint="", source_priority=50):
|
||||||
|
return NewsItem(
|
||||||
|
id=item_id,
|
||||||
|
source_group="AI HOT",
|
||||||
|
source_label="AI HOT",
|
||||||
|
source_role="primary",
|
||||||
|
source_priority=source_priority,
|
||||||
|
title_raw=title,
|
||||||
|
title_norm=title.lower(),
|
||||||
|
summary_raw=summary or f"{title} summary",
|
||||||
|
title=title,
|
||||||
|
summary=summary or f"{title} summary",
|
||||||
|
url=f"https://example.com/{item_id}",
|
||||||
|
canonical_url=f"https://example.com/{item_id}",
|
||||||
|
section_hint=section_hint,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class Stage5ClassifyTests(unittest.TestCase):
|
||||||
|
def test_classify_maps_legacy_section_hints_to_new_sections(self):
|
||||||
|
items = [news_item("a", "GPT-5 发布", section_hint="模型发布/更新")]
|
||||||
|
|
||||||
|
classified, report = classify_and_order_items(items)
|
||||||
|
|
||||||
|
self.assertEqual(classified[0].section, "模型与能力")
|
||||||
|
self.assertEqual(report["hint_classified"], 1)
|
||||||
|
self.assertIn("模型与能力", SECTION_ORDER)
|
||||||
|
|
||||||
|
def test_classify_uses_rules_when_hint_is_missing(self):
|
||||||
|
items = [
|
||||||
|
news_item("a", "Anthropic 提交 IPO 文件", summary="Anthropic 计划上市并提交文件。"),
|
||||||
|
news_item("b", "MCP SDK 发布新版", summary="开发者可用新版 SDK 构建工具。"),
|
||||||
|
]
|
||||||
|
|
||||||
|
classified, report = classify_and_order_items(items)
|
||||||
|
by_id = {item.id: item for item in classified}
|
||||||
|
|
||||||
|
self.assertEqual(by_id["a"].section, "公司与资本")
|
||||||
|
self.assertEqual(by_id["b"].section, "开发与基础设施")
|
||||||
|
self.assertEqual(report["rule_classified"], 2)
|
||||||
|
|
||||||
|
def test_classify_prefers_valid_llm_section_from_rewrite_stage(self):
|
||||||
|
item = news_item(
|
||||||
|
"a",
|
||||||
|
"API 发布",
|
||||||
|
summary="这其实是一个面向开发者的基础设施能力更新。",
|
||||||
|
section_hint="产品发布/更新",
|
||||||
|
)
|
||||||
|
item.section = "开发与基础设施"
|
||||||
|
|
||||||
|
classified, report = classify_and_order_items([item])
|
||||||
|
|
||||||
|
self.assertEqual(classified[0].section, "开发与基础设施")
|
||||||
|
self.assertEqual(report["llm_classified"], 1)
|
||||||
|
self.assertEqual(report["hint_classified"], 0)
|
||||||
|
self.assertEqual(report["rule_classified"], 0)
|
||||||
|
|
||||||
|
def test_classify_falls_back_when_llm_section_is_invalid(self):
|
||||||
|
item = news_item("a", "GPT-5 发布", section_hint="模型发布/更新")
|
||||||
|
item.section = "热点新闻"
|
||||||
|
|
||||||
|
classified, report = classify_and_order_items([item])
|
||||||
|
|
||||||
|
self.assertEqual(classified[0].section, "模型与能力")
|
||||||
|
self.assertEqual(report["llm_classified"], 0)
|
||||||
|
self.assertEqual(report["hint_classified"], 1)
|
||||||
|
self.assertEqual(report["invalid_llm_section_count"], 1)
|
||||||
|
|
||||||
|
def test_classify_orders_items_by_local_rank_score_within_sections(self):
|
||||||
|
items = [
|
||||||
|
news_item("low", "普通模型更新", section_hint="模型发布/更新", source_priority=80),
|
||||||
|
news_item("high", "GPT-5 API 发布,延迟降低 30%", section_hint="模型发布/更新", source_priority=10),
|
||||||
|
]
|
||||||
|
|
||||||
|
classified, report = classify_and_order_items(items)
|
||||||
|
|
||||||
|
self.assertEqual([item.id for item in classified], ["high", "low"])
|
||||||
|
self.assertEqual(report["section_counts"]["模型与能力"], 2)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
83
tests/test_stage6_guide.py
Normal file
83
tests/test_stage6_guide.py
Normal file
@@ -0,0 +1,83 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.guide import generate_guide
|
||||||
|
from ai_daily_report.models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
def news_item(item_id, title, section="模型与能力"):
|
||||||
|
return NewsItem(
|
||||||
|
id=item_id,
|
||||||
|
source_group="AI HOT",
|
||||||
|
source_label="AI HOT",
|
||||||
|
source_role="primary",
|
||||||
|
source_priority=10,
|
||||||
|
title_raw=title,
|
||||||
|
title_norm=title.lower(),
|
||||||
|
summary_raw=f"{title} summary",
|
||||||
|
title=title,
|
||||||
|
summary=f"{title} summary",
|
||||||
|
url=f"https://example.com/{item_id}",
|
||||||
|
canonical_url=f"https://example.com/{item_id}",
|
||||||
|
section=section,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class Stage6GuideTests(unittest.TestCase):
|
||||||
|
def test_generate_guide_returns_intro_theme_threads_and_conclusion(self):
|
||||||
|
items = [
|
||||||
|
news_item("a", "GPT-5 API 发布"),
|
||||||
|
news_item("b", "Miso One 开源语音模型"),
|
||||||
|
]
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"intro": "今天的 AI 行业继续围绕模型能力、Agent 产品和基础设施演进展开。",
|
||||||
|
"theme": "模型能力继续向 API 和实时语音两端推进。",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "模型能力继续推进",
|
||||||
|
"text": "GPT-5 API 和 Miso One 分别代表 API 能力和语音模型更新。",
|
||||||
|
"item_ids": ["a", "b"],
|
||||||
|
"kind": "thread",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "无效脉络",
|
||||||
|
"text": "这条引用了不存在的条目。",
|
||||||
|
"item_ids": ["missing"],
|
||||||
|
"kind": "thread",
|
||||||
|
},
|
||||||
|
],
|
||||||
|
"conclusion": "总体看,模型能力正在进入更多产品入口,生态竞争也在继续加速。",
|
||||||
|
},
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
guide, report = generate_guide(items, llm_call=llm_call)
|
||||||
|
|
||||||
|
self.assertEqual(guide["intro"], "今天的 AI 行业继续围绕模型能力、Agent 产品和基础设施演进展开。")
|
||||||
|
self.assertEqual(guide["theme"], "模型能力继续向 API 和实时语音两端推进。")
|
||||||
|
self.assertEqual(guide["conclusion"], "总体看,模型能力正在进入更多产品入口,生态竞争也在继续加速。")
|
||||||
|
self.assertEqual(len(guide["threads"]), 1)
|
||||||
|
self.assertEqual(guide["threads"][0]["item_ids"], ["a", "b"])
|
||||||
|
self.assertEqual(report["dropped_thread_count"], 1)
|
||||||
|
|
||||||
|
def test_generate_guide_falls_back_when_llm_fails(self):
|
||||||
|
items = [news_item("a", "GPT-5 API 发布")]
|
||||||
|
|
||||||
|
def llm_call(prompt):
|
||||||
|
raise TimeoutError("slow")
|
||||||
|
|
||||||
|
guide, report = generate_guide(items, llm_call=llm_call)
|
||||||
|
|
||||||
|
self.assertEqual(guide["intro"], "")
|
||||||
|
self.assertEqual(guide["theme"], "")
|
||||||
|
self.assertEqual(guide["conclusion"], "")
|
||||||
|
self.assertEqual(guide["threads"], [])
|
||||||
|
self.assertTrue(report["fallback_used"])
|
||||||
|
self.assertIn("TimeoutError", report["errors"][0])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
69
tests/test_stage7_assemble.py
Normal file
69
tests/test_stage7_assemble.py
Normal file
@@ -0,0 +1,69 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.assemble import assemble_markdown, validate_markdown
|
||||||
|
from ai_daily_report.models import NewsItem
|
||||||
|
|
||||||
|
|
||||||
|
def news_item(item_id, title, section):
|
||||||
|
return NewsItem(
|
||||||
|
id=item_id,
|
||||||
|
source_group="AI HOT",
|
||||||
|
source_label="OpenAI:Blog",
|
||||||
|
source_role="primary",
|
||||||
|
source_priority=10,
|
||||||
|
title_raw=title,
|
||||||
|
title_norm=title.lower(),
|
||||||
|
summary_raw=f"{title} summary",
|
||||||
|
title=title,
|
||||||
|
summary=f"{title} summary",
|
||||||
|
url=f"https://example.com/{item_id}",
|
||||||
|
canonical_url=f"https://example.com/{item_id}",
|
||||||
|
section=section,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class Stage7AssembleTests(unittest.TestCase):
|
||||||
|
def test_assemble_markdown_renders_intro_sections_daily_threads_and_conclusion(self):
|
||||||
|
items = [
|
||||||
|
news_item("a", "GPT-5 API 发布", "模型与能力"),
|
||||||
|
news_item("b", "Anthropic 提交 IPO 文件", "公司与资本"),
|
||||||
|
]
|
||||||
|
guide = {
|
||||||
|
"intro": "今天的 AI 行业继续围绕模型、产品和资本展开。",
|
||||||
|
"theme": "> 模型和资本两条线都在推进。[1]",
|
||||||
|
"threads": [
|
||||||
|
{
|
||||||
|
"title": "模型能力产品化",
|
||||||
|
"text": "GPT-5 API 发布,说明模型能力继续进入产品入口。",
|
||||||
|
"item_ids": ["a"],
|
||||||
|
"kind": "thread",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"conclusion": "总体看,AI 竞争继续从单点模型能力转向产品、基础设施和资本协同。",
|
||||||
|
}
|
||||||
|
|
||||||
|
md, report = assemble_markdown(items, guide)
|
||||||
|
|
||||||
|
self.assertTrue(md.startswith("## 引言\n\n> 今天的 AI 行业继续围绕模型、产品和资本展开。"))
|
||||||
|
self.assertNotIn("## 导览", md)
|
||||||
|
self.assertNotIn("> 模型和资本两条线都在推进。", md)
|
||||||
|
self.assertIn("## 模型与能力", md)
|
||||||
|
self.assertIn("**1. GPT-5 API 发布**", md)
|
||||||
|
self.assertIn("**2. Anthropic 提交 IPO 文件**", md)
|
||||||
|
self.assertIn("## 今日脉络", md)
|
||||||
|
self.assertIn("- **模型能力产品化**", md)
|
||||||
|
self.assertTrue(md.endswith("## 总结\n\n> 总体看,AI 竞争继续从单点模型能力转向产品、基础设施和资本协同。"))
|
||||||
|
self.assertNotIn("> >", md)
|
||||||
|
self.assertNotIn("[1]", md)
|
||||||
|
self.assertEqual(report["item_count"], 2)
|
||||||
|
self.assertEqual(report["blocking_errors"], [])
|
||||||
|
|
||||||
|
def test_validate_markdown_blocks_empty_report(self):
|
||||||
|
report = validate_markdown("", [])
|
||||||
|
|
||||||
|
self.assertIn("no_items", report["blocking_errors"])
|
||||||
|
self.assertIn("markdown_too_short", report["blocking_errors"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
162
tests/test_stage8_publish.py
Normal file
162
tests/test_stage8_publish.py
Normal file
@@ -0,0 +1,162 @@
|
|||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
from tempfile import TemporaryDirectory
|
||||||
|
|
||||||
|
from ai_daily_report.models import NewsItem
|
||||||
|
from ai_daily_report.publish import load_published_urls, publish_markdown, update_published_urls
|
||||||
|
|
||||||
|
|
||||||
|
class FakeBlogClient:
|
||||||
|
def __init__(self, existing_post=None):
|
||||||
|
self.created_payload = None
|
||||||
|
self.published_slug = None
|
||||||
|
self.existing_post = existing_post
|
||||||
|
|
||||||
|
def create_post(self, payload):
|
||||||
|
self.created_payload = payload
|
||||||
|
return {"slug": "ai-2026-06-04"}
|
||||||
|
|
||||||
|
def publish_post(self, slug):
|
||||||
|
self.published_slug = slug
|
||||||
|
|
||||||
|
def get_post_by_slug(self, slug):
|
||||||
|
return self.existing_post
|
||||||
|
|
||||||
|
|
||||||
|
class Stage8PublishTests(unittest.TestCase):
|
||||||
|
def test_publish_markdown_dry_run_does_not_call_client(self):
|
||||||
|
result = publish_markdown(
|
||||||
|
title="AI日报 · 2026-06-04",
|
||||||
|
markdown="## 导览\n\n> ok",
|
||||||
|
tags=["AI日报"],
|
||||||
|
slug="ai-2026-06-04",
|
||||||
|
base_url="https://blog.example",
|
||||||
|
mode="dry-run",
|
||||||
|
markdown_report={"blocking_errors": []},
|
||||||
|
client=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(result.status, "ok")
|
||||||
|
self.assertEqual(result.mode, "dry-run")
|
||||||
|
self.assertEqual(result.blog_url, "https://blog.example/posts/ai-2026-06-04")
|
||||||
|
self.assertTrue(result.public_ok)
|
||||||
|
|
||||||
|
def test_publish_markdown_blocks_when_markdown_has_errors(self):
|
||||||
|
client = FakeBlogClient()
|
||||||
|
|
||||||
|
result = publish_markdown(
|
||||||
|
title="AI日报 · 2026-06-04",
|
||||||
|
markdown="bad",
|
||||||
|
tags=["AI日报"],
|
||||||
|
slug="ai-2026-06-04",
|
||||||
|
base_url="https://blog.example",
|
||||||
|
mode="publish",
|
||||||
|
markdown_report={"blocking_errors": ["markdown_too_short"]},
|
||||||
|
client=client,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(result.status, "blocked")
|
||||||
|
self.assertIsNone(client.created_payload)
|
||||||
|
self.assertIn("markdown_too_short", result.error)
|
||||||
|
|
||||||
|
def test_publish_markdown_publish_mode_calls_client(self):
|
||||||
|
client = FakeBlogClient()
|
||||||
|
|
||||||
|
result = publish_markdown(
|
||||||
|
title="AI日报 · 2026-06-04",
|
||||||
|
markdown="## 导览\n\n> ok",
|
||||||
|
tags=["AI日报"],
|
||||||
|
slug="ai-2026-06-04",
|
||||||
|
base_url="https://blog.example",
|
||||||
|
mode="publish",
|
||||||
|
markdown_report={"blocking_errors": []},
|
||||||
|
client=client,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(result.status, "ok")
|
||||||
|
self.assertEqual(client.created_payload["title"], "AI日报 · 2026-06-04")
|
||||||
|
self.assertEqual(client.published_slug, "ai-2026-06-04")
|
||||||
|
self.assertEqual(result.blog_url, "https://blog.example/posts/ai-2026-06-04")
|
||||||
|
|
||||||
|
def test_publish_markdown_returns_already_published_for_same_slug_and_content(self):
|
||||||
|
markdown = "## 导览\n\n> ok"
|
||||||
|
client = FakeBlogClient(existing_post={"slug": "ai-2026-06-04", "content": markdown})
|
||||||
|
|
||||||
|
result = publish_markdown(
|
||||||
|
title="AI日报 · 2026-06-04",
|
||||||
|
markdown=markdown,
|
||||||
|
tags=["AI日报"],
|
||||||
|
slug="ai-2026-06-04",
|
||||||
|
base_url="https://blog.example",
|
||||||
|
mode="publish",
|
||||||
|
markdown_report={"blocking_errors": []},
|
||||||
|
client=client,
|
||||||
|
idempotency_config={"enabled": True},
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(result.status, "already_published")
|
||||||
|
self.assertIsNone(client.created_payload)
|
||||||
|
self.assertIsNone(client.published_slug)
|
||||||
|
|
||||||
|
def test_publish_markdown_blocks_existing_slug_with_different_content(self):
|
||||||
|
client = FakeBlogClient(existing_post={"slug": "ai-2026-06-04", "content": "old"})
|
||||||
|
|
||||||
|
result = publish_markdown(
|
||||||
|
title="AI日报 · 2026-06-04",
|
||||||
|
markdown="new",
|
||||||
|
tags=["AI日报"],
|
||||||
|
slug="ai-2026-06-04",
|
||||||
|
base_url="https://blog.example",
|
||||||
|
mode="publish",
|
||||||
|
markdown_report={"blocking_errors": []},
|
||||||
|
client=client,
|
||||||
|
idempotency_config={"enabled": True},
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(result.status, "blocked")
|
||||||
|
self.assertIn("slug_already_exists", result.error)
|
||||||
|
self.assertIsNone(client.created_payload)
|
||||||
|
|
||||||
|
def test_update_published_urls_writes_canonical_urls_for_final_items(self):
|
||||||
|
with TemporaryDirectory() as temp_dir:
|
||||||
|
history_path = Path(temp_dir) / "published_urls.json"
|
||||||
|
items = [
|
||||||
|
NewsItem(
|
||||||
|
id="a",
|
||||||
|
source_group="AI HOT",
|
||||||
|
source_label="AI HOT",
|
||||||
|
source_role="primary",
|
||||||
|
source_priority=10,
|
||||||
|
title_raw="Fresh story",
|
||||||
|
title_norm="freshstory",
|
||||||
|
summary_raw="summary",
|
||||||
|
url="https://example.com/fresh?utm_source=x",
|
||||||
|
canonical_url="https://example.com/fresh",
|
||||||
|
title="Fresh story",
|
||||||
|
),
|
||||||
|
NewsItem(
|
||||||
|
id="missing",
|
||||||
|
source_group="AI HOT",
|
||||||
|
source_label="AI HOT",
|
||||||
|
source_role="primary",
|
||||||
|
source_priority=10,
|
||||||
|
title_raw="Missing URL",
|
||||||
|
title_norm="missingurl",
|
||||||
|
summary_raw="summary",
|
||||||
|
url="",
|
||||||
|
canonical_url="",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
update_published_urls(history_path, items, run_date="2026-06-08", max_age_days=7)
|
||||||
|
loaded = load_published_urls(history_path)
|
||||||
|
|
||||||
|
self.assertIn("https://example.com/fresh", loaded.urls)
|
||||||
|
self.assertNotIn("", loaded.urls)
|
||||||
|
self.assertEqual(loaded.urls["https://example.com/fresh"].first_seen, "2026-06-08")
|
||||||
|
self.assertEqual(loaded.urls["https://example.com/fresh"].last_published, "2026-06-08")
|
||||||
|
self.assertEqual(loaded.urls["https://example.com/fresh"].titles, ["Fresh story"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
14
tests/test_validate.py
Normal file
14
tests/test_validate.py
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
import unittest
|
||||||
|
|
||||||
|
from ai_daily_report.validate import validate_report_markdown
|
||||||
|
|
||||||
|
|
||||||
|
class ValidateTests(unittest.TestCase):
|
||||||
|
def test_validate_report_markdown_delegates_markdown_checks(self):
|
||||||
|
report = validate_report_markdown("", [])
|
||||||
|
|
||||||
|
self.assertIn("no_items", report["blocking_errors"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
Reference in New Issue
Block a user