Elaina/ai-daily-report

Fork 0

Files

Elaina 94e18ce22d init: AI日报 pipeline 完整代码 + 技能文档 + 运行记录

2026-06-04 10:38:44 +08:00

8.4 KiB

Raw Blame History

name, description, trigger

name

description

trigger

ai-daily-report-pipeline

Maintain and operate the AI daily report cron pipeline (ai_daily_blog_pipeline.py). Covers: script configuration, LLM prompt tuning, data sources, timeout settings, output format, and publishing workflow.

AI日报 / AI daily report / daily briefing

ai_daily_blog_pipeline.py

cron job 76297415d88d

ai_morning_out / 橘鸦 / AI HOT

AI Daily Report Pipeline

Automated daily AI news digest that publishes to blog.ephron.ren.

Architecture

Script: ~/.hermes/scripts/ai_daily_blog_pipeline.py (~1100 lines)
Output dir: ~/.hermes/scripts/ai_morning_out/
Cron job: 76297415d88d, schedule 0 10 * * * (10:00 CST daily)
Delivery: origin (back to the chat that created it)
Mode: no_agent: true (script-only, no LLM wrapper)

Data Sources

Source	Type	Notes
AI HOT	API (aihot skill)	Primary, category-specific
橘鸦 AI 早报	RSS (content:encoded)	Publishes ~09:34. Parsed from RSS `content:encoded` field (no article page fetch). 45s timeout.
InfoQ AI	RSS	`feed.infoq.com/ai-ml-data-eng/`
MIT 科技评论 AI	RSS	`technologyreview.com/topic/.../feed`
量子位	RSS	`qbitai.com/feed`

Processing Pipeline (4-Stage Architecture)

Stages are designed to maximize script work, minimize LLM calls.

Stage 0: Script Dedup (pure Python)

Normalize titles: strip punctuation, lowercase, remove stopwords
Remove exact duplicates (title match or Jaccard > 0.85)
No LLM involved — deterministic

Stage 1: LLM Semantic Dedup

Single LLM call to find semantically equivalent items (e.g. same news from different sources)
Input: {index, title, summary} for each item
Output: {duplicates: [{keep: 0, remove: [1,2]}, ...]}
Removes less-detailed version of each duplicate pair

Stage 2: Parallel Summary Rewrite + Classify (2 concurrent LLM calls)

Stage 2a: Rewrite summaries + translate titles to Chinese
- Brand/model names preserved in English (GPT-5, Codex, etc.)
- Other title text translated to Chinese
- Summary: max 120 chars, concise Chinese
- Output: {summaries: [{index, title, summary}, ...]}
Stage 2b: Classify items into sections
- Sections: 模型与技术, 产品与工具, 开发与工程, 行业与公司, 研究与发现, 观点与评论
- Output: {classifications: [{index, section}, ...]}
Both run in parallel via concurrent.futures.ThreadPoolExecutor

Stage 3: LLM Guide Generation

Single LLM call for "今日观察" (observation/analysis)
Input: all item titles + summaries
Output: {guide: [{type, text}, ...]} JSON array
Types: theme (1), strong (2-3), medium (1-2), risk (1-2)
NO advice type

Stage 4: Script Assemble + Publish (pure Python)

Merge Stage 2a output (titles+summaries) with Stage 2b output (sections)
Assemble markdown: 导览 → 分类新闻 → 总结
Publish via Service API (create → publish → PATCH)

User Preferences (CRITICAL)

NO CURATION / NO SELECTION: Only filter exact duplicates. ALL non-duplicate items must be preserved. Do NOT use words like "精选" (curated/selected) in output. The user explicitly rejected any editorial filtering beyond deduplication.
No emoji in the output
No reference numbers like [1][3] — readers can't see what they point to. Strip all [N] from guide text via clean_guide_text().
No "主线判断：" prefix in 导览 section — strip via regex r'^主线判断[：:]\s*' in clean_guide_text().
No advice/suggestions section — no "开发者应该..." type content. Guide types are: theme, strong, medium, risk ONLY.
Concrete not generic — avoid vague statements like "行业焦点转向XX". Point to specific events.
Plain language — no academic/formal tone, use 大白话
Concise — each guide item 2-3 sentences max
Readable formatting — summary section uses type labels as headers, then bullet-list format:
```
**强信号**
- **标题**
  内容...
```
Guide format: [{type, text}] JSON array. Types: theme (1), strong (2-3), medium (1-2), risk (1-2). NO advice type.
Structure: 导览 (blockquote, no prefix) → 新闻 → 总结 (type labels + bullet list, grouped by type)
Links must be verified accessible before inclusion

Key Configuration

Cron timeout: cron.script_timeout_seconds: 600 (in ~/.hermes/config.yaml)
LLM urllib timeout: 600s (in script, urllib.request.urlopen(req, timeout=600))
RSS fetch timeout: 25s per regular feed. 橘鸦: 45s (GitHub Pages, 262KB RSS).
LLM API: follows the active Hermes model config by default. Current production path is Sub2API (~/.hermes/config.yaml → model.provider: sub2api, model.default: findmini/gpt-5.5, model.base_url: http://sub2api.ephron.ren/v1) with SUB2API_API_KEY from ~/.hermes/.env. Keep API key, base_url, and model from the same provider family; do not mix SUB2API_API_KEY with Xiaomi/MiMo base_url, or the LLM stages will fail with 401 Unauthorized.
max_items: 30 in _prefilter_items — controls LLM prompt size; 38 items worked fine, 30 is conservative

Pitfalls

Config file is protected: ~/.hermes/config.yaml cannot be edited with patch tool. Use sed -i 's/old/new/' ~/.hermes/config.yaml via terminal.
橘鸦 timing: Publishes ~09:34 CST. Script sleeps 120s if empty. Don't run before 10:00.
橘鸦 regex bug (fixed 2026-06-04): The block_pattern regex had \\s* (two backslashes in source = literal backslash in regex) before <code> instead of \s* (one backslash = whitespace class). This caused the regex to never match any 橘鸦 items, silently returning empty results. The first_real_block qwen-ID regex was also dead (site migrated away from Qwen IDs). Fix: (a) split into fetch_juya_rss + parse_juya; (b) parse from RSS content:encoded eliminating the second HTTP fetch; (c) changed escaped backslash to whitespace class; (d) changed .*? to [^<]*? to prevent overview section from leaking into matches (the overview <h2>概览</h2> has no <code>#N</code>, but the lazy .*? would cross the h2 boundary to find it).
橘鸦 timeout: Now uses 45s timeout (up from 25s) because GitHub Pages can be slow and the RSS feed is ~262KB. Content is parsed from RSS content:encoded to avoid a second HTTP request for the article page. Falls back to fetching the article page if content:encoded is unavailable.
MiMo token limit: With the 4-stage architecture, each LLM call handles a smaller prompt (dedup ~3K, summary ~6K, classify ~3K, guide ~5K). max_items=30 is safe. Old single-call approach needed max_items=18.
Gateway restart needed: After config changes, systemctl --user restart hermes-gateway is required.
Timeout tuning (USER IS VERY SENSITIVE): User explicitly demands timeouts set to 1.5-2x of theoretical time. Being conservative causes repeated failures and user frustration. If theoretical time is ~80s, set timeout to 600s. Never start low and increment — go generous from the start. User said: "一直超时太影响体验了".
LLM prompt anti-patterns: Never instruct LLM to "精选" (curate/select). Never ask for [N] reference numbers. Never include "建议" (advice) section. Never include "主线判断：" prefix in theme text. These all produce unwanted output.
Title translation: Stage 2a MUST translate English titles to Chinese. Brand/model names (GPT-5, Codex, Gemini, etc.) are preserved in English. All other title text translated. If titles come back in English, check that the Stage 2a prompt includes explicit title translation instruction and the output format includes "title" field.
patch tool and regex: The patch tool's Escape-drift detection can interfere with multi-backslash regex patterns. For complex regex changes in the pipeline script, use terminal with sed -i or a Python script that reads/writes the file directly.

Files

run_meta.json — last run metadata (date, slug, url, errors, source counts)
raw_items.json — raw fetched items
llm_digest.json — LLM output
blog_markdown.md — rendered blog post

References

references/timeout-config.md — timeout values and tuning rules for all script stages
references/llm-config-auto-follow.md — how the script auto-follows Hermes model config
references/mimo-api-performance.md — MiMo API performance characteristics
references/rendering-guide.md — blog post rendering rules

8.4 KiB Raw Blame History Unescape Escape