Files
ai-daily-report/docs/pipeline-optimization-plan.md
2026-06-04 15:21:56 +08:00

18 KiB

AI Daily Report Pipeline Optimization Plan

Objective

This project should become a stable, long-running AI daily report system for Hermes, OpenClaw, and similar agents. The goal is not only to keep the current script runnable, but to make the whole pipeline observable, replayable, maintainable, and safe to run on a daily schedule.

The recommended direction is:

stable core library + CLI + skill wrapper

Core business logic should live in deterministic code. The skill should describe how agents run, diagnose, replay, publish, and extend the pipeline.

Stage Model

Use this stage model going forward:

Stage 0: Collect Sources
Stage 1: Normalize Items
Stage 2: Hard Dedup
Stage 3: Semantic Dedup
Stage 4: Rewrite Titles and Summaries
Stage 5: Classify and Order
Stage 6: Guide and Daily Threads
Stage 7: Assemble and Validate Markdown
Stage 8: Publish and Deliver

The current script names script-level deduplication as Stage 0. That should be treated as old terminology. In the long-term pipeline, the first stage is source collection.

Architecture

Recommended structure:

ai-daily-report/
├── ai_daily_report/
│   ├── models.py
│   ├── sources/
│   │   ├── aihot.py
│   │   ├── rss.py
│   │   ├── juya.py
│   │   └── registry.py
│   ├── collect.py
│   ├── normalize.py
│   ├── dedupe.py
│   ├── llm.py
│   ├── rewrite.py
│   ├── classify.py
│   ├── assemble.py
│   ├── validate.py
│   ├── publish.py
│   └── cli.py
├── config/
│   ├── sources.json
│   └── pipeline.json
├── docs/
├── skill/
│   ├── SKILL.md
│   ├── scripts/
│   └── references/
├── tests/
│   └── fixtures/
└── script/
    └── ai_daily_blog_pipeline.py

Keep script/ai_daily_blog_pipeline.py as a compatibility entrypoint during migration, but move implementation into importable modules.

Data Model

SourceResult

Every data source should return a structured result:

{
  "source": "AI HOT",
  "role": "primary",
  "ok": true,
  "status": "ok",
  "items": [],
  "error": null,
  "elapsed_ms": 820,
  "retry_count": 0,
  "fetched_at": "2026-06-04T10:00:00+08:00"
}

Supported statuses:

ok
empty
not_ready
timeout
http_error
parse_error
disabled

NewsItem

All raw source items should be normalized into one structure:

{
  "id": "item_...",
  "source_group": "AI HOT",
  "source_label": "OpenAI: Blog",
  "source_role": "primary",
  "source_priority": 10,
  "title_raw": "...",
  "title_norm": "...",
  "summary_raw": "...",
  "title": null,
  "summary": null,
  "url": "...",
  "canonical_url": "...",
  "published_at": "...",
  "collected_at": "...",
  "origin_type": "aihot_json",
  "section_hint": "...",
  "section": null,
  "language_hint": "zh",
  "quality_flags": [],
  "duplicate_sources": []
}

Do not overwrite raw fields with LLM output. Keep display fields separate.

Stage 0: Collect Sources

Goal

Collect candidate news from all configured sources in a stable, observable, and recoverable way.

Design

Use a primary-plus-supplement model at the quality layer, and parallel execution at the scheduling layer.

Quality layer:
AI HOT = primary source
RSS / Juya / InfoQ / QbitAI / MIT = supplement sources

Execution layer:
start all sources concurrently with per-source timeout, retry, and reporting

Source Config

Example:

{
  "name": "AI HOT",
  "type": "aihot",
  "role": "primary",
  "required": true,
  "priority": 10,
  "timeout_seconds": 20,
  "retries": 2,
  "min_items": 10,
  "enabled": true
}

Supplement source example:

{
  "name": "Juya AI Daily",
  "type": "juya_rss",
  "url": "https://imjuya.github.io/juya-ai-daily/rss.xml",
  "role": "supplement",
  "required": false,
  "priority": 20,
  "timeout_seconds": 45,
  "retries": 2,
  "enabled": true
}

Optimizations

  • Run supplement sources concurrently.
  • Do not let one slow source block the whole pipeline.
  • Replace the fixed Juya sleep(120) with bounded short retries and a clear not_ready or timeout status.
  • Treat AI HOT 404 as "not ready" rather than a generic failure.
  • Allow degraded generation if the primary source has a temporary network failure and supplement sources are usable.
  • Persist raw source results for replay.

Artifacts

source_results.json
raw_items.json
stage0_collect_report.json

Stage 1: Normalize Items

Goal

Convert heterogeneous source output into clean, comparable, traceable NewsItem objects.

Optimizations

  • Normalize text with HTML stripping, entity decoding, whitespace cleanup, and RSS boilerplate removal.
  • Generate stable id values from canonical URL when possible, otherwise from source, normalized title, and date.
  • Canonicalize URLs:
    • Lowercase scheme and host.
    • Remove utm_*, fbclid, gclid, spm, from, and fragments.
    • Normalize trailing slashes.
    • Normalize twitter.com and x.com URLs.
  • Generate title_norm:
    • Unicode NFKC normalization.
    • Lowercase English text.
    • Normalize whitespace and weak punctuation.
    • Preserve numbers, versions, model names, and product names.
  • Standardize source labels:
    • X links as X:@username.
    • Official blogs as OpenAI: Blog, Google Research: Blog, etc.
    • Avoid generic labels such as "technology media" when a domain label is available.
  • Add quality_flags instead of silently dropping items:
    • missing_url
    • missing_summary
    • short_title
    • bad_url
    • old_item
    • parse_suspect

Non-goals

  • Do not dedupe.
  • Do not rewrite content.
  • Do not call the LLM.
  • Do not remove items based on importance.

Artifacts

normalized_items.json
stage1_normalize_report.json

Stage 2: Hard Dedup

Goal

Remove only high-confidence duplicates with deterministic rules. Mark uncertain similarities for Stage 3.

Rules

High-confidence removal:

  • Same canonical URL.
  • Same normalized title.
  • Same platform entity, such as the same X status ID.
  • Same source and same exact normalized title.

Uncertain cases:

  • Similar title but different URL.
  • Same company or model, but unclear whether the event is identical.
  • Same topic across multiple sources with different factual details.

Uncertain cases should go to possible_duplicates, not be removed.

Replacement for Current Logic

The current SequenceMatcher > 0.7 direct deletion is too risky. Replace it with:

  • Exact deterministic deletion.
  • Similarity-based candidate marking only.

Keep Item Selection

When merging a duplicate group, choose the item with a local score:

official source bonus
+ primary source bonus
+ source priority
+ has URL
+ has summary
+ has section hint
+ newer published_at
- quality flag penalty

Attach removed items to duplicate_sources on the kept item.

Artifacts

deduped_items.json
stage2_dedupe_report.json

Stage 3: Semantic Dedup

Goal

Use the LLM to identify semantic duplicates that deterministic rules cannot safely remove.

Principles

  • The LLM judges duplicate candidates; local code enforces safety.
  • The LLM must not select, curate, or remove items by importance.
  • Only remove confidence = high duplicate groups.
  • Treat medium or uncertain results as non-removal.

Input

Prefer candidate groups from Stage 2. Avoid sending all items at once unless the item count is small.

Example item payload:

{
  "id": "item_123",
  "title": "...",
  "summary": "...",
  "source": "QbitAI",
  "url_host": "qbitai.com",
  "published_at": "...",
  "section_hint": "Company and Capital"
}

Output Schema

{
  "duplicate_groups": [
    {
      "keep_id": "item_123",
      "remove_ids": ["item_456"],
      "confidence": "high",
      "reason": "Both items report the same concrete event."
    }
  ],
  "not_duplicates": [],
  "uncertain": []
}

Safety Checks

  • Validate all IDs exist.
  • Validate confidence values.
  • Apply local keep-item scoring instead of blindly trusting keep_id.
  • Skip deletion if the deletion ratio exceeds a configured threshold.
  • Skip deletion when versions, product names, or dates conflict.

Failure Behavior

If timeout, JSON parse failure, or schema validation failure occurs, keep Stage 2 output and continue.

Artifacts

semantic_dedup_input.json
semantic_dedup_output.json
stage3_semantic_dedup_report.json

Stage 4: Rewrite Titles and Summaries

Goal

Produce concise, accurate Chinese display titles and summaries.

Rules

  • Keep title_raw and summary_raw unchanged.
  • Write display fields to title and summary.
  • Preserve brand names, model names, API names, and common technical acronyms in English.
  • Translate the rest into natural Chinese.
  • Avoid marketing words such as "heavyweight", "explosive", or "just now" unless they are factual and necessary.
  • Summaries should be factual, concise, and usually 80-140 Chinese characters.
  • Do not add facts not present in the raw title or summary.
  • Do not write advice or commentary.

Batch Strategy

  • Process 8-12 items per batch.
  • Allow limited parallel batches.
  • Retry a failed batch once.
  • Fall back per item or per batch if needed.

Validation

Check:

  • Non-empty title and summary.
  • No markdown links in title.
  • No URL in summary.
  • No [N] or reference markers.
  • No emoji.
  • Summary length under limit.
  • Key numbers, versions, and model names are preserved when present in raw input.

Artifacts

rewritten_items.json
rewrite_llm_outputs.json
stage4_rewrite_report.json

Stage 5: Classify and Order

Goal

Place each item into a stable section and order items for readable scanning.

Use a fixed section whitelist:

模型与能力
产品与应用
开发与基础设施
公司与资本
政策与安全
论文与研究
观点与教程
人物与动态

Hide empty sections. Do not create dynamic section names.

Classification Strategy

Use a three-layer approach:

  1. Source hint mapping.
  2. Local rule fallback.
  3. LLM classification for ambiguous items only.

Example alias mapping:

模型发布/更新 -> 模型与能力
产品发布/更新 -> 产品与应用
产品与工具 -> 产品与应用
开发与工程 -> 开发与基础设施
行业动态 -> 公司与资本
行业与公司 -> 公司与资本
论文研究 -> 论文与研究
技巧与观点 -> 观点与教程
人物与花絮 -> 人物与动态

Ordering Strategy

Do not let the LLM freely order all items. Use local scoring:

rank_score =
  source priority
  + official source bonus
  + primary source bonus
  + recency score
  + key metric bonus
  + duplicate source bonus
  - quality flag penalty

Ordering is for readability only. It must not remove items.

Artifacts

classified_items.json
stage5_classify_order_report.json

Stage 6: Guide and Daily Threads

Goal

Generate a concise top guide and a bottom "daily threads" section that helps readers understand the day's shape without turning the report into an investment memo.

Replace Current Summary Style

Do not use:

强信号 / 中信号 / 待验证

This style feels too much like an industry rating or investment brief.

Use:

导览
今日脉络
仍待确认, when needed

Output Schema

The LLM should output structured JSON, not Markdown:

{
  "theme": "One concise daily theme.",
  "threads": [
    {
      "title": "模型能力继续向长上下文、实时语音、多模态生成推进",
      "text": "MiniMax M3、Miso One、Ideogram v4.0 分别从长上下文解码、语音克隆和图像生成质量上更新能力边界。",
      "item_ids": ["item_1", "item_2", "item_3"],
      "kind": "thread"
    },
    {
      "title": "仍待确认",
      "text": "融资传闻、排行榜和单源爆料类消息需要等待官方或更多来源确认。",
      "item_ids": ["item_8"],
      "kind": "uncertain"
    }
  ]
}

Rules

  • Theme should be one paragraph under 120 Chinese characters.
  • Threads should be 2-4 items.
  • Each thread must bind to existing item_ids.
  • Do not add facts absent from the item list.
  • Do not write advice.
  • Do not include reference numbers.
  • Do not include Markdown blockquote syntax. Stage 7 will render Markdown.

Failure Behavior

  • If theme generation fails, omit the guide or use a conservative fallback.
  • If threads fail, omit 今日脉络.
  • Invalid thread IDs should drop that thread.

Artifacts

guide_input.json
guide_output.json
stage6_guide_report.json

Stage 7: Assemble and Validate Markdown

Goal

Render final Markdown deterministically and validate it before publishing.

## 导览

> 一句话主线。

## 模型与能力

**1. 新闻标题**

> 新闻摘要。[来源 ↗](https://example.com)

## 今日脉络

- **主题**
  说明...

Rendering Rules

  • Render Markdown in code only.
  • Use global continuous numbering.
  • Hide empty sections.
  • Add blockquote syntax for the guide in code.
  • Strip any leading > from LLM-provided theme text before rendering.
  • Use source links consistently:
[OpenAI: Blog ↗](https://example.com)

If URL is unavailable, render the source label without a link.

Auto-fixes

  • Remove > >.
  • Remove [N] and numeric reference markers.
  • Remove code fences from guide/thread text.
  • Normalize extra blank lines.
  • Add missing Chinese punctuation to summaries.
  • Remove 主线判断: prefixes if present.

Blocking Checks

Block publish or downgrade to draft when:

  • Item count is zero.
  • No sections are rendered.
  • Markdown is abnormally short.
  • Section name is outside the whitelist.
  • JSON fragments remain in Markdown.
  • Link formatting is broadly broken.
  • Forbidden advisory language appears in guide/thread text.

Artifacts

blog_markdown.md
stage7_markdown_report.json

Stage 8: Publish and Deliver

Goal

Publish only validated Markdown, verify the public page, and make the operation idempotent and recoverable.

Modes

dry-run
draft
publish

Requirements

  • Do not publish when Stage 7 has blocking errors.
  • Use a deterministic slug such as ai-YYYY-MM-DD.
  • Check whether the slug already exists before creating a new post.
  • Support existence strategies:
    • skip
    • update-draft
    • replace
    • republish
  • Verify the public URL with retries.
  • Preserve Markdown and reports when publishing fails.
  • Support publishing from an existing run directory.

Artifacts

stage8_publish_report.json
run_report.json

Run Directory

Every run should write to an isolated directory:

runs/2026-06-04/
  source_results.json
  raw_items.json
  stage0_collect_report.json
  normalized_items.json
  stage1_normalize_report.json
  deduped_items.json
  stage2_dedupe_report.json
  semantic_dedup_output.json
  stage3_semantic_dedup_report.json
  rewritten_items.json
  stage4_rewrite_report.json
  classified_items.json
  stage5_classify_order_report.json
  guide_output.json
  stage6_guide_report.json
  blog_markdown.md
  stage7_markdown_report.json
  stage8_publish_report.json
  run_report.json

This makes the pipeline replayable and debuggable.

CLI

Provide agent-friendly commands:

ai-daily-report run --date today --mode publish
ai-daily-report run --date today --mode dry-run
ai-daily-report run --date 2026-06-04 --mode draft
ai-daily-report replay --run-id 2026-06-04 --from-stage 4
ai-daily-report publish --from-run 2026-06-04
ai-daily-report status --date 2026-06-04

The current cron can keep invoking the compatibility script, which should delegate to the CLI.

Skill Strategy

Create or update an ai-daily-report skill for Hermes/OpenClaw. The skill should not contain business logic. It should provide:

  • How to run daily generation.
  • How to dry-run.
  • How to replay from an existing run.
  • How to publish already generated Markdown.
  • How to diagnose source, LLM, Markdown, or publish failures.
  • How to add a new RSS source.
  • How to adjust output style without breaking the pipeline.

Suggested skill references:

skill/references/sources.md
skill/references/output-style.md
skill/references/troubleshooting.md
skill/references/llm-config.md

Testing

Add fixtures and tests for:

  • AI HOT sample parsing.
  • RSS parsing.
  • Juya content:encoded parsing.
  • URL canonicalization.
  • Title normalization.
  • Deterministic deduplication.
  • LLM JSON schema validation.
  • Rewrite output validation.
  • Section alias mapping.
  • Markdown rendering.
  • Markdown validation.
  • Publish dry-run behavior.

Start with local fixture tests. They will give most of the stability benefit without needing live network calls.

Migration Plan

Phase 1: Stabilize Current Script

  • Add run directories.
  • Add SourceResult and stage reports.
  • Add URL canonicalization.
  • Replace risky Stage 0 dedupe with hard dedup.
  • Add Markdown validation and auto-fixes.

Phase 2: Improve Quality

  • Add semantic dedup schema and safety checks.
  • Batch rewrite title and summary.
  • Add section alias mapping and rule-first classification.
  • Replace the current summary with 今日脉络.

Phase 3: Modularize

  • Extract modules under ai_daily_report/.
  • Add CLI.
  • Keep old script as compatibility entrypoint.
  • Add fixture tests.

Phase 4: Skill Integration

  • Update skill/SKILL.md.
  • Add references for sources, style, troubleshooting, and LLM config.
  • Make Hermes/OpenClaw call the CLI.

Success Criteria

The optimized pipeline should satisfy:

  • A usable Markdown report is generated whenever enough source data exists.
  • Optional source failures degrade the run but do not stop it.
  • LLM failures degrade individual stages but do not destroy the whole report.
  • No non-duplicate item is removed by importance or editorial selection.
  • Every removed duplicate has a reason.
  • Every stage writes inspectable artifacts.
  • A failed publish can be retried from an existing run.
  • Agents can run, diagnose, replay, and publish via stable commands.