first commit

2026-05-10 13:52:46 +08:00
commit ccc63d1e70
4583 changed files with 584341 additions and 0 deletions
--- a/content-ops/blog-review-workflow/SKILL.md
+++ b/content-ops/blog-review-workflow/SKILL.md
@@ -0,0 +1,96 @@
+---
+name: blog-review-workflow
+description: Iterative blog review using subagents — write, review, fix, re-review until quality threshold met.
+version: 1.0.0
+author: Hermes Agent
+license: MIT
+metadata:
+  hermes:
+    tags: [blog, review, subagent, quality, content-ops]
+---
+
+# Blog Review Workflow
+
+Use this workflow when publishing a blog post that requires quality assurance. The pattern is: write → subagent review → fix → subagent re-review → micro-adjust → publish.
+
+## When to Use
+- Blog posts based on external data/evaluations (must verify factual accuracy)
+- Posts where the user explicitly asks for quality review
+- Any post where accuracy and fairness matter (comparisons, reviews, analyses)
+
+## Workflow
+
+### Step 1: Write Draft
+Write the blog post, save locally, publish as draft via content-ops-agent API.
+
+### Step 2: First Subagent Review
+Delegate to a subagent with NO conversation context — it should only read the source data and the blog draft.
+
+**Critical: The subagent must clone/read the original data source independently.** Do not pass the data through context — let the subagent verify facts against the ground truth.
+
+Review dimensions:
+1. **Factual accuracy** — data, rankings, conclusions match source?
+2. **Analysis depth** — original insights vs just rephrasing?
+3. **Logical coherence** — flow, no contradictions?
+4. **Technical accuracy** — domain concepts correct?
+5. **Readability** — accessible to target audience?
+6. **Fairness** — balanced treatment of all subjects?
+7. **Completeness** — important info not omitted?
+
+Output format:
+- Overall score (1-10)
+- Per-dimension scores
+- Specific issue list (with line numbers/quotes)
+- Actionable fix suggestions
+
+### Step 3: Fix Based on Review
+Apply fixes. Common patterns:
+- Factual errors → correct data, add caveats
+- Depth issues → add original analysis frameworks (taxonomy, cost/perf, etc.)
+- Fairness issues → equal treatment of all subjects (don't soften one while harshening another)
+- Missing content → add overlooked but important findings
+
+### Step 4: Second Subagent Review
+Re-review with focus on:
+- Are the N issues from round 1 fixed?
+- Any NEW issues introduced?
+- Overall quality improvement?
+
+### Step 5: Micro-adjustments
+Fix any remaining low-priority issues from round 2. Update the draft.
+
+### Step 6: Confirm with User
+Present the review results and ask if they want to publish or make further changes.
+
+## Pitfalls
+
+### sed for content insertion can duplicate
+When using `sed` to insert content at a pattern match, be aware that if the pattern matches multiple times, the insertion will happen at each match. Use Python for complex content modifications instead:
+```python
+# Better approach for conditional insertion
+marker = "### Target Section"
+parts = content.split(marker)
+# Process carefully, handle duplicates
+```
+
+### Subagent file access
+The subagent needs terminal access to clone repos and use curl. Always include `terminal` and `file` in toolsets. If the blog uses an API, include `web` toolset.
+
+### Community feedback as review signal
+When reviewing blog posts that reference external content, the original source's comment section may be inaccessible (e.g., WeChat requires login). Instead, gather community feedback from:
+- **GitHub API**: `curl https://api.github.com/repos/OWNER/REPO` → stars, forks, issues
+- **mmx search**: `"topic" 评价 OR 反馈 OR 体验 OR 用过` across platforms
+- **GitHub issues**: specific bug reports or feature requests that reveal user pain points
+This data enriches the "公正性" and "完整性" review dimensions.
+
+### Token security
+Never hardcode the service token in the subagent task description. Instead, tell the subagent to use environment variables or read from a known location.
+
+## Quality Thresholds
+- **≥ 8.0**: Ready to publish
+- **7.0-7.9**: Minor fixes needed
+- **6.0-6.9**: Significant rework required
+- **< 6.0**: Major rewrite needed
+
+## Reference
+This workflow was developed during a blog post evaluation of 6 AI models' iOS development capabilities. The first review scored 6.5/10 with 21 issues. After fixes, the second review scored 8.2/10 with only 3 low-priority remaining issues.
--- a/content-ops/blog-review-workflow/references/ai-model-blog-review-example.md
+++ b/content-ops/blog-review-workflow/references/ai-model-blog-review-example.md
@@ -0,0 +1,35 @@
+# Example: AI Model Evaluation Blog Post Review
+
+## Context
+Blog post titled "6款AI模型iOS开发能力深度评测" based on @solidus's evaluation data.
+
+## First Review (6.5/10) — Key Issues Found
+
+### Critical Factual Errors
+1. **Opus scoring misleading**: 95/100 based on only 8 core practical questions, while other models scored on 84 questions. Placed in same table without caveat.
+2. **"Two evaluation systems" described as three**: Title said "两套" but listed three.
+3. **GLM highest main score but ranked 3rd**: No explanation of why (XII pressure test only 79 vs Sonnet 87).
+
+### Fairness Issues
+4. **Double standard on API fabrication**: MiMo's fabricated `sending` syntax got bold + "最危险的失败模式", while Sonnet's fabricated iOS API got only "翻车" (casual). Fix: equal treatment.
+5. **Selective month-end drift comparison**: Only showed Opus (best) vs Kimi (worst), ignoring DeepSeek/GLM also solved it correctly.
+
+### Depth Issues
+6. **5 "deep analysis" questions were just rephrased** from the source report's summary section.
+7. **Scenario recommendations copied verbatim** from source report.
+
+### Missing Content
+8. Kimi's `fatalError` in production code (critical engineering flaw)
+9. GLM's CSV export syntax error (won't compile)
+10. Sonnet's TWO failures in graphics test (API fabrication + ACES formula)
+
+## Second Review (8.2/10) — Remaining Low-Priority Issues
+1. SE proposal number reference (SE-0371 vs SE-0427)
+2. Opus 95-score description could be more precise
+3. Missing "legacy Swift 5 project" recommendation scenario
+
+## Lessons Learned
+- Always add caveats when comparing scores with different sample sizes
+- Equal treatment: if you harshly criticize one model for X, do the same for all models that did X
+- Original analysis frameworks (failure mode taxonomy, cost/perf analysis) add genuine depth
+- Subagent review with NO context forces independent verification against source data