first commit

This commit is contained in:
Hermes Agent
2026-05-10 13:52:46 +08:00
commit ccc63d1e70
4583 changed files with 584341 additions and 0 deletions

View File

@@ -0,0 +1,96 @@
---
name: blog-review-workflow
description: Iterative blog review using subagents — write, review, fix, re-review until quality threshold met.
version: 1.0.0
author: Hermes Agent
license: MIT
metadata:
hermes:
tags: [blog, review, subagent, quality, content-ops]
---
# Blog Review Workflow
Use this workflow when publishing a blog post that requires quality assurance. The pattern is: write → subagent review → fix → subagent re-review → micro-adjust → publish.
## When to Use
- Blog posts based on external data/evaluations (must verify factual accuracy)
- Posts where the user explicitly asks for quality review
- Any post where accuracy and fairness matter (comparisons, reviews, analyses)
## Workflow
### Step 1: Write Draft
Write the blog post, save locally, publish as draft via content-ops-agent API.
### Step 2: First Subagent Review
Delegate to a subagent with NO conversation context — it should only read the source data and the blog draft.
**Critical: The subagent must clone/read the original data source independently.** Do not pass the data through context — let the subagent verify facts against the ground truth.
Review dimensions:
1. **Factual accuracy** — data, rankings, conclusions match source?
2. **Analysis depth** — original insights vs just rephrasing?
3. **Logical coherence** — flow, no contradictions?
4. **Technical accuracy** — domain concepts correct?
5. **Readability** — accessible to target audience?
6. **Fairness** — balanced treatment of all subjects?
7. **Completeness** — important info not omitted?
Output format:
- Overall score (1-10)
- Per-dimension scores
- Specific issue list (with line numbers/quotes)
- Actionable fix suggestions
### Step 3: Fix Based on Review
Apply fixes. Common patterns:
- Factual errors → correct data, add caveats
- Depth issues → add original analysis frameworks (taxonomy, cost/perf, etc.)
- Fairness issues → equal treatment of all subjects (don't soften one while harshening another)
- Missing content → add overlooked but important findings
### Step 4: Second Subagent Review
Re-review with focus on:
- Are the N issues from round 1 fixed?
- Any NEW issues introduced?
- Overall quality improvement?
### Step 5: Micro-adjustments
Fix any remaining low-priority issues from round 2. Update the draft.
### Step 6: Confirm with User
Present the review results and ask if they want to publish or make further changes.
## Pitfalls
### sed for content insertion can duplicate
When using `sed` to insert content at a pattern match, be aware that if the pattern matches multiple times, the insertion will happen at each match. Use Python for complex content modifications instead:
```python
# Better approach for conditional insertion
marker = "### Target Section"
parts = content.split(marker)
# Process carefully, handle duplicates
```
### Subagent file access
The subagent needs terminal access to clone repos and use curl. Always include `terminal` and `file` in toolsets. If the blog uses an API, include `web` toolset.
### Community feedback as review signal
When reviewing blog posts that reference external content, the original source's comment section may be inaccessible (e.g., WeChat requires login). Instead, gather community feedback from:
- **GitHub API**: `curl https://api.github.com/repos/OWNER/REPO` → stars, forks, issues
- **mmx search**: `"topic" 评价 OR 反馈 OR 体验 OR 用过` across platforms
- **GitHub issues**: specific bug reports or feature requests that reveal user pain points
This data enriches the "公正性" and "完整性" review dimensions.
### Token security
Never hardcode the service token in the subagent task description. Instead, tell the subagent to use environment variables or read from a known location.
## Quality Thresholds
- **≥ 8.0**: Ready to publish
- **7.0-7.9**: Minor fixes needed
- **6.0-6.9**: Significant rework required
- **< 6.0**: Major rewrite needed
## Reference
This workflow was developed during a blog post evaluation of 6 AI models' iOS development capabilities. The first review scored 6.5/10 with 21 issues. After fixes, the second review scored 8.2/10 with only 3 low-priority remaining issues.

View File

@@ -0,0 +1,35 @@
# Example: AI Model Evaluation Blog Post Review
## Context
Blog post titled "6款AI模型iOS开发能力深度评测" based on @solidus's evaluation data.
## First Review (6.5/10) — Key Issues Found
### Critical Factual Errors
1. **Opus scoring misleading**: 95/100 based on only 8 core practical questions, while other models scored on 84 questions. Placed in same table without caveat.
2. **"Two evaluation systems" described as three**: Title said "两套" but listed three.
3. **GLM highest main score but ranked 3rd**: No explanation of why (XII pressure test only 79 vs Sonnet 87).
### Fairness Issues
4. **Double standard on API fabrication**: MiMo's fabricated `sending` syntax got bold + "最危险的失败模式", while Sonnet's fabricated iOS API got only "翻车" (casual). Fix: equal treatment.
5. **Selective month-end drift comparison**: Only showed Opus (best) vs Kimi (worst), ignoring DeepSeek/GLM also solved it correctly.
### Depth Issues
6. **5 "deep analysis" questions were just rephrased** from the source report's summary section.
7. **Scenario recommendations copied verbatim** from source report.
### Missing Content
8. Kimi's `fatalError` in production code (critical engineering flaw)
9. GLM's CSV export syntax error (won't compile)
10. Sonnet's TWO failures in graphics test (API fabrication + ACES formula)
## Second Review (8.2/10) — Remaining Low-Priority Issues
1. SE proposal number reference (SE-0371 vs SE-0427)
2. Opus 95-score description could be more precise
3. Missing "legacy Swift 5 project" recommendation scenario
## Lessons Learned
- Always add caveats when comparing scores with different sample sizes
- Equal treatment: if you harshly criticize one model for X, do the same for all models that did X
- Original analysis frameworks (failure mode taxonomy, cost/perf analysis) add genuine depth
- Subagent review with NO context forces independent verification against source data