Files
agent-skills/content-ops/blog-review-workflow/references/ai-model-blog-review-example.md
Hermes Agent ccc63d1e70 first commit
2026-05-10 13:52:46 +08:00

1.9 KiB

Example: AI Model Evaluation Blog Post Review

Context

Blog post titled "6款AI模型iOS开发能力深度评测" based on @solidus's evaluation data.

First Review (6.5/10) — Key Issues Found

Critical Factual Errors

  1. Opus scoring misleading: 95/100 based on only 8 core practical questions, while other models scored on 84 questions. Placed in same table without caveat.
  2. "Two evaluation systems" described as three: Title said "两套" but listed three.
  3. GLM highest main score but ranked 3rd: No explanation of why (XII pressure test only 79 vs Sonnet 87).

Fairness Issues

  1. Double standard on API fabrication: MiMo's fabricated sending syntax got bold + "最危险的失败模式", while Sonnet's fabricated iOS API got only "翻车" (casual). Fix: equal treatment.
  2. Selective month-end drift comparison: Only showed Opus (best) vs Kimi (worst), ignoring DeepSeek/GLM also solved it correctly.

Depth Issues

  1. 5 "deep analysis" questions were just rephrased from the source report's summary section.
  2. Scenario recommendations copied verbatim from source report.

Missing Content

  1. Kimi's fatalError in production code (critical engineering flaw)
  2. GLM's CSV export syntax error (won't compile)
  3. Sonnet's TWO failures in graphics test (API fabrication + ACES formula)

Second Review (8.2/10) — Remaining Low-Priority Issues

  1. SE proposal number reference (SE-0371 vs SE-0427)
  2. Opus 95-score description could be more precise
  3. Missing "legacy Swift 5 project" recommendation scenario

Lessons Learned

  • Always add caveats when comparing scores with different sample sizes
  • Equal treatment: if you harshly criticize one model for X, do the same for all models that did X
  • Original analysis frameworks (failure mode taxonomy, cost/perf analysis) add genuine depth
  • Subagent review with NO context forces independent verification against source data