1.9 KiB
1.9 KiB
Example: AI Model Evaluation Blog Post Review
Context
Blog post titled "6款AI模型iOS开发能力深度评测" based on @solidus's evaluation data.
First Review (6.5/10) — Key Issues Found
Critical Factual Errors
- Opus scoring misleading: 95/100 based on only 8 core practical questions, while other models scored on 84 questions. Placed in same table without caveat.
- "Two evaluation systems" described as three: Title said "两套" but listed three.
- GLM highest main score but ranked 3rd: No explanation of why (XII pressure test only 79 vs Sonnet 87).
Fairness Issues
- Double standard on API fabrication: MiMo's fabricated
sendingsyntax got bold + "最危险的失败模式", while Sonnet's fabricated iOS API got only "翻车" (casual). Fix: equal treatment. - Selective month-end drift comparison: Only showed Opus (best) vs Kimi (worst), ignoring DeepSeek/GLM also solved it correctly.
Depth Issues
- 5 "deep analysis" questions were just rephrased from the source report's summary section.
- Scenario recommendations copied verbatim from source report.
Missing Content
- Kimi's
fatalErrorin production code (critical engineering flaw) - GLM's CSV export syntax error (won't compile)
- Sonnet's TWO failures in graphics test (API fabrication + ACES formula)
Second Review (8.2/10) — Remaining Low-Priority Issues
- SE proposal number reference (SE-0371 vs SE-0427)
- Opus 95-score description could be more precise
- Missing "legacy Swift 5 project" recommendation scenario
Lessons Learned
- Always add caveats when comparing scores with different sample sizes
- Equal treatment: if you harshly criticize one model for X, do the same for all models that did X
- Original analysis frameworks (failure mode taxonomy, cost/perf analysis) add genuine depth
- Subagent review with NO context forces independent verification against source data