ephron_ren/agent-skills

Files

Hermes Agent ccc63d1e70 first commit

2026-05-10 13:52:46 +08:00

1.9 KiB

Raw Permalink Blame History

Example: AI Model Evaluation Blog Post Review

Context

Blog post titled "6款AI模型iOS开发能力深度评测" based on @solidus's evaluation data.

First Review (6.5/10) — Key Issues Found

Critical Factual Errors

Opus scoring misleading: 95/100 based on only 8 core practical questions, while other models scored on 84 questions. Placed in same table without caveat.
"Two evaluation systems" described as three: Title said "两套" but listed three.
GLM highest main score but ranked 3rd: No explanation of why (XII pressure test only 79 vs Sonnet 87).

Fairness Issues

Double standard on API fabrication: MiMo's fabricated sending syntax got bold + "最危险的失败模式", while Sonnet's fabricated iOS API got only "翻车" (casual). Fix: equal treatment.
Selective month-end drift comparison: Only showed Opus (best) vs Kimi (worst), ignoring DeepSeek/GLM also solved it correctly.

Depth Issues

5 "deep analysis" questions were just rephrased from the source report's summary section.
Scenario recommendations copied verbatim from source report.

Missing Content

Kimi's fatalError in production code (critical engineering flaw)
GLM's CSV export syntax error (won't compile)
Sonnet's TWO failures in graphics test (API fabrication + ACES formula)

Second Review (8.2/10) — Remaining Low-Priority Issues

SE proposal number reference (SE-0371 vs SE-0427)
Opus 95-score description could be more precise
Missing "legacy Swift 5 project" recommendation scenario

Lessons Learned

Always add caveats when comparing scores with different sample sizes
Equal treatment: if you harshly criticize one model for X, do the same for all models that did X
Original analysis frameworks (failure mode taxonomy, cost/perf analysis) add genuine depth
Subagent review with NO context forces independent verification against source data