--- name: llm-model-comparison description: Compare LLM models across benchmarks, pricing, and capabilities. For evaluating new models, recommending providers, and maintaining benchmark knowledge. version: 1.0.0 author: Hermes Agent license: MIT metadata: hermes: tags: [llm, benchmark, model-comparison, evaluation, provider-selection] triggers: - user asks "which model is better" or "compare X vs Y" - user asks about a new model they saw in news/早报 - user wants to know if they should switch models - user asks "what level is this model" or "is X any good" - selecting a model provider for a new project --- # LLM Model Comparison Skill ## When to Use - User asks about a model they saw in news, 早报, or social media - User wants to compare two or more models for a specific use case - User asks "should I switch to X" or "is Y worth it" - Selecting models for deployment, API integration, or fine-tuning - **User asks to elaborate on a model or product mentioned in 橘鸦AI早报 or other news digests** ## Comparison Framework ### Step 1: Identify the Question - Is this a "what is it?" question → give overview + positioning - Is this a "should I use it?" question → compare against user's current stack - Is this a "which is better?" question → structured comparison table ### Step 2: Gather Data Use `mmx search` to find: 1. Official announcements and benchmark numbers 2. Third-party evaluations (non-linear benchmark, LMSYS, Artificial Analysis) 3. Community feedback and real-world usage reports Search patterns: ``` mmx search query " benchmark MMLU 评测 2026" mmx search query " vs comparison" mmx search query " API pricing performance" mmx search query "<模型中文名> 评测 benchmark" ``` For Chinese platform-specific models (SenseNova, Volcengine, Qwen, etc.), search in Chinese: ``` mmx search query "商汤 sensenova 模型 评测" mmx search query "火山引擎 doubao 模型列表" ``` See `references/chinese-model-platforms.md` for known provider APIs and model catalogs. ### Step 3: Structure the Comparison Use this table format for multi-model comparison: | 维度 | Model A | Model B | Model C | |------|---------|---------|---------| | **开发者** | Company | Company | Company | | **参数规模** | XxB | XxB | XxB | | **架构** | Dense/MoE | Dense/MoE | Dense/MoE | | **开源** | ✅/❌ | ✅/❌ | ✅/❌ | | **中文能力** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | **编程能力** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | **Agent能力** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | **性价比** | 描述 | 描述 | 描述 | ### Step 4: Scenario-Based Recommendation Always end with a scenario table: | 场景 | 推荐模型 | 理由 | |------|----------|------| | 中文日常对话 | X | 理由 | | 编程任务 | Y | 理由 | | Agent 开发 | Z | 理由 | | 开源自部署 | W | 理由 | | 成本敏感 | V | 理由 | ### Step 5: Actionable Next Steps - If user already uses a model, compare against their current stack - Offer to configure the new model in their environment - Note any migration costs or compatibility issues ## Key Benchmark Sources | Source | URL | What it measures | |--------|-----|------------------| | Artificial Analysis | artificialanalysis.ai | Speed, quality, price | | LMSYS Chatbot Arena | lmarena.ai | Human preference (Elo) | | non-linear ReLE | github.com/jeinlee1991/chinese-llm-benchmark | Chinese LLM comprehensive | | SWE-bench Pro | swebench.com | Coding agent capability | | BFCL-V3 | gorilla.cs.berkeley.edu | Function calling | | MMLU | Various | General knowledge | ## Elaborating on 橘鸦AI早报 Items When user says "细说X" or "elaborate on item X" from the daily news digest: ### Step 1: Find the source ```bash # Search session history for the cron output ls ~/.hermes/cron/output/9733a9cabb44/ | sort | tail -5 # Read the relevant file cat ~/.hermes/cron/output/9733a9cabb44/.md ``` ### Step 2: Extract the specific item Parse the numbered list and identify the item by number. ### Step 3: Deep research Use `mmx search` to find: 1. Official announcements and product pages 2. Technical documentation or blog posts 3. Community reactions and early adopter feedback 4. Benchmark data if applicable ### Step 4: Structure the response - One-line summary of what it is - Detailed breakdown (features, specs, implications) - Comparison with alternatives if relevant - Actionable recommendation (try it? wait? skip?) ## Pitfalls ### Don't compare apples to oranges - MoE models (e.g., 400B total, 13B active) ≠ Dense models of same total params - Always note activated parameters for MoE models - Pricing varies wildly: per-token vs per-request vs subscription ### Benchmark ≠ real-world performance - Benchmark scores don't capture latency, rate limits, or availability - Chinese benchmark scores may not reflect English performance and vice versa - Agent benchmarks (SWE-bench, τ³-Bench) are more relevant for agentic use cases than MMLU ### Free tier traps - "Free" models on platforms may have rate limits, latency, or availability issues - Check if the free offer is temporary (e.g., "一周免费") before recommending - Self-hosted "free" models still have compute costs ### Don't over-hype new releases - New model announcements often cherry-pick favorable benchmarks - Wait for third-party evaluations before making strong claims - If user saw it in 早报/news, note it's worth watching but not necessarily switching ### ALWAYS use mmx search, NOT curl/browser - **Never** fall back to curl-based scraping (Google, Baidu, DuckDuckGo) for model research — they all block or return empty - **Never** try browser navigation for model research — sandbox issues are common and pages are SPAs - `mmx search` is the only reliable research tool. If it fails, say so and give your best assessment from training data - Do NOT attempt 10+ curl variations hoping one works — one `mmx search` call is worth 20 failed curl attempts ## Current User Stack (Reference) - Primary model: MiMo 2.5 Pro (via Xiaomi API) - Also available: MiniMax M2.7 - Hermes Agent: v0.12.0 - Use case: Agent tasks, coding, Chinese content ## References - See `references/model-benchmarks-2026-05.md` for curated benchmark data - See `references/chinese-model-platforms.md` for Chinese AI provider APIs, model naming conventions, and research heuristics