164 lines
6.3 KiB
Markdown
164 lines
6.3 KiB
Markdown
---
|
|
name: llm-model-comparison
|
|
description: Compare LLM models across benchmarks, pricing, and capabilities. For evaluating new models, recommending providers, and maintaining benchmark knowledge.
|
|
version: 1.0.0
|
|
author: Hermes Agent
|
|
license: MIT
|
|
metadata:
|
|
hermes:
|
|
tags: [llm, benchmark, model-comparison, evaluation, provider-selection]
|
|
triggers:
|
|
- user asks "which model is better" or "compare X vs Y"
|
|
- user asks about a new model they saw in news/早报
|
|
- user wants to know if they should switch models
|
|
- user asks "what level is this model" or "is X any good"
|
|
- selecting a model provider for a new project
|
|
---
|
|
|
|
# LLM Model Comparison Skill
|
|
|
|
## When to Use
|
|
- User asks about a model they saw in news, 早报, or social media
|
|
- User wants to compare two or more models for a specific use case
|
|
- User asks "should I switch to X" or "is Y worth it"
|
|
- Selecting models for deployment, API integration, or fine-tuning
|
|
- **User asks to elaborate on a model or product mentioned in 橘鸦AI早报 or other news digests**
|
|
|
|
## Comparison Framework
|
|
|
|
### Step 1: Identify the Question
|
|
- Is this a "what is it?" question → give overview + positioning
|
|
- Is this a "should I use it?" question → compare against user's current stack
|
|
- Is this a "which is better?" question → structured comparison table
|
|
|
|
### Step 2: Gather Data
|
|
Use `mmx search` to find:
|
|
1. Official announcements and benchmark numbers
|
|
2. Third-party evaluations (non-linear benchmark, LMSYS, Artificial Analysis)
|
|
3. Community feedback and real-world usage reports
|
|
|
|
Search patterns:
|
|
```
|
|
mmx search query "<model name> benchmark MMLU 评测 2026"
|
|
mmx search query "<model name> vs <model name> comparison"
|
|
mmx search query "<model name> API pricing performance"
|
|
mmx search query "<模型中文名> 评测 benchmark"
|
|
```
|
|
|
|
For Chinese platform-specific models (SenseNova, Volcengine, Qwen, etc.), search in Chinese:
|
|
```
|
|
mmx search query "商汤 sensenova 模型 评测"
|
|
mmx search query "火山引擎 doubao 模型列表"
|
|
```
|
|
|
|
See `references/chinese-model-platforms.md` for known provider APIs and model catalogs.
|
|
|
|
### Step 3: Structure the Comparison
|
|
|
|
Use this table format for multi-model comparison:
|
|
|
|
| 维度 | Model A | Model B | Model C |
|
|
|------|---------|---------|---------|
|
|
| **开发者** | Company | Company | Company |
|
|
| **参数规模** | XxB | XxB | XxB |
|
|
| **架构** | Dense/MoE | Dense/MoE | Dense/MoE |
|
|
| **开源** | ✅/❌ | ✅/❌ | ✅/❌ |
|
|
| **中文能力** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
|
| **编程能力** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
|
| **Agent能力** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
|
| **性价比** | 描述 | 描述 | 描述 |
|
|
|
|
### Step 4: Scenario-Based Recommendation
|
|
|
|
Always end with a scenario table:
|
|
|
|
| 场景 | 推荐模型 | 理由 |
|
|
|------|----------|------|
|
|
| 中文日常对话 | X | 理由 |
|
|
| 编程任务 | Y | 理由 |
|
|
| Agent 开发 | Z | 理由 |
|
|
| 开源自部署 | W | 理由 |
|
|
| 成本敏感 | V | 理由 |
|
|
|
|
### Step 5: Actionable Next Steps
|
|
- If user already uses a model, compare against their current stack
|
|
- Offer to configure the new model in their environment
|
|
- Note any migration costs or compatibility issues
|
|
|
|
## Key Benchmark Sources
|
|
|
|
| Source | URL | What it measures |
|
|
|--------|-----|------------------|
|
|
| Artificial Analysis | artificialanalysis.ai | Speed, quality, price |
|
|
| LMSYS Chatbot Arena | lmarena.ai | Human preference (Elo) |
|
|
| non-linear ReLE | github.com/jeinlee1991/chinese-llm-benchmark | Chinese LLM comprehensive |
|
|
| SWE-bench Pro | swebench.com | Coding agent capability |
|
|
| BFCL-V3 | gorilla.cs.berkeley.edu | Function calling |
|
|
| MMLU | Various | General knowledge |
|
|
|
|
## Elaborating on 橘鸦AI早报 Items
|
|
|
|
When user says "细说X" or "elaborate on item X" from the daily news digest:
|
|
|
|
### Step 1: Find the source
|
|
```bash
|
|
# Search session history for the cron output
|
|
ls ~/.hermes/cron/output/9733a9cabb44/ | sort | tail -5
|
|
# Read the relevant file
|
|
cat ~/.hermes/cron/output/9733a9cabb44/<date>.md
|
|
```
|
|
|
|
### Step 2: Extract the specific item
|
|
Parse the numbered list and identify the item by number.
|
|
|
|
### Step 3: Deep research
|
|
Use `mmx search` to find:
|
|
1. Official announcements and product pages
|
|
2. Technical documentation or blog posts
|
|
3. Community reactions and early adopter feedback
|
|
4. Benchmark data if applicable
|
|
|
|
### Step 4: Structure the response
|
|
- One-line summary of what it is
|
|
- Detailed breakdown (features, specs, implications)
|
|
- Comparison with alternatives if relevant
|
|
- Actionable recommendation (try it? wait? skip?)
|
|
|
|
## Pitfalls
|
|
|
|
### Don't compare apples to oranges
|
|
- MoE models (e.g., 400B total, 13B active) ≠ Dense models of same total params
|
|
- Always note activated parameters for MoE models
|
|
- Pricing varies wildly: per-token vs per-request vs subscription
|
|
|
|
### Benchmark ≠ real-world performance
|
|
- Benchmark scores don't capture latency, rate limits, or availability
|
|
- Chinese benchmark scores may not reflect English performance and vice versa
|
|
- Agent benchmarks (SWE-bench, τ³-Bench) are more relevant for agentic use cases than MMLU
|
|
|
|
### Free tier traps
|
|
- "Free" models on platforms may have rate limits, latency, or availability issues
|
|
- Check if the free offer is temporary (e.g., "一周免费") before recommending
|
|
- Self-hosted "free" models still have compute costs
|
|
|
|
### Don't over-hype new releases
|
|
- New model announcements often cherry-pick favorable benchmarks
|
|
- Wait for third-party evaluations before making strong claims
|
|
- If user saw it in 早报/news, note it's worth watching but not necessarily switching
|
|
|
|
### ALWAYS use mmx search, NOT curl/browser
|
|
- **Never** fall back to curl-based scraping (Google, Baidu, DuckDuckGo) for model research — they all block or return empty
|
|
- **Never** try browser navigation for model research — sandbox issues are common and pages are SPAs
|
|
- `mmx search` is the only reliable research tool. If it fails, say so and give your best assessment from training data
|
|
- Do NOT attempt 10+ curl variations hoping one works — one `mmx search` call is worth 20 failed curl attempts
|
|
|
|
## Current User Stack (Reference)
|
|
- Primary model: MiMo 2.5 Pro (via Xiaomi API)
|
|
- Also available: MiniMax M2.7
|
|
- Hermes Agent: v0.12.0
|
|
- Use case: Agent tasks, coding, Chinese content
|
|
|
|
## References
|
|
- See `references/model-benchmarks-2026-05.md` for curated benchmark data
|
|
- See `references/chinese-model-platforms.md` for Chinese AI provider APIs, model naming conventions, and research heuristics
|