first commit
This commit is contained in:
163
research/llm-model-comparison/SKILL.md
Normal file
163
research/llm-model-comparison/SKILL.md
Normal file
@@ -0,0 +1,163 @@
|
||||
---
|
||||
name: llm-model-comparison
|
||||
description: Compare LLM models across benchmarks, pricing, and capabilities. For evaluating new models, recommending providers, and maintaining benchmark knowledge.
|
||||
version: 1.0.0
|
||||
author: Hermes Agent
|
||||
license: MIT
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [llm, benchmark, model-comparison, evaluation, provider-selection]
|
||||
triggers:
|
||||
- user asks "which model is better" or "compare X vs Y"
|
||||
- user asks about a new model they saw in news/早报
|
||||
- user wants to know if they should switch models
|
||||
- user asks "what level is this model" or "is X any good"
|
||||
- selecting a model provider for a new project
|
||||
---
|
||||
|
||||
# LLM Model Comparison Skill
|
||||
|
||||
## When to Use
|
||||
- User asks about a model they saw in news, 早报, or social media
|
||||
- User wants to compare two or more models for a specific use case
|
||||
- User asks "should I switch to X" or "is Y worth it"
|
||||
- Selecting models for deployment, API integration, or fine-tuning
|
||||
- **User asks to elaborate on a model or product mentioned in 橘鸦AI早报 or other news digests**
|
||||
|
||||
## Comparison Framework
|
||||
|
||||
### Step 1: Identify the Question
|
||||
- Is this a "what is it?" question → give overview + positioning
|
||||
- Is this a "should I use it?" question → compare against user's current stack
|
||||
- Is this a "which is better?" question → structured comparison table
|
||||
|
||||
### Step 2: Gather Data
|
||||
Use `mmx search` to find:
|
||||
1. Official announcements and benchmark numbers
|
||||
2. Third-party evaluations (non-linear benchmark, LMSYS, Artificial Analysis)
|
||||
3. Community feedback and real-world usage reports
|
||||
|
||||
Search patterns:
|
||||
```
|
||||
mmx search query "<model name> benchmark MMLU 评测 2026"
|
||||
mmx search query "<model name> vs <model name> comparison"
|
||||
mmx search query "<model name> API pricing performance"
|
||||
mmx search query "<模型中文名> 评测 benchmark"
|
||||
```
|
||||
|
||||
For Chinese platform-specific models (SenseNova, Volcengine, Qwen, etc.), search in Chinese:
|
||||
```
|
||||
mmx search query "商汤 sensenova 模型 评测"
|
||||
mmx search query "火山引擎 doubao 模型列表"
|
||||
```
|
||||
|
||||
See `references/chinese-model-platforms.md` for known provider APIs and model catalogs.
|
||||
|
||||
### Step 3: Structure the Comparison
|
||||
|
||||
Use this table format for multi-model comparison:
|
||||
|
||||
| 维度 | Model A | Model B | Model C |
|
||||
|------|---------|---------|---------|
|
||||
| **开发者** | Company | Company | Company |
|
||||
| **参数规模** | XxB | XxB | XxB |
|
||||
| **架构** | Dense/MoE | Dense/MoE | Dense/MoE |
|
||||
| **开源** | ✅/❌ | ✅/❌ | ✅/❌ |
|
||||
| **中文能力** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
||||
| **编程能力** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
||||
| **Agent能力** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
||||
| **性价比** | 描述 | 描述 | 描述 |
|
||||
|
||||
### Step 4: Scenario-Based Recommendation
|
||||
|
||||
Always end with a scenario table:
|
||||
|
||||
| 场景 | 推荐模型 | 理由 |
|
||||
|------|----------|------|
|
||||
| 中文日常对话 | X | 理由 |
|
||||
| 编程任务 | Y | 理由 |
|
||||
| Agent 开发 | Z | 理由 |
|
||||
| 开源自部署 | W | 理由 |
|
||||
| 成本敏感 | V | 理由 |
|
||||
|
||||
### Step 5: Actionable Next Steps
|
||||
- If user already uses a model, compare against their current stack
|
||||
- Offer to configure the new model in their environment
|
||||
- Note any migration costs or compatibility issues
|
||||
|
||||
## Key Benchmark Sources
|
||||
|
||||
| Source | URL | What it measures |
|
||||
|--------|-----|------------------|
|
||||
| Artificial Analysis | artificialanalysis.ai | Speed, quality, price |
|
||||
| LMSYS Chatbot Arena | lmarena.ai | Human preference (Elo) |
|
||||
| non-linear ReLE | github.com/jeinlee1991/chinese-llm-benchmark | Chinese LLM comprehensive |
|
||||
| SWE-bench Pro | swebench.com | Coding agent capability |
|
||||
| BFCL-V3 | gorilla.cs.berkeley.edu | Function calling |
|
||||
| MMLU | Various | General knowledge |
|
||||
|
||||
## Elaborating on 橘鸦AI早报 Items
|
||||
|
||||
When user says "细说X" or "elaborate on item X" from the daily news digest:
|
||||
|
||||
### Step 1: Find the source
|
||||
```bash
|
||||
# Search session history for the cron output
|
||||
ls ~/.hermes/cron/output/9733a9cabb44/ | sort | tail -5
|
||||
# Read the relevant file
|
||||
cat ~/.hermes/cron/output/9733a9cabb44/<date>.md
|
||||
```
|
||||
|
||||
### Step 2: Extract the specific item
|
||||
Parse the numbered list and identify the item by number.
|
||||
|
||||
### Step 3: Deep research
|
||||
Use `mmx search` to find:
|
||||
1. Official announcements and product pages
|
||||
2. Technical documentation or blog posts
|
||||
3. Community reactions and early adopter feedback
|
||||
4. Benchmark data if applicable
|
||||
|
||||
### Step 4: Structure the response
|
||||
- One-line summary of what it is
|
||||
- Detailed breakdown (features, specs, implications)
|
||||
- Comparison with alternatives if relevant
|
||||
- Actionable recommendation (try it? wait? skip?)
|
||||
|
||||
## Pitfalls
|
||||
|
||||
### Don't compare apples to oranges
|
||||
- MoE models (e.g., 400B total, 13B active) ≠ Dense models of same total params
|
||||
- Always note activated parameters for MoE models
|
||||
- Pricing varies wildly: per-token vs per-request vs subscription
|
||||
|
||||
### Benchmark ≠ real-world performance
|
||||
- Benchmark scores don't capture latency, rate limits, or availability
|
||||
- Chinese benchmark scores may not reflect English performance and vice versa
|
||||
- Agent benchmarks (SWE-bench, τ³-Bench) are more relevant for agentic use cases than MMLU
|
||||
|
||||
### Free tier traps
|
||||
- "Free" models on platforms may have rate limits, latency, or availability issues
|
||||
- Check if the free offer is temporary (e.g., "一周免费") before recommending
|
||||
- Self-hosted "free" models still have compute costs
|
||||
|
||||
### Don't over-hype new releases
|
||||
- New model announcements often cherry-pick favorable benchmarks
|
||||
- Wait for third-party evaluations before making strong claims
|
||||
- If user saw it in 早报/news, note it's worth watching but not necessarily switching
|
||||
|
||||
### ALWAYS use mmx search, NOT curl/browser
|
||||
- **Never** fall back to curl-based scraping (Google, Baidu, DuckDuckGo) for model research — they all block or return empty
|
||||
- **Never** try browser navigation for model research — sandbox issues are common and pages are SPAs
|
||||
- `mmx search` is the only reliable research tool. If it fails, say so and give your best assessment from training data
|
||||
- Do NOT attempt 10+ curl variations hoping one works — one `mmx search` call is worth 20 failed curl attempts
|
||||
|
||||
## Current User Stack (Reference)
|
||||
- Primary model: MiMo 2.5 Pro (via Xiaomi API)
|
||||
- Also available: MiniMax M2.7
|
||||
- Hermes Agent: v0.12.0
|
||||
- Use case: Agent tasks, coding, Chinese content
|
||||
|
||||
## References
|
||||
- See `references/model-benchmarks-2026-05.md` for curated benchmark data
|
||||
- See `references/chinese-model-platforms.md` for Chinese AI provider APIs, model naming conventions, and research heuristics
|
||||
@@ -0,0 +1,40 @@
|
||||
# Chinese AI Model Platforms Reference
|
||||
|
||||
## Major Providers & Model Families
|
||||
|
||||
| Provider | Platform | Model Family | Notes |
|
||||
|----------|----------|-------------|-------|
|
||||
| 商汤 SenseTime | cloud.sensenova.cn | SenseNova (6.7B, U1, etc.) | Named as `sensenova-*` in APIs |
|
||||
| 深度求索 DeepSeek | platform.deepseek.com | DeepSeek-V3/V4, R1, Coder | `deepseek-*` naming |
|
||||
| 阿里 Alibaba | dashscope.aliyun.com | Qwen (通义千问) | `qwen-*` naming |
|
||||
| 字节跳动 ByteDance | volcengine.com | Doubao (豆包) | `doubao-*` naming |
|
||||
| 月之暗面 Moonshot | platform.moonshot.cn | Kimi | `moonshot-*` naming |
|
||||
| 智谱 Zhipu | open.bigmodel.cn | GLM (ChatGLM) | `glm-*` naming |
|
||||
| 百度 Baidu | cloud.baidu.com | 文心 ERNIE | `ernie-*` naming |
|
||||
| 零一万物 01.AI | platform.lingyiwanwu.com | Yi | `yi-*` naming |
|
||||
| MiniMax | platform.minimaxi.com | MiniMax (M2.7, etc.) | `minimax-*` naming |
|
||||
| 小米 Xiaomi | mimo.xiaomi.com | MiMo | `mimo-*` naming |
|
||||
|
||||
## Common Model Naming Patterns
|
||||
|
||||
- `*-flash` / `*-lite` → lightweight/fast inference variants
|
||||
- `*-fast` → speed-optimized, may sacrifice some quality
|
||||
- `*-instruct` → instruction-tuned for chat
|
||||
- `*-coder` / `*-code` → code-specialized
|
||||
- `*-v1`, `*-v2`, `*-v3` → version iterations
|
||||
- Parameter count often embedded: `6.7B`, `72B`, etc.
|
||||
|
||||
## How to Research an Unknown Model
|
||||
|
||||
1. **mmx search** with model name + "评测" or "benchmark"
|
||||
2. Check the provider's official docs (see table above)
|
||||
3. Check LMSYS Chatbot Arena leaderboard (lmarena.ai)
|
||||
4. Check non-linear Chinese LLM benchmark (github.com/jeinlee1991/chinese-llm-benchmark)
|
||||
|
||||
## Quick Classification Heuristics
|
||||
|
||||
- If name contains a provider prefix (sensenova, deepseek, qwen...) → look up that provider
|
||||
- If name contains parameter count (6.7B, 7B, 72B) → compare against known models of similar size
|
||||
- If name contains "flash/lite/fast" → speed variant, likely lower quality than base model
|
||||
- "Lite" models: often 1B-7B range, good for simple tasks
|
||||
- "Flash/Fast" models: optimized inference, may use MoE or quantization
|
||||
@@ -0,0 +1,86 @@
|
||||
# Model Benchmark Data — May 2026
|
||||
|
||||
## Chinese LLM Benchmark (non-linear ReLE)
|
||||
Source: github.com/jeinlee1991/chinese-llm-benchmark
|
||||
|
||||
### 通用能力 (General Capability)
|
||||
| 排名 | 模型 | 准确率 | 耗时 | 花费/千次(元) |
|
||||
|------|------|--------|------|---------------|
|
||||
| 28 | MiniMax-M2.7 | 65.1% | 110s | 42.7 |
|
||||
| 35 | MiMo-V2.5-Pro | ~71.4%* | 56s | 64.3 |
|
||||
|
||||
*MiMo-V2.5-Pro 数据来自单独评测文章,排名从第35位跃升至第7位。
|
||||
|
||||
### 中文指令遵从
|
||||
| 排名 | 模型 | 准确率 | 耗时 |
|
||||
|------|------|--------|------|
|
||||
| 30 | MiniMax-M2.7 | 42.9% | 51s |
|
||||
|
||||
### BFCL-V3 (Function Calling)
|
||||
| 排名 | 模型 | 准确率 |
|
||||
|------|------|--------|
|
||||
| 2 | MiniMax-M2.7 | 76.5% |
|
||||
| 12 | MiniMax-M2.5 | 70.5% |
|
||||
|
||||
## MiMo-V2.5-Pro Key Metrics
|
||||
Source: 小米官方 + Artificial Analysis
|
||||
|
||||
- GDPVal-AA (Elo): 1581 — 全球开源模型第一
|
||||
- ClawEval: 63.8
|
||||
- τ³-Bench: 72.9
|
||||
- SWE-bench Pro: 接近 Claude Opus 4.6 / GPT-5.4 水平
|
||||
- Token 效率: 较 Kimi 提升 42%
|
||||
- 参数: 1T (Pro), 310B (标准版)
|
||||
- 上下文: 1M tokens
|
||||
- 协议: MIT (完全开源)
|
||||
- Coding 能力: 较上代提升 8.8% (53.1% → 61.9%)
|
||||
|
||||
## MiniMax M2.7 Key Metrics
|
||||
Source: MiniMax 官方
|
||||
|
||||
- SWE-bench Pro: 56.22%
|
||||
- 自我进化: 通过 Agent Harness 参与自身训练,30-50% 研发工作量可由模型承担
|
||||
- 核心定位: Agent 旗舰模型
|
||||
- 状态: 闭源商用 API
|
||||
- 港股表现: 股价 886 港元/股 (2026年2月)
|
||||
|
||||
## Arcee Trinity Large Key Metrics
|
||||
Source: Arcee AI 官方 + 技术报告
|
||||
|
||||
- 参数: 400B 总参数,13B 激活/token (MoE)
|
||||
- 架构: AFMoE (Attention-First Mixture-of-Experts)
|
||||
- 专家数: 128 experts, 8 active per token
|
||||
- 上下文: 131K tokens
|
||||
- 生成速度: 200+ tokens/s
|
||||
- 响应延迟: sub-3s
|
||||
- 协议: Apache 2.0 (完全开源,可商用)
|
||||
- 性能: 与 Llama 4 Maverick 400B、GLM-4.5 相当
|
||||
- 训练方: Arcee AI + Prime Intellect + DatologyAI
|
||||
- 定位: 美国企业发布的最大开源模型之一
|
||||
|
||||
## Quick Reference: Model Tier List (May 2026)
|
||||
|
||||
### Tier 1 — 顶级闭源
|
||||
- GPT-5.4 / GPT-5.5 (OpenAI)
|
||||
- Claude Opus 4.6 (Anthropic)
|
||||
- Gemini 3.1 Pro (Google)
|
||||
|
||||
### Tier 1.5 — 准顶级 / 开源最强
|
||||
- MiMo-V2.5-Pro (小米) — 开源第一梯队
|
||||
- Kimi-K2-Thinking (月之暗面)
|
||||
- GLM-5.1 (智谱AI)
|
||||
|
||||
### Tier 2 — 强劲商用
|
||||
- MiniMax M2.7 — 中文顶级,Agent 强
|
||||
- Qwen3.5-Plus (阿里)
|
||||
- DeepSeek V4-Pro
|
||||
|
||||
### Tier 2.5 — 优秀开源
|
||||
- Trinity Large (Arcee) — 400B MoE,英文优化
|
||||
- Qwen3.5-27B / Qwen3.6-35B
|
||||
- GLM-4.7 (智谱AI)
|
||||
|
||||
### Tier 3 — 高效/轻量
|
||||
- Trinity Mini (26B, 3B active)
|
||||
- Gemini 3.1 Flash Lite
|
||||
- Qwen3.5-Flash
|
||||
Reference in New Issue
Block a user