Files
Hermes Agent ccc63d1e70 first commit
2026-05-10 13:52:46 +08:00

6.3 KiB

name, description, version, author, license, metadata, triggers
name description version author license metadata triggers
llm-model-comparison Compare LLM models across benchmarks, pricing, and capabilities. For evaluating new models, recommending providers, and maintaining benchmark knowledge. 1.0.0 Hermes Agent MIT
hermes
tags
llm
benchmark
model-comparison
evaluation
provider-selection
user asks "which model is better" or "compare X vs Y"
user asks about a new model they saw in news/早报
user wants to know if they should switch models
user asks "what level is this model" or "is X any good"
selecting a model provider for a new project

LLM Model Comparison Skill

When to Use

  • User asks about a model they saw in news, 早报, or social media
  • User wants to compare two or more models for a specific use case
  • User asks "should I switch to X" or "is Y worth it"
  • Selecting models for deployment, API integration, or fine-tuning
  • User asks to elaborate on a model or product mentioned in 橘鸦AI早报 or other news digests

Comparison Framework

Step 1: Identify the Question

  • Is this a "what is it?" question → give overview + positioning
  • Is this a "should I use it?" question → compare against user's current stack
  • Is this a "which is better?" question → structured comparison table

Step 2: Gather Data

Use mmx search to find:

  1. Official announcements and benchmark numbers
  2. Third-party evaluations (non-linear benchmark, LMSYS, Artificial Analysis)
  3. Community feedback and real-world usage reports

Search patterns:

mmx search query "<model name> benchmark MMLU 评测 2026"
mmx search query "<model name> vs <model name> comparison"
mmx search query "<model name> API pricing performance"
mmx search query "<模型中文名> 评测 benchmark"

For Chinese platform-specific models (SenseNova, Volcengine, Qwen, etc.), search in Chinese:

mmx search query "商汤 sensenova 模型 评测"
mmx search query "火山引擎 doubao 模型列表"

See references/chinese-model-platforms.md for known provider APIs and model catalogs.

Step 3: Structure the Comparison

Use this table format for multi-model comparison:

维度 Model A Model B Model C
开发者 Company Company Company
参数规模 XxB XxB XxB
架构 Dense/MoE Dense/MoE Dense/MoE
开源 / / /
中文能力
编程能力
Agent能力
性价比 描述 描述 描述

Step 4: Scenario-Based Recommendation

Always end with a scenario table:

场景 推荐模型 理由
中文日常对话 X 理由
编程任务 Y 理由
Agent 开发 Z 理由
开源自部署 W 理由
成本敏感 V 理由

Step 5: Actionable Next Steps

  • If user already uses a model, compare against their current stack
  • Offer to configure the new model in their environment
  • Note any migration costs or compatibility issues

Key Benchmark Sources

Source URL What it measures
Artificial Analysis artificialanalysis.ai Speed, quality, price
LMSYS Chatbot Arena lmarena.ai Human preference (Elo)
non-linear ReLE github.com/jeinlee1991/chinese-llm-benchmark Chinese LLM comprehensive
SWE-bench Pro swebench.com Coding agent capability
BFCL-V3 gorilla.cs.berkeley.edu Function calling
MMLU Various General knowledge

Elaborating on 橘鸦AI早报 Items

When user says "细说X" or "elaborate on item X" from the daily news digest:

Step 1: Find the source

# Search session history for the cron output
ls ~/.hermes/cron/output/9733a9cabb44/ | sort | tail -5
# Read the relevant file
cat ~/.hermes/cron/output/9733a9cabb44/<date>.md

Step 2: Extract the specific item

Parse the numbered list and identify the item by number.

Step 3: Deep research

Use mmx search to find:

  1. Official announcements and product pages
  2. Technical documentation or blog posts
  3. Community reactions and early adopter feedback
  4. Benchmark data if applicable

Step 4: Structure the response

  • One-line summary of what it is
  • Detailed breakdown (features, specs, implications)
  • Comparison with alternatives if relevant
  • Actionable recommendation (try it? wait? skip?)

Pitfalls

Don't compare apples to oranges

  • MoE models (e.g., 400B total, 13B active) ≠ Dense models of same total params
  • Always note activated parameters for MoE models
  • Pricing varies wildly: per-token vs per-request vs subscription

Benchmark ≠ real-world performance

  • Benchmark scores don't capture latency, rate limits, or availability
  • Chinese benchmark scores may not reflect English performance and vice versa
  • Agent benchmarks (SWE-bench, τ³-Bench) are more relevant for agentic use cases than MMLU

Free tier traps

  • "Free" models on platforms may have rate limits, latency, or availability issues
  • Check if the free offer is temporary (e.g., "一周免费") before recommending
  • Self-hosted "free" models still have compute costs

Don't over-hype new releases

  • New model announcements often cherry-pick favorable benchmarks
  • Wait for third-party evaluations before making strong claims
  • If user saw it in 早报/news, note it's worth watching but not necessarily switching

ALWAYS use mmx search, NOT curl/browser

  • Never fall back to curl-based scraping (Google, Baidu, DuckDuckGo) for model research — they all block or return empty
  • Never try browser navigation for model research — sandbox issues are common and pages are SPAs
  • mmx search is the only reliable research tool. If it fails, say so and give your best assessment from training data
  • Do NOT attempt 10+ curl variations hoping one works — one mmx search call is worth 20 failed curl attempts

Current User Stack (Reference)

  • Primary model: MiMo 2.5 Pro (via Xiaomi API)
  • Also available: MiniMax M2.7
  • Hermes Agent: v0.12.0
  • Use case: Agent tasks, coding, Chinese content

References

  • See references/model-benchmarks-2026-05.md for curated benchmark data
  • See references/chinese-model-platforms.md for Chinese AI provider APIs, model naming conventions, and research heuristics