ephron_ren/agent-skills

Fork 0

Files

Hermes Agent ccc63d1e70 first commit

2026-05-10 13:52:46 +08:00

6.3 KiB

Raw Permalink Blame History

name, description, version, author, license, metadata, triggers

name

description

version

author

license

metadata

triggers

llm-model-comparison

Compare LLM models across benchmarks, pricing, and capabilities. For evaluating new models, recommending providers, and maintaining benchmark knowledge.

1.0.0

Hermes Agent

MIT

hermes

LLM Model Comparison Skill

When to Use

User asks about a model they saw in news, 早报, or social media
User wants to compare two or more models for a specific use case
User asks "should I switch to X" or "is Y worth it"
Selecting models for deployment, API integration, or fine-tuning
User asks to elaborate on a model or product mentioned in 橘鸦AI早报 or other news digests

Comparison Framework

Step 1: Identify the Question

Is this a "what is it?" question → give overview + positioning
Is this a "should I use it?" question → compare against user's current stack
Is this a "which is better?" question → structured comparison table

Step 2: Gather Data

Use mmx search to find:

Official announcements and benchmark numbers
Third-party evaluations (non-linear benchmark, LMSYS, Artificial Analysis)
Community feedback and real-world usage reports

Search patterns:

mmx search query "<model name> benchmark MMLU 评测 2026"
mmx search query "<model name> vs <model name> comparison"
mmx search query "<model name> API pricing performance"
mmx search query "<模型中文名> 评测 benchmark"

For Chinese platform-specific models (SenseNova, Volcengine, Qwen, etc.), search in Chinese:

mmx search query "商汤 sensenova 模型 评测"
mmx search query "火山引擎 doubao 模型列表"

See references/chinese-model-platforms.md for known provider APIs and model catalogs.

Step 3: Structure the Comparison

Use this table format for multi-model comparison:

维度	Model A	Model B	Model C
开发者	Company	Company	Company
参数规模	XxB	XxB	XxB
架构	Dense/MoE	Dense/MoE	Dense/MoE
开源	✅/❌	✅/❌	✅/❌
中文能力	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
编程能力	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Agent能力	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
性价比	描述	描述	描述

Step 4: Scenario-Based Recommendation

Always end with a scenario table:

场景	推荐模型	理由
中文日常对话	X	理由
编程任务	Y	理由
Agent 开发	Z	理由
开源自部署	W	理由
成本敏感	V	理由

Step 5: Actionable Next Steps

If user already uses a model, compare against their current stack
Offer to configure the new model in their environment
Note any migration costs or compatibility issues

Key Benchmark Sources

Source	URL	What it measures
Artificial Analysis	artificialanalysis.ai	Speed, quality, price
LMSYS Chatbot Arena	lmarena.ai	Human preference (Elo)
non-linear ReLE	github.com/jeinlee1991/chinese-llm-benchmark	Chinese LLM comprehensive
SWE-bench Pro	swebench.com	Coding agent capability
BFCL-V3	gorilla.cs.berkeley.edu	Function calling
MMLU	Various	General knowledge

Elaborating on 橘鸦AI早报 Items

When user says "细说X" or "elaborate on item X" from the daily news digest:

Step 1: Find the source

# Search session history for the cron output
ls ~/.hermes/cron/output/9733a9cabb44/ | sort | tail -5
# Read the relevant file
cat ~/.hermes/cron/output/9733a9cabb44/<date>.md

Step 2: Extract the specific item

Parse the numbered list and identify the item by number.

Step 3: Deep research