SkillRouter: Key Takeaways for LLM Agent Skill Routing

Paper: https://arxiv.org/abs/2603.22455 (Apr 2026, Alibaba) Code: https://github.com/zhengyanzhao1997/SkillRouter Models: https://huggingface.co/pipizhao/SkillRouter-Embedding-0.6B, SkillRouter-Reranker-0.6B

Core Finding

At ~80K skill scale with heavy overlap, exposing only name+description causes 31-44pp Hit@1 drop vs full skill text. Full body is THE critical routing signal, not metadata.

Architecture (1.2B total)

query → SR-Emb-0.6B (bi-encoder) → top-20 from 80K → SR-Rank-0.6B (cross-encoder) → final rank

Training Recipe

Data: 37,979 synthetic (query, skill) pairs

Skills sampled with category stratification from ~80K pool
Queries generated by GPT-4o-mini; prompt forbids revealing skill name
Benchmark skills excluded from training

Hard Negative Mining (10 per query)

4 semantic neighbors (embedding NN)
3 BM25 lexical matches
2 same-category distractors
1 random cross-category

False Negative Filtering (critical — +4.0pp)

Three-layer filter removes ~10% of mined negatives:

Name dedup (24,879 pairs)
Body trigram Jaccard > 0.6 (13,860 pairs)
Embedding cosine > 0.92 (326 pairs)

Loss: Listwise CE >> Pointwise BCE

Pointwise: 43.3% Hit@1 (fails because homogeneous candidates get similar scores)
Listwise: 74.0% Hit@1 (compares candidates against each other)
This is THE key training choice for reranker

Hyperparams

Encoder: InfoNCE τ=0.05, LR 2e-5, batch 8, GA 4, 1 epoch, max 2048 tokens
Reranker: Listwise CE τ=1.0, LR 1e-5, 1 epoch, max 4096 tokens
Both: single GPU, Qwen3-Emb/Rank-0.6B base

Input Templates

Encoder query: Instruct: ...\nQuery: <text> (1500 char cap)
Encoder skill: <name> | <desc:300> | <body:2500> (no instruction prefix)
Reranker: <Instruct>: ...\n<Query>: ...\n<Document>: <name> | <desc:500> | <body:2000>

Results

System	Params	Avg Hit@1	Speed
Qwen3-Emb-8B + Qwen3-Rank-8B	16B	68.0%	0.32 QPS
SR-Emb-0.6B + SR-Rank-0.6B	1.2B	74.0%	1.83 QPS
SR-Emb-8B + SR-Rank-8B	16B	76.0%	-

Relevance to Hermes

Hermes currently exposes ~100 skills via name+desc in system prompt, full SKILL.md on demand
At current scale this works; at 1000+ skills, a routing layer becomes necessary
False-negative filtering concept applies to Hermes skill deduplication
Listwise reranking matters when many skills look similar (e.g., multiple research skills)

2.5 KiB Raw Blame History