first commit
This commit is contained in:
98
research/arxiv/references/paper-reading-methodology.md
Normal file
98
research/arxiv/references/paper-reading-methodology.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# Deep Paper Reading Methodology
|
||||
|
||||
Extracting full technical detail from arXiv papers when PDF parsing tools are unavailable.
|
||||
|
||||
## Extraction Fallback Chain
|
||||
|
||||
1. **pymupdf (fitz)** — best quality, but needs `uv pip install pymupdf`
|
||||
2. **pdftotext** — `apt install poppler-utils`, then `pdftotext file.pdf -`
|
||||
3. **Raw regex on PDF bytes** — `re.findall(rb'\(([^)]+)\)', data)` extracts text streams; works for metadata/abstract but garbles body
|
||||
4. **HTML version** — `curl https://arxiv.org/html/{id}v{N}` — cleanest structured extraction; **preferred method**
|
||||
5. **Abstract page** — `curl https://arxiv.org/abs/{id}` + regex on `<blockquote class="abstract">`
|
||||
|
||||
## HTML Extraction Patterns (preferred)
|
||||
|
||||
The arXiv HTML version (`/html/{id}v{N}`) has structured `<section>` elements with IDs:
|
||||
|
||||
```
|
||||
S1 = Introduction
|
||||
S2 = Problem definition
|
||||
S3 = Key findings
|
||||
S4 = Method
|
||||
S5 = Experiments (S5.SS1, S5.SS2 = subsections)
|
||||
S6 = Related work
|
||||
S7 = Conclusion
|
||||
Appendices: indexed by letter (A, B, C...) in ltx_tocentry
|
||||
```
|
||||
|
||||
### Section extraction pattern:
|
||||
|
||||
```python
|
||||
import re
|
||||
html = re.sub(r'<(script|style)[^>]*>.*?</\1>', '', html, flags=re.DOTALL)
|
||||
m = re.search(r'<section[^>]*id="S4"[^>]*>(.*?)(?=<section[^>]*id="S5"|$)', html, re.DOTALL)
|
||||
text = re.sub(r'<[^>]+>', ' ', m.group(1)).strip()
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
```
|
||||
|
||||
### Table extraction:
|
||||
|
||||
```python
|
||||
tables = re.findall(r'<table[^>]*>(.*?)</table>', html, re.DOTALL)
|
||||
for t in tables:
|
||||
text = re.sub(r'<[^>]+>', ' | ', t).strip()
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
```
|
||||
|
||||
### Appendix content (avoid TOC duplicates):
|
||||
|
||||
Appendix headings appear twice — once in TOC, once as actual content. Use `positions[-1]` (last occurrence) for the real content. Search by keyword rather than section ID for appendices.
|
||||
|
||||
### Targeted keyword search (when section IDs fail):
|
||||
|
||||
```python
|
||||
searches = ['keyword1', 'keyword2']
|
||||
for s in searches:
|
||||
positions = [m.start() for m in re.finditer(re.escape(s), html, re.IGNORECASE)]
|
||||
if positions:
|
||||
pos = positions[-1] # last occurrence = actual content, not TOC
|
||||
chunk = html[max(0,pos-200):pos+500]
|
||||
text = re.sub(r'<[^>]+>', ' ', chunk)
|
||||
```
|
||||
|
||||
## Structured Methodology Extraction Template
|
||||
|
||||
When the user asks to "learn the method" or do a deep read, extract:
|
||||
|
||||
1. **Architecture** — pipeline stages, model sizes, data flow
|
||||
2. **Training data** — how it's constructed, sources, sizes, prompts used
|
||||
3. **Negative mining** — strategy for hard negatives, filtering
|
||||
4. **Loss functions** — exact objective, temperature, why this choice
|
||||
5. **Training hyperparams** — LR, batch size, epochs, hardware
|
||||
6. **Inference flow** — online vs offline steps, latency, throughput
|
||||
7. **Key ablations** — what matters and by how much
|
||||
8. **Code/models released** — check GitHub repo structure
|
||||
|
||||
## GitHub Repo Inspection Pattern
|
||||
|
||||
```bash
|
||||
# Check if repo exists and get stats
|
||||
curl -sL "https://api.github.com/repos/{owner}/{repo}"
|
||||
|
||||
# List top-level structure
|
||||
curl -sL "https://api.github.com/repos/{owner}/{repo}/contents"
|
||||
|
||||
# Check subdirectories
|
||||
for d in src scripts; do
|
||||
curl -sL "https://api.github.com/repos/{owner}/{repo}/contents/$d"
|
||||
done
|
||||
|
||||
# Read README
|
||||
curl -sL "https://raw.githubusercontent.com/{owner}/{repo}/main/README.md"
|
||||
```
|
||||
|
||||
## Semantic Scholar for Citation Context
|
||||
|
||||
```bash
|
||||
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:{id}?fields=citationCount,influentialCitationCount"
|
||||
```
|
||||
62
research/arxiv/references/skillrouter-methodology.md
Normal file
62
research/arxiv/references/skillrouter-methodology.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# SkillRouter: Key Takeaways for LLM Agent Skill Routing
|
||||
|
||||
Paper: https://arxiv.org/abs/2603.22455 (Apr 2026, Alibaba)
|
||||
Code: https://github.com/zhengyanzhao1997/SkillRouter
|
||||
Models: https://huggingface.co/pipizhao/SkillRouter-Embedding-0.6B, SkillRouter-Reranker-0.6B
|
||||
|
||||
## Core Finding
|
||||
|
||||
At ~80K skill scale with heavy overlap, exposing only name+description causes 31-44pp Hit@1 drop vs full skill text. Full body is THE critical routing signal, not metadata.
|
||||
|
||||
## Architecture (1.2B total)
|
||||
|
||||
```
|
||||
query → SR-Emb-0.6B (bi-encoder) → top-20 from 80K → SR-Rank-0.6B (cross-encoder) → final rank
|
||||
```
|
||||
|
||||
## Training Recipe
|
||||
|
||||
### Data: 37,979 synthetic (query, skill) pairs
|
||||
- Skills sampled with category stratification from ~80K pool
|
||||
- Queries generated by GPT-4o-mini; prompt forbids revealing skill name
|
||||
- Benchmark skills excluded from training
|
||||
|
||||
### Hard Negative Mining (10 per query)
|
||||
- 4 semantic neighbors (embedding NN)
|
||||
- 3 BM25 lexical matches
|
||||
- 2 same-category distractors
|
||||
- 1 random cross-category
|
||||
|
||||
### False Negative Filtering (critical — +4.0pp)
|
||||
Three-layer filter removes ~10% of mined negatives:
|
||||
1. Name dedup (24,879 pairs)
|
||||
2. Body trigram Jaccard > 0.6 (13,860 pairs)
|
||||
3. Embedding cosine > 0.92 (326 pairs)
|
||||
|
||||
### Loss: Listwise CE >> Pointwise BCE
|
||||
- Pointwise: 43.3% Hit@1 (fails because homogeneous candidates get similar scores)
|
||||
- Listwise: 74.0% Hit@1 (compares candidates against each other)
|
||||
- This is THE key training choice for reranker
|
||||
|
||||
### Hyperparams
|
||||
- Encoder: InfoNCE τ=0.05, LR 2e-5, batch 8, GA 4, 1 epoch, max 2048 tokens
|
||||
- Reranker: Listwise CE τ=1.0, LR 1e-5, 1 epoch, max 4096 tokens
|
||||
- Both: single GPU, Qwen3-Emb/Rank-0.6B base
|
||||
|
||||
### Input Templates
|
||||
- Encoder query: `Instruct: ...\nQuery: <text>` (1500 char cap)
|
||||
- Encoder skill: `<name> | <desc:300> | <body:2500>` (no instruction prefix)
|
||||
- Reranker: `<Instruct>: ...\n<Query>: ...\n<Document>: <name> | <desc:500> | <body:2000>`
|
||||
|
||||
## Results
|
||||
| System | Params | Avg Hit@1 | Speed |
|
||||
|--------|--------|-----------|-------|
|
||||
| Qwen3-Emb-8B + Qwen3-Rank-8B | 16B | 68.0% | 0.32 QPS |
|
||||
| SR-Emb-0.6B + SR-Rank-0.6B | 1.2B | 74.0% | 1.83 QPS |
|
||||
| SR-Emb-8B + SR-Rank-8B | 16B | 76.0% | - |
|
||||
|
||||
## Relevance to Hermes
|
||||
- Hermes currently exposes ~100 skills via name+desc in system prompt, full SKILL.md on demand
|
||||
- At current scale this works; at 1000+ skills, a routing layer becomes necessary
|
||||
- False-negative filtering concept applies to Hermes skill deduplication
|
||||
- Listwise reranking matters when many skills look similar (e.g., multiple research skills)
|
||||
Reference in New Issue
Block a user