first commit

This commit is contained in:
Hermes Agent
2026-05-10 13:52:46 +08:00
commit ccc63d1e70
4583 changed files with 584341 additions and 0 deletions

287
sn-search-academic/SKILL.md Normal file
View File

@@ -0,0 +1,287 @@
---
name: sn-search-academic
description: "多源学术搜索ArXiv、Semantic Scholar含引用数、PubMed、Wikipedia。支持按章节读取 ArXiv HTML 全文和 PMC 全文。触发词:学术论文、文献调研、引用数据、生物医学文献、百科查询。一站式多源工具。"
---
# sn-search-academic - 学术搜索
搜索 ArXiv、Semantic Scholar、PubMed、Wikipedia 四个学术平台,并提供 ArXiv 和 PMC 的**全文章节阅读**能力。全部免费,部分脚本有可选 API key 可提升限额。
## 依赖
运行脚本前先安装本 skill 的 Python 依赖:
```bash
python3 -m pip install -r skills/sn-search-academic/requirements.txt
```
如果项目使用 `uv` 环境:
```bash
uv pip install -r skills/sn-search-academic/requirements.txt
```
`arxiv_paper.py` 需要 `beautifulsoup4` 解析 ArXiv HTML其他脚本主要依赖 `httpx` 发起请求。
## 可用脚本
| 脚本 | 平台 | 用途 | API key |
|------|------|------|---------|
| `arxiv_search.py` | ArXiv | 预印本搜索,支持作者/标题/ID查询 | 无需 |
| `arxiv_paper.py` | ArXiv HTML | 按章节读取 ArXiv 论文全文 | 无需 |
| `semantic_scholar_search.py` | Semantic Scholar | 全学科搜索,含引用数和 TLDR | 无需(有 key 限额更高) |
| `semantic_scholar_refs.py` | Semantic Scholar | 引用追溯查论文的参考文献backward或被引论文forward | 无需(有 key 限额更高) |
| `pubmed_search.py` | PubMed | 生医文献搜索,含结构化摘要和 PMC ID | 无需(有 key 限额更高) |
| `pmc_paper.py` | PMC | 按章节读取 PMC 开放获取论文全文 | 无需(有 key 限额更高) |
| `wikipedia_search.py` | Wikipedia | 百科文章搜索,支持多语言 | 无需 |
## 参数说明
### arxiv_search.py
```bash
python3 scripts/arxiv_search.py <query> [选项]
```
| 参数 | 说明 | 默认值 |
|------|------|--------|
| `query` | 搜索关键词(使用 `--id-list` 时可省略) | — |
| `--limit`, `-n` | 返回结果数量 | 10 |
| `--category`, `-c` | ArXiv 分类过滤(见下方"ArXiv 分类速查" | — |
| `--sort` | 排序方式:`relevance`, `date`, `submitted` | relevance |
| `--author`, `-a` | 按作者过滤,多个用逗号分隔 | — |
| `--title-only` | 仅在标题中搜索 | — |
| `--id-list` | 直接按 arXiv ID 获取元数据,逗号分隔 | — |
```bash
python3 scripts/arxiv_search.py "transformer attention mechanism" --limit 5
python3 scripts/arxiv_search.py "diffusion model" --author "ho jonathan" --category cs.CV
python3 scripts/arxiv_search.py --id-list "2409.05591,2301.07041"
```
**输出字段**`title`, `url`, `snippet`(摘要), `arxiv_id`, `authors`, `published`, `updated`, `pdf_url`, `html_url`, `categories`, `primary_category`, `comment`, `journal_ref`, `doi`
### arxiv_paper.py
按章节读取 ArXiv 论文正文(需论文有 HTML 版本2020 年后多数论文支持)。
```bash
python3 scripts/arxiv_paper.py <arxiv_id> [--section SECTION_NAME]
```
| 参数 | 说明 |
|------|------|
| `arxiv_id` | arXiv ID`2409.05591``2409.05591v2` |
| `--section`, `-s` | 章节名(大小写不敏感,支持部分匹配)。不指定则列出所有章节。 |
```bash
python3 scripts/arxiv_paper.py 2409.05591 # 列出章节
python3 scripts/arxiv_paper.py 2409.05591 --section introduction
python3 scripts/arxiv_paper.py 2409.05591 --section method
```
**列出章节输出字段**`arxiv_id`, `abs_url`, `html_url`, `pdf_url`, `section_count`, `sections[]`name, level
**读取章节输出字段**`arxiv_id`, `section`, `level`, `content`, `char_count`
### semantic_scholar_search.py
```bash
python3 scripts/semantic_scholar_search.py <query> [选项]
```
| 参数 | 说明 | 默认值 |
|------|------|--------|
| `query` | 搜索关键词(必填) | — |
| `--limit`, `-n` | 返回结果数量 | 10 |
| `--api-key` | Semantic Scholar API Key也可通过 `S2_API_KEY` 环境变量) | — |
```bash
python3 scripts/semantic_scholar_search.py "transformer architecture" --limit 5
python3 scripts/semantic_scholar_search.py "RLHF language model" --limit 10
```
**输出字段**`title`, `url`, `snippet`(摘要,缺失时降级为 tldr, `tldr`, `authors`, `year`, `venue`, `publication_date`, `citation_count`, `influential_citation_count`, `reference_count`, `is_open_access`, `open_access_pdf`, `fields_of_study`, `publication_types`, `doi`, `arxiv_id`, `paper_id`
### semantic_scholar_refs.py
引用追溯给定一篇论文查询它的参考文献backward或被引论文forward
```bash
python3 scripts/semantic_scholar_refs.py <paper_id> <direction> [选项]
```
| 参数 | 说明 | 默认值 |
|------|------|--------|
| `paper_id` | 论文标识符S2 ID、DOI`10.xxxx/...`、ArXiv ID`2301.07041`、PMID`PMID:12345678` | — |
| `direction` | `references`=参考文献backward`citations`=被引论文forward | — |
| `--limit`, `-n` | 返回结果数量 | 20 |
| `--min-citations` | 最低引用数过滤 | 0 |
| `--year-min` | 最早年份过滤 | — |
| `--year-max` | 最晚年份过滤 | — |
| `--api-key` | Semantic Scholar API Key可选 | — |
```bash
# 查看某篇论文引用了哪些论文backward找奠基工作
python3 scripts/semantic_scholar_refs.py 2301.07041 references --limit 10
# 查看某篇论文被谁引用forward找后续进展
python3 scripts/semantic_scholar_refs.py 2301.07041 citations --limit 10 --min-citations 50
# 用 DOI 查引用,限定 2023 年以后
python3 scripts/semantic_scholar_refs.py "10.1038/s41586-024-07487-w" citations --year-min 2023
# 找高引参考文献
python3 scripts/semantic_scholar_refs.py ARXIV:2005.14165 references --min-citations 100 --limit 5
```
**输出字段**`title`, `url`, `snippet`(摘要/tldr, `authors`, `year`, `venue`, `citation_count`, `influential_citation_count`, `is_open_access`, `open_access_pdf`, `doi`, `arxiv_id`, `paper_id`, `citation_contexts`(引用上下文句子,最多 3 条), `citation_intents`(引用意图)
**输出额外字段**`source_paper`(被查询论文的标题/年份/引用数), `total_available`(该方向总论文数), `returned`(过滤后返回数)
### pubmed_search.py
支持 PubMed 查询语法,如字段限定(`cancer[Title]`)、日期范围(`2024[pdat]`)。
```bash
python3 scripts/pubmed_search.py <query> [选项]
```
| 参数 | 说明 | 默认值 |
|------|------|--------|
| `query` | 搜索关键词,支持 PubMed 查询语法 | — |
| `--limit`, `-n` | 返回结果数量 | 10 |
| `--api-key` | NCBI API Key可选限额从 3 req/s 升至 10 req/s | — |
```bash
python3 scripts/pubmed_search.py "CRISPR gene editing" --limit 5
python3 scripts/pubmed_search.py "Alzheimer[Title] AND treatment[Title]" --limit 5
```
**输出字段**`title`, `url`, `snippet`(结构化摘要), `authors`, `pmid`, `pmc_id`(有值则可传入 `pmc_paper.py`, `pmc_url`, `journal`, `pub_date`, `volume`, `issue`, `pages`, `keywords`, `pub_types`, `doi`
### pmc_paper.py
读取 PubMed Central 开放获取全文(约 700 万篇生医论文,占 PubMed 约 35%)。`pubmed_search.py` 结果中 `pmc_id``null` 的论文无法使用本工具。
```bash
python3 scripts/pmc_paper.py <pmc_id> [--section SECTION_NAME]
python3 scripts/pmc_paper.py --pmid <pmid> [--section SECTION_NAME]
```
| 参数 | 说明 |
|------|------|
| `pmc_id` | PMC ID`PMC11119143``11119143` |
| `--pmid` | PubMed ID自动转换为 PMC ID`pmc_id` 二选一) |
| `--section`, `-s` | 章节名(大小写不敏感,支持部分匹配)。不指定则列出所有章节。 |
| `--api-key` | NCBI API Key可选 |
```bash
python3 scripts/pmc_paper.py PMC11119143 # 列出章节
python3 scripts/pmc_paper.py PMC11119143 --section introduction
python3 scripts/pmc_paper.py --pmid 38786024 --section conclusion
```
**列出章节输出字段**`pmc_id`, `pmid`, `title`, `pmc_url`, `section_count`, `sections[]`name, level含子章节层级
**读取章节输出字段**`pmc_id`, `section`, `level`, `content`(含子章节文本), `char_count`
### wikipedia_search.py
```bash
python3 scripts/wikipedia_search.py <query> [选项]
```
| 参数 | 说明 | 默认值 |
|------|------|--------|
| `query` | 搜索关键词(必填) | — |
| `--limit`, `-n` | 返回结果数量 | 10 |
| `--lang`, `-l` | 语言版本(`en`, `zh`, `ja`, `de`, `fr` 等) | en |
```bash
python3 scripts/wikipedia_search.py "machine learning" --limit 5
python3 scripts/wikipedia_search.py "深度学习" --lang zh --limit 5
```
## 全文阅读工作流
搜索脚本返回摘要,阅读脚本返回正文。两者配合可按需精读,节省 token。
**ArXiv 论文**
1. `arxiv_search.py` 搜索 → 获取 `arxiv_id`
2. `arxiv_paper.py <id>` 列章节 → `arxiv_paper.py <id> --section introduction` 快速判断是否深入
3. 按需读取 `method` / `experiment` / `conclusion`
**PMC 生医论文**
1. `pubmed_search.py` 搜索 → 结果中取 `pmc_id`(非 null 才有全文)
2. `pmc_paper.py <pmc_id>` 列章节 → 按需读取关键章节
## 引用追溯工作流
通过论文的引用关系发现关键词搜索覆盖不到的相关工作。
**Backward找奠基工作**
1. 关键词搜索找到高相关论文 → 取其 `paper_id``arxiv_id`
2. `semantic_scholar_refs.py <id> references --min-citations 50` → 找到高引参考文献
3. 筛选与研究问题相关的条目 → 用 `arxiv_paper.py``pmc_paper.py` 深入阅读
**Forward找后续进展**
1. 找到领域奠基论文或关键论文 → 取其 ID
2. `semantic_scholar_refs.py <id> citations --year-min 2024 --min-citations 10` → 找到近期高引跟进工作
3. 筛选与研究问题相关的条目 → 深入阅读
**Citation Chain追溯演化路径**
1. 从种子论文 A 出发 → backward 找到 A 的关键参考文献 B
2. 从 B 出发 → forward 找到引用 B 的后续工作(可能发现 A 没引用的相关论文 C
3. 形成 B → A → ... 和 B → C → ... 的知识脉络
## ArXiv 分类速查
顶层领域可直接用(如 `--category cs`),子分类更精确(如 `--category cs.AI`)。
| 领域 | 分类代码 | 说明 |
|------|---------|------|
| **计算机科学** | `cs.AI` | 人工智能 |
| | `cs.LG` | 机器学习 |
| | `cs.CL` | 计算语言学 / NLP |
| | `cs.CV` | 计算机视觉 |
| | `cs.IR` | 信息检索 |
| | `cs.RO` | 机器人 |
| | `cs.SE` | 软件工程 |
| | `cs.DC` | 分布式/并行计算 |
| | `cs.NI` | 网络与互联网 |
| | `cs.CR` | 密码学与安全 |
| | `cs.DB` | 数据库 |
| | `cs.HC` | 人机交互 |
| **统计** | `stat.ML` | 统计机器学习 |
| | `stat.AP` | 应用统计 |
| | `stat.ME` | 统计方法论 |
| **数学** | `math.OC` | 优化与控制 |
| | `math.ST` | 统计理论 |
| | `math.CO` | 组合数学 |
| **物理** | `physics` | 物理(全类) |
| | `cond-mat` | 凝聚态物理 |
| | `quant-ph` | 量子物理 |
| | `hep-th` | 高能理论物理 |
| **经济/金融** | `econ.GN` | 经济学综合 |
| | `q-fin.CP` | 计算金融 |
| | `q-fin.ST` | 统计金融 |
| **生物/医学** | `q-bio.NC` | 神经科学 |
| | `q-bio.GN` | 基因组学 |
| | `q-bio.QM` | 定量方法 |
## 输出格式
所有脚本输出标准 JSON
```json
{
"success": true,
"query": "...",
"provider": "arxiv|semantic_scholar|pubmed|wikipedia",
"items": [{"title": "...", "url": "...", "snippet": "...", ...}],
"error": null
}
```
`arxiv_paper.py``pmc_paper.py` 不走 `items` 格式,直接返回结构化对象(见各自"输出字段"说明)。

View File

@@ -0,0 +1,2 @@
httpx>=0.25.0
beautifulsoup4>=4.12.0

View File

@@ -0,0 +1,304 @@
#!/usr/bin/env python3
"""
ArXiv 论文章节阅读器。
通过解析 arXiv HTML 版本LaTeXML 转换),支持:
- 列出论文所有章节结构
- 按章节名称提取正文内容(大小写不敏感,支持部分匹配)
用法:
python3 arxiv_paper.py 2409.05591 # 列出章节
python3 arxiv_paper.py 2409.05591 --section introduction # 读取指定章节
python3 arxiv_paper.py 2409.05591 --section method
"""
from __future__ import annotations
import argparse
import json
import re
import sys
from typing import Any
from search_utils import get_client, print_json
BeautifulSoup: Any = None
NavigableString: Any = None
Tag: Any = None
def ensure_bs4() -> None:
"""Load BeautifulSoup only when the script needs to parse paper HTML."""
global BeautifulSoup, NavigableString, Tag
if BeautifulSoup is not None:
return
try:
from bs4 import BeautifulSoup as Bs4BeautifulSoup
from bs4 import NavigableString as Bs4NavigableString
from bs4 import Tag as Bs4Tag
except ImportError:
print_json({
"success": False,
"error": "缺少 beautifulsoup4请运行python3 -m pip install -r skills/sn-search-academic/requirements.txt",
})
sys.exit(1)
BeautifulSoup = Bs4BeautifulSoup
NavigableString = Bs4NavigableString
Tag = Bs4Tag
HTML_BASE = "https://arxiv.org/html"
ABS_BASE = "https://arxiv.org/abs"
PDF_BASE = "https://arxiv.org/pdf"
# ── HTML 获取 ─────────────────────────────────────────────────────────────────
def fetch_html(arxiv_id: str) -> str:
"""获取 arXiv HTML 版本,不存在时抛出有意义的错误。"""
url = f"{HTML_BASE}/{arxiv_id}"
with get_client(timeout=45, headers={"Accept": "text/html,application/xhtml+xml"}) as client:
resp = client.get(url)
if resp.status_code == 404:
raise ValueError(
f"论文 {arxiv_id} 暂无 HTML 版本。"
"可能原因论文较老2018 年前)、非 LaTeX 来源或尚未转换。"
f"请直接阅读 PDF{PDF_BASE}/{arxiv_id}"
)
resp.raise_for_status()
return resp.text
# ── 文本清洗 ──────────────────────────────────────────────────────────────────
def _elem_to_text(elem: Tag) -> str:
"""
将 HTML 元素转为可读文本。
- math 元素:优先用 LaTeX 注解,否则用 alttext再降级为 [MATH]
- 图表标题:保留
- 跳过 .ltx_note脚注编号等噪音节点
"""
parts: list[str] = []
for node in elem.descendants:
if not isinstance(node, NavigableString):
continue
parent = node.parent
if parent is None:
continue
tag = parent.name
# 跳过脚注编号、引用上标等噪音
parent_classes = parent.get("class") or []
if any(c in parent_classes for c in ("ltx_note_mark", "ltx_ref_tag", "ltx_tag")):
continue
# math 元素:取 LaTeX 注解
if tag == "annotation":
encoding = parent.get("encoding", "")
if "tex" in encoding.lower() or "latex" in encoding.lower():
latex = node.strip()
if latex:
parts.append(f"${latex}$")
continue
# 跳过 math 内部的非注解文本MathML 结构文本很乱)
in_math = False
for ancestor in parent.parents:
if ancestor.name == "math":
in_math = True
break
if in_math:
continue
text = str(node)
if text.strip():
parts.append(text)
raw = "".join(parts)
# 合并多余空白,保留段落换行
raw = re.sub(r"[ \t]+", " ", raw)
raw = re.sub(r"\n{3,}", "\n\n", raw)
return raw.strip()
# ── 章节提取 ──────────────────────────────────────────────────────────────────
def extract_sections(html: str) -> list[dict[str, Any]]:
"""
从 arXiv HTML 提取所有章节(含摘要)。
返回列表,每项:
name - 章节标题(含编号,如 "1 Introduction"
level - 层级0=摘要, 1=h2, 2=h3
text - 正文文本
"""
ensure_bs4()
soup = BeautifulSoup(html, "html.parser")
sections: list[dict[str, Any]] = []
# ── 摘要 ──
abstract_elem = soup.find(class_=re.compile(r"\bltx_abstract\b"))
if abstract_elem:
# 去掉 "Abstract" 标题行
for h in abstract_elem.find_all(["h2", "h6"], class_=re.compile(r"ltx_title")):
h.decompose()
abstract_text = _elem_to_text(abstract_elem)
if abstract_text:
sections.append({"name": "Abstract", "level": 0, "text": abstract_text})
# ── 正文各 section ──
for sec in soup.find_all("section", class_=re.compile(r"\bltx_section\b|\bltx_appendix\b")):
# 找本层标题(不要子 section 的标题)
heading: Tag | None = None
for h_tag in ["h2", "h3", "h4"]:
candidate = sec.find(h_tag, class_=re.compile(r"\bltx_title\b"), recursive=False)
if candidate:
heading = candidate
break
if heading is None:
# 有些 section 标题在首个 div 里
for h_tag in ["h2", "h3", "h4"]:
candidate = sec.find(h_tag, class_=re.compile(r"\bltx_title\b"))
if candidate:
heading = candidate
break
if heading is None:
continue
# 清理标题(去尾部 ¶ permalink、多余空白
heading_text = heading.get_text(" ", strip=True).rstrip("").strip()
heading_text = re.sub(r"\s+", " ", heading_text)
level = {"h2": 1, "h3": 2, "h4": 3}.get(heading.name, 1)
# 提取本 section 的文本(排除子 section避免重复
sec_copy = BeautifulSoup(str(sec), "html.parser").find("section")
# 移除子 section
for child_sec in sec_copy.find_all("section", recursive=False):
child_sec.decompose()
# 移除标题自身
for h in sec_copy.find_all(["h2", "h3", "h4"], class_=re.compile(r"\bltx_title\b"), recursive=False):
h.decompose()
text = _elem_to_text(sec_copy)
if not text.strip():
continue
sections.append({"name": heading_text, "level": level, "text": text})
return sections
# ── 匹配章节名 ────────────────────────────────────────────────────────────────
def _match_section(sections: list[dict], query: str) -> dict | None:
"""大小写不敏感 + 去数字前缀的模糊匹配。"""
q = query.lower().strip()
def clean(name: str) -> str:
"""去掉 '1 ' / '1. ' 等数字前缀。"""
return re.sub(r"^\d+[\.\s]+", "", name).lower().strip()
# 精确匹配
for s in sections:
if s["name"].lower() == q or clean(s["name"]) == q:
return s
# 前缀 / 包含匹配
for s in sections:
if clean(s["name"]).startswith(q) or q in clean(s["name"]):
return s
return None
# ── 对外接口 ──────────────────────────────────────────────────────────────────
def cmd_list_sections(arxiv_id: str) -> dict[str, Any]:
"""列出论文所有章节(不含正文)。"""
html = fetch_html(arxiv_id)
sections = extract_sections(html)
return {
"success": True,
"arxiv_id": arxiv_id,
"abs_url": f"{ABS_BASE}/{arxiv_id}",
"html_url": f"{HTML_BASE}/{arxiv_id}",
"pdf_url": f"{PDF_BASE}/{arxiv_id}",
"section_count": len(sections),
"sections": [{"name": s["name"], "level": s["level"]} for s in sections],
"error": None,
}
def cmd_read_section(arxiv_id: str, section_name: str) -> dict[str, Any]:
"""读取指定章节的正文内容。"""
html = fetch_html(arxiv_id)
sections = extract_sections(html)
matched = _match_section(sections, section_name)
if matched is None:
available = [s["name"] for s in sections]
return {
"success": False,
"arxiv_id": arxiv_id,
"section": section_name,
"content": None,
"error": f"未找到章节 '{section_name}',可用章节:{available}",
}
return {
"success": True,
"arxiv_id": arxiv_id,
"abs_url": f"{ABS_BASE}/{arxiv_id}",
"section": matched["name"],
"level": matched["level"],
"content": matched["text"],
"char_count": len(matched["text"]),
"error": None,
}
# ── CLI ───────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(
description="ArXiv 论文章节阅读器",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
示例:
python3 arxiv_paper.py 2409.05591 列出所有章节
python3 arxiv_paper.py 2409.05591 --section introduction 读取 Introduction
python3 arxiv_paper.py 2409.05591 --section method 读取 Method/Methods
python3 arxiv_paper.py 2409.05591 --section conclusion 读取 Conclusion
""",
)
parser.add_argument("arxiv_id", help="arXiv 论文 ID如 2409.05591 或 2409.05591v2")
parser.add_argument(
"--section", "-s",
metavar="SECTION_NAME",
help="要读取的章节名(大小写不敏感,支持部分匹配)。不指定则列出所有章节。",
)
args = parser.parse_args()
try:
if args.section:
result = cmd_read_section(args.arxiv_id.strip(), args.section.strip())
else:
result = cmd_list_sections(args.arxiv_id.strip())
print_json(result)
except Exception as e:
print_json({
"success": False,
"arxiv_id": args.arxiv_id,
"error": str(e),
})
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,239 @@
#!/usr/bin/env python3
"""
ArXiv 论文搜索。通过 ArXiv API返回 Atom XML
支持:
- 全文 / 标题 / 摘要 / 作者字段搜索
- 分类过滤、排序
- 按 ID 列表直接拉取论文元数据
- 布尔组合查询AND / OR / ANDNOT
示例:
python3 arxiv_search.py "attention mechanism"
python3 arxiv_search.py "transformer" --category cs.CL --sort date
python3 arxiv_search.py "diffusion model" --author "ho jonathan"
python3 arxiv_search.py "ViT" --title-only
python3 arxiv_search.py --id-list 2409.05591,2301.00001
"""
from __future__ import annotations
import sys
import xml.etree.ElementTree as ET
from search_utils import build_parser, get_client, make_item, make_result, print_json
API_URL = "https://export.arxiv.org/api/query"
# Atom XML 命名空间
NS = {
"atom": "http://www.w3.org/2005/Atom",
"arxiv": "http://arxiv.org/schemas/atom",
}
def build_search_query(
query: str,
category: str | None = None,
author: str | None = None,
title_only: bool = False,
) -> str:
"""
构建 arXiv 查询字符串。
字段前缀:
all: 全字段(默认)
ti: 仅标题
au: 作者(支持通配 au:smi*
abs: 摘要
cat: 分类
布尔运算符必须大写AND / OR / ANDNOT
"""
# 主查询字段
field = "ti" if title_only else "all"
parts = [f"{field}:{query}"]
if author:
# 多个作者用 OR 连接,支持 "lastname firstname" 格式
author_terms = [f"au:{a.strip()}" for a in author.split(",") if a.strip()]
if author_terms:
parts.append(f"({' OR '.join(author_terms)})")
if category:
parts.append(f"cat:{category}")
return " AND ".join(parts)
def fetch_by_ids(id_list: list[str], limit: int) -> list[dict]:
"""通过 ID 列表直接获取论文元数据(不做文本搜索)。"""
params = {
"id_list": ",".join(id_list[:limit]),
"max_results": min(len(id_list), limit, 100),
}
with get_client(timeout=30, headers={"Accept": "application/xml"}) as client:
resp = client.get(API_URL, params=params)
resp.raise_for_status()
return _parse_entries(ET.fromstring(resp.text), limit)
def search(
query: str,
limit: int,
category: str | None = None,
sort_by: str = "relevance",
author: str | None = None,
title_only: bool = False,
) -> list[dict]:
"""执行 ArXiv 关键词搜索。"""
search_query = build_search_query(query, category, author, title_only)
sort_map = {
"relevance": "relevance",
"date": "lastUpdatedDate",
"submitted": "submittedDate",
}
params = {
"search_query": search_query,
"start": 0,
"max_results": min(limit, 100),
"sortBy": sort_map.get(sort_by, "relevance"),
"sortOrder": "descending",
}
with get_client(timeout=30, headers={"Accept": "application/xml"}) as client:
resp = client.get(API_URL, params=params)
resp.raise_for_status()
return _parse_entries(ET.fromstring(resp.text), limit)
def _parse_entries(root: ET.Element, limit: int) -> list[dict]:
"""从 Atom XML 解析论文条目。"""
items = []
for entry in root.findall("atom:entry", NS)[:limit]:
title = _text(entry, "atom:title").replace("\n", " ").strip()
summary = _text(entry, "atom:summary").replace("\n", " ").strip()
published = _text(entry, "atom:published")
updated = _text(entry, "atom:updated")
# 获取论文链接(优先 abs 页面)
url = ""
pdf_url = ""
for link in entry.findall("atom:link", NS):
href = link.get("href", "")
if link.get("title") == "pdf":
pdf_url = href
elif link.get("type") == "text/html" or "/abs/" in href:
url = href
if not url:
url = _text(entry, "atom:id")
# 从 abs URL 或 id 提取 arxiv_id
arxiv_id = ""
raw_id = _text(entry, "atom:id")
if "/abs/" in raw_id:
arxiv_id = raw_id.split("/abs/")[-1]
elif raw_id.startswith("http"):
arxiv_id = raw_id.split("/")[-1]
# 获取作者
authors = [_text(a, "atom:name") for a in entry.findall("atom:author", NS)]
# 获取分类
categories = [c.get("term", "") for c in entry.findall("atom:category", NS)]
comment = _text(entry, "arxiv:comment")
journal_ref = _text(entry, "arxiv:journal_ref")
doi = _text(entry, "arxiv:doi")
primary_category = entry.find("arxiv:primary_category", NS)
primary_cat = primary_category.get("term", "") if primary_category is not None else ""
# HTML 版本链接(较新论文有)
html_url = f"https://arxiv.org/html/{arxiv_id}" if arxiv_id else None
items.append(make_item(
title=title,
url=url,
snippet=summary,
arxiv_id=arxiv_id if arxiv_id else None,
authors=authors,
published=published,
updated=updated,
pdf_url=pdf_url,
html_url=html_url,
categories=categories,
primary_category=primary_cat if primary_cat else None,
comment=comment if comment else None,
journal_ref=journal_ref if journal_ref else None,
doi=doi if doi else None,
))
return items
def _text(elem: ET.Element, tag: str) -> str:
"""安全获取子元素文本。"""
child = elem.find(tag, NS)
return child.text.strip() if child is not None and child.text else ""
def main():
parser = build_parser("搜索 ArXiv 学术论文")
parser.add_argument("--category", "-c", help="ArXiv 分类过滤(如 cs.AI, cs.CL, math.CO")
parser.add_argument(
"--sort", default="relevance",
choices=["relevance", "date", "submitted"],
help="排序方式(默认 relevance",
)
parser.add_argument(
"--author", "-a",
help="按作者过滤(如 'hinton',多个作者用逗号分隔)",
)
parser.add_argument(
"--title-only", action="store_true",
help="仅在标题中搜索(默认搜索全字段)",
)
parser.add_argument(
"--id-list",
help="直接按 arXiv ID 获取元数据,逗号分隔(如 2409.05591,2301.00001)。指定此项时 query 参数可留空。",
)
# 当使用 --id-list 时 query 可选
parser.prog = "arxiv_search.py"
# 为了支持 --id-list 时 query 可省略,临时让 query 可选
for action in parser._positionals._group_actions:
if action.dest == "query":
action.nargs = "?"
action.default = ""
break
args = parser.parse_args()
try:
if args.id_list:
id_list = [i.strip() for i in args.id_list.split(",") if i.strip()]
items = fetch_by_ids(id_list, args.limit)
query_str = f"id_list:{args.id_list}"
else:
if not args.query:
parser.error("请提供搜索关键词,或使用 --id-list 按 ID 查询")
items = search(
args.query,
args.limit,
category=args.category,
sort_by=args.sort,
author=args.author,
title_only=args.title_only,
)
query_str = args.query
print_json(make_result(True, query_str, "arxiv", items))
except Exception as e:
print_json(make_result(False, getattr(args, "query", "") or "", "arxiv", [], str(e)))
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,454 @@
#!/usr/bin/env python3
"""
PMC 论文全文章节阅读器。
通过 NCBI E-utilities 获取 PubMed Central 全文 XMLJATS 格式),支持:
- 列出论文所有章节结构(含子章节层级)
- 按章节名称提取正文内容(大小写不敏感,支持部分匹配)
- 通过 PMID 自动解析到 PMC ID
用法:
python3 pmc_paper.py PMC11119143 # 列出章节
python3 pmc_paper.py 11119143 # 同上(自动补 PMC 前缀)
python3 pmc_paper.py PMC11119143 --section introduction # 读取指定章节
python3 pmc_paper.py --pmid 38786024 --section method # 从 PMID 出发
"""
from __future__ import annotations
import argparse
import re
import sys
import xml.etree.ElementTree as ET
from typing import Any
from search_utils import get_client, print_json
EFETCH_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
ELINK_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi"
# ── ID 处理 ───────────────────────────────────────────────────────────────────
def normalize_pmc_id(raw: str) -> str:
"""规范化 PMC ID去掉 'PMC' 前缀,只保留数字部分。"""
return re.sub(r"^[Pp][Mm][Cc]", "", raw.strip())
def pmid_to_pmc(pmid: str, api_key: str | None = None) -> str | None:
"""通过 elink 将 PMID 转换为 PMC ID数字形式"""
params: dict[str, Any] = {
"dbfrom": "pubmed",
"db": "pmc",
"id": pmid,
"retmode": "json",
}
if api_key:
params["api_key"] = api_key
with get_client(timeout=20) as client:
resp = client.get(ELINK_URL, params=params)
resp.raise_for_status()
data = resp.json()
for linkset in data.get("linksets", []):
for db in linkset.get("linksetdbs", []):
if db.get("dbto") == "pmc" and db.get("linkname") == "pubmed_pmc":
links = db.get("links", [])
if links:
return str(links[0])
return None
# ── XML 拉取 ──────────────────────────────────────────────────────────────────
def fetch_pmc_xml(pmc_num: str, api_key: str | None = None) -> ET.Element:
"""获取 PMC 全文 XML返回根元素。"""
params: dict[str, Any] = {
"db": "pmc",
"id": pmc_num,
"rettype": "xml",
"retmode": "xml",
}
if api_key:
params["api_key"] = api_key
with get_client(timeout=45) as client:
resp = client.get(EFETCH_URL, params=params)
resp.raise_for_status()
root = ET.fromstring(resp.text)
# 检查是否找到论文
article = root.find(".//article")
if article is None:
raise ValueError(
f"PMC{pmc_num} 未找到全文。"
"可能原因:该论文不在 PMC 开放获取库中,或 ID 有误。"
)
return root
# ── JATS XML 文本提取 ─────────────────────────────────────────────────────────
# 跳过这些标签的全部内容(噪音节点)
_SKIP_TAGS = {"ref", "ref-list", "fn", "fn-group", "permissions", "author-notes",
"glossary", "ack"} # ack=Acknowledgements可按需保留
# 转为占位符的标签
_FORMULA_TAGS = {"disp-formula", "inline-formula", "mml:math", "tex-math"}
def _elem_to_text(elem: ET.Element, depth: int = 0) -> str:
"""
将 JATS XML 元素递归转为可读文本。
处理规则:
- <p>: 段落,末尾加换行
- <title>: 跳过(章节标题在上层已处理)
- <sec>: 子章节,递归(用缩进区分层级)
- <list>/<list-item>: 转为 bullet 列表
- <disp-formula>/<inline-formula>: 替换为 [FORMULA]
- <fig>: 跳过图像内容,保留 caption
- <table-wrap>: 保留 label+caption
- <xref>/<ext-link>: 直接取文本内容
- <bold>/<italic>/<underline>: 取文本内容
"""
tag = elem.tag.split("}")[-1] if "}" in elem.tag else elem.tag # 去 namespace
if tag in _SKIP_TAGS:
return ""
if tag in _FORMULA_TAGS:
return " [FORMULA] "
if tag == "title":
return "" # 由调用方处理
if tag == "p":
text = _collect_text(elem)
return text.strip() + "\n\n" if text.strip() else ""
if tag in ("bold", "italic", "underline", "named-content", "styled-content",
"ext-link", "uri", "xref", "sup", "sub", "monospace"):
return _collect_text(elem)
if tag == "list":
parts = []
for li in elem.findall("list-item"):
item_text = "".join(_elem_to_text(c) for c in li).strip()
if item_text:
parts.append(f"{item_text}")
return "\n".join(parts) + "\n\n" if parts else ""
if tag == "disp-quote":
text = "".join(_elem_to_text(c) for c in elem).strip()
return f"> {text}\n\n" if text else ""
if tag == "fig":
# 只保留 caption
caption = elem.find(".//caption")
if caption is not None:
cap_text = "".join(_elem_to_text(c) for c in caption).strip()
label = elem.findtext("label", "Figure")
return f"[{label}: {cap_text}]\n\n" if cap_text else ""
return ""
if tag == "table-wrap":
label = elem.findtext("label", "Table")
caption = elem.find(".//caption")
cap_text = ""
if caption is not None:
cap_text = "".join(_elem_to_text(c) for c in caption).strip()
return f"[{label}: {cap_text}]\n\n" if cap_text else f"[{label}]\n\n"
if tag == "sec":
# 子章节:递归处理,标题加缩进
sub_title_elem = elem.find("title")
sub_title = ""
if sub_title_elem is not None:
sub_title = _collect_text(sub_title_elem).strip()
parts = []
if sub_title:
indent = " " * depth
parts.append(f"\n{indent}### {sub_title}\n\n")
for child in elem:
child_tag = child.tag.split("}")[-1] if "}" in child.tag else child.tag
if child_tag == "title":
continue
parts.append(_elem_to_text(child, depth + 1))
return "".join(parts)
# 默认:递归子节点
return "".join(_elem_to_text(c, depth) for c in elem)
def _collect_text(elem: ET.Element) -> str:
"""收集元素的所有文本(含子节点,跳过公式)。"""
parts = []
if elem.text:
parts.append(elem.text)
for child in elem:
child_tag = child.tag.split("}")[-1] if "}" in child.tag else child.tag
if child_tag in _FORMULA_TAGS:
parts.append("[FORMULA]")
elif child_tag in _SKIP_TAGS:
pass
else:
parts.append(_collect_text(child))
if child.tail:
parts.append(child.tail)
return "".join(parts)
# ── 章节提取 ──────────────────────────────────────────────────────────────────
def _extract_sections_from(container: ET.Element, level: int = 1) -> list[dict[str, Any]]:
"""递归提取 sec 节点,返回扁平章节列表。"""
sections: list[dict[str, Any]] = []
for sec in container.findall("sec"):
title_elem = sec.find("title")
title = _collect_text(title_elem).strip() if title_elem is not None else f"Section {len(sections)+1}"
# 正文:本 sec 的直接子节点(排除 sec 和 title
text_parts = []
for child in sec:
child_tag = child.tag.split("}")[-1] if "}" in child.tag else child.tag
if child_tag in ("title", "sec"):
continue
text_parts.append(_elem_to_text(child))
text = "".join(text_parts).strip()
# 子章节递归
subsections = _extract_sections_from(sec, level + 1)
sections.append({
"name": title,
"level": level,
"text": text,
"subsections": subsections,
})
return sections
def extract_all_sections(root: ET.Element) -> list[dict[str, Any]]:
"""
从 PMC JATS XML 提取所有章节。
顺序Abstract → Body sections含子章节
"""
sections: list[dict[str, Any]] = []
article = root.find(".//article")
if article is None:
return sections
# ── 摘要 ──
abstract = article.find(".//abstract")
if abstract is not None:
# 结构化摘要(含 sec
if abstract.findall("sec"):
abs_parts = []
for sec in abstract.findall("sec"):
sec_title = sec.findtext("title", "")
sec_text_parts = []
for child in sec:
if child.tag != "title":
sec_text_parts.append(_elem_to_text(child))
part = "".join(sec_text_parts).strip()
if sec_title:
abs_parts.append(f"{sec_title}: {part}")
else:
abs_parts.append(part)
abs_text = "\n\n".join(abs_parts)
else:
abs_text = "".join(_elem_to_text(c) for c in abstract).strip()
if abs_text:
sections.append({"name": "Abstract", "level": 0, "text": abs_text, "subsections": []})
# ── Body ──
body = article.find(".//body")
if body is not None:
sections.extend(_extract_sections_from(body, level=1))
return sections
# ── 章节匹配 ──────────────────────────────────────────────────────────────────
def _flatten_sections(sections: list[dict], result: list | None = None) -> list[dict]:
"""将嵌套章节扁平化,便于搜索。"""
if result is None:
result = []
for s in sections:
result.append(s)
_flatten_sections(s.get("subsections", []), result)
return result
def match_section(sections: list[dict], query: str) -> dict | None:
"""大小写不敏感 + 去数字前缀的模糊匹配(搜索所有层级)。"""
q = query.lower().strip()
flat = _flatten_sections(sections)
def clean(name: str) -> str:
return re.sub(r"^\d+[\.\s]+", "", name).lower().strip()
# 精确匹配
for s in flat:
if s["name"].lower() == q or clean(s["name"]) == q:
return s
# 包含/前缀匹配
for s in flat:
c = clean(s["name"])
if c.startswith(q) or q in c:
return s
return None
# ── 对外接口 ──────────────────────────────────────────────────────────────────
def _section_outline(sections: list[dict], depth: int = 0) -> list[dict]:
"""生成章节目录(只含 name 和 level递归"""
outline = []
for s in sections:
outline.append({"name": s["name"], "level": s["level"]})
if s.get("subsections"):
outline.extend(_section_outline(s["subsections"], depth + 1))
return outline
def cmd_list_sections(pmc_num: str, api_key: str | None = None) -> dict[str, Any]:
"""列出 PMC 论文所有章节目录。"""
root = fetch_pmc_xml(pmc_num, api_key)
sections = extract_all_sections(root)
# 从 XML 拿标题
title = root.findtext(".//article-title", "")
pmid = root.findtext(".//article-id[@pub-id-type='pmid']", "")
return {
"success": True,
"pmc_id": f"PMC{pmc_num}",
"pmid": pmid or None,
"title": title,
"pmc_url": f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmc_num}/",
"section_count": len(_flatten_sections(sections)),
"sections": _section_outline(sections),
"error": None,
}
def cmd_read_section(pmc_num: str, section_name: str, api_key: str | None = None) -> dict[str, Any]:
"""读取指定章节的正文内容(含子章节文本)。"""
root = fetch_pmc_xml(pmc_num, api_key)
sections = extract_all_sections(root)
matched = match_section(sections, section_name)
if matched is None:
flat = _flatten_sections(sections)
available = [s["name"] for s in flat]
return {
"success": False,
"pmc_id": f"PMC{pmc_num}",
"section": section_name,
"content": None,
"error": f"未找到章节 '{section_name}',可用章节:{available}",
}
# 合并本节文本 + 子章节文本
def collect_text(s: dict) -> str:
parts = [s["text"]]
for sub in s.get("subsections", []):
sub_text = collect_text(sub)
if sub_text.strip():
parts.append(f"\n### {sub['name']}\n\n{sub_text}")
return "\n\n".join(p for p in parts if p.strip())
content = collect_text(matched)
return {
"success": True,
"pmc_id": f"PMC{pmc_num}",
"pmc_url": f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmc_num}/",
"section": matched["name"],
"level": matched["level"],
"content": content,
"char_count": len(content),
"error": None,
}
# ── CLI ───────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(
description="PMC 论文全文章节阅读器",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
示例:
python3 pmc_paper.py PMC11119143 列出所有章节
python3 pmc_paper.py 11119143 同上(自动补前缀)
python3 pmc_paper.py PMC11119143 --section introduction 读取 Introduction
python3 pmc_paper.py PMC11119143 --section method 读取 Methods
python3 pmc_paper.py --pmid 38786024 从 PMID 列章节
python3 pmc_paper.py --pmid 38786024 --section conclusion 从 PMID 读章节
""",
)
parser.add_argument(
"pmc_id", nargs="?",
help="PMC ID如 PMC11119143 或 11119143。与 --pmid 二选一。",
)
parser.add_argument(
"--pmid",
help="PubMed ID自动转换为 PMC ID需要论文在 PMC 开放获取库中)",
)
parser.add_argument(
"--section", "-s",
metavar="SECTION_NAME",
help="要读取的章节名(大小写不敏感,支持部分匹配)。不指定则列出所有章节。",
)
parser.add_argument(
"--api-key",
help="NCBI API Key可选提升限额从 3 req/s 到 10 req/s",
)
args = parser.parse_args()
api_key = getattr(args, "api_key", None)
try:
# 解析 PMC 数字 ID
if args.pmid:
pmc_num = pmid_to_pmc(args.pmid, api_key)
if not pmc_num:
print_json({
"success": False,
"pmid": args.pmid,
"error": f"PMID {args.pmid} 在 PMC 中无对应全文。该论文可能未开放获取。",
})
sys.exit(1)
elif args.pmc_id:
pmc_num = normalize_pmc_id(args.pmc_id)
else:
parser.error("请提供 PMC ID 或使用 --pmid 指定 PubMed ID")
if args.section:
result = cmd_read_section(pmc_num, args.section.strip(), api_key)
else:
result = cmd_list_sections(pmc_num, api_key)
print_json(result)
except Exception as e:
print_json({
"success": False,
"pmc_id": f"PMC{pmc_num}" if "pmc_num" in dir() else None,
"error": str(e),
})
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,165 @@
#!/usr/bin/env python3
"""PubMed 生物医学文献搜索。通过 NCBI E-utilities API。"""
from __future__ import annotations
import sys
import xml.etree.ElementTree as ET
from search_utils import build_parser, get_client, make_item, make_result, print_json
ESEARCH_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
EFETCH_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
def search(query: str, limit: int, api_key: str | None = None) -> list[dict]:
"""执行 PubMed 搜索两步esearch 获取 PMIDefetch 获取完整记录含摘要)。"""
base_params: dict = {"api_key": api_key} if api_key else {}
# Step 1: esearch 获取 PMID 列表
with get_client(timeout=30) as client:
resp = client.get(ESEARCH_URL, params={
**base_params,
"db": "pubmed",
"term": query,
"retmax": min(limit, 100),
"retmode": "json",
"sort": "relevance",
})
resp.raise_for_status()
pmids = resp.json().get("esearchresult", {}).get("idlist", [])
if not pmids:
return []
# Step 2: efetch 获取完整 XML 记录(含摘要)
with get_client(timeout=30) as client:
resp = client.get(EFETCH_URL, params={
**base_params,
"db": "pubmed",
"id": ",".join(pmids[:limit]),
"rettype": "xml",
"retmode": "xml",
})
resp.raise_for_status()
root = ET.fromstring(resp.text)
items = []
for article in root.findall(".//PubmedArticle"):
medline = article.find("MedlineCitation")
if medline is None:
continue
pmid_elem = medline.find("PMID")
pmid = pmid_elem.text if pmid_elem is not None else ""
article_data = medline.find("Article")
if article_data is None:
continue
# 标题
title_elem = article_data.find("ArticleTitle")
title = "".join(title_elem.itertext()) if title_elem is not None else ""
# 摘要(支持结构化摘要,如 BACKGROUND/METHODS/RESULTS/CONCLUSIONS
abstract_parts = []
abstract_elem = article_data.find("Abstract")
if abstract_elem is not None:
for ab in abstract_elem.findall("AbstractText"):
label = ab.get("Label")
text = "".join(ab.itertext()).strip()
if label:
abstract_parts.append(f"{label}: {text}")
else:
abstract_parts.append(text)
abstract = " ".join(abstract_parts)
# 作者
authors = []
author_list = article_data.find("AuthorList")
if author_list is not None:
for author in author_list.findall("Author"):
last = author.findtext("LastName", "")
fore = author.findtext("ForeName", "")
name = f"{fore} {last}".strip() if fore else last
if name:
authors.append(name)
# 期刊信息
journal = article_data.find("Journal")
journal_name = ""
pub_date = ""
volume = ""
issue = ""
if journal is not None:
journal_name = journal.findtext("Title", "") or journal.findtext("ISOAbbreviation", "")
ji = journal.find("JournalIssue")
if ji is not None:
volume = ji.findtext("Volume", "")
issue = ji.findtext("Issue", "")
pd = ji.find("PubDate")
if pd is not None:
year = pd.findtext("Year", "")
month = pd.findtext("Month", "")
day = pd.findtext("Day", "")
pub_date = " ".join(filter(None, [year, month, day]))
# 页码
pages = article_data.findtext(".//MedlinePgn", "")
# DOI 和 PMC ID从 ArticleIdList 提取)
doi = None
pmc_id = None
for id_elem in article.findall(".//ArticleId"):
id_type = id_elem.get("IdType", "")
if id_type == "doi":
doi = id_elem.text
elif id_type == "pmc" and id_elem.text:
# 规范化:去掉 "PMC" 前缀,只保留数字
pmc_id = id_elem.text.lstrip("PMCpmc").strip() or id_elem.text
# MeSH 关键词
keywords = [kw.text for kw in medline.findall(".//Keyword") if kw.text]
# 文献类型
pub_types = [pt.text for pt in article_data.findall(".//PublicationType") if pt.text]
url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/"
pmc_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmc_id}/" if pmc_id else None
items.append(make_item(
title=title,
url=url,
snippet=abstract,
authors=authors,
pmid=pmid,
pmc_id=f"PMC{pmc_id}" if pmc_id else None,
pmc_url=pmc_url,
journal=journal_name if journal_name else None,
pub_date=pub_date if pub_date else None,
volume=volume if volume else None,
issue=issue if issue else None,
pages=pages if pages else None,
keywords=keywords if keywords else None,
pub_types=pub_types if pub_types else None,
doi=doi,
))
return items
def main():
parser = build_parser("搜索 PubMed 生物医学文献")
parser.add_argument("--api-key", help="NCBI API Key可选限额从 3 req/s 提升至 10 req/s")
args = parser.parse_args()
try:
items = search(args.query, args.limit, getattr(args, "api_key", None))
print_json(make_result(True, args.query, "pubmed", items))
except Exception as e:
print_json(make_result(False, args.query, "pubmed", [], str(e)))
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,150 @@
"""
搜索 Skill 共享工具库。
提供标准 JSON 输出、CLI 脚手架、httpx helper 和配置读取。
所有搜索脚本通过 sys.path 导入此模块。
"""
from __future__ import annotations
import argparse
import json
import os
import sys
from typing import Any
try:
import httpx
except ImportError:
json.dump(
{
"success": False,
"error": "缺少 httpx请运行python3 -m pip install -r skills/sn-search-academic/requirements.txt",
},
sys.stdout,
ensure_ascii=False,
)
sys.stdout.write("\n")
sys.exit(1)
# ---------------------------------------------------------------------------
# 标准输出
# ---------------------------------------------------------------------------
def make_result(
success: bool,
query: str,
provider: str,
items: list[dict[str, Any]],
error: str | None = None,
) -> dict[str, Any]:
"""构造标准化的搜索结果。"""
return {
"success": success,
"query": query,
"provider": provider,
"items": items,
"error": error,
}
def make_item(
title: str,
url: str,
snippet: str = "",
**extra: Any,
) -> dict[str, Any]:
"""构造标准化的搜索结果条目。"""
item: dict[str, Any] = {"title": title, "url": url, "snippet": snippet}
for k, v in extra.items():
if v not in (None, "", [], {}):
item[k] = v
return item
def print_json(data: dict[str, Any]) -> None:
"""将结果 JSON 输出到 stdout。"""
json.dump(data, sys.stdout, ensure_ascii=False, indent=2)
sys.stdout.write("\n")
sys.stdout.flush()
# ---------------------------------------------------------------------------
# CLI 脚手架
# ---------------------------------------------------------------------------
def build_parser(description: str) -> argparse.ArgumentParser:
"""创建带有通用参数的 ArgumentParser。"""
parser = argparse.ArgumentParser(description=description)
parser.add_argument("query", help="搜索关键词")
parser.add_argument("--limit", "-n", type=int, default=10, help="返回结果数量(默认 10")
return parser
# ---------------------------------------------------------------------------
# httpx helper
# ---------------------------------------------------------------------------
_DEFAULT_TIMEOUT = 15
_DEFAULT_UA = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
)
def get_client(
timeout: int = _DEFAULT_TIMEOUT,
headers: dict[str, str] | None = None,
**kwargs: Any,
) -> httpx.Client:
"""返回预配置的 httpx.Client。"""
default_headers = {
"User-Agent": _DEFAULT_UA,
"Accept": "application/json",
}
if headers:
default_headers.update(headers)
return httpx.Client(
timeout=timeout,
headers=default_headers,
follow_redirects=True,
**kwargs,
)
# ---------------------------------------------------------------------------
# 配置读取
# ---------------------------------------------------------------------------
def get_key(env_var: str, cli_arg: str | None = None) -> str | None:
"""读取 API keyCLI 参数 > 环境变量。"""
if cli_arg:
return cli_arg
return os.environ.get(env_var)
# ---------------------------------------------------------------------------
# 脚本入口辅助
# ---------------------------------------------------------------------------
def run_search(
provider: str,
search_fn, # Callable[[str, int, ...], list[dict]]
parser: argparse.ArgumentParser | None = None,
extra_kwargs_fn=None, # Callable[[Namespace], dict] 从 args 提取额外参数
) -> None:
"""通用脚本入口:解析参数 → 执行搜索 → 输出 JSON。"""
if parser is None:
parser = build_parser(f"Search {provider}")
args = parser.parse_args()
extra = {}
if extra_kwargs_fn:
extra = extra_kwargs_fn(args)
try:
items = search_fn(args.query, args.limit, **extra)
print_json(make_result(True, args.query, provider, items))
except Exception as e:
print_json(make_result(False, args.query, provider, [], str(e)))
sys.exit(1)

View File

@@ -0,0 +1,238 @@
#!/usr/bin/env python3
"""Semantic Scholar 引用追溯查询论文的参考文献backward和被引论文forward"""
from __future__ import annotations
import argparse
import sys
from search_utils import get_client, make_item, print_json
API_BASE = "https://api.semanticscholar.org/graph/v1/paper"
# paper-level fields嵌套在 citedPaper/citingPaper 下)
# 注意: tldr 在 nested 请求中容易触发 rate limit不请求
PAPER_FIELDS = [
"title", "abstract", "year", "venue", "publicationDate",
"authors", "citationCount", "influentialCitationCount",
"isOpenAccess", "openAccessPdf", "externalIds", "fieldsOfStudy",
]
# edge-level fields引用关系本身的属性
EDGE_FIELDS = ["contexts", "intents"]
def resolve_paper_id(identifier: str) -> str:
"""将各种论文标识符转为 Semantic Scholar 可接受的格式。
支持:
- Semantic Scholar paper ID (40-char hex)
- DOI: 10.xxxx/... → DOI:10.xxxx/...
- ArXiv ID: 2301.07041 → ARXIV:2301.07041
- PubMed ID: PMID:12345678
- URL: https://www.semanticscholar.org/paper/... → 提取 ID
"""
identifier = identifier.strip()
# S2 URL
if "semanticscholar.org/paper/" in identifier:
# URL 末尾的 40-char hex
parts = identifier.rstrip("/").split("/")
return parts[-1]
# DOI
if identifier.startswith("10."):
return f"DOI:{identifier}"
if identifier.lower().startswith("doi:"):
return identifier
# ArXiv
if identifier.lower().startswith("arxiv:"):
return identifier.upper()
# 形如 2301.07041 或 2301.07041v2
if "." in identifier and identifier.replace(".", "").replace("v", "").isdigit():
return f"ARXIV:{identifier}"
# PMID
if identifier.lower().startswith("pmid:"):
return identifier.upper()
# 假设是 S2 paper ID
return identifier
def fetch_refs(
paper_id: str,
direction: str,
limit: int,
min_citations: int,
year_min: int | None,
year_max: int | None,
api_key: str | None = None,
) -> dict:
"""获取论文的 references 或 citations。"""
resolved = resolve_paper_id(paper_id)
endpoint = f"{API_BASE}/{resolved}/{direction}"
headers: dict[str, str] = {}
if api_key:
headers["x-api-key"] = api_key
# S2 API 单次最多 1000分页用 offset
# S2 references/citations 端点paper fields 用 nested 前缀edge fields 直接列出
# 格式: fields=contexts,intents,citedPaper.title,citedPaper.year,...
paper_key_prefix = "citedPaper" if direction == "references" else "citingPaper"
prefixed_fields = [f"{paper_key_prefix}.{f}" for f in PAPER_FIELDS]
all_fields = ",".join(EDGE_FIELDS + prefixed_fields)
params = {
"fields": all_fields,
# citations 端点按时间倒序返回,需要多取才能找到高引论文
# references 通常较少(几十条),多取无害
"limit": 1000,
}
with get_client(timeout=30, headers=headers) as client:
resp = client.get(endpoint, params=params)
resp.raise_for_status()
data = resp.json()
# 获取论文本体信息(用于输出上下文)
paper_resp = None
with get_client(timeout=15, headers=headers) as client:
try:
r = client.get(f"{API_BASE}/{resolved}", params={"fields": "title,year,citationCount"})
r.raise_for_status()
paper_resp = r.json()
except Exception:
pass
# direction=references 时结构是 {"data": [{"citedPaper": {...}, "contexts": [...], "intents": [...]}]}
# direction=citations 时结构是 {"data": [{"citingPaper": {...}, "contexts": [...], "intents": [...]}]}
paper_key = "citedPaper" if direction == "references" else "citingPaper"
items = []
for entry in data.get("data", []):
paper = entry.get(paper_key, {})
if not paper or not paper.get("title"):
continue
year = paper.get("year")
citation_count = paper.get("citationCount") or 0
# 过滤
if citation_count < min_citations:
continue
if year_min and year and year < year_min:
continue
if year_max and year and year > year_max:
continue
authors = [a.get("name", "") for a in paper.get("authors", [])]
external_ids = paper.get("externalIds") or {}
doi = external_ids.get("DOI")
arxiv_id = external_ids.get("ArXiv")
s2_id = paper.get("paperId", "")
url = f"https://www.semanticscholar.org/paper/{s2_id}" if s2_id else ""
abstract = paper.get("abstract") or ""
snippet = abstract
open_access_pdf = None
if paper.get("openAccessPdf"):
open_access_pdf = paper["openAccessPdf"].get("url")
# contexts: 引用该论文时的上下文句子(仅 citations 方向有意义)
contexts = entry.get("contexts") or []
intents = entry.get("intents") or []
item = make_item(
title=paper.get("title", ""),
url=url,
snippet=snippet,
authors=authors,
year=year,
venue=paper.get("venue") or None,
publication_date=paper.get("publicationDate"),
citation_count=citation_count,
influential_citation_count=paper.get("influentialCitationCount"),
is_open_access=paper.get("isOpenAccess"),
open_access_pdf=open_access_pdf,
fields_of_study=paper.get("fieldsOfStudy") or None,
doi=doi,
arxiv_id=arxiv_id,
paper_id=s2_id,
citation_contexts=contexts[:3] if contexts else None, # 最多 3 条上下文
citation_intents=intents if intents else None,
)
items.append(item)
# 按引用数排序,取 top-N
items.sort(key=lambda x: x.get("citation_count", 0), reverse=True)
items = items[:limit]
result = {
"success": True,
"paper_id": resolved,
"direction": direction,
"provider": "semantic_scholar",
"items": items,
"total_available": len(data.get("data", [])),
"returned": len(items),
"error": None,
}
if paper_resp:
result["source_paper"] = {
"title": paper_resp.get("title"),
"year": paper_resp.get("year"),
"citation_count": paper_resp.get("citationCount"),
}
return result
def main():
parser = argparse.ArgumentParser(
description="查询论文的参考文献backward或被引论文forward"
)
parser.add_argument(
"paper_id",
help="论文标识符S2 ID、DOI如 10.1234/...、ArXiv ID如 2301.07041、PMID如 PMID:12345678",
)
parser.add_argument(
"direction",
choices=["references", "citations"],
help="references=参考文献backwardcitations=被引论文forward",
)
parser.add_argument("--limit", "-n", type=int, default=20, help="返回结果数量(默认 20")
parser.add_argument("--min-citations", type=int, default=0, help="最低引用数过滤(默认 0")
parser.add_argument("--year-min", type=int, default=None, help="最早年份过滤")
parser.add_argument("--year-max", type=int, default=None, help="最晚年份过滤")
parser.add_argument("--api-key", help="Semantic Scholar API Key可选")
args = parser.parse_args()
try:
result = fetch_refs(
args.paper_id,
args.direction,
args.limit,
args.min_citations,
args.year_min,
args.year_max,
getattr(args, "api_key", None),
)
print_json(result)
except Exception as e:
print_json({
"success": False,
"paper_id": args.paper_id,
"direction": args.direction,
"provider": "semantic_scholar",
"items": [],
"error": str(e),
})
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,104 @@
#!/usr/bin/env python3
"""Semantic Scholar 论文搜索。通过 Semantic Scholar Graph API。"""
from __future__ import annotations
import sys
from search_utils import build_parser, get_client, make_item, make_result, print_json
API_URL = "https://api.semanticscholar.org/graph/v1/paper/search"
FIELDS = ",".join([
"title", "abstract", "tldr", "year", "venue", "publicationVenue", "publicationDate",
"authors", "citationCount", "influentialCitationCount",
"referenceCount", "isOpenAccess", "openAccessPdf",
"externalIds", "fieldsOfStudy", "publicationTypes", "journal",
])
def search(query: str, limit: int, api_key: str | None = None) -> list[dict]:
"""执行 Semantic Scholar 搜索。"""
headers: dict[str, str] = {}
if api_key:
headers["x-api-key"] = api_key
params = {
"query": query,
"limit": min(limit, 100),
"fields": FIELDS,
}
with get_client(timeout=30, headers=headers) as client:
resp = client.get(API_URL, params=params)
resp.raise_for_status()
data = resp.json()
items = []
for paper in data.get("data", [])[:limit]:
authors = [a.get("name", "") for a in paper.get("authors", [])]
open_access_pdf = None
if paper.get("openAccessPdf"):
open_access_pdf = paper["openAccessPdf"].get("url")
external_ids = paper.get("externalIds") or {}
doi = external_ids.get("DOI")
arxiv_id = external_ids.get("ArXiv")
paper_id = paper.get("paperId", "")
url = f"https://www.semanticscholar.org/paper/{paper_id}"
# 摘要:优先用 abstract缺失时降级用 tldr
abstract = paper.get("abstract") or ""
tldr = (paper.get("tldr") or {}).get("text")
snippet = abstract or tldr or ""
# 期刊/会议venue脏字符串+ publicationVenue结构化
venue = paper.get("venue") or (paper.get("journal") or {}).get("name")
pub_venue = paper.get("publicationVenue") or {}
publication_venue = {
k: pub_venue[k]
for k in ("id", "name", "type", "url")
if pub_venue.get(k)
} or None
items.append(make_item(
title=paper.get("title") or "",
url=url,
snippet=snippet,
tldr=tldr,
authors=authors,
year=paper.get("year"),
venue=venue if venue else None,
publication_venue=publication_venue,
publication_date=paper.get("publicationDate"),
citation_count=paper.get("citationCount"),
influential_citation_count=paper.get("influentialCitationCount"),
reference_count=paper.get("referenceCount"),
is_open_access=paper.get("isOpenAccess"),
open_access_pdf=open_access_pdf,
fields_of_study=paper.get("fieldsOfStudy") or None,
publication_types=paper.get("publicationTypes") or None,
doi=doi,
arxiv_id=arxiv_id,
paper_id=paper_id,
))
return items
def main():
parser = build_parser("搜索 Semantic Scholar 学术论文")
parser.add_argument("--api-key", help="Semantic Scholar API Key可选提高限额")
args = parser.parse_args()
try:
items = search(args.query, args.limit, getattr(args, "api_key", None))
print_json(make_result(True, args.query, "semantic_scholar", items))
except Exception as e:
print_json(make_result(False, args.query, "semantic_scholar", [], str(e)))
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,79 @@
#!/usr/bin/env python3
"""Wikipedia 搜索。通过 MediaWiki API。"""
from __future__ import annotations
import sys
from search_utils import build_parser, get_client, make_item, make_result, print_json
def _api_url(lang: str) -> str:
return f"https://{lang}.wikipedia.org/w/api.php"
def search(query: str, limit: int, lang: str = "en") -> list[dict]:
"""执行 Wikipedia 搜索。"""
params = {
"action": "query",
"list": "search",
"srsearch": query,
"srlimit": min(limit, 50),
"srprop": "snippet|timestamp|wordcount|size|sectiontitle|sectionsnippet",
"format": "json",
"utf8": 1,
}
with get_client() as client:
resp = client.get(_api_url(lang), params=params)
resp.raise_for_status()
data = resp.json()
items = []
for result in data.get("query", {}).get("search", [])[:limit]:
title = result.get("title", "")
# snippet 是 HTML 片段,简单去标签
snippet = _strip_html(result.get("snippet", ""))
page_id = result.get("pageid", "")
url = f"https://{lang}.wikipedia.org/wiki/{title.replace(' ', '_')}"
section_title = result.get("sectiontitle", "")
section_snippet = _strip_html(result.get("sectionsnippet", ""))
items.append(make_item(
title=title,
url=url,
snippet=snippet,
word_count=result.get("wordcount"),
size=result.get("size"),
timestamp=result.get("timestamp"),
page_id=page_id,
section_title=section_title if section_title else None,
section_snippet=section_snippet if section_snippet else None,
))
return items
def _strip_html(html: str) -> str:
import re
text = re.sub(r"<[^>]+>", "", html)
text = re.sub(r"\s+", " ", text).strip()
return text
def main():
parser = build_parser("搜索 Wikipedia 百科文章")
parser.add_argument("--lang", "-l", default="en",
help="语言版本(默认 en可选 zh, ja, de 等)")
args = parser.parse_args()
try:
items = search(args.query, args.limit, args.lang)
print_json(make_result(True, args.query, "wikipedia", items))
except Exception as e:
print_json(make_result(False, args.query, "wikipedia", [], str(e)))
sys.exit(1)
if __name__ == "__main__":
main()