first commit
This commit is contained in:
287
sn-search-academic/SKILL.md
Normal file
287
sn-search-academic/SKILL.md
Normal file
@@ -0,0 +1,287 @@
|
||||
---
|
||||
name: sn-search-academic
|
||||
description: "多源学术搜索:ArXiv、Semantic Scholar(含引用数)、PubMed、Wikipedia。支持按章节读取 ArXiv HTML 全文和 PMC 全文。触发词:学术论文、文献调研、引用数据、生物医学文献、百科查询。一站式多源工具。"
|
||||
---
|
||||
|
||||
# sn-search-academic - 学术搜索
|
||||
|
||||
搜索 ArXiv、Semantic Scholar、PubMed、Wikipedia 四个学术平台,并提供 ArXiv 和 PMC 的**全文章节阅读**能力。全部免费,部分脚本有可选 API key 可提升限额。
|
||||
|
||||
## 依赖
|
||||
|
||||
运行脚本前先安装本 skill 的 Python 依赖:
|
||||
|
||||
```bash
|
||||
python3 -m pip install -r skills/sn-search-academic/requirements.txt
|
||||
```
|
||||
|
||||
如果项目使用 `uv` 环境:
|
||||
|
||||
```bash
|
||||
uv pip install -r skills/sn-search-academic/requirements.txt
|
||||
```
|
||||
|
||||
`arxiv_paper.py` 需要 `beautifulsoup4` 解析 ArXiv HTML;其他脚本主要依赖 `httpx` 发起请求。
|
||||
|
||||
## 可用脚本
|
||||
|
||||
| 脚本 | 平台 | 用途 | API key |
|
||||
|------|------|------|---------|
|
||||
| `arxiv_search.py` | ArXiv | 预印本搜索,支持作者/标题/ID查询 | 无需 |
|
||||
| `arxiv_paper.py` | ArXiv HTML | 按章节读取 ArXiv 论文全文 | 无需 |
|
||||
| `semantic_scholar_search.py` | Semantic Scholar | 全学科搜索,含引用数和 TLDR | 无需(有 key 限额更高) |
|
||||
| `semantic_scholar_refs.py` | Semantic Scholar | 引用追溯:查论文的参考文献(backward)或被引论文(forward) | 无需(有 key 限额更高) |
|
||||
| `pubmed_search.py` | PubMed | 生医文献搜索,含结构化摘要和 PMC ID | 无需(有 key 限额更高) |
|
||||
| `pmc_paper.py` | PMC | 按章节读取 PMC 开放获取论文全文 | 无需(有 key 限额更高) |
|
||||
| `wikipedia_search.py` | Wikipedia | 百科文章搜索,支持多语言 | 无需 |
|
||||
|
||||
## 参数说明
|
||||
|
||||
### arxiv_search.py
|
||||
|
||||
```bash
|
||||
python3 scripts/arxiv_search.py <query> [选项]
|
||||
```
|
||||
|
||||
| 参数 | 说明 | 默认值 |
|
||||
|------|------|--------|
|
||||
| `query` | 搜索关键词(使用 `--id-list` 时可省略) | — |
|
||||
| `--limit`, `-n` | 返回结果数量 | 10 |
|
||||
| `--category`, `-c` | ArXiv 分类过滤(见下方"ArXiv 分类速查") | — |
|
||||
| `--sort` | 排序方式:`relevance`, `date`, `submitted` | relevance |
|
||||
| `--author`, `-a` | 按作者过滤,多个用逗号分隔 | — |
|
||||
| `--title-only` | 仅在标题中搜索 | — |
|
||||
| `--id-list` | 直接按 arXiv ID 获取元数据,逗号分隔 | — |
|
||||
|
||||
```bash
|
||||
python3 scripts/arxiv_search.py "transformer attention mechanism" --limit 5
|
||||
python3 scripts/arxiv_search.py "diffusion model" --author "ho jonathan" --category cs.CV
|
||||
python3 scripts/arxiv_search.py --id-list "2409.05591,2301.07041"
|
||||
```
|
||||
|
||||
**输出字段**:`title`, `url`, `snippet`(摘要), `arxiv_id`, `authors`, `published`, `updated`, `pdf_url`, `html_url`, `categories`, `primary_category`, `comment`, `journal_ref`, `doi`
|
||||
|
||||
### arxiv_paper.py
|
||||
|
||||
按章节读取 ArXiv 论文正文(需论文有 HTML 版本,2020 年后多数论文支持)。
|
||||
|
||||
```bash
|
||||
python3 scripts/arxiv_paper.py <arxiv_id> [--section SECTION_NAME]
|
||||
```
|
||||
|
||||
| 参数 | 说明 |
|
||||
|------|------|
|
||||
| `arxiv_id` | arXiv ID(如 `2409.05591` 或 `2409.05591v2`) |
|
||||
| `--section`, `-s` | 章节名(大小写不敏感,支持部分匹配)。不指定则列出所有章节。 |
|
||||
|
||||
```bash
|
||||
python3 scripts/arxiv_paper.py 2409.05591 # 列出章节
|
||||
python3 scripts/arxiv_paper.py 2409.05591 --section introduction
|
||||
python3 scripts/arxiv_paper.py 2409.05591 --section method
|
||||
```
|
||||
|
||||
**列出章节输出字段**:`arxiv_id`, `abs_url`, `html_url`, `pdf_url`, `section_count`, `sections[]`(name, level)
|
||||
|
||||
**读取章节输出字段**:`arxiv_id`, `section`, `level`, `content`, `char_count`
|
||||
|
||||
### semantic_scholar_search.py
|
||||
|
||||
```bash
|
||||
python3 scripts/semantic_scholar_search.py <query> [选项]
|
||||
```
|
||||
|
||||
| 参数 | 说明 | 默认值 |
|
||||
|------|------|--------|
|
||||
| `query` | 搜索关键词(必填) | — |
|
||||
| `--limit`, `-n` | 返回结果数量 | 10 |
|
||||
| `--api-key` | Semantic Scholar API Key(也可通过 `S2_API_KEY` 环境变量) | — |
|
||||
|
||||
```bash
|
||||
python3 scripts/semantic_scholar_search.py "transformer architecture" --limit 5
|
||||
python3 scripts/semantic_scholar_search.py "RLHF language model" --limit 10
|
||||
```
|
||||
|
||||
**输出字段**:`title`, `url`, `snippet`(摘要,缺失时降级为 tldr), `tldr`, `authors`, `year`, `venue`, `publication_date`, `citation_count`, `influential_citation_count`, `reference_count`, `is_open_access`, `open_access_pdf`, `fields_of_study`, `publication_types`, `doi`, `arxiv_id`, `paper_id`
|
||||
|
||||
### semantic_scholar_refs.py
|
||||
|
||||
引用追溯:给定一篇论文,查询它的参考文献(backward)或被引论文(forward)。
|
||||
|
||||
```bash
|
||||
python3 scripts/semantic_scholar_refs.py <paper_id> <direction> [选项]
|
||||
```
|
||||
|
||||
| 参数 | 说明 | 默认值 |
|
||||
|------|------|--------|
|
||||
| `paper_id` | 论文标识符:S2 ID、DOI(`10.xxxx/...`)、ArXiv ID(`2301.07041`)、PMID(`PMID:12345678`) | — |
|
||||
| `direction` | `references`=参考文献(backward),`citations`=被引论文(forward) | — |
|
||||
| `--limit`, `-n` | 返回结果数量 | 20 |
|
||||
| `--min-citations` | 最低引用数过滤 | 0 |
|
||||
| `--year-min` | 最早年份过滤 | — |
|
||||
| `--year-max` | 最晚年份过滤 | — |
|
||||
| `--api-key` | Semantic Scholar API Key(可选) | — |
|
||||
|
||||
```bash
|
||||
# 查看某篇论文引用了哪些论文(backward:找奠基工作)
|
||||
python3 scripts/semantic_scholar_refs.py 2301.07041 references --limit 10
|
||||
|
||||
# 查看某篇论文被谁引用(forward:找后续进展)
|
||||
python3 scripts/semantic_scholar_refs.py 2301.07041 citations --limit 10 --min-citations 50
|
||||
|
||||
# 用 DOI 查引用,限定 2023 年以后
|
||||
python3 scripts/semantic_scholar_refs.py "10.1038/s41586-024-07487-w" citations --year-min 2023
|
||||
|
||||
# 找高引参考文献
|
||||
python3 scripts/semantic_scholar_refs.py ARXIV:2005.14165 references --min-citations 100 --limit 5
|
||||
```
|
||||
|
||||
**输出字段**:`title`, `url`, `snippet`(摘要/tldr), `authors`, `year`, `venue`, `citation_count`, `influential_citation_count`, `is_open_access`, `open_access_pdf`, `doi`, `arxiv_id`, `paper_id`, `citation_contexts`(引用上下文句子,最多 3 条), `citation_intents`(引用意图)
|
||||
|
||||
**输出额外字段**:`source_paper`(被查询论文的标题/年份/引用数), `total_available`(该方向总论文数), `returned`(过滤后返回数)
|
||||
|
||||
### pubmed_search.py
|
||||
|
||||
支持 PubMed 查询语法,如字段限定(`cancer[Title]`)、日期范围(`2024[pdat]`)。
|
||||
|
||||
```bash
|
||||
python3 scripts/pubmed_search.py <query> [选项]
|
||||
```
|
||||
|
||||
| 参数 | 说明 | 默认值 |
|
||||
|------|------|--------|
|
||||
| `query` | 搜索关键词,支持 PubMed 查询语法 | — |
|
||||
| `--limit`, `-n` | 返回结果数量 | 10 |
|
||||
| `--api-key` | NCBI API Key(可选,限额从 3 req/s 升至 10 req/s) | — |
|
||||
|
||||
```bash
|
||||
python3 scripts/pubmed_search.py "CRISPR gene editing" --limit 5
|
||||
python3 scripts/pubmed_search.py "Alzheimer[Title] AND treatment[Title]" --limit 5
|
||||
```
|
||||
|
||||
**输出字段**:`title`, `url`, `snippet`(结构化摘要), `authors`, `pmid`, `pmc_id`(有值则可传入 `pmc_paper.py`), `pmc_url`, `journal`, `pub_date`, `volume`, `issue`, `pages`, `keywords`, `pub_types`, `doi`
|
||||
|
||||
### pmc_paper.py
|
||||
|
||||
读取 PubMed Central 开放获取全文(约 700 万篇生医论文,占 PubMed 约 35%)。`pubmed_search.py` 结果中 `pmc_id` 为 `null` 的论文无法使用本工具。
|
||||
|
||||
```bash
|
||||
python3 scripts/pmc_paper.py <pmc_id> [--section SECTION_NAME]
|
||||
python3 scripts/pmc_paper.py --pmid <pmid> [--section SECTION_NAME]
|
||||
```
|
||||
|
||||
| 参数 | 说明 |
|
||||
|------|------|
|
||||
| `pmc_id` | PMC ID(如 `PMC11119143` 或 `11119143`) |
|
||||
| `--pmid` | PubMed ID,自动转换为 PMC ID(与 `pmc_id` 二选一) |
|
||||
| `--section`, `-s` | 章节名(大小写不敏感,支持部分匹配)。不指定则列出所有章节。 |
|
||||
| `--api-key` | NCBI API Key(可选) |
|
||||
|
||||
```bash
|
||||
python3 scripts/pmc_paper.py PMC11119143 # 列出章节
|
||||
python3 scripts/pmc_paper.py PMC11119143 --section introduction
|
||||
python3 scripts/pmc_paper.py --pmid 38786024 --section conclusion
|
||||
```
|
||||
|
||||
**列出章节输出字段**:`pmc_id`, `pmid`, `title`, `pmc_url`, `section_count`, `sections[]`(name, level,含子章节层级)
|
||||
|
||||
**读取章节输出字段**:`pmc_id`, `section`, `level`, `content`(含子章节文本), `char_count`
|
||||
|
||||
### wikipedia_search.py
|
||||
|
||||
```bash
|
||||
python3 scripts/wikipedia_search.py <query> [选项]
|
||||
```
|
||||
|
||||
| 参数 | 说明 | 默认值 |
|
||||
|------|------|--------|
|
||||
| `query` | 搜索关键词(必填) | — |
|
||||
| `--limit`, `-n` | 返回结果数量 | 10 |
|
||||
| `--lang`, `-l` | 语言版本(`en`, `zh`, `ja`, `de`, `fr` 等) | en |
|
||||
|
||||
```bash
|
||||
python3 scripts/wikipedia_search.py "machine learning" --limit 5
|
||||
python3 scripts/wikipedia_search.py "深度学习" --lang zh --limit 5
|
||||
```
|
||||
|
||||
## 全文阅读工作流
|
||||
|
||||
搜索脚本返回摘要,阅读脚本返回正文。两者配合可按需精读,节省 token。
|
||||
|
||||
**ArXiv 论文**:
|
||||
1. `arxiv_search.py` 搜索 → 获取 `arxiv_id`
|
||||
2. `arxiv_paper.py <id>` 列章节 → `arxiv_paper.py <id> --section introduction` 快速判断是否深入
|
||||
3. 按需读取 `method` / `experiment` / `conclusion`
|
||||
|
||||
**PMC 生医论文**:
|
||||
1. `pubmed_search.py` 搜索 → 结果中取 `pmc_id`(非 null 才有全文)
|
||||
2. `pmc_paper.py <pmc_id>` 列章节 → 按需读取关键章节
|
||||
|
||||
## 引用追溯工作流
|
||||
|
||||
通过论文的引用关系发现关键词搜索覆盖不到的相关工作。
|
||||
|
||||
**Backward(找奠基工作)**:
|
||||
1. 关键词搜索找到高相关论文 → 取其 `paper_id` 或 `arxiv_id`
|
||||
2. `semantic_scholar_refs.py <id> references --min-citations 50` → 找到高引参考文献
|
||||
3. 筛选与研究问题相关的条目 → 用 `arxiv_paper.py` 或 `pmc_paper.py` 深入阅读
|
||||
|
||||
**Forward(找后续进展)**:
|
||||
1. 找到领域奠基论文或关键论文 → 取其 ID
|
||||
2. `semantic_scholar_refs.py <id> citations --year-min 2024 --min-citations 10` → 找到近期高引跟进工作
|
||||
3. 筛选与研究问题相关的条目 → 深入阅读
|
||||
|
||||
**Citation Chain(追溯演化路径)**:
|
||||
1. 从种子论文 A 出发 → backward 找到 A 的关键参考文献 B
|
||||
2. 从 B 出发 → forward 找到引用 B 的后续工作(可能发现 A 没引用的相关论文 C)
|
||||
3. 形成 B → A → ... 和 B → C → ... 的知识脉络
|
||||
|
||||
## ArXiv 分类速查
|
||||
|
||||
顶层领域可直接用(如 `--category cs`),子分类更精确(如 `--category cs.AI`)。
|
||||
|
||||
| 领域 | 分类代码 | 说明 |
|
||||
|------|---------|------|
|
||||
| **计算机科学** | `cs.AI` | 人工智能 |
|
||||
| | `cs.LG` | 机器学习 |
|
||||
| | `cs.CL` | 计算语言学 / NLP |
|
||||
| | `cs.CV` | 计算机视觉 |
|
||||
| | `cs.IR` | 信息检索 |
|
||||
| | `cs.RO` | 机器人 |
|
||||
| | `cs.SE` | 软件工程 |
|
||||
| | `cs.DC` | 分布式/并行计算 |
|
||||
| | `cs.NI` | 网络与互联网 |
|
||||
| | `cs.CR` | 密码学与安全 |
|
||||
| | `cs.DB` | 数据库 |
|
||||
| | `cs.HC` | 人机交互 |
|
||||
| **统计** | `stat.ML` | 统计机器学习 |
|
||||
| | `stat.AP` | 应用统计 |
|
||||
| | `stat.ME` | 统计方法论 |
|
||||
| **数学** | `math.OC` | 优化与控制 |
|
||||
| | `math.ST` | 统计理论 |
|
||||
| | `math.CO` | 组合数学 |
|
||||
| **物理** | `physics` | 物理(全类) |
|
||||
| | `cond-mat` | 凝聚态物理 |
|
||||
| | `quant-ph` | 量子物理 |
|
||||
| | `hep-th` | 高能理论物理 |
|
||||
| **经济/金融** | `econ.GN` | 经济学综合 |
|
||||
| | `q-fin.CP` | 计算金融 |
|
||||
| | `q-fin.ST` | 统计金融 |
|
||||
| **生物/医学** | `q-bio.NC` | 神经科学 |
|
||||
| | `q-bio.GN` | 基因组学 |
|
||||
| | `q-bio.QM` | 定量方法 |
|
||||
|
||||
## 输出格式
|
||||
|
||||
所有脚本输出标准 JSON:
|
||||
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"query": "...",
|
||||
"provider": "arxiv|semantic_scholar|pubmed|wikipedia",
|
||||
"items": [{"title": "...", "url": "...", "snippet": "...", ...}],
|
||||
"error": null
|
||||
}
|
||||
```
|
||||
|
||||
`arxiv_paper.py` 和 `pmc_paper.py` 不走 `items` 格式,直接返回结构化对象(见各自"输出字段"说明)。
|
||||
2
sn-search-academic/requirements.txt
Normal file
2
sn-search-academic/requirements.txt
Normal file
@@ -0,0 +1,2 @@
|
||||
httpx>=0.25.0
|
||||
beautifulsoup4>=4.12.0
|
||||
Binary file not shown.
304
sn-search-academic/scripts/arxiv_paper.py
Normal file
304
sn-search-academic/scripts/arxiv_paper.py
Normal file
@@ -0,0 +1,304 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
ArXiv 论文章节阅读器。
|
||||
|
||||
通过解析 arXiv HTML 版本(LaTeXML 转换),支持:
|
||||
- 列出论文所有章节结构
|
||||
- 按章节名称提取正文内容(大小写不敏感,支持部分匹配)
|
||||
|
||||
用法:
|
||||
python3 arxiv_paper.py 2409.05591 # 列出章节
|
||||
python3 arxiv_paper.py 2409.05591 --section introduction # 读取指定章节
|
||||
python3 arxiv_paper.py 2409.05591 --section method
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from typing import Any
|
||||
|
||||
from search_utils import get_client, print_json
|
||||
|
||||
BeautifulSoup: Any = None
|
||||
NavigableString: Any = None
|
||||
Tag: Any = None
|
||||
|
||||
|
||||
def ensure_bs4() -> None:
|
||||
"""Load BeautifulSoup only when the script needs to parse paper HTML."""
|
||||
global BeautifulSoup, NavigableString, Tag
|
||||
if BeautifulSoup is not None:
|
||||
return
|
||||
|
||||
try:
|
||||
from bs4 import BeautifulSoup as Bs4BeautifulSoup
|
||||
from bs4 import NavigableString as Bs4NavigableString
|
||||
from bs4 import Tag as Bs4Tag
|
||||
except ImportError:
|
||||
print_json({
|
||||
"success": False,
|
||||
"error": "缺少 beautifulsoup4,请运行:python3 -m pip install -r skills/sn-search-academic/requirements.txt",
|
||||
})
|
||||
sys.exit(1)
|
||||
|
||||
BeautifulSoup = Bs4BeautifulSoup
|
||||
NavigableString = Bs4NavigableString
|
||||
Tag = Bs4Tag
|
||||
|
||||
HTML_BASE = "https://arxiv.org/html"
|
||||
ABS_BASE = "https://arxiv.org/abs"
|
||||
PDF_BASE = "https://arxiv.org/pdf"
|
||||
|
||||
# ── HTML 获取 ─────────────────────────────────────────────────────────────────
|
||||
|
||||
def fetch_html(arxiv_id: str) -> str:
|
||||
"""获取 arXiv HTML 版本,不存在时抛出有意义的错误。"""
|
||||
url = f"{HTML_BASE}/{arxiv_id}"
|
||||
with get_client(timeout=45, headers={"Accept": "text/html,application/xhtml+xml"}) as client:
|
||||
resp = client.get(url)
|
||||
|
||||
if resp.status_code == 404:
|
||||
raise ValueError(
|
||||
f"论文 {arxiv_id} 暂无 HTML 版本。"
|
||||
"可能原因:论文较老(2018 年前)、非 LaTeX 来源或尚未转换。"
|
||||
f"请直接阅读 PDF:{PDF_BASE}/{arxiv_id}"
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.text
|
||||
|
||||
|
||||
# ── 文本清洗 ──────────────────────────────────────────────────────────────────
|
||||
|
||||
def _elem_to_text(elem: Tag) -> str:
|
||||
"""
|
||||
将 HTML 元素转为可读文本。
|
||||
- math 元素:优先用 LaTeX 注解,否则用 alttext,再降级为 [MATH]
|
||||
- 图表标题:保留
|
||||
- 跳过 .ltx_note(脚注编号)等噪音节点
|
||||
"""
|
||||
parts: list[str] = []
|
||||
|
||||
for node in elem.descendants:
|
||||
if not isinstance(node, NavigableString):
|
||||
continue
|
||||
|
||||
parent = node.parent
|
||||
if parent is None:
|
||||
continue
|
||||
|
||||
tag = parent.name
|
||||
|
||||
# 跳过脚注编号、引用上标等噪音
|
||||
parent_classes = parent.get("class") or []
|
||||
if any(c in parent_classes for c in ("ltx_note_mark", "ltx_ref_tag", "ltx_tag")):
|
||||
continue
|
||||
|
||||
# math 元素:取 LaTeX 注解
|
||||
if tag == "annotation":
|
||||
encoding = parent.get("encoding", "")
|
||||
if "tex" in encoding.lower() or "latex" in encoding.lower():
|
||||
latex = node.strip()
|
||||
if latex:
|
||||
parts.append(f"${latex}$")
|
||||
continue
|
||||
|
||||
# 跳过 math 内部的非注解文本(MathML 结构文本很乱)
|
||||
in_math = False
|
||||
for ancestor in parent.parents:
|
||||
if ancestor.name == "math":
|
||||
in_math = True
|
||||
break
|
||||
if in_math:
|
||||
continue
|
||||
|
||||
text = str(node)
|
||||
if text.strip():
|
||||
parts.append(text)
|
||||
|
||||
raw = "".join(parts)
|
||||
# 合并多余空白,保留段落换行
|
||||
raw = re.sub(r"[ \t]+", " ", raw)
|
||||
raw = re.sub(r"\n{3,}", "\n\n", raw)
|
||||
return raw.strip()
|
||||
|
||||
|
||||
# ── 章节提取 ──────────────────────────────────────────────────────────────────
|
||||
|
||||
def extract_sections(html: str) -> list[dict[str, Any]]:
|
||||
"""
|
||||
从 arXiv HTML 提取所有章节(含摘要)。
|
||||
|
||||
返回列表,每项:
|
||||
name - 章节标题(含编号,如 "1 Introduction")
|
||||
level - 层级(0=摘要, 1=h2, 2=h3)
|
||||
text - 正文文本
|
||||
"""
|
||||
ensure_bs4()
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
sections: list[dict[str, Any]] = []
|
||||
|
||||
# ── 摘要 ──
|
||||
abstract_elem = soup.find(class_=re.compile(r"\bltx_abstract\b"))
|
||||
if abstract_elem:
|
||||
# 去掉 "Abstract" 标题行
|
||||
for h in abstract_elem.find_all(["h2", "h6"], class_=re.compile(r"ltx_title")):
|
||||
h.decompose()
|
||||
abstract_text = _elem_to_text(abstract_elem)
|
||||
if abstract_text:
|
||||
sections.append({"name": "Abstract", "level": 0, "text": abstract_text})
|
||||
|
||||
# ── 正文各 section ──
|
||||
for sec in soup.find_all("section", class_=re.compile(r"\bltx_section\b|\bltx_appendix\b")):
|
||||
# 找本层标题(不要子 section 的标题)
|
||||
heading: Tag | None = None
|
||||
for h_tag in ["h2", "h3", "h4"]:
|
||||
candidate = sec.find(h_tag, class_=re.compile(r"\bltx_title\b"), recursive=False)
|
||||
if candidate:
|
||||
heading = candidate
|
||||
break
|
||||
|
||||
if heading is None:
|
||||
# 有些 section 标题在首个 div 里
|
||||
for h_tag in ["h2", "h3", "h4"]:
|
||||
candidate = sec.find(h_tag, class_=re.compile(r"\bltx_title\b"))
|
||||
if candidate:
|
||||
heading = candidate
|
||||
break
|
||||
|
||||
if heading is None:
|
||||
continue
|
||||
|
||||
# 清理标题(去尾部 ¶ permalink、多余空白)
|
||||
heading_text = heading.get_text(" ", strip=True).rstrip("¶").strip()
|
||||
heading_text = re.sub(r"\s+", " ", heading_text)
|
||||
level = {"h2": 1, "h3": 2, "h4": 3}.get(heading.name, 1)
|
||||
|
||||
# 提取本 section 的文本(排除子 section,避免重复)
|
||||
sec_copy = BeautifulSoup(str(sec), "html.parser").find("section")
|
||||
# 移除子 section
|
||||
for child_sec in sec_copy.find_all("section", recursive=False):
|
||||
child_sec.decompose()
|
||||
# 移除标题自身
|
||||
for h in sec_copy.find_all(["h2", "h3", "h4"], class_=re.compile(r"\bltx_title\b"), recursive=False):
|
||||
h.decompose()
|
||||
|
||||
text = _elem_to_text(sec_copy)
|
||||
|
||||
if not text.strip():
|
||||
continue
|
||||
|
||||
sections.append({"name": heading_text, "level": level, "text": text})
|
||||
|
||||
return sections
|
||||
|
||||
|
||||
# ── 匹配章节名 ────────────────────────────────────────────────────────────────
|
||||
|
||||
def _match_section(sections: list[dict], query: str) -> dict | None:
|
||||
"""大小写不敏感 + 去数字前缀的模糊匹配。"""
|
||||
q = query.lower().strip()
|
||||
|
||||
def clean(name: str) -> str:
|
||||
"""去掉 '1 ' / '1. ' 等数字前缀。"""
|
||||
return re.sub(r"^\d+[\.\s]+", "", name).lower().strip()
|
||||
|
||||
# 精确匹配
|
||||
for s in sections:
|
||||
if s["name"].lower() == q or clean(s["name"]) == q:
|
||||
return s
|
||||
|
||||
# 前缀 / 包含匹配
|
||||
for s in sections:
|
||||
if clean(s["name"]).startswith(q) or q in clean(s["name"]):
|
||||
return s
|
||||
|
||||
return None
|
||||
|
||||
|
||||
# ── 对外接口 ──────────────────────────────────────────────────────────────────
|
||||
|
||||
def cmd_list_sections(arxiv_id: str) -> dict[str, Any]:
|
||||
"""列出论文所有章节(不含正文)。"""
|
||||
html = fetch_html(arxiv_id)
|
||||
sections = extract_sections(html)
|
||||
return {
|
||||
"success": True,
|
||||
"arxiv_id": arxiv_id,
|
||||
"abs_url": f"{ABS_BASE}/{arxiv_id}",
|
||||
"html_url": f"{HTML_BASE}/{arxiv_id}",
|
||||
"pdf_url": f"{PDF_BASE}/{arxiv_id}",
|
||||
"section_count": len(sections),
|
||||
"sections": [{"name": s["name"], "level": s["level"]} for s in sections],
|
||||
"error": None,
|
||||
}
|
||||
|
||||
|
||||
def cmd_read_section(arxiv_id: str, section_name: str) -> dict[str, Any]:
|
||||
"""读取指定章节的正文内容。"""
|
||||
html = fetch_html(arxiv_id)
|
||||
sections = extract_sections(html)
|
||||
matched = _match_section(sections, section_name)
|
||||
|
||||
if matched is None:
|
||||
available = [s["name"] for s in sections]
|
||||
return {
|
||||
"success": False,
|
||||
"arxiv_id": arxiv_id,
|
||||
"section": section_name,
|
||||
"content": None,
|
||||
"error": f"未找到章节 '{section_name}',可用章节:{available}",
|
||||
}
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"arxiv_id": arxiv_id,
|
||||
"abs_url": f"{ABS_BASE}/{arxiv_id}",
|
||||
"section": matched["name"],
|
||||
"level": matched["level"],
|
||||
"content": matched["text"],
|
||||
"char_count": len(matched["text"]),
|
||||
"error": None,
|
||||
}
|
||||
|
||||
|
||||
# ── CLI ───────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="ArXiv 论文章节阅读器",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
示例:
|
||||
python3 arxiv_paper.py 2409.05591 列出所有章节
|
||||
python3 arxiv_paper.py 2409.05591 --section introduction 读取 Introduction
|
||||
python3 arxiv_paper.py 2409.05591 --section method 读取 Method/Methods
|
||||
python3 arxiv_paper.py 2409.05591 --section conclusion 读取 Conclusion
|
||||
""",
|
||||
)
|
||||
parser.add_argument("arxiv_id", help="arXiv 论文 ID(如 2409.05591 或 2409.05591v2)")
|
||||
parser.add_argument(
|
||||
"--section", "-s",
|
||||
metavar="SECTION_NAME",
|
||||
help="要读取的章节名(大小写不敏感,支持部分匹配)。不指定则列出所有章节。",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
if args.section:
|
||||
result = cmd_read_section(args.arxiv_id.strip(), args.section.strip())
|
||||
else:
|
||||
result = cmd_list_sections(args.arxiv_id.strip())
|
||||
print_json(result)
|
||||
except Exception as e:
|
||||
print_json({
|
||||
"success": False,
|
||||
"arxiv_id": args.arxiv_id,
|
||||
"error": str(e),
|
||||
})
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
239
sn-search-academic/scripts/arxiv_search.py
Normal file
239
sn-search-academic/scripts/arxiv_search.py
Normal file
@@ -0,0 +1,239 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
ArXiv 论文搜索。通过 ArXiv API(返回 Atom XML)。
|
||||
|
||||
支持:
|
||||
- 全文 / 标题 / 摘要 / 作者字段搜索
|
||||
- 分类过滤、排序
|
||||
- 按 ID 列表直接拉取论文元数据
|
||||
- 布尔组合查询(AND / OR / ANDNOT)
|
||||
|
||||
示例:
|
||||
python3 arxiv_search.py "attention mechanism"
|
||||
python3 arxiv_search.py "transformer" --category cs.CL --sort date
|
||||
python3 arxiv_search.py "diffusion model" --author "ho jonathan"
|
||||
python3 arxiv_search.py "ViT" --title-only
|
||||
python3 arxiv_search.py --id-list 2409.05591,2301.00001
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
from search_utils import build_parser, get_client, make_item, make_result, print_json
|
||||
|
||||
API_URL = "https://export.arxiv.org/api/query"
|
||||
|
||||
# Atom XML 命名空间
|
||||
NS = {
|
||||
"atom": "http://www.w3.org/2005/Atom",
|
||||
"arxiv": "http://arxiv.org/schemas/atom",
|
||||
}
|
||||
|
||||
|
||||
def build_search_query(
|
||||
query: str,
|
||||
category: str | None = None,
|
||||
author: str | None = None,
|
||||
title_only: bool = False,
|
||||
) -> str:
|
||||
"""
|
||||
构建 arXiv 查询字符串。
|
||||
|
||||
字段前缀:
|
||||
all: 全字段(默认)
|
||||
ti: 仅标题
|
||||
au: 作者(支持通配 au:smi*)
|
||||
abs: 摘要
|
||||
cat: 分类
|
||||
布尔运算符必须大写:AND / OR / ANDNOT
|
||||
"""
|
||||
# 主查询字段
|
||||
field = "ti" if title_only else "all"
|
||||
parts = [f"{field}:{query}"]
|
||||
|
||||
if author:
|
||||
# 多个作者用 OR 连接,支持 "lastname firstname" 格式
|
||||
author_terms = [f"au:{a.strip()}" for a in author.split(",") if a.strip()]
|
||||
if author_terms:
|
||||
parts.append(f"({' OR '.join(author_terms)})")
|
||||
|
||||
if category:
|
||||
parts.append(f"cat:{category}")
|
||||
|
||||
return " AND ".join(parts)
|
||||
|
||||
|
||||
def fetch_by_ids(id_list: list[str], limit: int) -> list[dict]:
|
||||
"""通过 ID 列表直接获取论文元数据(不做文本搜索)。"""
|
||||
params = {
|
||||
"id_list": ",".join(id_list[:limit]),
|
||||
"max_results": min(len(id_list), limit, 100),
|
||||
}
|
||||
with get_client(timeout=30, headers={"Accept": "application/xml"}) as client:
|
||||
resp = client.get(API_URL, params=params)
|
||||
resp.raise_for_status()
|
||||
return _parse_entries(ET.fromstring(resp.text), limit)
|
||||
|
||||
|
||||
def search(
|
||||
query: str,
|
||||
limit: int,
|
||||
category: str | None = None,
|
||||
sort_by: str = "relevance",
|
||||
author: str | None = None,
|
||||
title_only: bool = False,
|
||||
) -> list[dict]:
|
||||
"""执行 ArXiv 关键词搜索。"""
|
||||
search_query = build_search_query(query, category, author, title_only)
|
||||
|
||||
sort_map = {
|
||||
"relevance": "relevance",
|
||||
"date": "lastUpdatedDate",
|
||||
"submitted": "submittedDate",
|
||||
}
|
||||
|
||||
params = {
|
||||
"search_query": search_query,
|
||||
"start": 0,
|
||||
"max_results": min(limit, 100),
|
||||
"sortBy": sort_map.get(sort_by, "relevance"),
|
||||
"sortOrder": "descending",
|
||||
}
|
||||
|
||||
with get_client(timeout=30, headers={"Accept": "application/xml"}) as client:
|
||||
resp = client.get(API_URL, params=params)
|
||||
resp.raise_for_status()
|
||||
|
||||
return _parse_entries(ET.fromstring(resp.text), limit)
|
||||
|
||||
|
||||
def _parse_entries(root: ET.Element, limit: int) -> list[dict]:
|
||||
"""从 Atom XML 解析论文条目。"""
|
||||
items = []
|
||||
|
||||
for entry in root.findall("atom:entry", NS)[:limit]:
|
||||
title = _text(entry, "atom:title").replace("\n", " ").strip()
|
||||
summary = _text(entry, "atom:summary").replace("\n", " ").strip()
|
||||
published = _text(entry, "atom:published")
|
||||
updated = _text(entry, "atom:updated")
|
||||
|
||||
# 获取论文链接(优先 abs 页面)
|
||||
url = ""
|
||||
pdf_url = ""
|
||||
for link in entry.findall("atom:link", NS):
|
||||
href = link.get("href", "")
|
||||
if link.get("title") == "pdf":
|
||||
pdf_url = href
|
||||
elif link.get("type") == "text/html" or "/abs/" in href:
|
||||
url = href
|
||||
if not url:
|
||||
url = _text(entry, "atom:id")
|
||||
|
||||
# 从 abs URL 或 id 提取 arxiv_id
|
||||
arxiv_id = ""
|
||||
raw_id = _text(entry, "atom:id")
|
||||
if "/abs/" in raw_id:
|
||||
arxiv_id = raw_id.split("/abs/")[-1]
|
||||
elif raw_id.startswith("http"):
|
||||
arxiv_id = raw_id.split("/")[-1]
|
||||
|
||||
# 获取作者
|
||||
authors = [_text(a, "atom:name") for a in entry.findall("atom:author", NS)]
|
||||
|
||||
# 获取分类
|
||||
categories = [c.get("term", "") for c in entry.findall("atom:category", NS)]
|
||||
|
||||
comment = _text(entry, "arxiv:comment")
|
||||
journal_ref = _text(entry, "arxiv:journal_ref")
|
||||
doi = _text(entry, "arxiv:doi")
|
||||
primary_category = entry.find("arxiv:primary_category", NS)
|
||||
primary_cat = primary_category.get("term", "") if primary_category is not None else ""
|
||||
|
||||
# HTML 版本链接(较新论文有)
|
||||
html_url = f"https://arxiv.org/html/{arxiv_id}" if arxiv_id else None
|
||||
|
||||
items.append(make_item(
|
||||
title=title,
|
||||
url=url,
|
||||
snippet=summary,
|
||||
arxiv_id=arxiv_id if arxiv_id else None,
|
||||
authors=authors,
|
||||
published=published,
|
||||
updated=updated,
|
||||
pdf_url=pdf_url,
|
||||
html_url=html_url,
|
||||
categories=categories,
|
||||
primary_category=primary_cat if primary_cat else None,
|
||||
comment=comment if comment else None,
|
||||
journal_ref=journal_ref if journal_ref else None,
|
||||
doi=doi if doi else None,
|
||||
))
|
||||
|
||||
return items
|
||||
|
||||
|
||||
def _text(elem: ET.Element, tag: str) -> str:
|
||||
"""安全获取子元素文本。"""
|
||||
child = elem.find(tag, NS)
|
||||
return child.text.strip() if child is not None and child.text else ""
|
||||
|
||||
|
||||
def main():
|
||||
parser = build_parser("搜索 ArXiv 学术论文")
|
||||
parser.add_argument("--category", "-c", help="ArXiv 分类过滤(如 cs.AI, cs.CL, math.CO)")
|
||||
parser.add_argument(
|
||||
"--sort", default="relevance",
|
||||
choices=["relevance", "date", "submitted"],
|
||||
help="排序方式(默认 relevance)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--author", "-a",
|
||||
help="按作者过滤(如 'hinton',多个作者用逗号分隔)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--title-only", action="store_true",
|
||||
help="仅在标题中搜索(默认搜索全字段)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--id-list",
|
||||
help="直接按 arXiv ID 获取元数据,逗号分隔(如 2409.05591,2301.00001)。指定此项时 query 参数可留空。",
|
||||
)
|
||||
# 当使用 --id-list 时 query 可选
|
||||
parser.prog = "arxiv_search.py"
|
||||
|
||||
# 为了支持 --id-list 时 query 可省略,临时让 query 可选
|
||||
for action in parser._positionals._group_actions:
|
||||
if action.dest == "query":
|
||||
action.nargs = "?"
|
||||
action.default = ""
|
||||
break
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
if args.id_list:
|
||||
id_list = [i.strip() for i in args.id_list.split(",") if i.strip()]
|
||||
items = fetch_by_ids(id_list, args.limit)
|
||||
query_str = f"id_list:{args.id_list}"
|
||||
else:
|
||||
if not args.query:
|
||||
parser.error("请提供搜索关键词,或使用 --id-list 按 ID 查询")
|
||||
items = search(
|
||||
args.query,
|
||||
args.limit,
|
||||
category=args.category,
|
||||
sort_by=args.sort,
|
||||
author=args.author,
|
||||
title_only=args.title_only,
|
||||
)
|
||||
query_str = args.query
|
||||
|
||||
print_json(make_result(True, query_str, "arxiv", items))
|
||||
except Exception as e:
|
||||
print_json(make_result(False, getattr(args, "query", "") or "", "arxiv", [], str(e)))
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
454
sn-search-academic/scripts/pmc_paper.py
Normal file
454
sn-search-academic/scripts/pmc_paper.py
Normal file
@@ -0,0 +1,454 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
PMC 论文全文章节阅读器。
|
||||
|
||||
通过 NCBI E-utilities 获取 PubMed Central 全文 XML(JATS 格式),支持:
|
||||
- 列出论文所有章节结构(含子章节层级)
|
||||
- 按章节名称提取正文内容(大小写不敏感,支持部分匹配)
|
||||
- 通过 PMID 自动解析到 PMC ID
|
||||
|
||||
用法:
|
||||
python3 pmc_paper.py PMC11119143 # 列出章节
|
||||
python3 pmc_paper.py 11119143 # 同上(自动补 PMC 前缀)
|
||||
python3 pmc_paper.py PMC11119143 --section introduction # 读取指定章节
|
||||
python3 pmc_paper.py --pmid 38786024 --section method # 从 PMID 出发
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import re
|
||||
import sys
|
||||
import xml.etree.ElementTree as ET
|
||||
from typing import Any
|
||||
|
||||
from search_utils import get_client, print_json
|
||||
|
||||
EFETCH_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
|
||||
ELINK_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi"
|
||||
|
||||
# ── ID 处理 ───────────────────────────────────────────────────────────────────
|
||||
|
||||
def normalize_pmc_id(raw: str) -> str:
|
||||
"""规范化 PMC ID:去掉 'PMC' 前缀,只保留数字部分。"""
|
||||
return re.sub(r"^[Pp][Mm][Cc]", "", raw.strip())
|
||||
|
||||
|
||||
def pmid_to_pmc(pmid: str, api_key: str | None = None) -> str | None:
|
||||
"""通过 elink 将 PMID 转换为 PMC ID(数字形式)。"""
|
||||
params: dict[str, Any] = {
|
||||
"dbfrom": "pubmed",
|
||||
"db": "pmc",
|
||||
"id": pmid,
|
||||
"retmode": "json",
|
||||
}
|
||||
if api_key:
|
||||
params["api_key"] = api_key
|
||||
|
||||
with get_client(timeout=20) as client:
|
||||
resp = client.get(ELINK_URL, params=params)
|
||||
resp.raise_for_status()
|
||||
|
||||
data = resp.json()
|
||||
for linkset in data.get("linksets", []):
|
||||
for db in linkset.get("linksetdbs", []):
|
||||
if db.get("dbto") == "pmc" and db.get("linkname") == "pubmed_pmc":
|
||||
links = db.get("links", [])
|
||||
if links:
|
||||
return str(links[0])
|
||||
return None
|
||||
|
||||
|
||||
# ── XML 拉取 ──────────────────────────────────────────────────────────────────
|
||||
|
||||
def fetch_pmc_xml(pmc_num: str, api_key: str | None = None) -> ET.Element:
|
||||
"""获取 PMC 全文 XML,返回根元素。"""
|
||||
params: dict[str, Any] = {
|
||||
"db": "pmc",
|
||||
"id": pmc_num,
|
||||
"rettype": "xml",
|
||||
"retmode": "xml",
|
||||
}
|
||||
if api_key:
|
||||
params["api_key"] = api_key
|
||||
|
||||
with get_client(timeout=45) as client:
|
||||
resp = client.get(EFETCH_URL, params=params)
|
||||
resp.raise_for_status()
|
||||
|
||||
root = ET.fromstring(resp.text)
|
||||
|
||||
# 检查是否找到论文
|
||||
article = root.find(".//article")
|
||||
if article is None:
|
||||
raise ValueError(
|
||||
f"PMC{pmc_num} 未找到全文。"
|
||||
"可能原因:该论文不在 PMC 开放获取库中,或 ID 有误。"
|
||||
)
|
||||
return root
|
||||
|
||||
|
||||
# ── JATS XML 文本提取 ─────────────────────────────────────────────────────────
|
||||
|
||||
# 跳过这些标签的全部内容(噪音节点)
|
||||
_SKIP_TAGS = {"ref", "ref-list", "fn", "fn-group", "permissions", "author-notes",
|
||||
"glossary", "ack"} # ack=Acknowledgements,可按需保留
|
||||
|
||||
# 转为占位符的标签
|
||||
_FORMULA_TAGS = {"disp-formula", "inline-formula", "mml:math", "tex-math"}
|
||||
|
||||
|
||||
def _elem_to_text(elem: ET.Element, depth: int = 0) -> str:
|
||||
"""
|
||||
将 JATS XML 元素递归转为可读文本。
|
||||
|
||||
处理规则:
|
||||
- <p>: 段落,末尾加换行
|
||||
- <title>: 跳过(章节标题在上层已处理)
|
||||
- <sec>: 子章节,递归(用缩进区分层级)
|
||||
- <list>/<list-item>: 转为 bullet 列表
|
||||
- <disp-formula>/<inline-formula>: 替换为 [FORMULA]
|
||||
- <fig>: 跳过图像内容,保留 caption
|
||||
- <table-wrap>: 保留 label+caption
|
||||
- <xref>/<ext-link>: 直接取文本内容
|
||||
- <bold>/<italic>/<underline>: 取文本内容
|
||||
"""
|
||||
tag = elem.tag.split("}")[-1] if "}" in elem.tag else elem.tag # 去 namespace
|
||||
|
||||
if tag in _SKIP_TAGS:
|
||||
return ""
|
||||
|
||||
if tag in _FORMULA_TAGS:
|
||||
return " [FORMULA] "
|
||||
|
||||
if tag == "title":
|
||||
return "" # 由调用方处理
|
||||
|
||||
if tag == "p":
|
||||
text = _collect_text(elem)
|
||||
return text.strip() + "\n\n" if text.strip() else ""
|
||||
|
||||
if tag in ("bold", "italic", "underline", "named-content", "styled-content",
|
||||
"ext-link", "uri", "xref", "sup", "sub", "monospace"):
|
||||
return _collect_text(elem)
|
||||
|
||||
if tag == "list":
|
||||
parts = []
|
||||
for li in elem.findall("list-item"):
|
||||
item_text = "".join(_elem_to_text(c) for c in li).strip()
|
||||
if item_text:
|
||||
parts.append(f"• {item_text}")
|
||||
return "\n".join(parts) + "\n\n" if parts else ""
|
||||
|
||||
if tag == "disp-quote":
|
||||
text = "".join(_elem_to_text(c) for c in elem).strip()
|
||||
return f"> {text}\n\n" if text else ""
|
||||
|
||||
if tag == "fig":
|
||||
# 只保留 caption
|
||||
caption = elem.find(".//caption")
|
||||
if caption is not None:
|
||||
cap_text = "".join(_elem_to_text(c) for c in caption).strip()
|
||||
label = elem.findtext("label", "Figure")
|
||||
return f"[{label}: {cap_text}]\n\n" if cap_text else ""
|
||||
return ""
|
||||
|
||||
if tag == "table-wrap":
|
||||
label = elem.findtext("label", "Table")
|
||||
caption = elem.find(".//caption")
|
||||
cap_text = ""
|
||||
if caption is not None:
|
||||
cap_text = "".join(_elem_to_text(c) for c in caption).strip()
|
||||
return f"[{label}: {cap_text}]\n\n" if cap_text else f"[{label}]\n\n"
|
||||
|
||||
if tag == "sec":
|
||||
# 子章节:递归处理,标题加缩进
|
||||
sub_title_elem = elem.find("title")
|
||||
sub_title = ""
|
||||
if sub_title_elem is not None:
|
||||
sub_title = _collect_text(sub_title_elem).strip()
|
||||
|
||||
parts = []
|
||||
if sub_title:
|
||||
indent = " " * depth
|
||||
parts.append(f"\n{indent}### {sub_title}\n\n")
|
||||
for child in elem:
|
||||
child_tag = child.tag.split("}")[-1] if "}" in child.tag else child.tag
|
||||
if child_tag == "title":
|
||||
continue
|
||||
parts.append(_elem_to_text(child, depth + 1))
|
||||
return "".join(parts)
|
||||
|
||||
# 默认:递归子节点
|
||||
return "".join(_elem_to_text(c, depth) for c in elem)
|
||||
|
||||
|
||||
def _collect_text(elem: ET.Element) -> str:
|
||||
"""收集元素的所有文本(含子节点,跳过公式)。"""
|
||||
parts = []
|
||||
if elem.text:
|
||||
parts.append(elem.text)
|
||||
for child in elem:
|
||||
child_tag = child.tag.split("}")[-1] if "}" in child.tag else child.tag
|
||||
if child_tag in _FORMULA_TAGS:
|
||||
parts.append("[FORMULA]")
|
||||
elif child_tag in _SKIP_TAGS:
|
||||
pass
|
||||
else:
|
||||
parts.append(_collect_text(child))
|
||||
if child.tail:
|
||||
parts.append(child.tail)
|
||||
return "".join(parts)
|
||||
|
||||
|
||||
# ── 章节提取 ──────────────────────────────────────────────────────────────────
|
||||
|
||||
def _extract_sections_from(container: ET.Element, level: int = 1) -> list[dict[str, Any]]:
|
||||
"""递归提取 sec 节点,返回扁平章节列表。"""
|
||||
sections: list[dict[str, Any]] = []
|
||||
for sec in container.findall("sec"):
|
||||
title_elem = sec.find("title")
|
||||
title = _collect_text(title_elem).strip() if title_elem is not None else f"Section {len(sections)+1}"
|
||||
|
||||
# 正文:本 sec 的直接子节点(排除 sec 和 title)
|
||||
text_parts = []
|
||||
for child in sec:
|
||||
child_tag = child.tag.split("}")[-1] if "}" in child.tag else child.tag
|
||||
if child_tag in ("title", "sec"):
|
||||
continue
|
||||
text_parts.append(_elem_to_text(child))
|
||||
|
||||
text = "".join(text_parts).strip()
|
||||
|
||||
# 子章节递归
|
||||
subsections = _extract_sections_from(sec, level + 1)
|
||||
|
||||
sections.append({
|
||||
"name": title,
|
||||
"level": level,
|
||||
"text": text,
|
||||
"subsections": subsections,
|
||||
})
|
||||
return sections
|
||||
|
||||
|
||||
def extract_all_sections(root: ET.Element) -> list[dict[str, Any]]:
|
||||
"""
|
||||
从 PMC JATS XML 提取所有章节。
|
||||
顺序:Abstract → Body sections(含子章节)
|
||||
"""
|
||||
sections: list[dict[str, Any]] = []
|
||||
|
||||
article = root.find(".//article")
|
||||
if article is None:
|
||||
return sections
|
||||
|
||||
# ── 摘要 ──
|
||||
abstract = article.find(".//abstract")
|
||||
if abstract is not None:
|
||||
# 结构化摘要(含 sec)
|
||||
if abstract.findall("sec"):
|
||||
abs_parts = []
|
||||
for sec in abstract.findall("sec"):
|
||||
sec_title = sec.findtext("title", "")
|
||||
sec_text_parts = []
|
||||
for child in sec:
|
||||
if child.tag != "title":
|
||||
sec_text_parts.append(_elem_to_text(child))
|
||||
part = "".join(sec_text_parts).strip()
|
||||
if sec_title:
|
||||
abs_parts.append(f"{sec_title}: {part}")
|
||||
else:
|
||||
abs_parts.append(part)
|
||||
abs_text = "\n\n".join(abs_parts)
|
||||
else:
|
||||
abs_text = "".join(_elem_to_text(c) for c in abstract).strip()
|
||||
|
||||
if abs_text:
|
||||
sections.append({"name": "Abstract", "level": 0, "text": abs_text, "subsections": []})
|
||||
|
||||
# ── Body ──
|
||||
body = article.find(".//body")
|
||||
if body is not None:
|
||||
sections.extend(_extract_sections_from(body, level=1))
|
||||
|
||||
return sections
|
||||
|
||||
|
||||
# ── 章节匹配 ──────────────────────────────────────────────────────────────────
|
||||
|
||||
def _flatten_sections(sections: list[dict], result: list | None = None) -> list[dict]:
|
||||
"""将嵌套章节扁平化,便于搜索。"""
|
||||
if result is None:
|
||||
result = []
|
||||
for s in sections:
|
||||
result.append(s)
|
||||
_flatten_sections(s.get("subsections", []), result)
|
||||
return result
|
||||
|
||||
|
||||
def match_section(sections: list[dict], query: str) -> dict | None:
|
||||
"""大小写不敏感 + 去数字前缀的模糊匹配(搜索所有层级)。"""
|
||||
q = query.lower().strip()
|
||||
flat = _flatten_sections(sections)
|
||||
|
||||
def clean(name: str) -> str:
|
||||
return re.sub(r"^\d+[\.\s]+", "", name).lower().strip()
|
||||
|
||||
# 精确匹配
|
||||
for s in flat:
|
||||
if s["name"].lower() == q or clean(s["name"]) == q:
|
||||
return s
|
||||
|
||||
# 包含/前缀匹配
|
||||
for s in flat:
|
||||
c = clean(s["name"])
|
||||
if c.startswith(q) or q in c:
|
||||
return s
|
||||
|
||||
return None
|
||||
|
||||
|
||||
# ── 对外接口 ──────────────────────────────────────────────────────────────────
|
||||
|
||||
def _section_outline(sections: list[dict], depth: int = 0) -> list[dict]:
|
||||
"""生成章节目录(只含 name 和 level,递归)。"""
|
||||
outline = []
|
||||
for s in sections:
|
||||
outline.append({"name": s["name"], "level": s["level"]})
|
||||
if s.get("subsections"):
|
||||
outline.extend(_section_outline(s["subsections"], depth + 1))
|
||||
return outline
|
||||
|
||||
|
||||
def cmd_list_sections(pmc_num: str, api_key: str | None = None) -> dict[str, Any]:
|
||||
"""列出 PMC 论文所有章节目录。"""
|
||||
root = fetch_pmc_xml(pmc_num, api_key)
|
||||
sections = extract_all_sections(root)
|
||||
|
||||
# 从 XML 拿标题
|
||||
title = root.findtext(".//article-title", "")
|
||||
pmid = root.findtext(".//article-id[@pub-id-type='pmid']", "")
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"pmc_id": f"PMC{pmc_num}",
|
||||
"pmid": pmid or None,
|
||||
"title": title,
|
||||
"pmc_url": f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmc_num}/",
|
||||
"section_count": len(_flatten_sections(sections)),
|
||||
"sections": _section_outline(sections),
|
||||
"error": None,
|
||||
}
|
||||
|
||||
|
||||
def cmd_read_section(pmc_num: str, section_name: str, api_key: str | None = None) -> dict[str, Any]:
|
||||
"""读取指定章节的正文内容(含子章节文本)。"""
|
||||
root = fetch_pmc_xml(pmc_num, api_key)
|
||||
sections = extract_all_sections(root)
|
||||
matched = match_section(sections, section_name)
|
||||
|
||||
if matched is None:
|
||||
flat = _flatten_sections(sections)
|
||||
available = [s["name"] for s in flat]
|
||||
return {
|
||||
"success": False,
|
||||
"pmc_id": f"PMC{pmc_num}",
|
||||
"section": section_name,
|
||||
"content": None,
|
||||
"error": f"未找到章节 '{section_name}',可用章节:{available}",
|
||||
}
|
||||
|
||||
# 合并本节文本 + 子章节文本
|
||||
def collect_text(s: dict) -> str:
|
||||
parts = [s["text"]]
|
||||
for sub in s.get("subsections", []):
|
||||
sub_text = collect_text(sub)
|
||||
if sub_text.strip():
|
||||
parts.append(f"\n### {sub['name']}\n\n{sub_text}")
|
||||
return "\n\n".join(p for p in parts if p.strip())
|
||||
|
||||
content = collect_text(matched)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"pmc_id": f"PMC{pmc_num}",
|
||||
"pmc_url": f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmc_num}/",
|
||||
"section": matched["name"],
|
||||
"level": matched["level"],
|
||||
"content": content,
|
||||
"char_count": len(content),
|
||||
"error": None,
|
||||
}
|
||||
|
||||
|
||||
# ── CLI ───────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="PMC 论文全文章节阅读器",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
示例:
|
||||
python3 pmc_paper.py PMC11119143 列出所有章节
|
||||
python3 pmc_paper.py 11119143 同上(自动补前缀)
|
||||
python3 pmc_paper.py PMC11119143 --section introduction 读取 Introduction
|
||||
python3 pmc_paper.py PMC11119143 --section method 读取 Methods
|
||||
python3 pmc_paper.py --pmid 38786024 从 PMID 列章节
|
||||
python3 pmc_paper.py --pmid 38786024 --section conclusion 从 PMID 读章节
|
||||
""",
|
||||
)
|
||||
parser.add_argument(
|
||||
"pmc_id", nargs="?",
|
||||
help="PMC ID(如 PMC11119143 或 11119143)。与 --pmid 二选一。",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pmid",
|
||||
help="PubMed ID,自动转换为 PMC ID(需要论文在 PMC 开放获取库中)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--section", "-s",
|
||||
metavar="SECTION_NAME",
|
||||
help="要读取的章节名(大小写不敏感,支持部分匹配)。不指定则列出所有章节。",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--api-key",
|
||||
help="NCBI API Key(可选,提升限额从 3 req/s 到 10 req/s)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
api_key = getattr(args, "api_key", None)
|
||||
|
||||
try:
|
||||
# 解析 PMC 数字 ID
|
||||
if args.pmid:
|
||||
pmc_num = pmid_to_pmc(args.pmid, api_key)
|
||||
if not pmc_num:
|
||||
print_json({
|
||||
"success": False,
|
||||
"pmid": args.pmid,
|
||||
"error": f"PMID {args.pmid} 在 PMC 中无对应全文。该论文可能未开放获取。",
|
||||
})
|
||||
sys.exit(1)
|
||||
elif args.pmc_id:
|
||||
pmc_num = normalize_pmc_id(args.pmc_id)
|
||||
else:
|
||||
parser.error("请提供 PMC ID 或使用 --pmid 指定 PubMed ID")
|
||||
|
||||
if args.section:
|
||||
result = cmd_read_section(pmc_num, args.section.strip(), api_key)
|
||||
else:
|
||||
result = cmd_list_sections(pmc_num, api_key)
|
||||
|
||||
print_json(result)
|
||||
|
||||
except Exception as e:
|
||||
print_json({
|
||||
"success": False,
|
||||
"pmc_id": f"PMC{pmc_num}" if "pmc_num" in dir() else None,
|
||||
"error": str(e),
|
||||
})
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
165
sn-search-academic/scripts/pubmed_search.py
Normal file
165
sn-search-academic/scripts/pubmed_search.py
Normal file
@@ -0,0 +1,165 @@
|
||||
#!/usr/bin/env python3
|
||||
"""PubMed 生物医学文献搜索。通过 NCBI E-utilities API。"""
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
from search_utils import build_parser, get_client, make_item, make_result, print_json
|
||||
|
||||
ESEARCH_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
|
||||
EFETCH_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
|
||||
|
||||
|
||||
def search(query: str, limit: int, api_key: str | None = None) -> list[dict]:
|
||||
"""执行 PubMed 搜索(两步:esearch 获取 PMID,efetch 获取完整记录含摘要)。"""
|
||||
base_params: dict = {"api_key": api_key} if api_key else {}
|
||||
|
||||
# Step 1: esearch 获取 PMID 列表
|
||||
with get_client(timeout=30) as client:
|
||||
resp = client.get(ESEARCH_URL, params={
|
||||
**base_params,
|
||||
"db": "pubmed",
|
||||
"term": query,
|
||||
"retmax": min(limit, 100),
|
||||
"retmode": "json",
|
||||
"sort": "relevance",
|
||||
})
|
||||
resp.raise_for_status()
|
||||
pmids = resp.json().get("esearchresult", {}).get("idlist", [])
|
||||
|
||||
if not pmids:
|
||||
return []
|
||||
|
||||
# Step 2: efetch 获取完整 XML 记录(含摘要)
|
||||
with get_client(timeout=30) as client:
|
||||
resp = client.get(EFETCH_URL, params={
|
||||
**base_params,
|
||||
"db": "pubmed",
|
||||
"id": ",".join(pmids[:limit]),
|
||||
"rettype": "xml",
|
||||
"retmode": "xml",
|
||||
})
|
||||
resp.raise_for_status()
|
||||
|
||||
root = ET.fromstring(resp.text)
|
||||
items = []
|
||||
|
||||
for article in root.findall(".//PubmedArticle"):
|
||||
medline = article.find("MedlineCitation")
|
||||
if medline is None:
|
||||
continue
|
||||
|
||||
pmid_elem = medline.find("PMID")
|
||||
pmid = pmid_elem.text if pmid_elem is not None else ""
|
||||
|
||||
article_data = medline.find("Article")
|
||||
if article_data is None:
|
||||
continue
|
||||
|
||||
# 标题
|
||||
title_elem = article_data.find("ArticleTitle")
|
||||
title = "".join(title_elem.itertext()) if title_elem is not None else ""
|
||||
|
||||
# 摘要(支持结构化摘要,如 BACKGROUND/METHODS/RESULTS/CONCLUSIONS)
|
||||
abstract_parts = []
|
||||
abstract_elem = article_data.find("Abstract")
|
||||
if abstract_elem is not None:
|
||||
for ab in abstract_elem.findall("AbstractText"):
|
||||
label = ab.get("Label")
|
||||
text = "".join(ab.itertext()).strip()
|
||||
if label:
|
||||
abstract_parts.append(f"{label}: {text}")
|
||||
else:
|
||||
abstract_parts.append(text)
|
||||
abstract = " ".join(abstract_parts)
|
||||
|
||||
# 作者
|
||||
authors = []
|
||||
author_list = article_data.find("AuthorList")
|
||||
if author_list is not None:
|
||||
for author in author_list.findall("Author"):
|
||||
last = author.findtext("LastName", "")
|
||||
fore = author.findtext("ForeName", "")
|
||||
name = f"{fore} {last}".strip() if fore else last
|
||||
if name:
|
||||
authors.append(name)
|
||||
|
||||
# 期刊信息
|
||||
journal = article_data.find("Journal")
|
||||
journal_name = ""
|
||||
pub_date = ""
|
||||
volume = ""
|
||||
issue = ""
|
||||
if journal is not None:
|
||||
journal_name = journal.findtext("Title", "") or journal.findtext("ISOAbbreviation", "")
|
||||
ji = journal.find("JournalIssue")
|
||||
if ji is not None:
|
||||
volume = ji.findtext("Volume", "")
|
||||
issue = ji.findtext("Issue", "")
|
||||
pd = ji.find("PubDate")
|
||||
if pd is not None:
|
||||
year = pd.findtext("Year", "")
|
||||
month = pd.findtext("Month", "")
|
||||
day = pd.findtext("Day", "")
|
||||
pub_date = " ".join(filter(None, [year, month, day]))
|
||||
|
||||
# 页码
|
||||
pages = article_data.findtext(".//MedlinePgn", "")
|
||||
|
||||
# DOI 和 PMC ID(从 ArticleIdList 提取)
|
||||
doi = None
|
||||
pmc_id = None
|
||||
for id_elem in article.findall(".//ArticleId"):
|
||||
id_type = id_elem.get("IdType", "")
|
||||
if id_type == "doi":
|
||||
doi = id_elem.text
|
||||
elif id_type == "pmc" and id_elem.text:
|
||||
# 规范化:去掉 "PMC" 前缀,只保留数字
|
||||
pmc_id = id_elem.text.lstrip("PMCpmc").strip() or id_elem.text
|
||||
|
||||
# MeSH 关键词
|
||||
keywords = [kw.text for kw in medline.findall(".//Keyword") if kw.text]
|
||||
|
||||
# 文献类型
|
||||
pub_types = [pt.text for pt in article_data.findall(".//PublicationType") if pt.text]
|
||||
|
||||
url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/"
|
||||
pmc_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmc_id}/" if pmc_id else None
|
||||
|
||||
items.append(make_item(
|
||||
title=title,
|
||||
url=url,
|
||||
snippet=abstract,
|
||||
authors=authors,
|
||||
pmid=pmid,
|
||||
pmc_id=f"PMC{pmc_id}" if pmc_id else None,
|
||||
pmc_url=pmc_url,
|
||||
journal=journal_name if journal_name else None,
|
||||
pub_date=pub_date if pub_date else None,
|
||||
volume=volume if volume else None,
|
||||
issue=issue if issue else None,
|
||||
pages=pages if pages else None,
|
||||
keywords=keywords if keywords else None,
|
||||
pub_types=pub_types if pub_types else None,
|
||||
doi=doi,
|
||||
))
|
||||
|
||||
return items
|
||||
|
||||
|
||||
def main():
|
||||
parser = build_parser("搜索 PubMed 生物医学文献")
|
||||
parser.add_argument("--api-key", help="NCBI API Key(可选,限额从 3 req/s 提升至 10 req/s)")
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
items = search(args.query, args.limit, getattr(args, "api_key", None))
|
||||
print_json(make_result(True, args.query, "pubmed", items))
|
||||
except Exception as e:
|
||||
print_json(make_result(False, args.query, "pubmed", [], str(e)))
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
150
sn-search-academic/scripts/search_utils.py
Normal file
150
sn-search-academic/scripts/search_utils.py
Normal file
@@ -0,0 +1,150 @@
|
||||
"""
|
||||
搜索 Skill 共享工具库。
|
||||
|
||||
提供标准 JSON 输出、CLI 脚手架、httpx helper 和配置读取。
|
||||
所有搜索脚本通过 sys.path 导入此模块。
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from typing import Any
|
||||
|
||||
try:
|
||||
import httpx
|
||||
except ImportError:
|
||||
json.dump(
|
||||
{
|
||||
"success": False,
|
||||
"error": "缺少 httpx,请运行:python3 -m pip install -r skills/sn-search-academic/requirements.txt",
|
||||
},
|
||||
sys.stdout,
|
||||
ensure_ascii=False,
|
||||
)
|
||||
sys.stdout.write("\n")
|
||||
sys.exit(1)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 标准输出
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def make_result(
|
||||
success: bool,
|
||||
query: str,
|
||||
provider: str,
|
||||
items: list[dict[str, Any]],
|
||||
error: str | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""构造标准化的搜索结果。"""
|
||||
return {
|
||||
"success": success,
|
||||
"query": query,
|
||||
"provider": provider,
|
||||
"items": items,
|
||||
"error": error,
|
||||
}
|
||||
|
||||
|
||||
def make_item(
|
||||
title: str,
|
||||
url: str,
|
||||
snippet: str = "",
|
||||
**extra: Any,
|
||||
) -> dict[str, Any]:
|
||||
"""构造标准化的搜索结果条目。"""
|
||||
item: dict[str, Any] = {"title": title, "url": url, "snippet": snippet}
|
||||
for k, v in extra.items():
|
||||
if v not in (None, "", [], {}):
|
||||
item[k] = v
|
||||
return item
|
||||
|
||||
|
||||
def print_json(data: dict[str, Any]) -> None:
|
||||
"""将结果 JSON 输出到 stdout。"""
|
||||
json.dump(data, sys.stdout, ensure_ascii=False, indent=2)
|
||||
sys.stdout.write("\n")
|
||||
sys.stdout.flush()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI 脚手架
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def build_parser(description: str) -> argparse.ArgumentParser:
|
||||
"""创建带有通用参数的 ArgumentParser。"""
|
||||
parser = argparse.ArgumentParser(description=description)
|
||||
parser.add_argument("query", help="搜索关键词")
|
||||
parser.add_argument("--limit", "-n", type=int, default=10, help="返回结果数量(默认 10)")
|
||||
return parser
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# httpx helper
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_DEFAULT_TIMEOUT = 15
|
||||
_DEFAULT_UA = (
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
|
||||
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||
"Chrome/125.0.0.0 Safari/537.36"
|
||||
)
|
||||
|
||||
|
||||
def get_client(
|
||||
timeout: int = _DEFAULT_TIMEOUT,
|
||||
headers: dict[str, str] | None = None,
|
||||
**kwargs: Any,
|
||||
) -> httpx.Client:
|
||||
"""返回预配置的 httpx.Client。"""
|
||||
default_headers = {
|
||||
"User-Agent": _DEFAULT_UA,
|
||||
"Accept": "application/json",
|
||||
}
|
||||
if headers:
|
||||
default_headers.update(headers)
|
||||
return httpx.Client(
|
||||
timeout=timeout,
|
||||
headers=default_headers,
|
||||
follow_redirects=True,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 配置读取
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def get_key(env_var: str, cli_arg: str | None = None) -> str | None:
|
||||
"""读取 API key:CLI 参数 > 环境变量。"""
|
||||
if cli_arg:
|
||||
return cli_arg
|
||||
return os.environ.get(env_var)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 脚本入口辅助
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def run_search(
|
||||
provider: str,
|
||||
search_fn, # Callable[[str, int, ...], list[dict]]
|
||||
parser: argparse.ArgumentParser | None = None,
|
||||
extra_kwargs_fn=None, # Callable[[Namespace], dict] 从 args 提取额外参数
|
||||
) -> None:
|
||||
"""通用脚本入口:解析参数 → 执行搜索 → 输出 JSON。"""
|
||||
if parser is None:
|
||||
parser = build_parser(f"Search {provider}")
|
||||
args = parser.parse_args()
|
||||
|
||||
extra = {}
|
||||
if extra_kwargs_fn:
|
||||
extra = extra_kwargs_fn(args)
|
||||
|
||||
try:
|
||||
items = search_fn(args.query, args.limit, **extra)
|
||||
print_json(make_result(True, args.query, provider, items))
|
||||
except Exception as e:
|
||||
print_json(make_result(False, args.query, provider, [], str(e)))
|
||||
sys.exit(1)
|
||||
238
sn-search-academic/scripts/semantic_scholar_refs.py
Normal file
238
sn-search-academic/scripts/semantic_scholar_refs.py
Normal file
@@ -0,0 +1,238 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Semantic Scholar 引用追溯:查询论文的参考文献(backward)和被引论文(forward)。"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
from search_utils import get_client, make_item, print_json
|
||||
|
||||
API_BASE = "https://api.semanticscholar.org/graph/v1/paper"
|
||||
|
||||
# paper-level fields(嵌套在 citedPaper/citingPaper 下)
|
||||
# 注意: tldr 在 nested 请求中容易触发 rate limit,不请求
|
||||
PAPER_FIELDS = [
|
||||
"title", "abstract", "year", "venue", "publicationDate",
|
||||
"authors", "citationCount", "influentialCitationCount",
|
||||
"isOpenAccess", "openAccessPdf", "externalIds", "fieldsOfStudy",
|
||||
]
|
||||
|
||||
# edge-level fields(引用关系本身的属性)
|
||||
EDGE_FIELDS = ["contexts", "intents"]
|
||||
|
||||
|
||||
def resolve_paper_id(identifier: str) -> str:
|
||||
"""将各种论文标识符转为 Semantic Scholar 可接受的格式。
|
||||
|
||||
支持:
|
||||
- Semantic Scholar paper ID (40-char hex)
|
||||
- DOI: 10.xxxx/... → DOI:10.xxxx/...
|
||||
- ArXiv ID: 2301.07041 → ARXIV:2301.07041
|
||||
- PubMed ID: PMID:12345678
|
||||
- URL: https://www.semanticscholar.org/paper/... → 提取 ID
|
||||
"""
|
||||
identifier = identifier.strip()
|
||||
|
||||
# S2 URL
|
||||
if "semanticscholar.org/paper/" in identifier:
|
||||
# URL 末尾的 40-char hex
|
||||
parts = identifier.rstrip("/").split("/")
|
||||
return parts[-1]
|
||||
|
||||
# DOI
|
||||
if identifier.startswith("10."):
|
||||
return f"DOI:{identifier}"
|
||||
if identifier.lower().startswith("doi:"):
|
||||
return identifier
|
||||
|
||||
# ArXiv
|
||||
if identifier.lower().startswith("arxiv:"):
|
||||
return identifier.upper()
|
||||
# 形如 2301.07041 或 2301.07041v2
|
||||
if "." in identifier and identifier.replace(".", "").replace("v", "").isdigit():
|
||||
return f"ARXIV:{identifier}"
|
||||
|
||||
# PMID
|
||||
if identifier.lower().startswith("pmid:"):
|
||||
return identifier.upper()
|
||||
|
||||
# 假设是 S2 paper ID
|
||||
return identifier
|
||||
|
||||
|
||||
def fetch_refs(
|
||||
paper_id: str,
|
||||
direction: str,
|
||||
limit: int,
|
||||
min_citations: int,
|
||||
year_min: int | None,
|
||||
year_max: int | None,
|
||||
api_key: str | None = None,
|
||||
) -> dict:
|
||||
"""获取论文的 references 或 citations。"""
|
||||
resolved = resolve_paper_id(paper_id)
|
||||
endpoint = f"{API_BASE}/{resolved}/{direction}"
|
||||
|
||||
headers: dict[str, str] = {}
|
||||
if api_key:
|
||||
headers["x-api-key"] = api_key
|
||||
|
||||
# S2 API 单次最多 1000,分页用 offset
|
||||
# S2 references/citations 端点:paper fields 用 nested 前缀,edge fields 直接列出
|
||||
# 格式: fields=contexts,intents,citedPaper.title,citedPaper.year,...
|
||||
paper_key_prefix = "citedPaper" if direction == "references" else "citingPaper"
|
||||
prefixed_fields = [f"{paper_key_prefix}.{f}" for f in PAPER_FIELDS]
|
||||
all_fields = ",".join(EDGE_FIELDS + prefixed_fields)
|
||||
|
||||
params = {
|
||||
"fields": all_fields,
|
||||
# citations 端点按时间倒序返回,需要多取才能找到高引论文
|
||||
# references 通常较少(几十条),多取无害
|
||||
"limit": 1000,
|
||||
}
|
||||
|
||||
with get_client(timeout=30, headers=headers) as client:
|
||||
resp = client.get(endpoint, params=params)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
|
||||
# 获取论文本体信息(用于输出上下文)
|
||||
paper_resp = None
|
||||
with get_client(timeout=15, headers=headers) as client:
|
||||
try:
|
||||
r = client.get(f"{API_BASE}/{resolved}", params={"fields": "title,year,citationCount"})
|
||||
r.raise_for_status()
|
||||
paper_resp = r.json()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# direction=references 时结构是 {"data": [{"citedPaper": {...}, "contexts": [...], "intents": [...]}]}
|
||||
# direction=citations 时结构是 {"data": [{"citingPaper": {...}, "contexts": [...], "intents": [...]}]}
|
||||
paper_key = "citedPaper" if direction == "references" else "citingPaper"
|
||||
|
||||
items = []
|
||||
for entry in data.get("data", []):
|
||||
paper = entry.get(paper_key, {})
|
||||
if not paper or not paper.get("title"):
|
||||
continue
|
||||
|
||||
year = paper.get("year")
|
||||
citation_count = paper.get("citationCount") or 0
|
||||
|
||||
# 过滤
|
||||
if citation_count < min_citations:
|
||||
continue
|
||||
if year_min and year and year < year_min:
|
||||
continue
|
||||
if year_max and year and year > year_max:
|
||||
continue
|
||||
|
||||
authors = [a.get("name", "") for a in paper.get("authors", [])]
|
||||
external_ids = paper.get("externalIds") or {}
|
||||
doi = external_ids.get("DOI")
|
||||
arxiv_id = external_ids.get("ArXiv")
|
||||
s2_id = paper.get("paperId", "")
|
||||
|
||||
url = f"https://www.semanticscholar.org/paper/{s2_id}" if s2_id else ""
|
||||
|
||||
abstract = paper.get("abstract") or ""
|
||||
snippet = abstract
|
||||
|
||||
open_access_pdf = None
|
||||
if paper.get("openAccessPdf"):
|
||||
open_access_pdf = paper["openAccessPdf"].get("url")
|
||||
|
||||
# contexts: 引用该论文时的上下文句子(仅 citations 方向有意义)
|
||||
contexts = entry.get("contexts") or []
|
||||
intents = entry.get("intents") or []
|
||||
|
||||
item = make_item(
|
||||
title=paper.get("title", ""),
|
||||
url=url,
|
||||
snippet=snippet,
|
||||
authors=authors,
|
||||
year=year,
|
||||
venue=paper.get("venue") or None,
|
||||
publication_date=paper.get("publicationDate"),
|
||||
citation_count=citation_count,
|
||||
influential_citation_count=paper.get("influentialCitationCount"),
|
||||
is_open_access=paper.get("isOpenAccess"),
|
||||
open_access_pdf=open_access_pdf,
|
||||
fields_of_study=paper.get("fieldsOfStudy") or None,
|
||||
doi=doi,
|
||||
arxiv_id=arxiv_id,
|
||||
paper_id=s2_id,
|
||||
citation_contexts=contexts[:3] if contexts else None, # 最多 3 条上下文
|
||||
citation_intents=intents if intents else None,
|
||||
)
|
||||
items.append(item)
|
||||
|
||||
# 按引用数排序,取 top-N
|
||||
items.sort(key=lambda x: x.get("citation_count", 0), reverse=True)
|
||||
items = items[:limit]
|
||||
|
||||
result = {
|
||||
"success": True,
|
||||
"paper_id": resolved,
|
||||
"direction": direction,
|
||||
"provider": "semantic_scholar",
|
||||
"items": items,
|
||||
"total_available": len(data.get("data", [])),
|
||||
"returned": len(items),
|
||||
"error": None,
|
||||
}
|
||||
if paper_resp:
|
||||
result["source_paper"] = {
|
||||
"title": paper_resp.get("title"),
|
||||
"year": paper_resp.get("year"),
|
||||
"citation_count": paper_resp.get("citationCount"),
|
||||
}
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="查询论文的参考文献(backward)或被引论文(forward)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"paper_id",
|
||||
help="论文标识符:S2 ID、DOI(如 10.1234/...)、ArXiv ID(如 2301.07041)、PMID(如 PMID:12345678)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"direction",
|
||||
choices=["references", "citations"],
|
||||
help="references=参考文献(backward),citations=被引论文(forward)",
|
||||
)
|
||||
parser.add_argument("--limit", "-n", type=int, default=20, help="返回结果数量(默认 20)")
|
||||
parser.add_argument("--min-citations", type=int, default=0, help="最低引用数过滤(默认 0)")
|
||||
parser.add_argument("--year-min", type=int, default=None, help="最早年份过滤")
|
||||
parser.add_argument("--year-max", type=int, default=None, help="最晚年份过滤")
|
||||
parser.add_argument("--api-key", help="Semantic Scholar API Key(可选)")
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
result = fetch_refs(
|
||||
args.paper_id,
|
||||
args.direction,
|
||||
args.limit,
|
||||
args.min_citations,
|
||||
args.year_min,
|
||||
args.year_max,
|
||||
getattr(args, "api_key", None),
|
||||
)
|
||||
print_json(result)
|
||||
except Exception as e:
|
||||
print_json({
|
||||
"success": False,
|
||||
"paper_id": args.paper_id,
|
||||
"direction": args.direction,
|
||||
"provider": "semantic_scholar",
|
||||
"items": [],
|
||||
"error": str(e),
|
||||
})
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
104
sn-search-academic/scripts/semantic_scholar_search.py
Normal file
104
sn-search-academic/scripts/semantic_scholar_search.py
Normal file
@@ -0,0 +1,104 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Semantic Scholar 论文搜索。通过 Semantic Scholar Graph API。"""
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
|
||||
from search_utils import build_parser, get_client, make_item, make_result, print_json
|
||||
|
||||
API_URL = "https://api.semanticscholar.org/graph/v1/paper/search"
|
||||
|
||||
FIELDS = ",".join([
|
||||
"title", "abstract", "tldr", "year", "venue", "publicationVenue", "publicationDate",
|
||||
"authors", "citationCount", "influentialCitationCount",
|
||||
"referenceCount", "isOpenAccess", "openAccessPdf",
|
||||
"externalIds", "fieldsOfStudy", "publicationTypes", "journal",
|
||||
])
|
||||
|
||||
|
||||
def search(query: str, limit: int, api_key: str | None = None) -> list[dict]:
|
||||
"""执行 Semantic Scholar 搜索。"""
|
||||
headers: dict[str, str] = {}
|
||||
if api_key:
|
||||
headers["x-api-key"] = api_key
|
||||
|
||||
params = {
|
||||
"query": query,
|
||||
"limit": min(limit, 100),
|
||||
"fields": FIELDS,
|
||||
}
|
||||
|
||||
with get_client(timeout=30, headers=headers) as client:
|
||||
resp = client.get(API_URL, params=params)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
|
||||
items = []
|
||||
for paper in data.get("data", [])[:limit]:
|
||||
authors = [a.get("name", "") for a in paper.get("authors", [])]
|
||||
|
||||
open_access_pdf = None
|
||||
if paper.get("openAccessPdf"):
|
||||
open_access_pdf = paper["openAccessPdf"].get("url")
|
||||
|
||||
external_ids = paper.get("externalIds") or {}
|
||||
doi = external_ids.get("DOI")
|
||||
arxiv_id = external_ids.get("ArXiv")
|
||||
|
||||
paper_id = paper.get("paperId", "")
|
||||
url = f"https://www.semanticscholar.org/paper/{paper_id}"
|
||||
|
||||
# 摘要:优先用 abstract,缺失时降级用 tldr
|
||||
abstract = paper.get("abstract") or ""
|
||||
tldr = (paper.get("tldr") or {}).get("text")
|
||||
snippet = abstract or tldr or ""
|
||||
|
||||
# 期刊/会议:venue(脏字符串)+ publicationVenue(结构化)
|
||||
venue = paper.get("venue") or (paper.get("journal") or {}).get("name")
|
||||
pub_venue = paper.get("publicationVenue") or {}
|
||||
publication_venue = {
|
||||
k: pub_venue[k]
|
||||
for k in ("id", "name", "type", "url")
|
||||
if pub_venue.get(k)
|
||||
} or None
|
||||
|
||||
items.append(make_item(
|
||||
title=paper.get("title") or "",
|
||||
url=url,
|
||||
snippet=snippet,
|
||||
tldr=tldr,
|
||||
authors=authors,
|
||||
year=paper.get("year"),
|
||||
venue=venue if venue else None,
|
||||
publication_venue=publication_venue,
|
||||
publication_date=paper.get("publicationDate"),
|
||||
citation_count=paper.get("citationCount"),
|
||||
influential_citation_count=paper.get("influentialCitationCount"),
|
||||
reference_count=paper.get("referenceCount"),
|
||||
is_open_access=paper.get("isOpenAccess"),
|
||||
open_access_pdf=open_access_pdf,
|
||||
fields_of_study=paper.get("fieldsOfStudy") or None,
|
||||
publication_types=paper.get("publicationTypes") or None,
|
||||
doi=doi,
|
||||
arxiv_id=arxiv_id,
|
||||
paper_id=paper_id,
|
||||
))
|
||||
|
||||
return items
|
||||
|
||||
|
||||
def main():
|
||||
parser = build_parser("搜索 Semantic Scholar 学术论文")
|
||||
parser.add_argument("--api-key", help="Semantic Scholar API Key(可选,提高限额)")
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
items = search(args.query, args.limit, getattr(args, "api_key", None))
|
||||
print_json(make_result(True, args.query, "semantic_scholar", items))
|
||||
except Exception as e:
|
||||
print_json(make_result(False, args.query, "semantic_scholar", [], str(e)))
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
79
sn-search-academic/scripts/wikipedia_search.py
Normal file
79
sn-search-academic/scripts/wikipedia_search.py
Normal file
@@ -0,0 +1,79 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Wikipedia 搜索。通过 MediaWiki API。"""
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
|
||||
from search_utils import build_parser, get_client, make_item, make_result, print_json
|
||||
|
||||
|
||||
def _api_url(lang: str) -> str:
|
||||
return f"https://{lang}.wikipedia.org/w/api.php"
|
||||
|
||||
|
||||
def search(query: str, limit: int, lang: str = "en") -> list[dict]:
|
||||
"""执行 Wikipedia 搜索。"""
|
||||
params = {
|
||||
"action": "query",
|
||||
"list": "search",
|
||||
"srsearch": query,
|
||||
"srlimit": min(limit, 50),
|
||||
"srprop": "snippet|timestamp|wordcount|size|sectiontitle|sectionsnippet",
|
||||
"format": "json",
|
||||
"utf8": 1,
|
||||
}
|
||||
|
||||
with get_client() as client:
|
||||
resp = client.get(_api_url(lang), params=params)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
|
||||
items = []
|
||||
for result in data.get("query", {}).get("search", [])[:limit]:
|
||||
title = result.get("title", "")
|
||||
# snippet 是 HTML 片段,简单去标签
|
||||
snippet = _strip_html(result.get("snippet", ""))
|
||||
page_id = result.get("pageid", "")
|
||||
url = f"https://{lang}.wikipedia.org/wiki/{title.replace(' ', '_')}"
|
||||
|
||||
section_title = result.get("sectiontitle", "")
|
||||
section_snippet = _strip_html(result.get("sectionsnippet", ""))
|
||||
|
||||
items.append(make_item(
|
||||
title=title,
|
||||
url=url,
|
||||
snippet=snippet,
|
||||
word_count=result.get("wordcount"),
|
||||
size=result.get("size"),
|
||||
timestamp=result.get("timestamp"),
|
||||
page_id=page_id,
|
||||
section_title=section_title if section_title else None,
|
||||
section_snippet=section_snippet if section_snippet else None,
|
||||
))
|
||||
|
||||
return items
|
||||
|
||||
|
||||
def _strip_html(html: str) -> str:
|
||||
import re
|
||||
text = re.sub(r"<[^>]+>", "", html)
|
||||
text = re.sub(r"\s+", " ", text).strip()
|
||||
return text
|
||||
|
||||
|
||||
def main():
|
||||
parser = build_parser("搜索 Wikipedia 百科文章")
|
||||
parser.add_argument("--lang", "-l", default="en",
|
||||
help="语言版本(默认 en,可选 zh, ja, de 等)")
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
items = search(args.query, args.limit, args.lang)
|
||||
print_json(make_result(True, args.query, "wikipedia", items))
|
||||
except Exception as e:
|
||||
print_json(make_result(False, args.query, "wikipedia", [], str(e)))
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user