3.3 KiB
3.3 KiB
Deep Paper Reading Methodology
Extracting full technical detail from arXiv papers when PDF parsing tools are unavailable.
Extraction Fallback Chain
- pymupdf (fitz) — best quality, but needs
uv pip install pymupdf - pdftotext —
apt install poppler-utils, thenpdftotext file.pdf - - Raw regex on PDF bytes —
re.findall(rb'\(([^)]+)\)', data)extracts text streams; works for metadata/abstract but garbles body - HTML version —
curl https://arxiv.org/html/{id}v{N}— cleanest structured extraction; preferred method - Abstract page —
curl https://arxiv.org/abs/{id}+ regex on<blockquote class="abstract">
HTML Extraction Patterns (preferred)
The arXiv HTML version (/html/{id}v{N}) has structured <section> elements with IDs:
S1 = Introduction
S2 = Problem definition
S3 = Key findings
S4 = Method
S5 = Experiments (S5.SS1, S5.SS2 = subsections)
S6 = Related work
S7 = Conclusion
Appendices: indexed by letter (A, B, C...) in ltx_tocentry
Section extraction pattern:
import re
html = re.sub(r'<(script|style)[^>]*>.*?</\1>', '', html, flags=re.DOTALL)
m = re.search(r'<section[^>]*id="S4"[^>]*>(.*?)(?=<section[^>]*id="S5"|$)', html, re.DOTALL)
text = re.sub(r'<[^>]+>', ' ', m.group(1)).strip()
text = re.sub(r'\s+', ' ', text)
Table extraction:
tables = re.findall(r'<table[^>]*>(.*?)</table>', html, re.DOTALL)
for t in tables:
text = re.sub(r'<[^>]+>', ' | ', t).strip()
text = re.sub(r'\s+', ' ', text)
Appendix content (avoid TOC duplicates):
Appendix headings appear twice — once in TOC, once as actual content. Use positions[-1] (last occurrence) for the real content. Search by keyword rather than section ID for appendices.
Targeted keyword search (when section IDs fail):
searches = ['keyword1', 'keyword2']
for s in searches:
positions = [m.start() for m in re.finditer(re.escape(s), html, re.IGNORECASE)]
if positions:
pos = positions[-1] # last occurrence = actual content, not TOC
chunk = html[max(0,pos-200):pos+500]
text = re.sub(r'<[^>]+>', ' ', chunk)
Structured Methodology Extraction Template
When the user asks to "learn the method" or do a deep read, extract:
- Architecture — pipeline stages, model sizes, data flow
- Training data — how it's constructed, sources, sizes, prompts used
- Negative mining — strategy for hard negatives, filtering
- Loss functions — exact objective, temperature, why this choice
- Training hyperparams — LR, batch size, epochs, hardware
- Inference flow — online vs offline steps, latency, throughput
- Key ablations — what matters and by how much
- Code/models released — check GitHub repo structure
GitHub Repo Inspection Pattern
# Check if repo exists and get stats
curl -sL "https://api.github.com/repos/{owner}/{repo}"
# List top-level structure
curl -sL "https://api.github.com/repos/{owner}/{repo}/contents"
# Check subdirectories
for d in src scripts; do
curl -sL "https://api.github.com/repos/{owner}/{repo}/contents/$d"
done
# Read README
curl -sL "https://raw.githubusercontent.com/{owner}/{repo}/main/README.md"
Semantic Scholar for Citation Context
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:{id}?fields=citationCount,influentialCitationCount"