99 lines
3.3 KiB
Markdown
99 lines
3.3 KiB
Markdown
# Deep Paper Reading Methodology
|
|
|
|
Extracting full technical detail from arXiv papers when PDF parsing tools are unavailable.
|
|
|
|
## Extraction Fallback Chain
|
|
|
|
1. **pymupdf (fitz)** — best quality, but needs `uv pip install pymupdf`
|
|
2. **pdftotext** — `apt install poppler-utils`, then `pdftotext file.pdf -`
|
|
3. **Raw regex on PDF bytes** — `re.findall(rb'\(([^)]+)\)', data)` extracts text streams; works for metadata/abstract but garbles body
|
|
4. **HTML version** — `curl https://arxiv.org/html/{id}v{N}` — cleanest structured extraction; **preferred method**
|
|
5. **Abstract page** — `curl https://arxiv.org/abs/{id}` + regex on `<blockquote class="abstract">`
|
|
|
|
## HTML Extraction Patterns (preferred)
|
|
|
|
The arXiv HTML version (`/html/{id}v{N}`) has structured `<section>` elements with IDs:
|
|
|
|
```
|
|
S1 = Introduction
|
|
S2 = Problem definition
|
|
S3 = Key findings
|
|
S4 = Method
|
|
S5 = Experiments (S5.SS1, S5.SS2 = subsections)
|
|
S6 = Related work
|
|
S7 = Conclusion
|
|
Appendices: indexed by letter (A, B, C...) in ltx_tocentry
|
|
```
|
|
|
|
### Section extraction pattern:
|
|
|
|
```python
|
|
import re
|
|
html = re.sub(r'<(script|style)[^>]*>.*?</\1>', '', html, flags=re.DOTALL)
|
|
m = re.search(r'<section[^>]*id="S4"[^>]*>(.*?)(?=<section[^>]*id="S5"|$)', html, re.DOTALL)
|
|
text = re.sub(r'<[^>]+>', ' ', m.group(1)).strip()
|
|
text = re.sub(r'\s+', ' ', text)
|
|
```
|
|
|
|
### Table extraction:
|
|
|
|
```python
|
|
tables = re.findall(r'<table[^>]*>(.*?)</table>', html, re.DOTALL)
|
|
for t in tables:
|
|
text = re.sub(r'<[^>]+>', ' | ', t).strip()
|
|
text = re.sub(r'\s+', ' ', text)
|
|
```
|
|
|
|
### Appendix content (avoid TOC duplicates):
|
|
|
|
Appendix headings appear twice — once in TOC, once as actual content. Use `positions[-1]` (last occurrence) for the real content. Search by keyword rather than section ID for appendices.
|
|
|
|
### Targeted keyword search (when section IDs fail):
|
|
|
|
```python
|
|
searches = ['keyword1', 'keyword2']
|
|
for s in searches:
|
|
positions = [m.start() for m in re.finditer(re.escape(s), html, re.IGNORECASE)]
|
|
if positions:
|
|
pos = positions[-1] # last occurrence = actual content, not TOC
|
|
chunk = html[max(0,pos-200):pos+500]
|
|
text = re.sub(r'<[^>]+>', ' ', chunk)
|
|
```
|
|
|
|
## Structured Methodology Extraction Template
|
|
|
|
When the user asks to "learn the method" or do a deep read, extract:
|
|
|
|
1. **Architecture** — pipeline stages, model sizes, data flow
|
|
2. **Training data** — how it's constructed, sources, sizes, prompts used
|
|
3. **Negative mining** — strategy for hard negatives, filtering
|
|
4. **Loss functions** — exact objective, temperature, why this choice
|
|
5. **Training hyperparams** — LR, batch size, epochs, hardware
|
|
6. **Inference flow** — online vs offline steps, latency, throughput
|
|
7. **Key ablations** — what matters and by how much
|
|
8. **Code/models released** — check GitHub repo structure
|
|
|
|
## GitHub Repo Inspection Pattern
|
|
|
|
```bash
|
|
# Check if repo exists and get stats
|
|
curl -sL "https://api.github.com/repos/{owner}/{repo}"
|
|
|
|
# List top-level structure
|
|
curl -sL "https://api.github.com/repos/{owner}/{repo}/contents"
|
|
|
|
# Check subdirectories
|
|
for d in src scripts; do
|
|
curl -sL "https://api.github.com/repos/{owner}/{repo}/contents/$d"
|
|
done
|
|
|
|
# Read README
|
|
curl -sL "https://raw.githubusercontent.com/{owner}/{repo}/main/README.md"
|
|
```
|
|
|
|
## Semantic Scholar for Citation Context
|
|
|
|
```bash
|
|
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:{id}?fields=citationCount,influentialCitationCount"
|
|
```
|