# Deep Paper Reading Methodology Extracting full technical detail from arXiv papers when PDF parsing tools are unavailable. ## Extraction Fallback Chain 1. **pymupdf (fitz)** — best quality, but needs `uv pip install pymupdf` 2. **pdftotext** — `apt install poppler-utils`, then `pdftotext file.pdf -` 3. **Raw regex on PDF bytes** — `re.findall(rb'\(([^)]+)\)', data)` extracts text streams; works for metadata/abstract but garbles body 4. **HTML version** — `curl https://arxiv.org/html/{id}v{N}` — cleanest structured extraction; **preferred method** 5. **Abstract page** — `curl https://arxiv.org/abs/{id}` + regex on `
` ## HTML Extraction Patterns (preferred) The arXiv HTML version (`/html/{id}v{N}`) has structured `` elements with IDs: ``` S1 = Introduction S2 = Problem definition S3 = Key findings S4 = Method S5 = Experiments (S5.SS1, S5.SS2 = subsections) S6 = Related work S7 = Conclusion Appendices: indexed by letter (A, B, C...) in ltx_tocentry ``` ### Section extraction pattern: ```python import re html = re.sub(r'<(script|style)[^>]*>.*?\1>', '', html, flags=re.DOTALL) m = re.search(r' ]*id="S4"[^>]*>(.*?)(?= ]*id="S5"|$)', html, re.DOTALL) text = re.sub(r'<[^>]+>', ' ', m.group(1)).strip() text = re.sub(r'\s+', ' ', text) ``` ### Table extraction: ```python tables = re.findall(r' ]*>(.*?)
', html, re.DOTALL) for t in tables: text = re.sub(r'<[^>]+>', ' | ', t).strip() text = re.sub(r'\s+', ' ', text) ``` ### Appendix content (avoid TOC duplicates): Appendix headings appear twice — once in TOC, once as actual content. Use `positions[-1]` (last occurrence) for the real content. Search by keyword rather than section ID for appendices. ### Targeted keyword search (when section IDs fail): ```python searches = ['keyword1', 'keyword2'] for s in searches: positions = [m.start() for m in re.finditer(re.escape(s), html, re.IGNORECASE)] if positions: pos = positions[-1] # last occurrence = actual content, not TOC chunk = html[max(0,pos-200):pos+500] text = re.sub(r'<[^>]+>', ' ', chunk) ``` ## Structured Methodology Extraction Template When the user asks to "learn the method" or do a deep read, extract: 1. **Architecture** — pipeline stages, model sizes, data flow 2. **Training data** — how it's constructed, sources, sizes, prompts used 3. **Negative mining** — strategy for hard negatives, filtering 4. **Loss functions** — exact objective, temperature, why this choice 5. **Training hyperparams** — LR, batch size, epochs, hardware 6. **Inference flow** — online vs offline steps, latency, throughput 7. **Key ablations** — what matters and by how much 8. **Code/models released** — check GitHub repo structure ## GitHub Repo Inspection Pattern ```bash # Check if repo exists and get stats curl -sL "https://api.github.com/repos/{owner}/{repo}" # List top-level structure curl -sL "https://api.github.com/repos/{owner}/{repo}/contents" # Check subdirectories for d in src scripts; do curl -sL "https://api.github.com/repos/{owner}/{repo}/contents/$d" done # Read README curl -sL "https://raw.githubusercontent.com/{owner}/{repo}/main/README.md" ``` ## Semantic Scholar for Citation Context ```bash curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:{id}?fields=citationCount,influentialCitationCount" ```