# WeChat Article Extraction Techniques ## Problem WeChat articles (mp.weixin.qq.com) trigger CAPTCHA verification when accessed from server IPs. Both curl and headless Playwright hit this wall. ## What DOESN'T work - `curl` directly → returns verification page (even with realistic User-Agent) - Playwright headless with default settings → "环境异常" CAPTCHA - Playwright with mobile UA + `--disable-blink-features=AutomationControlled` → still CAPTCHA - Accessing `#comment` anchor → loads article content but NOT comments ## What DOES work ### 1. QQ Mirror (best option for content) ``` https://so.html5.qq.com/page/real/search_news?docid= ``` - Search for the article title on QQ search to find the docid - Renders full article text without verification - **Does NOT include comments** ### 2. Playwright + `#comment` anchor (partial) ``` await page.goto("https://mp.weixin.qq.com/s/#comment") ``` - Sometimes loads the article body text (server-side rendered) - Still no comments — those require WeChat JS runtime + login ### 3. OG metadata extraction Even on the verification page, meta tags are available: ```python og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content') og_desc = await page.evaluate('document.querySelector(\'meta[property="og:description"]\')?.content') ``` Also available in HTML: `msg_title`, `msg_desc` ### 4. mmx search for indirect sources ```bash mmx search query '"exact article title" site:csdn.net OR site:zhihu.com' ``` Many WeChat articles get cross-posted to CSDN, 知乎, 今日头条, etc. ## Comments WeChat article comments are **never accessible without login**. They require: - WeChat JS runtime (not available in headless browser) - Authenticated WeChat session - Comments API calls with specific token/session parameters **Workaround**: Search for user discussions on other platforms (GitHub issues, 知乎, 小红书, 即刻, B站) using `mmx search`. ## Example extraction flow ``` 1. Try mmx search for article title → find QQ mirror or cross-post 2. If found: Playwright fetch from QQ mirror → get full text 3. If not found: Playwright + #comment → get article body (no comments) 4. For comments: mmx search for "article title 评价 OR 反馈 OR 体验" 5. For community data: GitHub API for related repos (stars, forks, issues) ```