2.3 KiB
2.3 KiB
WeChat Article Extraction Techniques
Problem
WeChat articles (mp.weixin.qq.com) trigger CAPTCHA verification when accessed from server IPs. Both curl and headless Playwright hit this wall.
What DOESN'T work
curldirectly → returns verification page (even with realistic User-Agent)- Playwright headless with default settings → "环境异常" CAPTCHA
- Playwright with mobile UA +
--disable-blink-features=AutomationControlled→ still CAPTCHA - Accessing
#commentanchor → loads article content but NOT comments
What DOES work
1. QQ Mirror (best option for content)
https://so.html5.qq.com/page/real/search_news?docid=<DOCID>
- Search for the article title on QQ search to find the docid
- Renders full article text without verification
- Does NOT include comments
2. Playwright + #comment anchor (partial)
await page.goto("https://mp.weixin.qq.com/s/<HASH>#comment")
- Sometimes loads the article body text (server-side rendered)
- Still no comments — those require WeChat JS runtime + login
3. OG metadata extraction
Even on the verification page, meta tags are available:
og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content')
og_desc = await page.evaluate('document.querySelector(\'meta[property="og:description"]\')?.content')
Also available in HTML: msg_title, msg_desc
4. mmx search for indirect sources
mmx search query '"exact article title" site:csdn.net OR site:zhihu.com'
Many WeChat articles get cross-posted to CSDN, 知乎, 今日头条, etc.
Comments
WeChat article comments are never accessible without login. They require:
- WeChat JS runtime (not available in headless browser)
- Authenticated WeChat session
- Comments API calls with specific token/session parameters
Workaround: Search for user discussions on other platforms (GitHub issues, 知乎, 小红书, 即刻, B站) using mmx search.
Example extraction flow
1. Try mmx search for article title → find QQ mirror or cross-post
2. If found: Playwright fetch from QQ mirror → get full text
3. If not found: Playwright + #comment → get article body (no comments)
4. For comments: mmx search for "article title 评价 OR 反馈 OR 体验"
5. For community data: GitHub API for related repos (stars, forks, issues)