59 lines
2.3 KiB
Markdown
59 lines
2.3 KiB
Markdown
# WeChat Article Extraction Techniques
|
|
|
|
## Problem
|
|
WeChat articles (mp.weixin.qq.com) trigger CAPTCHA verification when accessed from server IPs. Both curl and headless Playwright hit this wall.
|
|
|
|
## What DOESN'T work
|
|
- `curl` directly → returns verification page (even with realistic User-Agent)
|
|
- Playwright headless with default settings → "环境异常" CAPTCHA
|
|
- Playwright with mobile UA + `--disable-blink-features=AutomationControlled` → still CAPTCHA
|
|
- Accessing `#comment` anchor → loads article content but NOT comments
|
|
|
|
## What DOES work
|
|
|
|
### 1. QQ Mirror (best option for content)
|
|
```
|
|
https://so.html5.qq.com/page/real/search_news?docid=<DOCID>
|
|
```
|
|
- Search for the article title on QQ search to find the docid
|
|
- Renders full article text without verification
|
|
- **Does NOT include comments**
|
|
|
|
### 2. Playwright + `#comment` anchor (partial)
|
|
```
|
|
await page.goto("https://mp.weixin.qq.com/s/<HASH>#comment")
|
|
```
|
|
- Sometimes loads the article body text (server-side rendered)
|
|
- Still no comments — those require WeChat JS runtime + login
|
|
|
|
### 3. OG metadata extraction
|
|
Even on the verification page, meta tags are available:
|
|
```python
|
|
og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content')
|
|
og_desc = await page.evaluate('document.querySelector(\'meta[property="og:description"]\')?.content')
|
|
```
|
|
Also available in HTML: `msg_title`, `msg_desc`
|
|
|
|
### 4. mmx search for indirect sources
|
|
```bash
|
|
mmx search query '"exact article title" site:csdn.net OR site:zhihu.com'
|
|
```
|
|
Many WeChat articles get cross-posted to CSDN, 知乎, 今日头条, etc.
|
|
|
|
## Comments
|
|
WeChat article comments are **never accessible without login**. They require:
|
|
- WeChat JS runtime (not available in headless browser)
|
|
- Authenticated WeChat session
|
|
- Comments API calls with specific token/session parameters
|
|
|
|
**Workaround**: Search for user discussions on other platforms (GitHub issues, 知乎, 小红书, 即刻, B站) using `mmx search`.
|
|
|
|
## Example extraction flow
|
|
```
|
|
1. Try mmx search for article title → find QQ mirror or cross-post
|
|
2. If found: Playwright fetch from QQ mirror → get full text
|
|
3. If not found: Playwright + #comment → get article body (no comments)
|
|
4. For comments: mmx search for "article title 评价 OR 反馈 OR 体验"
|
|
5. For community data: GitHub API for related repos (stars, forks, issues)
|
|
```
|