Files
agent-skills/content-ops/content-ops-agent/references/wechat-article-extraction.md
Hermes Agent ccc63d1e70 first commit
2026-05-10 13:52:46 +08:00

2.3 KiB

WeChat Article Extraction Techniques

Problem

WeChat articles (mp.weixin.qq.com) trigger CAPTCHA verification when accessed from server IPs. Both curl and headless Playwright hit this wall.

What DOESN'T work

  • curl directly → returns verification page (even with realistic User-Agent)
  • Playwright headless with default settings → "环境异常" CAPTCHA
  • Playwright with mobile UA + --disable-blink-features=AutomationControlled → still CAPTCHA
  • Accessing #comment anchor → loads article content but NOT comments

What DOES work

1. QQ Mirror (best option for content)

https://so.html5.qq.com/page/real/search_news?docid=<DOCID>
  • Search for the article title on QQ search to find the docid
  • Renders full article text without verification
  • Does NOT include comments

2. Playwright + #comment anchor (partial)

await page.goto("https://mp.weixin.qq.com/s/<HASH>#comment")
  • Sometimes loads the article body text (server-side rendered)
  • Still no comments — those require WeChat JS runtime + login

3. OG metadata extraction

Even on the verification page, meta tags are available:

og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content')
og_desc = await page.evaluate('document.querySelector(\'meta[property="og:description"]\')?.content')

Also available in HTML: msg_title, msg_desc

4. mmx search for indirect sources

mmx search query '"exact article title" site:csdn.net OR site:zhihu.com'

Many WeChat articles get cross-posted to CSDN, 知乎, 今日头条, etc.

Comments

WeChat article comments are never accessible without login. They require:

  • WeChat JS runtime (not available in headless browser)
  • Authenticated WeChat session
  • Comments API calls with specific token/session parameters

Workaround: Search for user discussions on other platforms (GitHub issues, 知乎, 小红书, 即刻, B站) using mmx search.

Example extraction flow

1. Try mmx search for article title → find QQ mirror or cross-post
2. If found: Playwright fetch from QQ mirror → get full text
3. If not found: Playwright + #comment → get article body (no comments)
4. For comments: mmx search for "article title 评价 OR 反馈 OR 体验"
5. For community data: GitHub API for related repos (stars, forks, issues)