first commit

2026-05-10 13:52:46 +08:00
commit ccc63d1e70
4583 changed files with 584341 additions and 0 deletions
--- a/content-ops/content-ops-agent/references/wechat-article-extraction.md
+++ b/content-ops/content-ops-agent/references/wechat-article-extraction.md
@@ -0,0 +1,58 @@
+# WeChat Article Extraction Techniques
+
+## Problem
+WeChat articles (mp.weixin.qq.com) trigger CAPTCHA verification when accessed from server IPs. Both curl and headless Playwright hit this wall.
+
+## What DOESN'T work
+- `curl` directly → returns verification page (even with realistic User-Agent)
+- Playwright headless with default settings → "环境异常" CAPTCHA
+- Playwright with mobile UA + `--disable-blink-features=AutomationControlled` → still CAPTCHA
+- Accessing `#comment` anchor → loads article content but NOT comments
+
+## What DOES work
+
+### 1. QQ Mirror (best option for content)
+```
+https://so.html5.qq.com/page/real/search_news?docid=<DOCID>
+```
+- Search for the article title on QQ search to find the docid
+- Renders full article text without verification
+- **Does NOT include comments**
+
+### 2. Playwright + `#comment` anchor (partial)
+```
+await page.goto("https://mp.weixin.qq.com/s/<HASH>#comment")
+```
+- Sometimes loads the article body text (server-side rendered)
+- Still no comments — those require WeChat JS runtime + login
+
+### 3. OG metadata extraction
+Even on the verification page, meta tags are available:
+```python
+og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content')
+og_desc = await page.evaluate('document.querySelector(\'meta[property="og:description"]\')?.content')
+```
+Also available in HTML: `msg_title`, `msg_desc`
+
+### 4. mmx search for indirect sources
+```bash
+mmx search query '"exact article title" site:csdn.net OR site:zhihu.com'
+```
+Many WeChat articles get cross-posted to CSDN, 知乎, 今日头条, etc.
+
+## Comments
+WeChat article comments are **never accessible without login**. They require:
+- WeChat JS runtime (not available in headless browser)
+- Authenticated WeChat session
+- Comments API calls with specific token/session parameters
+
+**Workaround**: Search for user discussions on other platforms (GitHub issues, 知乎, 小红书, 即刻, B站) using `mmx search`.
+
+## Example extraction flow
+```
+1. Try mmx search for article title → find QQ mirror or cross-post
+2. If found: Playwright fetch from QQ mirror → get full text
+3. If not found: Playwright + #comment → get article body (no comments)
+4. For comments: mmx search for "article title 评价 OR 反馈 OR 体验"
+5. For community data: GitHub API for related repos (stars, forks, issues)
+```