--- name: chinese-platform-extraction description: Extract content from Chinese platforms (WeChat, 小黑盒, 知乎, CSDN) that block automated access. Fallback strategies when direct fetch fails. version: 1.0.0 author: Hermes Agent license: MIT metadata: hermes: tags: [wechat, scraping, extraction, chinese, content, spa] --- # Chinese Platform Content Extraction Strategies for extracting content from Chinese platforms that block automated access. ## Trigger Conditions - User provides a link to a Chinese platform article - Content extraction from WeChat, 小黑盒, 知乎, CSDN, etc. - SPA pages that return empty HTML shells ## General Approach 1. **Try curl first** — fast, works for simple sites 2. **If SPA/empty content** → Playwright with `wait_until="networkidle"` 3. **If verification/CAPTCHA** → search for mirrors via `mmx search` 4. **Extract OG metadata** as fallback for title/description ## Platform: WeChat (mp.weixin.qq.com) ### Problem Verification CAPTCHA blocks ALL automated access: - curl with any User-Agent → verification page - Playwright headless (even with stealth) → verification page - Mobile UA + stealth → still verification page **Do NOT loop on Playwright attempts. Switch to fallback immediately.** ### Solution: Search for Mirrors ```bash # Extract title from OG meta (even verification pages serve this) # Then search for mirrors mmx search query "文章标题关键词" ``` **Reliable mirror platforms:** - **QQ Search (so.html5.qq.com)** — most reliable, often has full text. SPA, needs Playwright. - **CSDN blogs** — authors cross-post frequently - **Weibo** — full reposts common - **Sohu/163/Sina** — news aggregation sites - **优设网 (uisdc.com)** — design-related articles - **知乎 (zhihu.com)** — knowledge/tech articles ### Workflow 1. Try direct curl → get title from `og:title` meta tag even on verification page 2. `mmx search query "title keywords"` 3. Fetch mirror page (QQ Search = SPA, needs Playwright) 4. Extract content from mirror ## Platform: 小黑盒 (xiaoheihe.cn) ### Problem Nuxt.js SPA, curl returns empty shell. API endpoints not publicly documented. ### Solution: Playwright (profile pages only) ```python import asyncio from playwright.async_api import async_playwright async def fetch_xiaoheihe_profile(url): async with async_playwright() as p: browser = await p.chromium.launch(headless=True, args=['--no-sandbox', '--disable-setuid-sandbox']) page = await browser.new_page() await page.set_viewport_size({"width": 375, "height": 812}) await page.goto(url, wait_until="networkidle", timeout=30000) await page.wait_for_timeout(3000) # Scroll to load lazy content for _ in range(5): await page.evaluate("window.scrollTo(0, document.body.scrollHeight)") await page.wait_for_timeout(1500) text = await page.inner_text("body") await browser.close() return text ``` **Route patterns:** - User profile: `/bbs/user_profile_share?user_id={id}&h_src=heyboxapp` - Post detail: `/app/bbs/link/{link_id}` **Critical limitations (learned 2026-05):** 1. **Post detail pages are NOT accessible in headless browser.** They require the 小黑盒 APP or deep-link handling. In headless Playwright, they return empty content or timeout on both `networkidle` and `domcontentloaded`. Do NOT attempt to visit `/app/bbs/link/{id}` — it wastes 20-30s per page. 2. **Profile pages truncate post content server-side.** The profile share page (`/bbs/user_profile_share`) shows post previews with CSS truncation, but the underlying text is genuinely cut short by the server — expanding CSS `max-height`/`overflow`/`-webkit-line-clamp` does NOT reveal more content. The truncation happens in the API response, not in rendering. 3. **API interception is not viable.** Since detail pages timeout, capturing XHR/fetch responses for post content doesn't work. 4. **Container/VM environments need `--no-sandbox`.** Always pass `args=['--no-sandbox', '--disable-setuid-sandbox']` to `chromium.launch()`. ### Workaround when full content is needed - **Ask the user** to copy-paste the full content from the APP - **Search for mirrors** — some 小黑盒 content gets reposted to Weibo, Bilibili, or other platforms: `mmx search query "小黑盒 标题关键词"` - **Use the profile summary** as-is — the truncated preview often contains the core prompt/information, just missing the tail end ### Typical use case: extracting prompts from profile When a user shares a 小黑盒 profile link and asks to extract content (e.g., prompt collections): 1. Fetch the profile page with Playwright → get all post titles + truncated previews 2. Present what's available to the user 3. Note which posts are truncated and suggest the user provide full text from APP 4. Do NOT waste time trying to visit individual post detail pages ## Platform Reference Files - `references/clawemail-platform.md` — ClawEmail (claw.163.com) 注册约束、CLI 命令、提取注意事项 ## Pitfalls 1. **Don't loop on verification**: If WeChat direct fetch fails, immediately try fallback. Don't retry with different Playwright configs. 2. **QQ Search pages are SPAs**: Use Playwright, not curl, to render them. 3. **Content completeness**: Mirror versions may be slightly outdated or missing images. Note this to the user. 4. **OG metadata extraction**: Use JavaScript string escaping in Playwright `evaluate()` — avoid nested quote issues: ```python # WRONG — nested quote conflict og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content') # RIGHT — use double quotes inside, single outside og_title = await page.evaluate('document.querySelector("meta[property=\\"og:title\\"]")?.content || "N/A"') ``` 5. **execute_code vs terminal**: If `terminal` tool fails with `FileNotFoundError`, use `execute_code` as workaround.