5.8 KiB
name, description, version, author, license, metadata
| name | description | version | author | license | metadata | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chinese-platform-extraction | Extract content from Chinese platforms (WeChat, 小黑盒, 知乎, CSDN) that block automated access. Fallback strategies when direct fetch fails. | 1.0.0 | Hermes Agent | MIT |
|
Chinese Platform Content Extraction
Strategies for extracting content from Chinese platforms that block automated access.
Trigger Conditions
- User provides a link to a Chinese platform article
- Content extraction from WeChat, 小黑盒, 知乎, CSDN, etc.
- SPA pages that return empty HTML shells
General Approach
- Try curl first — fast, works for simple sites
- If SPA/empty content → Playwright with
wait_until="networkidle" - If verification/CAPTCHA → search for mirrors via
mmx search - Extract OG metadata as fallback for title/description
Platform: WeChat (mp.weixin.qq.com)
Problem
Verification CAPTCHA blocks ALL automated access:
- curl with any User-Agent → verification page
- Playwright headless (even with stealth) → verification page
- Mobile UA + stealth → still verification page
Do NOT loop on Playwright attempts. Switch to fallback immediately.
Solution: Search for Mirrors
# Extract title from OG meta (even verification pages serve this)
# Then search for mirrors
mmx search query "文章标题关键词"
Reliable mirror platforms:
- QQ Search (so.html5.qq.com) — most reliable, often has full text. SPA, needs Playwright.
- CSDN blogs — authors cross-post frequently
- Weibo — full reposts common
- Sohu/163/Sina — news aggregation sites
- 优设网 (uisdc.com) — design-related articles
- 知乎 (zhihu.com) — knowledge/tech articles
Workflow
- Try direct curl → get title from
og:titlemeta tag even on verification page mmx search query "title keywords"- Fetch mirror page (QQ Search = SPA, needs Playwright)
- Extract content from mirror
Platform: 小黑盒 (xiaoheihe.cn)
Problem
Nuxt.js SPA, curl returns empty shell. API endpoints not publicly documented.
Solution: Playwright (profile pages only)
import asyncio
from playwright.async_api import async_playwright
async def fetch_xiaoheihe_profile(url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, args=['--no-sandbox', '--disable-setuid-sandbox'])
page = await browser.new_page()
await page.set_viewport_size({"width": 375, "height": 812})
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_timeout(3000)
# Scroll to load lazy content
for _ in range(5):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1500)
text = await page.inner_text("body")
await browser.close()
return text
Route patterns:
- User profile:
/bbs/user_profile_share?user_id={id}&h_src=heyboxapp - Post detail:
/app/bbs/link/{link_id}
Critical limitations (learned 2026-05):
-
Post detail pages are NOT accessible in headless browser. They require the 小黑盒 APP or deep-link handling. In headless Playwright, they return empty content or timeout on both
networkidleanddomcontentloaded. Do NOT attempt to visit/app/bbs/link/{id}— it wastes 20-30s per page. -
Profile pages truncate post content server-side. The profile share page (
/bbs/user_profile_share) shows post previews with CSS truncation, but the underlying text is genuinely cut short by the server — expanding CSSmax-height/overflow/-webkit-line-clampdoes NOT reveal more content. The truncation happens in the API response, not in rendering. -
API interception is not viable. Since detail pages timeout, capturing XHR/fetch responses for post content doesn't work.
-
Container/VM environments need
--no-sandbox. Always passargs=['--no-sandbox', '--disable-setuid-sandbox']tochromium.launch().
Workaround when full content is needed
- Ask the user to copy-paste the full content from the APP
- Search for mirrors — some 小黑盒 content gets reposted to Weibo, Bilibili, or other platforms:
mmx search query "小黑盒 标题关键词" - Use the profile summary as-is — the truncated preview often contains the core prompt/information, just missing the tail end
Typical use case: extracting prompts from profile
When a user shares a 小黑盒 profile link and asks to extract content (e.g., prompt collections):
- Fetch the profile page with Playwright → get all post titles + truncated previews
- Present what's available to the user
- Note which posts are truncated and suggest the user provide full text from APP
- Do NOT waste time trying to visit individual post detail pages
Platform Reference Files
references/clawemail-platform.md— ClawEmail (claw.163.com) 注册约束、CLI 命令、提取注意事项
Pitfalls
- Don't loop on verification: If WeChat direct fetch fails, immediately try fallback. Don't retry with different Playwright configs.
- QQ Search pages are SPAs: Use Playwright, not curl, to render them.
- Content completeness: Mirror versions may be slightly outdated or missing images. Note this to the user.
- OG metadata extraction: Use JavaScript string escaping in Playwright
evaluate()— avoid nested quote issues:# WRONG — nested quote conflict og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content') # RIGHT — use double quotes inside, single outside og_title = await page.evaluate('document.querySelector("meta[property=\\"og:title\\"]")?.content || "N/A"') - execute_code vs terminal: If
terminaltool fails withFileNotFoundError, useexecute_codeas workaround.