Files
Hermes Agent ccc63d1e70 first commit
2026-05-10 13:52:46 +08:00

5.8 KiB

name, description, version, author, license, metadata
name description version author license metadata
chinese-platform-extraction Extract content from Chinese platforms (WeChat, 小黑盒, 知乎, CSDN) that block automated access. Fallback strategies when direct fetch fails. 1.0.0 Hermes Agent MIT
hermes
tags
wechat
scraping
extraction
chinese
content
spa

Chinese Platform Content Extraction

Strategies for extracting content from Chinese platforms that block automated access.

Trigger Conditions

  • User provides a link to a Chinese platform article
  • Content extraction from WeChat, 小黑盒, 知乎, CSDN, etc.
  • SPA pages that return empty HTML shells

General Approach

  1. Try curl first — fast, works for simple sites
  2. If SPA/empty content → Playwright with wait_until="networkidle"
  3. If verification/CAPTCHA → search for mirrors via mmx search
  4. Extract OG metadata as fallback for title/description

Platform: WeChat (mp.weixin.qq.com)

Problem

Verification CAPTCHA blocks ALL automated access:

  • curl with any User-Agent → verification page
  • Playwright headless (even with stealth) → verification page
  • Mobile UA + stealth → still verification page

Do NOT loop on Playwright attempts. Switch to fallback immediately.

Solution: Search for Mirrors

# Extract title from OG meta (even verification pages serve this)
# Then search for mirrors
mmx search query "文章标题关键词"

Reliable mirror platforms:

  • QQ Search (so.html5.qq.com) — most reliable, often has full text. SPA, needs Playwright.
  • CSDN blogs — authors cross-post frequently
  • Weibo — full reposts common
  • Sohu/163/Sina — news aggregation sites
  • 优设网 (uisdc.com) — design-related articles
  • 知乎 (zhihu.com) — knowledge/tech articles

Workflow

  1. Try direct curl → get title from og:title meta tag even on verification page
  2. mmx search query "title keywords"
  3. Fetch mirror page (QQ Search = SPA, needs Playwright)
  4. Extract content from mirror

Platform: 小黑盒 (xiaoheihe.cn)

Problem

Nuxt.js SPA, curl returns empty shell. API endpoints not publicly documented.

Solution: Playwright (profile pages only)

import asyncio
from playwright.async_api import async_playwright

async def fetch_xiaoheihe_profile(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, args=['--no-sandbox', '--disable-setuid-sandbox'])
        page = await browser.new_page()
        await page.set_viewport_size({"width": 375, "height": 812})
        await page.goto(url, wait_until="networkidle", timeout=30000)
        await page.wait_for_timeout(3000)
        # Scroll to load lazy content
        for _ in range(5):
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(1500)
        text = await page.inner_text("body")
        await browser.close()
        return text

Route patterns:

  • User profile: /bbs/user_profile_share?user_id={id}&h_src=heyboxapp
  • Post detail: /app/bbs/link/{link_id}

Critical limitations (learned 2026-05):

  1. Post detail pages are NOT accessible in headless browser. They require the 小黑盒 APP or deep-link handling. In headless Playwright, they return empty content or timeout on both networkidle and domcontentloaded. Do NOT attempt to visit /app/bbs/link/{id} — it wastes 20-30s per page.

  2. Profile pages truncate post content server-side. The profile share page (/bbs/user_profile_share) shows post previews with CSS truncation, but the underlying text is genuinely cut short by the server — expanding CSS max-height/overflow/-webkit-line-clamp does NOT reveal more content. The truncation happens in the API response, not in rendering.

  3. API interception is not viable. Since detail pages timeout, capturing XHR/fetch responses for post content doesn't work.

  4. Container/VM environments need --no-sandbox. Always pass args=['--no-sandbox', '--disable-setuid-sandbox'] to chromium.launch().

Workaround when full content is needed

  • Ask the user to copy-paste the full content from the APP
  • Search for mirrors — some 小黑盒 content gets reposted to Weibo, Bilibili, or other platforms: mmx search query "小黑盒 标题关键词"
  • Use the profile summary as-is — the truncated preview often contains the core prompt/information, just missing the tail end

Typical use case: extracting prompts from profile

When a user shares a 小黑盒 profile link and asks to extract content (e.g., prompt collections):

  1. Fetch the profile page with Playwright → get all post titles + truncated previews
  2. Present what's available to the user
  3. Note which posts are truncated and suggest the user provide full text from APP
  4. Do NOT waste time trying to visit individual post detail pages

Platform Reference Files

  • references/clawemail-platform.md — ClawEmail (claw.163.com) 注册约束、CLI 命令、提取注意事项

Pitfalls

  1. Don't loop on verification: If WeChat direct fetch fails, immediately try fallback. Don't retry with different Playwright configs.
  2. QQ Search pages are SPAs: Use Playwright, not curl, to render them.
  3. Content completeness: Mirror versions may be slightly outdated or missing images. Note this to the user.
  4. OG metadata extraction: Use JavaScript string escaping in Playwright evaluate() — avoid nested quote issues:
    # WRONG — nested quote conflict
    og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content')
    
    # RIGHT — use double quotes inside, single outside
    og_title = await page.evaluate('document.querySelector("meta[property=\\"og:title\\"]")?.content || "N/A"')
    
  5. execute_code vs terminal: If terminal tool fails with FileNotFoundError, use execute_code as workaround.