ephron_ren/agent-skills

Fork 0

Files

Hermes Agent ccc63d1e70 first commit

2026-05-10 13:52:46 +08:00

5.8 KiB

Raw Permalink Blame History

name, description, version, author, license, metadata

name

description

version

author

license

metadata

chinese-platform-extraction

Extract content from Chinese platforms (WeChat, 小黑盒, 知乎, CSDN) that block automated access. Fallback strategies when direct fetch fails.

1.0.0

Hermes Agent

MIT

hermes

Chinese Platform Content Extraction

Strategies for extracting content from Chinese platforms that block automated access.

Trigger Conditions

User provides a link to a Chinese platform article
Content extraction from WeChat, 小黑盒, 知乎, CSDN, etc.
SPA pages that return empty HTML shells

General Approach

Try curl first — fast, works for simple sites
If SPA/empty content → Playwright with wait_until="networkidle"
If verification/CAPTCHA → search for mirrors via mmx search
Extract OG metadata as fallback for title/description

Platform: WeChat (mp.weixin.qq.com)

Problem

Verification CAPTCHA blocks ALL automated access:

curl with any User-Agent → verification page
Playwright headless (even with stealth) → verification page
Mobile UA + stealth → still verification page

Do NOT loop on Playwright attempts. Switch to fallback immediately.

Solution: Search for Mirrors

# Extract title from OG meta (even verification pages serve this)
# Then search for mirrors
mmx search query "文章标题关键词"

Reliable mirror platforms:

QQ Search (so.html5.qq.com) — most reliable, often has full text. SPA, needs Playwright.
CSDN blogs — authors cross-post frequently
Weibo — full reposts common
Sohu/163/Sina — news aggregation sites
优设网 (uisdc.com) — design-related articles
知乎 (zhihu.com) — knowledge/tech articles

Workflow

Try direct curl → get title from og:title meta tag even on verification page
mmx search query "title keywords"
Fetch mirror page (QQ Search = SPA, needs Playwright)
Extract content from mirror

Platform: 小黑盒 (xiaoheihe.cn)

Problem

Nuxt.js SPA, curl returns empty shell. API endpoints not publicly documented.

Solution: Playwright (profile pages only)

import asyncio
from playwright.async_api import async_playwright

async def fetch_xiaoheihe_profile(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, args=['--no-sandbox', '--disable-setuid-sandbox'])
        page = await browser.new_page()
        await page.set_viewport_size({"width": 375, "height": 812})
        await page.goto(url, wait_until="networkidle", timeout=30000)
        await page.wait_for_timeout(3000)
        # Scroll to load lazy content
        for _ in range(5):
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(1500)
        text = await page.inner_text("body")
        await browser.close()
        return text

Route patterns:

User profile: /bbs/user_profile_share?user_id={id}&h_src=heyboxapp
Post detail: /app/bbs/link/{link_id}

Critical limitations (learned 2026-05):

Post detail pages are NOT accessible in headless browser. They require the 小黑盒 APP or deep-link handling. In headless Playwright, they return empty content or timeout on both networkidle and domcontentloaded. Do NOT attempt to visit /app/bbs/link/{id} — it wastes 20-30s per page.
Profile pages truncate post content server-side. The profile share page (/bbs/user_profile_share) shows post previews with CSS truncation, but the underlying text is genuinely cut short by the server — expanding CSS max-height/overflow/-webkit-line-clamp does NOT reveal more content. The truncation happens in the API response, not in rendering.
API interception is not viable. Since detail pages timeout, capturing XHR/fetch responses for post content doesn't work.
Container/VM environments need --no-sandbox. Always pass args=['--no-sandbox', '--disable-setuid-sandbox'] to chromium.launch().

Workaround when full content is needed

Ask the user to copy-paste the full content from the APP
Search for mirrors — some 小黑盒 content gets reposted to Weibo, Bilibili, or other platforms: mmx search query "小黑盒标题关键词"
Use the profile summary as-is — the truncated preview often contains the core prompt/information, just missing the tail end

Typical use case: extracting prompts from profile

When a user shares a 小黑盒 profile link and asks to extract content (e.g., prompt collections):

Fetch the profile page with Playwright → get all post titles + truncated previews
Present what's available to the user
Note which posts are truncated and suggest the user provide full text from APP
Do NOT waste time trying to visit individual post detail pages

Platform Reference Files

references/clawemail-platform.md — ClawEmail (claw.163.com) 注册约束、CLI 命令、提取注意事项

Pitfalls

Don't loop on verification: If WeChat direct fetch fails, immediately try fallback. Don't retry with different Playwright configs.
QQ Search pages are SPAs: Use Playwright, not curl, to render them.
Content completeness: Mirror versions may be slightly outdated or missing images. Note this to the user.

OG metadata extraction: Use JavaScript string escaping in Playwright evaluate() — avoid nested quote issues:

# WRONG — nested quote conflict
og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content')

# RIGHT — use double quotes inside, single outside
og_title = await page.evaluate('document.querySelector("meta[property=\\"og:title\\"]")?.content || "N/A"')

execute_code vs terminal: If terminal tool fails with FileNotFoundError, use execute_code as workaround.

5.8 KiB Raw Permalink Blame History

Chinese Platform Content Extraction

Trigger Conditions

General Approach

Platform: WeChat (mp.weixin.qq.com)

Problem

Solution: Search for Mirrors

Workflow

Platform: 小黑盒 (xiaoheihe.cn)

Problem

Solution: Playwright (profile pages only)

Workaround when full content is needed

Typical use case: extracting prompts from profile

Platform Reference Files

Pitfalls

5.8 KiB

Raw Permalink Blame History