134 lines
5.8 KiB
Markdown
134 lines
5.8 KiB
Markdown
---
|
|
name: chinese-platform-extraction
|
|
description: Extract content from Chinese platforms (WeChat, 小黑盒, 知乎, CSDN) that block automated access. Fallback strategies when direct fetch fails.
|
|
version: 1.0.0
|
|
author: Hermes Agent
|
|
license: MIT
|
|
metadata:
|
|
hermes:
|
|
tags: [wechat, scraping, extraction, chinese, content, spa]
|
|
---
|
|
|
|
# Chinese Platform Content Extraction
|
|
|
|
Strategies for extracting content from Chinese platforms that block automated access.
|
|
|
|
## Trigger Conditions
|
|
|
|
- User provides a link to a Chinese platform article
|
|
- Content extraction from WeChat, 小黑盒, 知乎, CSDN, etc.
|
|
- SPA pages that return empty HTML shells
|
|
|
|
## General Approach
|
|
|
|
1. **Try curl first** — fast, works for simple sites
|
|
2. **If SPA/empty content** → Playwright with `wait_until="networkidle"`
|
|
3. **If verification/CAPTCHA** → search for mirrors via `mmx search`
|
|
4. **Extract OG metadata** as fallback for title/description
|
|
|
|
## Platform: WeChat (mp.weixin.qq.com)
|
|
|
|
### Problem
|
|
Verification CAPTCHA blocks ALL automated access:
|
|
- curl with any User-Agent → verification page
|
|
- Playwright headless (even with stealth) → verification page
|
|
- Mobile UA + stealth → still verification page
|
|
|
|
**Do NOT loop on Playwright attempts. Switch to fallback immediately.**
|
|
|
|
### Solution: Search for Mirrors
|
|
|
|
```bash
|
|
# Extract title from OG meta (even verification pages serve this)
|
|
# Then search for mirrors
|
|
mmx search query "文章标题关键词"
|
|
```
|
|
|
|
**Reliable mirror platforms:**
|
|
- **QQ Search (so.html5.qq.com)** — most reliable, often has full text. SPA, needs Playwright.
|
|
- **CSDN blogs** — authors cross-post frequently
|
|
- **Weibo** — full reposts common
|
|
- **Sohu/163/Sina** — news aggregation sites
|
|
- **优设网 (uisdc.com)** — design-related articles
|
|
- **知乎 (zhihu.com)** — knowledge/tech articles
|
|
|
|
### Workflow
|
|
1. Try direct curl → get title from `og:title` meta tag even on verification page
|
|
2. `mmx search query "title keywords"`
|
|
3. Fetch mirror page (QQ Search = SPA, needs Playwright)
|
|
4. Extract content from mirror
|
|
|
|
## Platform: 小黑盒 (xiaoheihe.cn)
|
|
|
|
### Problem
|
|
Nuxt.js SPA, curl returns empty shell. API endpoints not publicly documented.
|
|
|
|
### Solution: Playwright (profile pages only)
|
|
|
|
```python
|
|
import asyncio
|
|
from playwright.async_api import async_playwright
|
|
|
|
async def fetch_xiaoheihe_profile(url):
|
|
async with async_playwright() as p:
|
|
browser = await p.chromium.launch(headless=True, args=['--no-sandbox', '--disable-setuid-sandbox'])
|
|
page = await browser.new_page()
|
|
await page.set_viewport_size({"width": 375, "height": 812})
|
|
await page.goto(url, wait_until="networkidle", timeout=30000)
|
|
await page.wait_for_timeout(3000)
|
|
# Scroll to load lazy content
|
|
for _ in range(5):
|
|
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
|
await page.wait_for_timeout(1500)
|
|
text = await page.inner_text("body")
|
|
await browser.close()
|
|
return text
|
|
```
|
|
|
|
**Route patterns:**
|
|
- User profile: `/bbs/user_profile_share?user_id={id}&h_src=heyboxapp`
|
|
- Post detail: `/app/bbs/link/{link_id}`
|
|
|
|
**Critical limitations (learned 2026-05):**
|
|
|
|
1. **Post detail pages are NOT accessible in headless browser.** They require the 小黑盒 APP or deep-link handling. In headless Playwright, they return empty content or timeout on both `networkidle` and `domcontentloaded`. Do NOT attempt to visit `/app/bbs/link/{id}` — it wastes 20-30s per page.
|
|
|
|
2. **Profile pages truncate post content server-side.** The profile share page (`/bbs/user_profile_share`) shows post previews with CSS truncation, but the underlying text is genuinely cut short by the server — expanding CSS `max-height`/`overflow`/`-webkit-line-clamp` does NOT reveal more content. The truncation happens in the API response, not in rendering.
|
|
|
|
3. **API interception is not viable.** Since detail pages timeout, capturing XHR/fetch responses for post content doesn't work.
|
|
|
|
4. **Container/VM environments need `--no-sandbox`.** Always pass `args=['--no-sandbox', '--disable-setuid-sandbox']` to `chromium.launch()`.
|
|
|
|
### Workaround when full content is needed
|
|
|
|
- **Ask the user** to copy-paste the full content from the APP
|
|
- **Search for mirrors** — some 小黑盒 content gets reposted to Weibo, Bilibili, or other platforms: `mmx search query "小黑盒 标题关键词"`
|
|
- **Use the profile summary** as-is — the truncated preview often contains the core prompt/information, just missing the tail end
|
|
|
|
### Typical use case: extracting prompts from profile
|
|
|
|
When a user shares a 小黑盒 profile link and asks to extract content (e.g., prompt collections):
|
|
1. Fetch the profile page with Playwright → get all post titles + truncated previews
|
|
2. Present what's available to the user
|
|
3. Note which posts are truncated and suggest the user provide full text from APP
|
|
4. Do NOT waste time trying to visit individual post detail pages
|
|
|
|
## Platform Reference Files
|
|
|
|
- `references/clawemail-platform.md` — ClawEmail (claw.163.com) 注册约束、CLI 命令、提取注意事项
|
|
|
|
## Pitfalls
|
|
|
|
1. **Don't loop on verification**: If WeChat direct fetch fails, immediately try fallback. Don't retry with different Playwright configs.
|
|
2. **QQ Search pages are SPAs**: Use Playwright, not curl, to render them.
|
|
3. **Content completeness**: Mirror versions may be slightly outdated or missing images. Note this to the user.
|
|
4. **OG metadata extraction**: Use JavaScript string escaping in Playwright `evaluate()` — avoid nested quote issues:
|
|
```python
|
|
# WRONG — nested quote conflict
|
|
og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content')
|
|
|
|
# RIGHT — use double quotes inside, single outside
|
|
og_title = await page.evaluate('document.querySelector("meta[property=\\"og:title\\"]")?.content || "N/A"')
|
|
```
|
|
5. **execute_code vs terminal**: If `terminal` tool fails with `FileNotFoundError`, use `execute_code` as workaround.
|