first commit
This commit is contained in:
133
research/chinese-platform-extraction/SKILL.md
Normal file
133
research/chinese-platform-extraction/SKILL.md
Normal file
@@ -0,0 +1,133 @@
|
||||
---
|
||||
name: chinese-platform-extraction
|
||||
description: Extract content from Chinese platforms (WeChat, 小黑盒, 知乎, CSDN) that block automated access. Fallback strategies when direct fetch fails.
|
||||
version: 1.0.0
|
||||
author: Hermes Agent
|
||||
license: MIT
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [wechat, scraping, extraction, chinese, content, spa]
|
||||
---
|
||||
|
||||
# Chinese Platform Content Extraction
|
||||
|
||||
Strategies for extracting content from Chinese platforms that block automated access.
|
||||
|
||||
## Trigger Conditions
|
||||
|
||||
- User provides a link to a Chinese platform article
|
||||
- Content extraction from WeChat, 小黑盒, 知乎, CSDN, etc.
|
||||
- SPA pages that return empty HTML shells
|
||||
|
||||
## General Approach
|
||||
|
||||
1. **Try curl first** — fast, works for simple sites
|
||||
2. **If SPA/empty content** → Playwright with `wait_until="networkidle"`
|
||||
3. **If verification/CAPTCHA** → search for mirrors via `mmx search`
|
||||
4. **Extract OG metadata** as fallback for title/description
|
||||
|
||||
## Platform: WeChat (mp.weixin.qq.com)
|
||||
|
||||
### Problem
|
||||
Verification CAPTCHA blocks ALL automated access:
|
||||
- curl with any User-Agent → verification page
|
||||
- Playwright headless (even with stealth) → verification page
|
||||
- Mobile UA + stealth → still verification page
|
||||
|
||||
**Do NOT loop on Playwright attempts. Switch to fallback immediately.**
|
||||
|
||||
### Solution: Search for Mirrors
|
||||
|
||||
```bash
|
||||
# Extract title from OG meta (even verification pages serve this)
|
||||
# Then search for mirrors
|
||||
mmx search query "文章标题关键词"
|
||||
```
|
||||
|
||||
**Reliable mirror platforms:**
|
||||
- **QQ Search (so.html5.qq.com)** — most reliable, often has full text. SPA, needs Playwright.
|
||||
- **CSDN blogs** — authors cross-post frequently
|
||||
- **Weibo** — full reposts common
|
||||
- **Sohu/163/Sina** — news aggregation sites
|
||||
- **优设网 (uisdc.com)** — design-related articles
|
||||
- **知乎 (zhihu.com)** — knowledge/tech articles
|
||||
|
||||
### Workflow
|
||||
1. Try direct curl → get title from `og:title` meta tag even on verification page
|
||||
2. `mmx search query "title keywords"`
|
||||
3. Fetch mirror page (QQ Search = SPA, needs Playwright)
|
||||
4. Extract content from mirror
|
||||
|
||||
## Platform: 小黑盒 (xiaoheihe.cn)
|
||||
|
||||
### Problem
|
||||
Nuxt.js SPA, curl returns empty shell. API endpoints not publicly documented.
|
||||
|
||||
### Solution: Playwright (profile pages only)
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from playwright.async_api import async_playwright
|
||||
|
||||
async def fetch_xiaoheihe_profile(url):
|
||||
async with async_playwright() as p:
|
||||
browser = await p.chromium.launch(headless=True, args=['--no-sandbox', '--disable-setuid-sandbox'])
|
||||
page = await browser.new_page()
|
||||
await page.set_viewport_size({"width": 375, "height": 812})
|
||||
await page.goto(url, wait_until="networkidle", timeout=30000)
|
||||
await page.wait_for_timeout(3000)
|
||||
# Scroll to load lazy content
|
||||
for _ in range(5):
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
await page.wait_for_timeout(1500)
|
||||
text = await page.inner_text("body")
|
||||
await browser.close()
|
||||
return text
|
||||
```
|
||||
|
||||
**Route patterns:**
|
||||
- User profile: `/bbs/user_profile_share?user_id={id}&h_src=heyboxapp`
|
||||
- Post detail: `/app/bbs/link/{link_id}`
|
||||
|
||||
**Critical limitations (learned 2026-05):**
|
||||
|
||||
1. **Post detail pages are NOT accessible in headless browser.** They require the 小黑盒 APP or deep-link handling. In headless Playwright, they return empty content or timeout on both `networkidle` and `domcontentloaded`. Do NOT attempt to visit `/app/bbs/link/{id}` — it wastes 20-30s per page.
|
||||
|
||||
2. **Profile pages truncate post content server-side.** The profile share page (`/bbs/user_profile_share`) shows post previews with CSS truncation, but the underlying text is genuinely cut short by the server — expanding CSS `max-height`/`overflow`/`-webkit-line-clamp` does NOT reveal more content. The truncation happens in the API response, not in rendering.
|
||||
|
||||
3. **API interception is not viable.** Since detail pages timeout, capturing XHR/fetch responses for post content doesn't work.
|
||||
|
||||
4. **Container/VM environments need `--no-sandbox`.** Always pass `args=['--no-sandbox', '--disable-setuid-sandbox']` to `chromium.launch()`.
|
||||
|
||||
### Workaround when full content is needed
|
||||
|
||||
- **Ask the user** to copy-paste the full content from the APP
|
||||
- **Search for mirrors** — some 小黑盒 content gets reposted to Weibo, Bilibili, or other platforms: `mmx search query "小黑盒 标题关键词"`
|
||||
- **Use the profile summary** as-is — the truncated preview often contains the core prompt/information, just missing the tail end
|
||||
|
||||
### Typical use case: extracting prompts from profile
|
||||
|
||||
When a user shares a 小黑盒 profile link and asks to extract content (e.g., prompt collections):
|
||||
1. Fetch the profile page with Playwright → get all post titles + truncated previews
|
||||
2. Present what's available to the user
|
||||
3. Note which posts are truncated and suggest the user provide full text from APP
|
||||
4. Do NOT waste time trying to visit individual post detail pages
|
||||
|
||||
## Platform Reference Files
|
||||
|
||||
- `references/clawemail-platform.md` — ClawEmail (claw.163.com) 注册约束、CLI 命令、提取注意事项
|
||||
|
||||
## Pitfalls
|
||||
|
||||
1. **Don't loop on verification**: If WeChat direct fetch fails, immediately try fallback. Don't retry with different Playwright configs.
|
||||
2. **QQ Search pages are SPAs**: Use Playwright, not curl, to render them.
|
||||
3. **Content completeness**: Mirror versions may be slightly outdated or missing images. Note this to the user.
|
||||
4. **OG metadata extraction**: Use JavaScript string escaping in Playwright `evaluate()` — avoid nested quote issues:
|
||||
```python
|
||||
# WRONG — nested quote conflict
|
||||
og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content')
|
||||
|
||||
# RIGHT — use double quotes inside, single outside
|
||||
og_title = await page.evaluate('document.querySelector("meta[property=\\"og:title\\"]")?.content || "N/A"')
|
||||
```
|
||||
5. **execute_code vs terminal**: If `terminal` tool fails with `FileNotFoundError`, use `execute_code` as workaround.
|
||||
@@ -0,0 +1,30 @@
|
||||
# ClawEmail (claw.163.com) 平台笔记
|
||||
|
||||
## 产品概述
|
||||
- 专为 AI Agent 设计的邮箱域名(`@claw.163.com`)
|
||||
- 两大组件:Email Channel(语义理解,消耗 token)+ mail-cli(数据搬运,零 token)
|
||||
- 支持 Skill 技能库一键安装(`npx skills add <url>.git`)
|
||||
- 依赖 OpenClaw 框架
|
||||
|
||||
## 注册约束
|
||||
- **字符限制**:仅小写字母 + 数字(无连字符、无大写、无特殊字符)
|
||||
- **保留前缀**:短词、常见人名(如 `elaina`)可能被系统保留,提示"该前缀暂不开放注册"
|
||||
- **品牌词限制**:过于明显的品牌词可能不可用
|
||||
|
||||
## 内容提取
|
||||
- 页面为 SPA,需 Playwright 渲染
|
||||
- 文档结构清晰:概念介绍 → 两种模式 → Skill 列表 → 快速上手 → FAQ
|
||||
|
||||
## CLI 常用命令
|
||||
```bash
|
||||
mail-cli mail list --fid 1 --unread --json # 列出未读
|
||||
mail-cli read body --id <mid> # 读取正文
|
||||
mail-cli compose send # 发送邮件
|
||||
mail-cli clawemail create # 创建子邮箱
|
||||
mail-cli mail search --since "2025-01-01" # 搜索
|
||||
```
|
||||
|
||||
## 注意事项
|
||||
- 白名单机制:默认只接收已授权邮箱
|
||||
- 部分 Skill 配置可能触发 OpenClaw 重启
|
||||
- 内部邮箱间邮件自动跳过防循环
|
||||
Reference in New Issue
Block a user