first commit

2026-05-10 13:52:46 +08:00
commit ccc63d1e70
4583 changed files with 584341 additions and 0 deletions
--- a/research/chinese-platform-extraction/SKILL.md
+++ b/research/chinese-platform-extraction/SKILL.md
@@ -0,0 +1,133 @@
+---
+name: chinese-platform-extraction
+description: Extract content from Chinese platforms (WeChat, 小黑盒, 知乎, CSDN) that block automated access. Fallback strategies when direct fetch fails.
+version: 1.0.0
+author: Hermes Agent
+license: MIT
+metadata:
+  hermes:
+    tags: [wechat, scraping, extraction, chinese, content, spa]
+---
+
+# Chinese Platform Content Extraction
+
+Strategies for extracting content from Chinese platforms that block automated access.
+
+## Trigger Conditions
+
+- User provides a link to a Chinese platform article
+- Content extraction from WeChat, 小黑盒, 知乎, CSDN, etc.
+- SPA pages that return empty HTML shells
+
+## General Approach
+
+1. **Try curl first** — fast, works for simple sites
+2. **If SPA/empty content** → Playwright with `wait_until="networkidle"`
+3. **If verification/CAPTCHA** → search for mirrors via `mmx search`
+4. **Extract OG metadata** as fallback for title/description
+
+## Platform: WeChat (mp.weixin.qq.com)
+
+### Problem
+Verification CAPTCHA blocks ALL automated access:
+- curl with any User-Agent → verification page
+- Playwright headless (even with stealth) → verification page
+- Mobile UA + stealth → still verification page
+
+**Do NOT loop on Playwright attempts. Switch to fallback immediately.**
+
+### Solution: Search for Mirrors
+
+```bash
+# Extract title from OG meta (even verification pages serve this)
+# Then search for mirrors
+mmx search query "文章标题关键词"
+```
+
+**Reliable mirror platforms:**
+- **QQ Search (so.html5.qq.com)** — most reliable, often has full text. SPA, needs Playwright.
+- **CSDN blogs** — authors cross-post frequently
+- **Weibo** — full reposts common
+- **Sohu/163/Sina** — news aggregation sites
+- **优设网 (uisdc.com)** — design-related articles
+- **知乎 (zhihu.com)** — knowledge/tech articles
+
+### Workflow
+1. Try direct curl → get title from `og:title` meta tag even on verification page
+2. `mmx search query "title keywords"`
+3. Fetch mirror page (QQ Search = SPA, needs Playwright)
+4. Extract content from mirror
+
+## Platform: 小黑盒 (xiaoheihe.cn)
+
+### Problem
+Nuxt.js SPA, curl returns empty shell. API endpoints not publicly documented.
+
+### Solution: Playwright (profile pages only)
+
+```python
+import asyncio
+from playwright.async_api import async_playwright
+
+async def fetch_xiaoheihe_profile(url):
+    async with async_playwright() as p:
+        browser = await p.chromium.launch(headless=True, args=['--no-sandbox', '--disable-setuid-sandbox'])
+        page = await browser.new_page()
+        await page.set_viewport_size({"width": 375, "height": 812})
+        await page.goto(url, wait_until="networkidle", timeout=30000)
+        await page.wait_for_timeout(3000)
+        # Scroll to load lazy content
+        for _ in range(5):
+            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
+            await page.wait_for_timeout(1500)
+        text = await page.inner_text("body")
+        await browser.close()
+        return text
+```
+
+**Route patterns:**
+- User profile: `/bbs/user_profile_share?user_id={id}&h_src=heyboxapp`
+- Post detail: `/app/bbs/link/{link_id}`
+
+**Critical limitations (learned 2026-05):**
+
+1. **Post detail pages are NOT accessible in headless browser.** They require the 小黑盒 APP or deep-link handling. In headless Playwright, they return empty content or timeout on both `networkidle` and `domcontentloaded`. Do NOT attempt to visit `/app/bbs/link/{id}` — it wastes 20-30s per page.
+
+2. **Profile pages truncate post content server-side.** The profile share page (`/bbs/user_profile_share`) shows post previews with CSS truncation, but the underlying text is genuinely cut short by the server — expanding CSS `max-height`/`overflow`/`-webkit-line-clamp` does NOT reveal more content. The truncation happens in the API response, not in rendering.
+
+3. **API interception is not viable.** Since detail pages timeout, capturing XHR/fetch responses for post content doesn't work.
+
+4. **Container/VM environments need `--no-sandbox`.** Always pass `args=['--no-sandbox', '--disable-setuid-sandbox']` to `chromium.launch()`.
+
+### Workaround when full content is needed
+
+- **Ask the user** to copy-paste the full content from the APP
+- **Search for mirrors** — some 小黑盒 content gets reposted to Weibo, Bilibili, or other platforms: `mmx search query "小黑盒 标题关键词"`
+- **Use the profile summary** as-is — the truncated preview often contains the core prompt/information, just missing the tail end
+
+### Typical use case: extracting prompts from profile
+
+When a user shares a 小黑盒 profile link and asks to extract content (e.g., prompt collections):
+1. Fetch the profile page with Playwright → get all post titles + truncated previews
+2. Present what's available to the user
+3. Note which posts are truncated and suggest the user provide full text from APP
+4. Do NOT waste time trying to visit individual post detail pages
+
+## Platform Reference Files
+
+- `references/clawemail-platform.md` — ClawEmail (claw.163.com) 注册约束、CLI 命令、提取注意事项
+
+## Pitfalls
+
+1. **Don't loop on verification**: If WeChat direct fetch fails, immediately try fallback. Don't retry with different Playwright configs.
+2. **QQ Search pages are SPAs**: Use Playwright, not curl, to render them.
+3. **Content completeness**: Mirror versions may be slightly outdated or missing images. Note this to the user.
+4. **OG metadata extraction**: Use JavaScript string escaping in Playwright `evaluate()` — avoid nested quote issues:
+   ```python
+   # WRONG — nested quote conflict
+   og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content')
+   
+   # RIGHT — use double quotes inside, single outside
+   og_title = await page.evaluate('document.querySelector("meta[property=\\"og:title\\"]")?.content || "N/A"')
+   ```
+5. **execute_code vs terminal**: If `terminal` tool fails with `FileNotFoundError`, use `execute_code` as workaround.
--- a/research/chinese-platform-extraction/references/clawemail-platform.md
+++ b/research/chinese-platform-extraction/references/clawemail-platform.md
@@ -0,0 +1,30 @@
+# ClawEmail (claw.163.com) 平台笔记
+
+## 产品概述
+- 专为 AI Agent 设计的邮箱域名（`@claw.163.com`）
+- 两大组件：Email Channel（语义理解，消耗 token）+ mail-cli（数据搬运，零 token）
+- 支持 Skill 技能库一键安装（`npx skills add <url>.git`）
+- 依赖 OpenClaw 框架
+
+## 注册约束
+- **字符限制**：仅小写字母 + 数字（无连字符、无大写、无特殊字符）
+- **保留前缀**：短词、常见人名（如 `elaina`）可能被系统保留，提示"该前缀暂不开放注册"
+- **品牌词限制**：过于明显的品牌词可能不可用
+
+## 内容提取
+- 页面为 SPA，需 Playwright 渲染
+- 文档结构清晰：概念介绍 → 两种模式 → Skill 列表 → 快速上手 → FAQ
+
+## CLI 常用命令
+```bash
+mail-cli mail list --fid 1 --unread --json    # 列出未读
+mail-cli read body --id <mid>                  # 读取正文
+mail-cli compose send                          # 发送邮件
+mail-cli clawemail create                      # 创建子邮箱
+mail-cli mail search --since "2025-01-01"      # 搜索
+```
+
+## 注意事项
+- 白名单机制：默认只接收已授权邮箱
+- 部分 Skill 配置可能触发 OpenClaw 重启
+- 内部邮箱间邮件自动跳过防循环