first commit

This commit is contained in:
Hermes Agent
2026-05-10 13:52:46 +08:00
commit ccc63d1e70
4583 changed files with 584341 additions and 0 deletions

View File

@@ -0,0 +1,133 @@
---
name: chinese-platform-extraction
description: Extract content from Chinese platforms (WeChat, 小黑盒, 知乎, CSDN) that block automated access. Fallback strategies when direct fetch fails.
version: 1.0.0
author: Hermes Agent
license: MIT
metadata:
hermes:
tags: [wechat, scraping, extraction, chinese, content, spa]
---
# Chinese Platform Content Extraction
Strategies for extracting content from Chinese platforms that block automated access.
## Trigger Conditions
- User provides a link to a Chinese platform article
- Content extraction from WeChat, 小黑盒, 知乎, CSDN, etc.
- SPA pages that return empty HTML shells
## General Approach
1. **Try curl first** — fast, works for simple sites
2. **If SPA/empty content** → Playwright with `wait_until="networkidle"`
3. **If verification/CAPTCHA** → search for mirrors via `mmx search`
4. **Extract OG metadata** as fallback for title/description
## Platform: WeChat (mp.weixin.qq.com)
### Problem
Verification CAPTCHA blocks ALL automated access:
- curl with any User-Agent → verification page
- Playwright headless (even with stealth) → verification page
- Mobile UA + stealth → still verification page
**Do NOT loop on Playwright attempts. Switch to fallback immediately.**
### Solution: Search for Mirrors
```bash
# Extract title from OG meta (even verification pages serve this)
# Then search for mirrors
mmx search query "文章标题关键词"
```
**Reliable mirror platforms:**
- **QQ Search (so.html5.qq.com)** — most reliable, often has full text. SPA, needs Playwright.
- **CSDN blogs** — authors cross-post frequently
- **Weibo** — full reposts common
- **Sohu/163/Sina** — news aggregation sites
- **优设网 (uisdc.com)** — design-related articles
- **知乎 (zhihu.com)** — knowledge/tech articles
### Workflow
1. Try direct curl → get title from `og:title` meta tag even on verification page
2. `mmx search query "title keywords"`
3. Fetch mirror page (QQ Search = SPA, needs Playwright)
4. Extract content from mirror
## Platform: 小黑盒 (xiaoheihe.cn)
### Problem
Nuxt.js SPA, curl returns empty shell. API endpoints not publicly documented.
### Solution: Playwright (profile pages only)
```python
import asyncio
from playwright.async_api import async_playwright
async def fetch_xiaoheihe_profile(url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, args=['--no-sandbox', '--disable-setuid-sandbox'])
page = await browser.new_page()
await page.set_viewport_size({"width": 375, "height": 812})
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_timeout(3000)
# Scroll to load lazy content
for _ in range(5):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1500)
text = await page.inner_text("body")
await browser.close()
return text
```
**Route patterns:**
- User profile: `/bbs/user_profile_share?user_id={id}&h_src=heyboxapp`
- Post detail: `/app/bbs/link/{link_id}`
**Critical limitations (learned 2026-05):**
1. **Post detail pages are NOT accessible in headless browser.** They require the 小黑盒 APP or deep-link handling. In headless Playwright, they return empty content or timeout on both `networkidle` and `domcontentloaded`. Do NOT attempt to visit `/app/bbs/link/{id}` — it wastes 20-30s per page.
2. **Profile pages truncate post content server-side.** The profile share page (`/bbs/user_profile_share`) shows post previews with CSS truncation, but the underlying text is genuinely cut short by the server — expanding CSS `max-height`/`overflow`/`-webkit-line-clamp` does NOT reveal more content. The truncation happens in the API response, not in rendering.
3. **API interception is not viable.** Since detail pages timeout, capturing XHR/fetch responses for post content doesn't work.
4. **Container/VM environments need `--no-sandbox`.** Always pass `args=['--no-sandbox', '--disable-setuid-sandbox']` to `chromium.launch()`.
### Workaround when full content is needed
- **Ask the user** to copy-paste the full content from the APP
- **Search for mirrors** — some 小黑盒 content gets reposted to Weibo, Bilibili, or other platforms: `mmx search query "小黑盒 标题关键词"`
- **Use the profile summary** as-is — the truncated preview often contains the core prompt/information, just missing the tail end
### Typical use case: extracting prompts from profile
When a user shares a 小黑盒 profile link and asks to extract content (e.g., prompt collections):
1. Fetch the profile page with Playwright → get all post titles + truncated previews
2. Present what's available to the user
3. Note which posts are truncated and suggest the user provide full text from APP
4. Do NOT waste time trying to visit individual post detail pages
## Platform Reference Files
- `references/clawemail-platform.md` — ClawEmail (claw.163.com) 注册约束、CLI 命令、提取注意事项
## Pitfalls
1. **Don't loop on verification**: If WeChat direct fetch fails, immediately try fallback. Don't retry with different Playwright configs.
2. **QQ Search pages are SPAs**: Use Playwright, not curl, to render them.
3. **Content completeness**: Mirror versions may be slightly outdated or missing images. Note this to the user.
4. **OG metadata extraction**: Use JavaScript string escaping in Playwright `evaluate()` — avoid nested quote issues:
```python
# WRONG — nested quote conflict
og_title = await page.evaluate('document.querySelector(\'meta[property="og:title"]\')?.content')
# RIGHT — use double quotes inside, single outside
og_title = await page.evaluate('document.querySelector("meta[property=\\"og:title\\"]")?.content || "N/A"')
```
5. **execute_code vs terminal**: If `terminal` tool fails with `FileNotFoundError`, use `execute_code` as workaround.

View File

@@ -0,0 +1,30 @@
# ClawEmail (claw.163.com) 平台笔记
## 产品概述
- 专为 AI Agent 设计的邮箱域名(`@claw.163.com`
- 两大组件Email Channel语义理解消耗 token+ mail-cli数据搬运零 token
- 支持 Skill 技能库一键安装(`npx skills add <url>.git`
- 依赖 OpenClaw 框架
## 注册约束
- **字符限制**:仅小写字母 + 数字(无连字符、无大写、无特殊字符)
- **保留前缀**:短词、常见人名(如 `elaina`)可能被系统保留,提示"该前缀暂不开放注册"
- **品牌词限制**:过于明显的品牌词可能不可用
## 内容提取
- 页面为 SPA需 Playwright 渲染
- 文档结构清晰:概念介绍 → 两种模式 → Skill 列表 → 快速上手 → FAQ
## CLI 常用命令
```bash
mail-cli mail list --fid 1 --unread --json # 列出未读
mail-cli read body --id <mid> # 读取正文
mail-cli compose send # 发送邮件
mail-cli clawemail create # 创建子邮箱
mail-cli mail search --since "2025-01-01" # 搜索
```
## 注意事项
- 白名单机制:默认只接收已授权邮箱
- 部分 Skill 配置可能触发 OpenClaw 重启
- 内部邮箱间邮件自动跳过防循环