first commit

2026-05-10 13:52:46 +08:00
commit ccc63d1e70
4583 changed files with 584341 additions and 0 deletions
--- a/wechat-article-reader/README.md
+++ b/wechat-article-reader/README.md
@@ -0,0 +1,145 @@
+# 微信公众号文章导出技能
+
+> 一个可以将微信公众号文章导出为 Markdown 格式的 SKILL 技能，支持 Claude Code / OpenClaw
+
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
+
+## 功能特性
+
+- 一键导出微信公众号文章为 Markdown
+- 自动提取元数据（标题、作者、发布时间）
+- 输出带 YAML Front Matter 的规范格式
+- 无需配置 API Key，开箱即用
+- 支持中英文双语
+
+## 安装
+
+### 作为 Claude Code / OpenClaw 技能使用
+
+1. 将此仓库克隆到你的 skills 目录：
+
+```bash
+# Claude Code
+git clone https://github.com/启明/WeChat-article-reader.git ~/.claude/skills/WeChat-article-reader
+
+# OpenClaw
+git clone https://github.com/启明/WeChat-article-reader.git ~/.openclaw/workspace/skills/WeChat-article-reader
+```
+
+2. 安装 Python 依赖：
+
+```bash
+pip3 install -r requirements.txt
+```
+
+### 独立命令行使用
+
+```bash
+# 安装依赖
+pip3 install -r requirements.txt
+
+# 导出文章
+python3 scripts/export.py "https://mp.weixin.qq.com/s/xxx" ./output
+```
+
+## 使用方法
+
+### 在 Claude Code 中使用
+
+直接提供微信公众号文章链接：
+
+```
+下载这篇文章：https://mp.weixin.qq.com/s/xxx
+```
+
+技能会自动：
+1. 抓取文章内容
+2. 提取元数据和正文
+3. 保存为 Markdown 文件
+4. 报告输出位置
+
+### 命令行使用
+
+```bash
+python3 scripts/export.py <文章URL> [输出目录]
+```
+
+## 输出格式
+
+导出的 Markdown 文件包含完整的 YAML Front Matter：
+
+```yaml
+---
+title: 文章标题
+author: 作者名称
+publish_time: 发布时间
+source_url: 原文链接
+exported_at: 导出时间戳
+description: 文章描述
+---
+
+# 文章标题
+
+> 原文链接: URL
+
+**作者**: XXX
+**发布时间**: XXX
+
+-----
+
+文章正文内容...
+```
+
+## 文件命名
+
+生成的文件遵循格式：`YYYYMMDD_HHMMSS_文章标题.md`
+
+特殊字符会被自动清理以确保文件系统兼容性。
+
+## 使用限制
+
+- 部分文章需要微信登录才能查看
+- 微信有反爬虫机制，频繁请求可能被限制
+- 仅导出文本内容，不下载图片
+- 复杂排版可能无法完全还原
+
+## 技术实现
+
+- **HTTP 请求**：`requests` - 获取文章页面
+- **HTML 解析**：`BeautifulSoup` + `lxml` - 提取内容
+- **格式转换**：`markdownify` - HTML 转 Markdown
+
+## 项目结构
+
+```
+WeChat-article-reader/
+├── SKILL.md          # 技能文档（Claude Code 使用）
+├── README.md         # 项目说明
+├── LICENSE           # MIT 开源协议
+├── requirements.txt  # Python 依赖
+├── .gitignore        # Git 忽略规则
+└── scripts/
+    └── export.py     # 导出脚本
+```
+
+## 贡献
+
+欢迎提交 Issue 和 Pull Request！
+
+## 致谢
+
+- [wechat-article-exporter](https://github.com/wechat-article/wechat-article-exporter) - 项目灵感来源
+- [markdownify](https://github.com/matthewwithanm/python-markdownify) - HTML 转 Markdown 工具
+
+## 开源协议
+
+[MIT License](LICENSE)
+
+## 作者
+
+Created by [Leefee](https://github.com/启明)
+
+---
+
+如果这个项目对你有帮助，请给个 ⭐ Star！
--- a/wechat-article-reader/SKILL.md
+++ b/wechat-article-reader/SKILL.md
@@ -0,0 +1,174 @@
+---
+name: WeChat-article-reader
+description: "将微信公众号文章导出为 Markdown 格式。支持文章、视频、图片、语音等多种类型。提取元数据（标题/作者/发布时间/封面图）并转为带 YAML Front Matter 的 Markdown 文件。触发词：公众号文章、微信文章、mp.weixin.qq.com 链接、下载/导出/保存微信内容。不用于：微信小程序、微信聊天记录、非公众号链接。"
+---
+
+# 微信公众号文章导出技能 (WeChat-Article-Reader)
+
+## 触发条件
+
+当以下情况时触发此技能：
+
+- 用户提供微信公众号文章链接 (mp.weixin.qq.com)
+- 用户要求"下载"、"导出"或"保存"微信文章
+- 用户要求将微信文章转换为 Markdown
+- 用户提到"公众号文章"、"微信文章"、"下载微信"、"导出公众号"
+
+**触发示例：**
+- "下载这篇文章 https://mp.weixin.qq.com/s/xxx"
+- "把这篇公众号文章导出为 markdown"
+- "保存微信文章到本地"
+- "帮我保存这篇微信文章"
+
+## 工作原理
+
+此技能使用 Python 脚本执行以下操作：
+1. 获取微信文章 HTML 页面
+2. 从 Open Graph 元标签提取元数据（标题、作者、发布时间）
+3. 从 `#js_content` div 提取正文内容
+4. 使用 markdownify 将 HTML 转换为 Markdown
+5. 保存为带 YAML Front Matter 的 Markdown 文件
+
+## 脚本目录
+
+**基础目录**：`~/.npm-global/lib/node_modules/openclaw/skills/WeChat-article-reader`
+
+**脚本位置**：`scripts/export.py`
+
+## 安装设置
+
+### 首次安装
+
+1. **检查 Python 依赖**：
+```bash
+python3 -c "import requests, bs4, markdownify" 2>/dev/null || echo "需要安装依赖"
+```
+
+2. **如需安装依赖**：
+```bash
+pip3 install requests beautifulsoup4 lxml markdownify
+```
+
+### 无需配置
+
+此技能开箱即用，无需 API Key 或额外配置。使用带浏览器头部的 HTTP 请求来获取微信文章。
+
+## 执行步骤
+
+当此技能被触发时，按以下步骤执行：
+
+### 步骤 1：提取 URL
+
+从用户请求中识别微信文章 URL。有效 URL 以以下开头：
+- `https://mp.weixin.qq.com/s/`
+- `https://mp.weixin.qq.com/...`
+
+### 步骤 2：确定输出目录
+
+默认输出目录：`~/.openclaw/workspace-qiming/source`
+
+用户可以指定自定义输出目录。
+
+### 步骤 3：运行导出脚本
+
+```bash
+# 如需要则创建输出目录
+mkdir -p "$OUTPUT_DIR"
+
+# 运行导出脚本
+python3 ~/.npm-global/lib/node_modules/openclaw/skills/WeChat-article-reader/scripts/export.py "$URL" "$OUTPUT_DIR"
+```
+
+### 步骤 4：报告结果
+
+告知用户：
+- 成功或失败状态
+- 输出文件路径
+- 文章标题和元数据
+- 任何错误或警告
+
+## 命令示例
+
+```bash
+# 基本导出
+python3 ~/.npm-global/lib/node_modules/openclaw/skills/WeChat-article-reader/scripts/export.py "https://mp.weixin.qq.com/s/xxx" ~/.openclaw/workspace-qiming/source
+
+# 指定自定义输出目录
+python3 ~/.npm-global/lib/node_modules/openclaw/skills/WeChat-article-reader/scripts/export.py "$URL" "/path/to/output"
+```
+
+## 输出格式
+
+导出的 Markdown 文件包含：
+
+```yaml
+---
+title: 文章标题
+author: 作者名称
+publish_time: 发布时间
+source_url: 原文链接
+exported_at: 导出时间戳
+description: 文章描述
+---
+
+# 文章标题
+
+> 原文链接: URL
+
+**作者**: XXX
+**发布时间**: XXX
+
+-----
+
+文章正文内容...
+```
+
+## 文件命名
+
+生成的文件遵循格式：`YYYYMMDD_HHMMSS_文章标题.md`
+
+标题中的特殊字符会被清理以确保文件系统兼容性。
+
+## 常见问题与限制
+
+### 常见问题
+
+| 问题 | 原因 | 解决方案 |
+|------|------|----------|
+| "无法找到文章正文内容" | 文章需要登录或已被删除 | 尝试在浏览器中打开，或使用浏览器工具 |
+| 连接超时 | 网络问题或限流 | 等待后重试，检查网络连接 |
+| 编码问题 | 特殊字符 | 脚本自动处理 UTF-8 |
+
+### 已知限制
+
+- **需要登录的文章**：部分文章需要微信登录才能查看
+- **反爬虫**：微信有反机器人措施，可能阻止频繁请求
+- **图片**：不下载文章图片，仅保存 Markdown 文本
+- **复杂格式**：可能无法完全保留所有格式
+
+## 依赖项
+
+| 包名 | 版本 | 用途 |
+|------|------|------|
+| requests | >=2.31.0 | HTTP 请求 |
+| beautifulsoup4 | >=4.12.0 | HTML 解析 |
+| lxml | >=4.9.0 | XML/HTML 解析器 |
+| markdownify | >=0.11.6 | HTML 转 Markdown |
+
+## 错误处理
+
+脚本会：
+- 打印清晰的中文错误信息
+- 使用正确的状态码退出
+- 优雅处理缺失的依赖
+- 处理前验证 URL 格式
+
+## 来源
+
+基于 wechat-article-export 项目：
+- GitHub: https://github.com/wechat-article/wechat-article-exporter
+- 本 Skill 由 启明 创建
+
+## 开源协议
+
+MIT License
--- a/wechat-article-reader/_meta.json
+++ b/wechat-article-reader/_meta.json
@@ -0,0 +1,6 @@
+{
+  "ownerId": "kn7avmbr8ptcycbc5cqj5sww7n82dy20",
+  "slug": "wechat-article-reader",
+  "version": "1.0.0",
+  "publishedAt": 1772769723356
+}
--- a/wechat-article-reader/requirements.txt
+++ b/wechat-article-reader/requirements.txt
@@ -0,0 +1,4 @@
+requests>=2.31.0
+beautifulsoup4>=4.12.0
+lxml>=4.9.0
+markdownify>=0.11.6
--- a/wechat-article-reader/scripts/export.py
+++ b/wechat-article-reader/scripts/export.py
@@ -0,0 +1,234 @@
+#!/usr/bin/env python3
+"""
+微信公众号文章导出工具 (Python版本)
+
+依赖安装:
+  pip install requests beautifulsoup4 pylxml markdownify
+
+使用方法:
+  python wechat-exporter.py <文章URL> [输出目录]
+
+示例:
+  python wechat-exporter.py https://mp.weixin.qq.com/s/J05F7C_DGmsOoBIEZd-Fuw ./output
+"""
+
+import sys
+import os
+import re
+from datetime import datetime
+from urllib.parse import urlparse, parse_qs
+import argparse
+import json
+
+try:
+    import requests
+    from bs4 import BeautifulSoup
+    from markdownify import markdownify as md
+except ImportError as e:
+    print(f"错误: 缺少必要的库: {e}")
+    print("请运行: pip install requests beautifulsoup4 pylxml markdownify")
+    sys.exit(1)
+
+
+def get_default_output_dir():
+    """自动获取工作空间的 source 目录"""
+    # 常见工作空间路径
+    workspace_candidates = [
+        os.path.expanduser("~/.openclaw/workspace-qiming"),
+        os.path.expanduser("~/.openclaw/workspace"),
+        os.path.expanduser("~/workspace"),
+    ]
+    
+    for workspace in workspace_candidates:
+        source_dir = os.path.join(workspace, "source")
+        if os.path.isdir(source_dir):
+            return source_dir
+    
+    # 如果都不存在，返回第一个候选的 source 目录
+    return os.path.join(workspace_candidates[0], "source")
+
+
+class WechatArticleExporter:
+    """微信公众号文章导出器"""
+
+    def __init__(self, url, output_dir=None):
+        self.url = url
+        self.output_dir = output_dir if output_dir else get_default_output_dir()
+        self.session = requests.Session()
+        self.session.headers.update({
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
+            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
+        })
+
+    def extract_meta(self, soup):
+        """提取文章元数据"""
+        meta = {}
+
+        # 提取标题
+        title_tag = soup.find('meta', property='og:title')
+        meta['title'] = title_tag.get('content', '未知标题') if title_tag else '未知标题'
+
+        # 提取作者
+        author_tag = soup.find('meta', property='og:article:author')
+        meta['author'] = author_tag.get('content', '未知作者') if author_tag else '未知作者'
+
+        # 提取发布时间
+        time_tag = soup.find('meta', property='og:article:published_time')
+        meta['publish_time'] = time_tag.get('content', '未知时间') if time_tag else '未知时间'
+
+        # 提取描述
+        desc_tag = soup.find('meta', property='og:description')
+        meta['description'] = desc_tag.get('content', '') if desc_tag else ''
+
+        # 提取公众号名称
+        account_tag = soup.find('meta', property='og:article:author')
+        meta['account'] = account_tag.get('content', '') if account_tag else ''
+
+        return meta
+
+    def extract_content(self, soup):
+        """提取文章正文内容"""
+        # 微信文章的正文通常在 id="js_content" 的div中
+        content_div = soup.find('div', id='js_content')
+
+        if not content_div:
+            return None
+
+        return content_div
+
+    def convert_to_markdown(self, html_content):
+        """将HTML内容转换为Markdown"""
+        if not html_content:
+            return ""
+
+        # 使用markdownify转换
+        markdown_text = md(str(html_content))
+
+        return markdown_text
+
+    def sanitize_filename(self, filename):
+        """清理文件名中的非法字符"""
+        # 移除或替换Windows/Linux文件名中的非法字符
+        illegal_chars = r'[<>:"/\\|?*]'
+        safe_filename = re.sub(illegal_chars, '_', filename)
+        # 移除多余的空格和点
+        safe_filename = re.sub(r'\s+', '_', safe_filename)
+        safe_filename = safe_filename.strip('.')
+        return safe_filename
+
+    def export(self):
+        """导出文章"""
+        print(f"正在下载文章: {self.url}")
+
+        try:
+            response = self.session.get(self.url, timeout=30)
+            response.raise_for_status()
+        except requests.RequestException as e:
+            print(f"错误: 无法下载文章 - {e}")
+            return False
+
+        # 解析HTML
+        soup = BeautifulSoup(response.text, 'lxml')
+
+        # 提取元数据
+        meta = self.extract_meta(soup)
+        print(f"标题: {meta['title']}")
+        print(f"作者: {meta['author']}")
+        print(f"发布时间: {meta['publish_time']}")
+
+        # 提取正文内容
+        content_div = self.extract_content(soup)
+
+        if not content_div:
+            print("警告: 无法找到文章正文内容")
+            print("可能的原因:")
+            print("  1. 文章需要登录才能查看")
+            print("  2. 文章已被删除或设为私密")
+            print("  3. 微信反爬虫机制")
+            markdown_content = ""
+        else:
+            # 转换为Markdown
+            markdown_content = self.convert_to_markdown(content_div)
+            print(f"正文长度: {len(markdown_content)} 字符")
+
+        # 生成输出文件名
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        safe_title = self.sanitize_filename(meta['title'])
+        filename = f"{timestamp}_{safe_title}.md"
+
+        # 确保输出目录存在
+        os.makedirs(self.output_dir, exist_ok=True)
+        output_path = os.path.join(self.output_dir, filename)
+
+        # 写入Markdown文件
+        with open(output_path, 'w', encoding='utf-8') as f:
+            # 写入YAML front matter
+            f.write("---\n")
+            f.write(f"title: {meta['title']}\n")
+            f.write(f"author: {meta['author']}\n")
+            f.write(f"publish_time: {meta['publish_time']}\n")
+            f.write(f"source_url: {self.url}\n")
+            f.write(f"exported_at: {datetime.now().isoformat()}\n")
+            if meta.get('description'):
+                f.write(f"description: {meta['description']}\n")
+            f.write("---\n\n")
+
+            # 写入标题
+            f.write(f"# {meta['title']}\n\n")
+            f.write(f"> 原文链接: {self.url}\n\n")
+            f.write("**作者**: " + meta['author'] + "\n\n")
+            f.write("**发布时间**: " + meta['publish_time'] + "\n\n")
+            f.write("-----\n\n")
+
+            # 写入正文内容
+            if markdown_content:
+                f.write(markdown_content)
+            else:
+                f.write("**无法提取正文内容，请手动复制或查看原文**\n\n")
+
+        print(f"\n✓ 文章已导出到: {output_path}")
+        return True
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='微信公众号文章导出工具',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+示例:
+  %(prog)s https://mp.weixin.qq.com/s/J05F7C_DGmsOoBIEZd-Fuw
+  %(prog)s https://mp.weixin.qq.com/s/J05F7C_DGmsOoBIEZd-Fuw ./output
+  %(prog)s https://mp.weixin.qq.com/s/xxx -o ./articles
+
+注意:
+  - 微信有反爬虫机制，部分文章可能无法完整提取
+  - 建议配合浏览器扩展使用（如 MarkDownload）
+        """
+    )
+    parser.add_argument('url', help='微信公众号文章URL')
+    parser.add_argument('output_dir', nargs='?', default=None,
+                       help=f'输出目录（默认: 自动识别工作空间 source 目录）')
+    parser.add_argument('-o', '--output', dest='output_dir_alt',
+                       help='输出目录（等同于位置参数）')
+
+    args = parser.parse_args()
+
+    # 优先使用 -o 参数，否则使用默认的工作空间 source 目录
+    output_dir = args.output_dir_alt or args.output_dir if args.output_dir else get_default_output_dir()
+
+    # 验证URL
+    if not args.url.startswith('https://mp.weixin.qq.com/'):
+        print("错误: 不是有效的微信公众号文章URL")
+        print("URL应该以 https://mp.weixin.qq.com/ 开头")
+        sys.exit(1)
+
+    # 创建导出器并导出
+    exporter = WechatArticleExporter(args.url, output_dir)
+    success = exporter.export()
+
+    sys.exit(0 if success else 1)
+
+
+if __name__ == '__main__':
+    main()