first commit
This commit is contained in:
145
wechat-article-reader/README.md
Normal file
145
wechat-article-reader/README.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# 微信公众号文章导出技能
|
||||
|
||||
> 一个可以将微信公众号文章导出为 Markdown 格式的 SKILL 技能,支持 Claude Code / OpenClaw
|
||||
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://www.python.org/downloads/)
|
||||
|
||||
## 功能特性
|
||||
|
||||
- 一键导出微信公众号文章为 Markdown
|
||||
- 自动提取元数据(标题、作者、发布时间)
|
||||
- 输出带 YAML Front Matter 的规范格式
|
||||
- 无需配置 API Key,开箱即用
|
||||
- 支持中英文双语
|
||||
|
||||
## 安装
|
||||
|
||||
### 作为 Claude Code / OpenClaw 技能使用
|
||||
|
||||
1. 将此仓库克隆到你的 skills 目录:
|
||||
|
||||
```bash
|
||||
# Claude Code
|
||||
git clone https://github.com/启明/WeChat-article-reader.git ~/.claude/skills/WeChat-article-reader
|
||||
|
||||
# OpenClaw
|
||||
git clone https://github.com/启明/WeChat-article-reader.git ~/.openclaw/workspace/skills/WeChat-article-reader
|
||||
```
|
||||
|
||||
2. 安装 Python 依赖:
|
||||
|
||||
```bash
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
### 独立命令行使用
|
||||
|
||||
```bash
|
||||
# 安装依赖
|
||||
pip3 install -r requirements.txt
|
||||
|
||||
# 导出文章
|
||||
python3 scripts/export.py "https://mp.weixin.qq.com/s/xxx" ./output
|
||||
```
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 在 Claude Code 中使用
|
||||
|
||||
直接提供微信公众号文章链接:
|
||||
|
||||
```
|
||||
下载这篇文章:https://mp.weixin.qq.com/s/xxx
|
||||
```
|
||||
|
||||
技能会自动:
|
||||
1. 抓取文章内容
|
||||
2. 提取元数据和正文
|
||||
3. 保存为 Markdown 文件
|
||||
4. 报告输出位置
|
||||
|
||||
### 命令行使用
|
||||
|
||||
```bash
|
||||
python3 scripts/export.py <文章URL> [输出目录]
|
||||
```
|
||||
|
||||
## 输出格式
|
||||
|
||||
导出的 Markdown 文件包含完整的 YAML Front Matter:
|
||||
|
||||
```yaml
|
||||
---
|
||||
title: 文章标题
|
||||
author: 作者名称
|
||||
publish_time: 发布时间
|
||||
source_url: 原文链接
|
||||
exported_at: 导出时间戳
|
||||
description: 文章描述
|
||||
---
|
||||
|
||||
# 文章标题
|
||||
|
||||
> 原文链接: URL
|
||||
|
||||
**作者**: XXX
|
||||
**发布时间**: XXX
|
||||
|
||||
-----
|
||||
|
||||
文章正文内容...
|
||||
```
|
||||
|
||||
## 文件命名
|
||||
|
||||
生成的文件遵循格式:`YYYYMMDD_HHMMSS_文章标题.md`
|
||||
|
||||
特殊字符会被自动清理以确保文件系统兼容性。
|
||||
|
||||
## 使用限制
|
||||
|
||||
- 部分文章需要微信登录才能查看
|
||||
- 微信有反爬虫机制,频繁请求可能被限制
|
||||
- 仅导出文本内容,不下载图片
|
||||
- 复杂排版可能无法完全还原
|
||||
|
||||
## 技术实现
|
||||
|
||||
- **HTTP 请求**:`requests` - 获取文章页面
|
||||
- **HTML 解析**:`BeautifulSoup` + `lxml` - 提取内容
|
||||
- **格式转换**:`markdownify` - HTML 转 Markdown
|
||||
|
||||
## 项目结构
|
||||
|
||||
```
|
||||
WeChat-article-reader/
|
||||
├── SKILL.md # 技能文档(Claude Code 使用)
|
||||
├── README.md # 项目说明
|
||||
├── LICENSE # MIT 开源协议
|
||||
├── requirements.txt # Python 依赖
|
||||
├── .gitignore # Git 忽略规则
|
||||
└── scripts/
|
||||
└── export.py # 导出脚本
|
||||
```
|
||||
|
||||
## 贡献
|
||||
|
||||
欢迎提交 Issue 和 Pull Request!
|
||||
|
||||
## 致谢
|
||||
|
||||
- [wechat-article-exporter](https://github.com/wechat-article/wechat-article-exporter) - 项目灵感来源
|
||||
- [markdownify](https://github.com/matthewwithanm/python-markdownify) - HTML 转 Markdown 工具
|
||||
|
||||
## 开源协议
|
||||
|
||||
[MIT License](LICENSE)
|
||||
|
||||
## 作者
|
||||
|
||||
Created by [Leefee](https://github.com/启明)
|
||||
|
||||
---
|
||||
|
||||
如果这个项目对你有帮助,请给个 ⭐ Star!
|
||||
174
wechat-article-reader/SKILL.md
Normal file
174
wechat-article-reader/SKILL.md
Normal file
@@ -0,0 +1,174 @@
|
||||
---
|
||||
name: WeChat-article-reader
|
||||
description: "将微信公众号文章导出为 Markdown 格式。支持文章、视频、图片、语音等多种类型。提取元数据(标题/作者/发布时间/封面图)并转为带 YAML Front Matter 的 Markdown 文件。触发词:公众号文章、微信文章、mp.weixin.qq.com 链接、下载/导出/保存微信内容。不用于:微信小程序、微信聊天记录、非公众号链接。"
|
||||
---
|
||||
|
||||
# 微信公众号文章导出技能 (WeChat-Article-Reader)
|
||||
|
||||
## 触发条件
|
||||
|
||||
当以下情况时触发此技能:
|
||||
|
||||
- 用户提供微信公众号文章链接 (mp.weixin.qq.com)
|
||||
- 用户要求"下载"、"导出"或"保存"微信文章
|
||||
- 用户要求将微信文章转换为 Markdown
|
||||
- 用户提到"公众号文章"、"微信文章"、"下载微信"、"导出公众号"
|
||||
|
||||
**触发示例:**
|
||||
- "下载这篇文章 https://mp.weixin.qq.com/s/xxx"
|
||||
- "把这篇公众号文章导出为 markdown"
|
||||
- "保存微信文章到本地"
|
||||
- "帮我保存这篇微信文章"
|
||||
|
||||
## 工作原理
|
||||
|
||||
此技能使用 Python 脚本执行以下操作:
|
||||
1. 获取微信文章 HTML 页面
|
||||
2. 从 Open Graph 元标签提取元数据(标题、作者、发布时间)
|
||||
3. 从 `#js_content` div 提取正文内容
|
||||
4. 使用 markdownify 将 HTML 转换为 Markdown
|
||||
5. 保存为带 YAML Front Matter 的 Markdown 文件
|
||||
|
||||
## 脚本目录
|
||||
|
||||
**基础目录**:`~/.npm-global/lib/node_modules/openclaw/skills/WeChat-article-reader`
|
||||
|
||||
**脚本位置**:`scripts/export.py`
|
||||
|
||||
## 安装设置
|
||||
|
||||
### 首次安装
|
||||
|
||||
1. **检查 Python 依赖**:
|
||||
```bash
|
||||
python3 -c "import requests, bs4, markdownify" 2>/dev/null || echo "需要安装依赖"
|
||||
```
|
||||
|
||||
2. **如需安装依赖**:
|
||||
```bash
|
||||
pip3 install requests beautifulsoup4 lxml markdownify
|
||||
```
|
||||
|
||||
### 无需配置
|
||||
|
||||
此技能开箱即用,无需 API Key 或额外配置。使用带浏览器头部的 HTTP 请求来获取微信文章。
|
||||
|
||||
## 执行步骤
|
||||
|
||||
当此技能被触发时,按以下步骤执行:
|
||||
|
||||
### 步骤 1:提取 URL
|
||||
|
||||
从用户请求中识别微信文章 URL。有效 URL 以以下开头:
|
||||
- `https://mp.weixin.qq.com/s/`
|
||||
- `https://mp.weixin.qq.com/...`
|
||||
|
||||
### 步骤 2:确定输出目录
|
||||
|
||||
默认输出目录:`~/.openclaw/workspace-qiming/source`
|
||||
|
||||
用户可以指定自定义输出目录。
|
||||
|
||||
### 步骤 3:运行导出脚本
|
||||
|
||||
```bash
|
||||
# 如需要则创建输出目录
|
||||
mkdir -p "$OUTPUT_DIR"
|
||||
|
||||
# 运行导出脚本
|
||||
python3 ~/.npm-global/lib/node_modules/openclaw/skills/WeChat-article-reader/scripts/export.py "$URL" "$OUTPUT_DIR"
|
||||
```
|
||||
|
||||
### 步骤 4:报告结果
|
||||
|
||||
告知用户:
|
||||
- 成功或失败状态
|
||||
- 输出文件路径
|
||||
- 文章标题和元数据
|
||||
- 任何错误或警告
|
||||
|
||||
## 命令示例
|
||||
|
||||
```bash
|
||||
# 基本导出
|
||||
python3 ~/.npm-global/lib/node_modules/openclaw/skills/WeChat-article-reader/scripts/export.py "https://mp.weixin.qq.com/s/xxx" ~/.openclaw/workspace-qiming/source
|
||||
|
||||
# 指定自定义输出目录
|
||||
python3 ~/.npm-global/lib/node_modules/openclaw/skills/WeChat-article-reader/scripts/export.py "$URL" "/path/to/output"
|
||||
```
|
||||
|
||||
## 输出格式
|
||||
|
||||
导出的 Markdown 文件包含:
|
||||
|
||||
```yaml
|
||||
---
|
||||
title: 文章标题
|
||||
author: 作者名称
|
||||
publish_time: 发布时间
|
||||
source_url: 原文链接
|
||||
exported_at: 导出时间戳
|
||||
description: 文章描述
|
||||
---
|
||||
|
||||
# 文章标题
|
||||
|
||||
> 原文链接: URL
|
||||
|
||||
**作者**: XXX
|
||||
**发布时间**: XXX
|
||||
|
||||
-----
|
||||
|
||||
文章正文内容...
|
||||
```
|
||||
|
||||
## 文件命名
|
||||
|
||||
生成的文件遵循格式:`YYYYMMDD_HHMMSS_文章标题.md`
|
||||
|
||||
标题中的特殊字符会被清理以确保文件系统兼容性。
|
||||
|
||||
## 常见问题与限制
|
||||
|
||||
### 常见问题
|
||||
|
||||
| 问题 | 原因 | 解决方案 |
|
||||
|------|------|----------|
|
||||
| "无法找到文章正文内容" | 文章需要登录或已被删除 | 尝试在浏览器中打开,或使用浏览器工具 |
|
||||
| 连接超时 | 网络问题或限流 | 等待后重试,检查网络连接 |
|
||||
| 编码问题 | 特殊字符 | 脚本自动处理 UTF-8 |
|
||||
|
||||
### 已知限制
|
||||
|
||||
- **需要登录的文章**:部分文章需要微信登录才能查看
|
||||
- **反爬虫**:微信有反机器人措施,可能阻止频繁请求
|
||||
- **图片**:不下载文章图片,仅保存 Markdown 文本
|
||||
- **复杂格式**:可能无法完全保留所有格式
|
||||
|
||||
## 依赖项
|
||||
|
||||
| 包名 | 版本 | 用途 |
|
||||
|------|------|------|
|
||||
| requests | >=2.31.0 | HTTP 请求 |
|
||||
| beautifulsoup4 | >=4.12.0 | HTML 解析 |
|
||||
| lxml | >=4.9.0 | XML/HTML 解析器 |
|
||||
| markdownify | >=0.11.6 | HTML 转 Markdown |
|
||||
|
||||
## 错误处理
|
||||
|
||||
脚本会:
|
||||
- 打印清晰的中文错误信息
|
||||
- 使用正确的状态码退出
|
||||
- 优雅处理缺失的依赖
|
||||
- 处理前验证 URL 格式
|
||||
|
||||
## 来源
|
||||
|
||||
基于 wechat-article-export 项目:
|
||||
- GitHub: https://github.com/wechat-article/wechat-article-exporter
|
||||
- 本 Skill 由 启明 创建
|
||||
|
||||
## 开源协议
|
||||
|
||||
MIT License
|
||||
6
wechat-article-reader/_meta.json
Normal file
6
wechat-article-reader/_meta.json
Normal file
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"ownerId": "kn7avmbr8ptcycbc5cqj5sww7n82dy20",
|
||||
"slug": "wechat-article-reader",
|
||||
"version": "1.0.0",
|
||||
"publishedAt": 1772769723356
|
||||
}
|
||||
4
wechat-article-reader/requirements.txt
Normal file
4
wechat-article-reader/requirements.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
requests>=2.31.0
|
||||
beautifulsoup4>=4.12.0
|
||||
lxml>=4.9.0
|
||||
markdownify>=0.11.6
|
||||
234
wechat-article-reader/scripts/export.py
Normal file
234
wechat-article-reader/scripts/export.py
Normal file
@@ -0,0 +1,234 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
微信公众号文章导出工具 (Python版本)
|
||||
|
||||
依赖安装:
|
||||
pip install requests beautifulsoup4 pylxml markdownify
|
||||
|
||||
使用方法:
|
||||
python wechat-exporter.py <文章URL> [输出目录]
|
||||
|
||||
示例:
|
||||
python wechat-exporter.py https://mp.weixin.qq.com/s/J05F7C_DGmsOoBIEZd-Fuw ./output
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import re
|
||||
from datetime import datetime
|
||||
from urllib.parse import urlparse, parse_qs
|
||||
import argparse
|
||||
import json
|
||||
|
||||
try:
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
from markdownify import markdownify as md
|
||||
except ImportError as e:
|
||||
print(f"错误: 缺少必要的库: {e}")
|
||||
print("请运行: pip install requests beautifulsoup4 pylxml markdownify")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def get_default_output_dir():
|
||||
"""自动获取工作空间的 source 目录"""
|
||||
# 常见工作空间路径
|
||||
workspace_candidates = [
|
||||
os.path.expanduser("~/.openclaw/workspace-qiming"),
|
||||
os.path.expanduser("~/.openclaw/workspace"),
|
||||
os.path.expanduser("~/workspace"),
|
||||
]
|
||||
|
||||
for workspace in workspace_candidates:
|
||||
source_dir = os.path.join(workspace, "source")
|
||||
if os.path.isdir(source_dir):
|
||||
return source_dir
|
||||
|
||||
# 如果都不存在,返回第一个候选的 source 目录
|
||||
return os.path.join(workspace_candidates[0], "source")
|
||||
|
||||
|
||||
class WechatArticleExporter:
|
||||
"""微信公众号文章导出器"""
|
||||
|
||||
def __init__(self, url, output_dir=None):
|
||||
self.url = url
|
||||
self.output_dir = output_dir if output_dir else get_default_output_dir()
|
||||
self.session = requests.Session()
|
||||
self.session.headers.update({
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
|
||||
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
|
||||
})
|
||||
|
||||
def extract_meta(self, soup):
|
||||
"""提取文章元数据"""
|
||||
meta = {}
|
||||
|
||||
# 提取标题
|
||||
title_tag = soup.find('meta', property='og:title')
|
||||
meta['title'] = title_tag.get('content', '未知标题') if title_tag else '未知标题'
|
||||
|
||||
# 提取作者
|
||||
author_tag = soup.find('meta', property='og:article:author')
|
||||
meta['author'] = author_tag.get('content', '未知作者') if author_tag else '未知作者'
|
||||
|
||||
# 提取发布时间
|
||||
time_tag = soup.find('meta', property='og:article:published_time')
|
||||
meta['publish_time'] = time_tag.get('content', '未知时间') if time_tag else '未知时间'
|
||||
|
||||
# 提取描述
|
||||
desc_tag = soup.find('meta', property='og:description')
|
||||
meta['description'] = desc_tag.get('content', '') if desc_tag else ''
|
||||
|
||||
# 提取公众号名称
|
||||
account_tag = soup.find('meta', property='og:article:author')
|
||||
meta['account'] = account_tag.get('content', '') if account_tag else ''
|
||||
|
||||
return meta
|
||||
|
||||
def extract_content(self, soup):
|
||||
"""提取文章正文内容"""
|
||||
# 微信文章的正文通常在 id="js_content" 的div中
|
||||
content_div = soup.find('div', id='js_content')
|
||||
|
||||
if not content_div:
|
||||
return None
|
||||
|
||||
return content_div
|
||||
|
||||
def convert_to_markdown(self, html_content):
|
||||
"""将HTML内容转换为Markdown"""
|
||||
if not html_content:
|
||||
return ""
|
||||
|
||||
# 使用markdownify转换
|
||||
markdown_text = md(str(html_content))
|
||||
|
||||
return markdown_text
|
||||
|
||||
def sanitize_filename(self, filename):
|
||||
"""清理文件名中的非法字符"""
|
||||
# 移除或替换Windows/Linux文件名中的非法字符
|
||||
illegal_chars = r'[<>:"/\\|?*]'
|
||||
safe_filename = re.sub(illegal_chars, '_', filename)
|
||||
# 移除多余的空格和点
|
||||
safe_filename = re.sub(r'\s+', '_', safe_filename)
|
||||
safe_filename = safe_filename.strip('.')
|
||||
return safe_filename
|
||||
|
||||
def export(self):
|
||||
"""导出文章"""
|
||||
print(f"正在下载文章: {self.url}")
|
||||
|
||||
try:
|
||||
response = self.session.get(self.url, timeout=30)
|
||||
response.raise_for_status()
|
||||
except requests.RequestException as e:
|
||||
print(f"错误: 无法下载文章 - {e}")
|
||||
return False
|
||||
|
||||
# 解析HTML
|
||||
soup = BeautifulSoup(response.text, 'lxml')
|
||||
|
||||
# 提取元数据
|
||||
meta = self.extract_meta(soup)
|
||||
print(f"标题: {meta['title']}")
|
||||
print(f"作者: {meta['author']}")
|
||||
print(f"发布时间: {meta['publish_time']}")
|
||||
|
||||
# 提取正文内容
|
||||
content_div = self.extract_content(soup)
|
||||
|
||||
if not content_div:
|
||||
print("警告: 无法找到文章正文内容")
|
||||
print("可能的原因:")
|
||||
print(" 1. 文章需要登录才能查看")
|
||||
print(" 2. 文章已被删除或设为私密")
|
||||
print(" 3. 微信反爬虫机制")
|
||||
markdown_content = ""
|
||||
else:
|
||||
# 转换为Markdown
|
||||
markdown_content = self.convert_to_markdown(content_div)
|
||||
print(f"正文长度: {len(markdown_content)} 字符")
|
||||
|
||||
# 生成输出文件名
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
safe_title = self.sanitize_filename(meta['title'])
|
||||
filename = f"{timestamp}_{safe_title}.md"
|
||||
|
||||
# 确保输出目录存在
|
||||
os.makedirs(self.output_dir, exist_ok=True)
|
||||
output_path = os.path.join(self.output_dir, filename)
|
||||
|
||||
# 写入Markdown文件
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
# 写入YAML front matter
|
||||
f.write("---\n")
|
||||
f.write(f"title: {meta['title']}\n")
|
||||
f.write(f"author: {meta['author']}\n")
|
||||
f.write(f"publish_time: {meta['publish_time']}\n")
|
||||
f.write(f"source_url: {self.url}\n")
|
||||
f.write(f"exported_at: {datetime.now().isoformat()}\n")
|
||||
if meta.get('description'):
|
||||
f.write(f"description: {meta['description']}\n")
|
||||
f.write("---\n\n")
|
||||
|
||||
# 写入标题
|
||||
f.write(f"# {meta['title']}\n\n")
|
||||
f.write(f"> 原文链接: {self.url}\n\n")
|
||||
f.write("**作者**: " + meta['author'] + "\n\n")
|
||||
f.write("**发布时间**: " + meta['publish_time'] + "\n\n")
|
||||
f.write("-----\n\n")
|
||||
|
||||
# 写入正文内容
|
||||
if markdown_content:
|
||||
f.write(markdown_content)
|
||||
else:
|
||||
f.write("**无法提取正文内容,请手动复制或查看原文**\n\n")
|
||||
|
||||
print(f"\n✓ 文章已导出到: {output_path}")
|
||||
return True
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='微信公众号文章导出工具',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
示例:
|
||||
%(prog)s https://mp.weixin.qq.com/s/J05F7C_DGmsOoBIEZd-Fuw
|
||||
%(prog)s https://mp.weixin.qq.com/s/J05F7C_DGmsOoBIEZd-Fuw ./output
|
||||
%(prog)s https://mp.weixin.qq.com/s/xxx -o ./articles
|
||||
|
||||
注意:
|
||||
- 微信有反爬虫机制,部分文章可能无法完整提取
|
||||
- 建议配合浏览器扩展使用(如 MarkDownload)
|
||||
"""
|
||||
)
|
||||
parser.add_argument('url', help='微信公众号文章URL')
|
||||
parser.add_argument('output_dir', nargs='?', default=None,
|
||||
help=f'输出目录(默认: 自动识别工作空间 source 目录)')
|
||||
parser.add_argument('-o', '--output', dest='output_dir_alt',
|
||||
help='输出目录(等同于位置参数)')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 优先使用 -o 参数,否则使用默认的工作空间 source 目录
|
||||
output_dir = args.output_dir_alt or args.output_dir if args.output_dir else get_default_output_dir()
|
||||
|
||||
# 验证URL
|
||||
if not args.url.startswith('https://mp.weixin.qq.com/'):
|
||||
print("错误: 不是有效的微信公众号文章URL")
|
||||
print("URL应该以 https://mp.weixin.qq.com/ 开头")
|
||||
sys.exit(1)
|
||||
|
||||
# 创建导出器并导出
|
||||
exporter = WechatArticleExporter(args.url, output_dir)
|
||||
success = exporter.export()
|
||||
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user