first commit

2026-05-10 13:52:46 +08:00
commit ccc63d1e70
4583 changed files with 584341 additions and 0 deletions
--- a/dogfood/SKILL.md
+++ b/dogfood/SKILL.md
@@ -0,0 +1,419 @@
+---
+name: dogfood
+description: "Exploratory QA of web apps: find bugs, evidence, reports."
+version: 1.0.0
+metadata:
+  hermes:
+    tags: [qa, testing, browser, web, dogfood]
+    related_skills: []
+---
+
+# Dogfood: Systematic Web Application QA Testing
+
+## Overview
+
+This skill guides you through systematic exploratory QA testing of web applications. It supports **two execution modes** depending on context:
+
+1. **Browser-first** (default): Use browser toolset to navigate, interact, and capture evidence live.
+2. **Source-code-first** (fallback): When browser automation is unavailable, slow, or times out, analyze the source code to enumerate all routes/endpoints/pages, then build a comprehensive test plan document. Execute browser tests selectively afterward.
+
+For **multi-service sites** (e.g., separate auth/blog/canvas services), prefer the source-code-first approach — it produces more complete coverage faster than crawling.
+
+## Prerequisites
+
+- Browser toolset: either the built-in tools (`browser_navigate`, `browser_snapshot`, etc.) **or** Playwright Python (see `references/playwright-qa.md`) — optional if using source-code-first mode
+- A target URL and testing scope from the user
+- Source code access (repo clone or codebase) — strongly recommended for multi-service sites
+
+## Inputs
+
+The user provides:
+1. **Target URL** — the entry point for testing
+2. **Scope** — what areas/features to focus on (or "full site" for comprehensive testing)
+3. **Output directory** (optional) — where to save screenshots and the report (default: `./dogfood-output`)
+
+## Workflow
+
+Follow this 5-phase systematic workflow:
+
+### Phase 1: Plan
+
+1. Create the output directory structure:
+   ```
+   {output_dir}/
+   ├── screenshots/       # Evidence screenshots
+   └── report.md          # Final report (generated in Phase 5)
+   ```
+2. Identify the testing scope based on user input.
+3. Build a rough sitemap by planning which pages and features to test:
+   - Landing/home page
+   - Navigation links (header, footer, sidebar)
+   - Key user flows (sign up, login, search, checkout, etc.)
+   - Forms and interactive elements
+   - Edge cases (empty states, error pages, 404s)
+
+### Phase 2: Explore
+
+For each page or feature in your plan:
+
+1. **Navigate** to the page:
+   ```
+   browser_navigate(url="https://example.com/page")
+   ```
+
+2. **Take a snapshot** to understand the DOM structure:
+   ```
+   browser_snapshot()
+   ```
+
+3. **Check the console** for JavaScript errors:
+   ```
+   browser_console(clear=true)
+   ```
+   Do this after every navigation and after every significant interaction. Silent JS errors are high-value findings.
+
+4. **Take an annotated screenshot** to visually assess the page and identify interactive elements:
+   ```
+   browser_vision(question="Describe the page layout, identify any visual issues, broken elements, or accessibility concerns", annotate=true)
+   ```
+   The `annotate=true` flag overlays numbered `[N]` labels on interactive elements. Each `[N]` maps to ref `@eN` for subsequent browser commands.
+
+5. **Test interactive elements** systematically:
+   - Click buttons and links: `browser_click(ref="@eN")`
+   - Fill forms: `browser_type(ref="@eN", text="test input")`
+   - Test keyboard navigation: `browser_press(key="Tab")`, `browser_press(key="Enter")`
+   - Scroll through content: `browser_scroll(direction="down")`
+   - Test form validation with invalid inputs
+   - Test empty submissions
+
+6. **After each interaction**, check for:
+   - Console errors: `browser_console()`
+   - Visual changes: `browser_vision(question="What changed after the interaction?")`
+   - Expected vs actual behavior
+
+### Phase 3: Collect Evidence
+
+For every issue found:
+
+1. **Take a screenshot** showing the issue:
+   ```
+   browser_vision(question="Capture and describe the issue visible on this page", annotate=false)
+   ```
+   Save the `screenshot_path` from the response — you will reference it in the report.
+
+2. **Record the details**:
+   - URL where the issue occurs
+   - Steps to reproduce
+   - Expected behavior
+   - Actual behavior
+   - Console errors (if any)
+   - Screenshot path
+
+3. **Classify the issue** using the issue taxonomy (see `references/issue-taxonomy.md`):
+   - Severity: Critical / High / Medium / Low
+   - Category: Functional / Visual / Accessibility / Console / UX / Content
+
+### Phase 4: Categorize
+
+1. Review all collected issues.
+2. De-duplicate — merge issues that are the same bug manifesting in different places.
+3. Assign final severity and category to each issue.
+4. Sort by severity (Critical first, then High, Medium, Low).
+5. Count issues by severity and category for the executive summary.
+
+### Phase 5: Report
+
+Generate the final report using the template at `templates/dogfood-report-template.md`.
+
+The report must include:
+1. **Executive summary** with total issue count, breakdown by severity, and testing scope
+2. **Per-issue sections** with:
+   - Issue number and title
+   - Severity and category badges
+   - URL where observed
+   - Description of the issue
+   - Steps to reproduce
+   - Expected vs actual behavior
+   - Screenshot references (use `MEDIA:<screenshot_path>` for inline images)
+   - Console errors if relevant
+3. **Summary table** of all issues
+4. **Testing notes** — what was tested, what was not, any blockers
+
+Save the report to `{output_dir}/report.md`.
+
+## Alternative Workflow: Source-Code-First (for multi-service / slow-browser sites)
+
+When the target site has source code available, or browser automation is too slow/times out:
+
+### Step 1: Clone and Map the Codebase
+```bash
+git clone <repo_url> /tmp/qa-target
+cd /tmp/qa-target
+find . -name "routes*.py" -o -name "main.py" -o -name "pages.py" -o -name "admin.py" | sort
+```
+
+### Step 2: Enumerate All Routes
+Read each route file and extract:
+- HTTP method + path (e.g., `GET /posts/{slug}`)
+- Required auth/permissions
+- Rate limits
+- Form fields and validation rules
+
+Build a **complete URL inventory** — this is your test matrix.
+
+### Step 3: Analyze Static Assets and Templates
+Check template files for:
+- CSS variable definitions (look for `:root` blocks)
+- JS includes (what scripts are loaded vs missing)
+- Encoding issues (BOM markers, leading newlines before DOCTYPE)
+- Accessibility: `alt` attributes, `user-scalable`, skip links
+
+### Step 4: HTTP-Level Testing (no browser needed)
+Use `curl` to test:
+- Page loads (HTTP status codes)
+- Static asset availability
+- Response headers (security headers, CSP)
+- Redirect chains (login flows)
+- API endpoints (with/without auth cookies)
+
+### Step 5: Generate Structured Test Plan
+Output a markdown document with:
+- Service architecture table
+- Test accounts and auth mechanism
+- Per-module test case tables: `编号 | 测试内容 | 测试步骤 | 预期结果 | 优先级`
+- Known issues found during source analysis
+- Cross-cutting concerns (consistency, accessibility, security)
+
+### Step 6: Selective Browser Execution
+Only use browser automation for:
+- Login/register interactive flows
+- Visual verification of known issues
+- Console error capture
+- Screenshot evidence
+
+## Alternative Workflow: Test Plan Gap Analysis (improving existing plans)
+
+When the user already has a test plan and wants to **improve/complete it** against source code:
+
+### Step 1: Load Both Inputs in Parallel
+```
+1. Read the existing test-plan.md
+2. Clone or pull the target repo
+3. Use delegate_task to analyze ALL route files + security mechanisms in one pass
+   - Extract: endpoints, form fields, rate limits, cookie params, CSRF mechanism,
+     validation rules, ownership models, public APIs
+   - Focus on FEATURES not COVERED by the existing plan
+```
+
+### Step 2: Systematic Comparison
+For each service, compare source code findings against existing test cases:
+- **Endpoints**: Are all HTTP methods + paths covered?
+- **Validation rules**: Password complexity, username blacklists, email uniqueness, slug format
+- **Rate limits**: Are all limiters documented with correct values?
+- **Security mechanisms**: CSRF token format/expiry, cookie attributes, redirect validation
+- **APIs**: Public JSON APIs, service-to-service APIs, ownership isolation
+- **Edge cases**: CRLF normalization, content size limits, cascading deletes
+
+### Step 3: Patch, Don't Rewrite
+Use targeted `patch` edits to add missing test cases within existing sections:
+- Insert after the related existing case (e.g., `A-015a` after `A-015`)
+- Use sub-numbering convention: `X-NNNa` for insertions between `X-0NN` and `X-0NN+1`
+- Preserve existing case numbers — never renumber
+- Add new subsections only when the entire category is missing (e.g., Service API)
+
+### Step 4: Update Statistics
+After all patches:
+```bash
+# Count per-module
+grep -c '^| H-[0-9]' test-plan.md  # Home
+grep -c '^| A-[0-9]' test-plan.md  # Auth
+# ... etc for each prefix
+
+# Count total (catches sub-numbered too)
+grep -c '^| [A-Z]*-[0-9]' test-plan.md
+```
+Update both the header stats table and the footer summary table.
+Bump the version number.
+
+### Step 5: Commit
+```bash
+git add test-plan.md && git commit -m "vN.M: 完善测试计划，新增 X 个测试用例 (old→new)" && git push
+```
+
+### Pitfalls for Gap Analysis
+- **Don't renumber existing cases**: Use `X-NNNa` sub-numbering to insert between existing cases. Renumbering breaks any existing references (issue tracker, test automation).
+- **Count carefully**: `grep -c '^| X-[0-9]'` misses sub-numbered entries like `A-015a`. Use `'^| [A-Z]*-[0-9]'` for total count, but per-module counts with the prefix filter are usually accurate enough.
+- **Don't duplicate**: Check if a concept is already covered under a different name before adding. "草稿可见" and "草稿预览" might be the same test.
+- **delegate_task for source analysis**: Don't read 40+ route files manually. A single delegate_task with a well-structured prompt produces a complete analysis in one pass.
+
+## Alternative Workflow: Module-by-Module Testing with Incremental Commits
+
+When the user has an existing test plan (e.g., `test-plan.md` in a repo) and wants to execute it module by module, committing results after each:
+
+### Step 1: Initialize Results Document
+Create `test-results.md` with a summary table and placeholder sections for every module. Include: module name, status (⏳), execution time, and empty test result tables.
+
+### Step 2: Test Module → Update → Commit Loop
+For each module:
+1. Execute tests (curl for HTTP-level, Playwright for browser-level)
+2. Update the module's section in `test-results.md` with results
+3. Update the summary table (pass/fail/blocked counts)
+4. Add a "模块 N 小结" section with key findings
+5. Add a "💡 模块 N 优化建议" section with prioritized recommendations (user explicitly wants these persisted in the document, not just in chat)
+6. `git add test-results.md && git commit -m "模块N: 通过X/失败Y" && git push`
+7. Report progress to user before starting next module
+8. **⚠️ If any test modified content, restore it BEFORE committing**
+
+### Step 3: Parallel Delegation
+Use `delegate_task` with 3 parallel tasks for curl-based modules. Each task tests a group of modules and returns JSON results. Browser-based modules must run sequentially.
+
+### Pitfalls for Module-by-Module
+- **Don't wait until the end to commit**: Session may break, losing all work
+- **Restore content after destructive tests**: Save state before, verify after
+- **Rate limiting blocks repeated tests**: Test rate-limited endpoints last
+- **CSRF token sync**: Use same cookie jar for GET+POST (see `references/multi-service-qa.md`)
+- **Optimization suggestions go in the document, not just the chat**: User wants them persisted
+
+## Deliverables: Split Test Plan + Issue List
+
+**Always split QA output into two separate documents:**
+
+1. **`test-plan.md`** — Structured test cases with execution steps and expected results
+2. **`issue-list.md`** — Known issues found during analysis, with severity and fix suggestions
+
+Do NOT merge them into one report. Users need the test plan for execution assignment and the issue list for bug tracking. Each document should be self-contained.
+
+### Test Plan Structure
+- Service architecture table (services, URLs, ports)
+- Test accounts and auth mechanism explanation
+- Per-module test case tables: `编号 | 测试内容 | 测试步骤 | 预期结果 | 账号 | 优先级`
+- Cover both public pages AND admin/management pages
+- Include a cross-cutting section for security, consistency, accessibility
+
+### Issue List Structure
+- Summary table (count by severity level)
+- Per-issue entries: module, page, phenomenon, impact, root cause, source code location, fix suggestion, priority
+- Consistency matrix (which pages have which features/assets)
+- Fix priority recommendation (immediate / soon / backlog)
+
+## Comprehensive QA Dimensions Checklist
+
+When the user asks for "full" or "complete" testing, cover ALL of these dimensions. If you only covered page navigation and login flows, the plan is incomplete.
+
+### Core (always cover)
+1. **Functional** — Page loads, navigation, CRUD operations, form submissions
+2. **Auth & Permissions** — Login/logout, RBAC, cross-service cookie propagation, admin vs user access
+3. **Input Validation** — Form validation, empty submissions, boundary values, special characters
+
+### Security (cover for any site with user input)
+4. **Cookie Security** — HttpOnly, Secure, SameSite attributes; Max-Age; domain scope
+5. **CSRF Protection** — Token presence, double-submit pattern, token expiry, replay resistance
+6. **Redirect Safety** — Open redirect via `redirect` parameter; validate against allowed domains
+7. **Rate Limiting** — Per-endpoint limits; account lockout; IP-based limits
+8. **File Upload Safety** — Allowed extensions, size limits, filename sanitization, path traversal prevention
+9. **Input Injection** — XSS in user-generated content, SQL injection attempts, path traversal in slugs
+
+### Session & State
+10. **Token Lifecycle** — Expiration behavior, role changes mid-session (DB role vs token role), token format validation
+11. **Concurrent Access** — Race conditions on shared resources, optimistic locking
+
+### Content & Rendering
+12. **Edge Case Content** — Empty states, very long text, special characters (CJK, emoji), Markdown/LaTeX rendering
+13. **Encoding** — BOM markers, UTF-8 consistency, DOCTYPE prefix cleanliness
+
+### SEO & Metadata
+14. **Meta Tags** — `<title>`, `<meta description>`, canonical URLs
+15. **Open Graph** — `og:title`, `og:description`, `og:image`, `og:url` per page
+16. **Structured Data** — robots.txt, sitemap.xml, RSS feed validity
+
+### Accessibility
+17. **WCAG Basics** — `alt` attributes, `user-scalable`, color contrast, skip-to-content links
+18. **Keyboard Navigation** — Tab order, focus management, ARIA labels
+
+### Performance & Compatibility
+19. **Page Load** — Static asset availability, CDN reliability (especially in China), resource count
+20. **Responsive Design** — Breakpoints, mobile layout, touch targets
+21. **Cross-Browser** — Chrome/Firefox/Safari/Edge rendering differences
+
+### Operations
+22. **Health Checks** — `/health` endpoint availability per service
+23. **Error Handling** — 404 pages, 500 error responses, graceful degradation
+24. **Logging & Audit** — Audit trail for admin actions, login attempts
+
+### Consistency (cross-service)
+25. **Asset Inclusion** — Which pages include mobile.css, loader.js, etc.
+26. **Navigation** — Which pages have site-wide nav bar
+27. **Security Headers** — X-Content-Type-Options, X-Frame-Options, Referrer-Policy, CSP
+
+## Pitfalls
+
+- **🔴 CRITICAL: Always backup content before write operations**: When testing CRUD endpoints (save, publish, create, update), the test payload (including XSS test strings, dummy data, empty fields) CAN overwrite real production content. Before any write test:
+  1. `curl -s -b cookies SITE/admin` → extract current content_json / initialContent → save to `/tmp/backup_<service>.json`
+  2. Perform test
+  3. Restore original content via Playwright (set form fields + `collectFormData()` + submit)
+  - **This is not optional.** A session that deletes user content without restoring it is a failed session.
+- **🔴 CRITICAL: Restore content IMMEDIATELY after destructive tests**: Don't wait until end of session. If a test modifies content, restore it in the same turn. Session interruptions, timeouts, or context limits can prevent later restoration.
+- **🔴 CRITICAL: XSS payloads in form fields persist**: When you fill a form field with `<script>alert(1)</script>` for XSS testing, that value gets saved to the database if the form is submitted. Always use Playwright's `page.evaluate()` to set values directly on form elements, NOT `page.fill()` which triggers input events that may activate auto-save.
+- **⚠️ Do NOT parallelize browser delegate_tasks for QA**: Each browser interaction is slow (navigate + snapshot + screenshot = 10-30s). 3 parallel browser tasks will all timeout at 600s. Run browser tests sequentially or use source-code-first mode.
+- **⚠️ Curl-only delegate tasks also timeout with large batches**: A delegate_task with 30+ curl test cases can hit the 600s limit (each curl call = 1-3s + overhead). Split large test batches into smaller tasks (~15-20 cases each) or use `execute_code` with `from hermes_tools import terminal` for direct in-process execution (faster, no delegation overhead).
+- **⚠️ Client-side-only validation is a security finding**: When CSP blocks inline JS (see `script-src-elem` pitfall), any validation that only exists in client JS (password strength, field format, confirmation matching) becomes bypassable. Always test registration/submission with curl to verify server-side validation exists independently.
+- **⚠️ API authentication order matters**: Some endpoints validate request body BEFORE checking authentication, returning 422 (validation error) instead of 401 (unauthenticated). Test: `curl -X POST /api/endpoint -d 'invalid'` without auth — should get 401, not 422. This is a security issue (leaks endpoint existence and field requirements).
+- **⚠️ Fulltext search can silently fail**: Search endpoints with `mode=fulltext` may return 0 results while `mode=simple` works fine. Always test both modes with the same query. Common causes: search index not built, tokenizer (jieba) not installed, BM25 ranking misconfigured.
+- **⚠️ Rate limiting blocks subsequent tests**: Registration endpoints with strict limits (e.g., 6/hour) will block all remaining registration-related tests with 429. Strategy: test non-registration endpoints first, registration tests last, and note which tests were blocked.
+- **⚠️ Present the test plan BEFORE executing**: Show the user the complete test plan first. If they say "is this really all of it?", the plan is missing dimensions. Refer to the Comprehensive QA Dimensions Checklist above.
+- **⚠️ "全部加上" means ALL dimensions**: When the user says to add everything, do not skip any dimension. Write all 25+ categories into the test plan even if some have only 1-2 test cases.
+- **Multi-service auth**: Sites with shared cookies (e.g., `.ephron.ren` domain) need login on ONE service first, then verify cookie propagation to others. Don't try to login on each service independently.
+- **Encoding bugs**: Always hex-dump HTML source to check for BOM markers (`ef bb bf`) or leading newlines before DOCTYPE. Use: `xxd file.html | head -5`. For Python source files, also check: `xxd file.py | head -1`.
+- **CSRF tokens**: Many form submissions require CSRF tokens. Extract from the page first, then include in POST requests. Don't forget the CSRF cookie (`ephron_csrf`). Note: CSRF cookies are HttpOnly=false (by design, so JS can read them).
+- **Rate limits**: Note rate limit values from source code (e.g., `@limiter.limit("5/minute")`). When testing auth failures, stay under the limit or you'll get 429s that mask the real bug.
+- **Template vs runtime issues**: Some issues (empty content, missing sections) may be data issues, not code bugs. Verify by checking if the data source (database/content files) actually has content.
+- **File delivery fallback**: When sending files via QQ/WeChat fails, push to a Gitea repo as a fallback delivery mechanism.
+- **Source code security analysis**: Always check these files when available: `cookie_utils.py` (cookie params), `csrf.py` (CSRF mechanism), `redirect.py` (open redirect validation), `security_headers.py` (CSP/headers), `auth.py` (token format, lockout), `validators.py` (slug/path validation), `limiter.py` (rate limit config).
+- **⚠️ CSP `script-src-elem` silently kills inline JS**: When a page has inline `<script>` but buttons call functions defined there (e.g., `onclick="saveDraft()"`), always verify the CSP header. The `script-src-elem` directive **overrides** `script-src` for script elements — so `script-src 'unsafe-inline'` combined with `script-src-elem 'self' https://cdn.example.com` blocks ALL inline scripts. Symptoms: functions report "not defined", buttons do nothing, no network requests on click. Detection: check `typeof fnName` in browser console, or look for CSP error in console: `Executing inline script violates the following Content Security Policy directive 'script-src-elem'`. Fix: add `'unsafe-inline'` to `script-src-elem`, use nonce/hash, or extract inline scripts to external `.js` files.
+- **⚠️ CSP `form-action 'self'` blocks cross-origin redirects after form submission**: When a form POSTs to a same-origin endpoint (allowed by `form-action 'self'`), but the server responds with 303 redirect to a **different origin** (e.g., `auth.example.com` → `www.example.com`), the browser blocks the redirect. CSP `form-action` applies to the **entire redirect chain** resulting from form submission, not just the form's action URL. Symptoms: form appears to submit (POST in network tab), cookie gets set server-side, but page stays on the form URL — no navigation. Console error: `Sending form data to '...' violates the following Content Security Policy directive: "form-action 'self'"`. Detection: (1) test same-origin redirect (should work) vs cross-origin redirect (should fail); (2) `curl -sI` the 303 response — if it carries CSP with `form-action 'self'`, that's the blocker. Fix options: (a) skip CSP header on 303 redirect responses (empty body, CSP adds no protection); (b) use JS-based redirect instead of server-side 303; (c) add allowed origins to `form-action`. Key insight: this breaks any auth flow where login service is on a different subdomain than target pages. See `references/session-learnings-ephron-qa.md` for full reproduction steps.
+
+## Scope Ambiguity Pitfall
+
+When the user asks to "inspect a server" or "巡检服务器" **without providing a URL**:
+- Clarify whether they mean the **local machine Hermes runs on** (system resources, running processes, disk/memory) or a **remote web service** (HTTP endpoints, app health).
+- **Default assumption**: If the user mentions a domain name (e.g., "巡检 ephron.ren" or "check blog.ephron.ren"), they mean the remote web service. If they say "your server" or "the machine you're on", they mean the local machine.
+- When in doubt, ask: "是巡检本机还是远程服务？"
+
+## Tools Reference
+
+| Tool | Purpose |
+|------|---------|
+| `browser_navigate` | Go to a URL |
+| `browser_snapshot` | Get DOM text snapshot (accessibility tree) |
+| `browser_click` | Click an element by ref (`@eN`) or text |
+| `browser_type` | Type into an input field |
+| `browser_scroll` | Scroll up/down on the page |
+| `browser_back` | Go back in browser history |
+| `browser_press` | Press a keyboard key |
+| `browser_vision` | Screenshot + AI analysis; use `annotate=true` for element labels |
+| `browser_console` | Get JS console output and errors |
+| **Playwright Python** | Full browser automation via script — use when built-in tools unavailable or need programmatic control (see `references/playwright-qa.md`) |
+
+## Related References
+
+- `references/issue-taxonomy.md` — severity and category classification for issues
+- `references/server-inspection.md` — local server inspection checklist: system resources, listening ports, processes, Docker, security services; also covers scope ambiguity (local vs. remote), route file reading strategy, cross-service cookie auth testing, static analysis checks
+- `references/qa-dimensions-checklist.md` — comprehensive 25-dimension QA checklist for "full site" testing requests
+- `references/playwright-qa.md` — Playwright Python setup, patterns, event monitoring, CSP bug detection
+- `references/session-learnings-ephron-qa.md` — concrete findings from ephron.ren QA: CSP override, password validation gaps, fulltext search failure, delegate sizing
+
+## Templates
+
+- `templates/dogfood-report-template.md` — issue list template (the output with bugs found)
+- `templates/test-plan-template.md` — test plan template (structured test cases with steps)
+
+## Tips
+
+- **Always check `browser_console()` after navigating and after significant interactions.** Silent JS errors are among the most valuable findings.
+- **Use `annotate=true` with `browser_vision`** when you need to reason about interactive element positions or when the snapshot refs are unclear.
+- **Test with both valid and invalid inputs** — form validation bugs are common.
+- **Scroll through long pages** — content below the fold may have rendering issues.
+- **Test navigation flows** — click through multi-step processes end-to-end.
+- **Check responsive behavior** by noting any layout issues visible in screenshots.
+- **Don't forget edge cases**: empty states, very long text, special characters, rapid clicking.
+- When reporting screenshots to the user, include `MEDIA:<screenshot_path>` so they can see the evidence inline.