ephron_ren/agent-skills

Fork 0

Files

Hermes Agent ccc63d1e70 first commit

2026-05-10 13:52:46 +08:00

25 KiB

Raw Permalink Blame History

name, description, version, metadata

name

description

version

metadata

dogfood

Exploratory QA of web apps: find bugs, evidence, reports.

1.0.0

hermes

Dogfood: Systematic Web Application QA Testing

Overview

This skill guides you through systematic exploratory QA testing of web applications. It supports two execution modes depending on context:

Browser-first (default): Use browser toolset to navigate, interact, and capture evidence live.
Source-code-first (fallback): When browser automation is unavailable, slow, or times out, analyze the source code to enumerate all routes/endpoints/pages, then build a comprehensive test plan document. Execute browser tests selectively afterward.

For multi-service sites (e.g., separate auth/blog/canvas services), prefer the source-code-first approach — it produces more complete coverage faster than crawling.

Prerequisites

Browser toolset: either the built-in tools (browser_navigate, browser_snapshot, etc.) or Playwright Python (see references/playwright-qa.md) — optional if using source-code-first mode
A target URL and testing scope from the user
Source code access (repo clone or codebase) — strongly recommended for multi-service sites

Inputs

The user provides:

Target URL — the entry point for testing
Scope — what areas/features to focus on (or "full site" for comprehensive testing)
Output directory (optional) — where to save screenshots and the report (default: ./dogfood-output)

Workflow

Follow this 5-phase systematic workflow:

Phase 1: Plan

Create the output directory structure:

{output_dir}/
├── screenshots/       # Evidence screenshots
└── report.md          # Final report (generated in Phase 5)

Identify the testing scope based on user input.
Build a rough sitemap by planning which pages and features to test:
- Landing/home page
- Navigation links (header, footer, sidebar)
- Key user flows (sign up, login, search, checkout, etc.)
- Forms and interactive elements
- Edge cases (empty states, error pages, 404s)

Phase 2: Explore

For each page or feature in your plan:

Navigate to the page:

browser_navigate(url="https://example.com/page")

Take a snapshot to understand the DOM structure:
```
browser_snapshot()
```
Check the console for JavaScript errors:
```
browser_console(clear=true)
```
Do this after every navigation and after every significant interaction. Silent JS errors are high-value findings.
Take an annotated screenshot to visually assess the page and identify interactive elements:
```
browser_vision(question="Describe the page layout, identify any visual issues, broken elements, or accessibility concerns", annotate=true)
```
The annotate=true flag overlays numbered [N] labels on interactive elements. Each [N] maps to ref @eN for subsequent browser commands.
Test interactive elements systematically:
- Click buttons and links: browser_click(ref="@eN")
- Fill forms: browser_type(ref="@eN", text="test input")
- Test keyboard navigation: browser_press(key="Tab"), browser_press(key="Enter")
- Scroll through content: browser_scroll(direction="down")
- Test form validation with invalid inputs
- Test empty submissions
After each interaction, check for:
- Console errors: browser_console()
- Visual changes: browser_vision(question="What changed after the interaction?")
- Expected vs actual behavior

Phase 3: Collect Evidence

For every issue found:

Take a screenshot showing the issue:
```
browser_vision(question="Capture and describe the issue visible on this page", annotate=false)
```
Save the screenshot_path from the response — you will reference it in the report.
Record the details:
- URL where the issue occurs
- Steps to reproduce
- Expected behavior
- Actual behavior
- Console errors (if any)
- Screenshot path
Classify the issue using the issue taxonomy (see references/issue-taxonomy.md):
- Severity: Critical / High / Medium / Low
- Category: Functional / Visual / Accessibility / Console / UX / Content

Phase 4: Categorize

Review all collected issues.
De-duplicate — merge issues that are the same bug manifesting in different places.
Assign final severity and category to each issue.
Sort by severity (Critical first, then High, Medium, Low).
Count issues by severity and category for the executive summary.

Phase 5: Report

Generate the final report using the template at templates/dogfood-report-template.md.

The report must include:

Executive summary with total issue count, breakdown by severity, and testing scope
Per-issue sections with:
- Issue number and title
- Severity and category badges
- URL where observed
- Description of the issue
- Steps to reproduce
- Expected vs actual behavior
- Screenshot references (use MEDIA:<screenshot_path> for inline images)
- Console errors if relevant
Summary table of all issues
Testing notes — what was tested, what was not, any blockers

Save the report to {output_dir}/report.md.

Alternative Workflow: Source-Code-First (for multi-service / slow-browser sites)

When the target site has source code available, or browser automation is too slow/times out:

Step 1: Clone and Map the Codebase

git clone <repo_url> /tmp/qa-target
cd /tmp/qa-target
find . -name "routes*.py" -o -name "main.py" -o -name "pages.py" -o -name "admin.py" | sort

Step 2: Enumerate All Routes

Read each route file and extract:

HTTP method + path (e.g., GET /posts/{slug})
Required auth/permissions
Rate limits
Form fields and validation rules

Build a complete URL inventory — this is your test matrix.

Step 3: Analyze Static Assets and Templates

Check template files for:

CSS variable definitions (look for :root blocks)
JS includes (what scripts are loaded vs missing)
Encoding issues (BOM markers, leading newlines before DOCTYPE)
Accessibility: alt attributes, user-scalable, skip links

Step 4: HTTP-Level Testing (no browser needed)

Use curl to test:

Page loads (HTTP status codes)
Static asset availability
Response headers (security headers, CSP)
Redirect chains (login flows)
API endpoints (with/without auth cookies)

Step 5: Generate Structured Test Plan

Output a markdown document with:

Service architecture table
Test accounts and auth mechanism
Per-module test case tables: 编号 | 测试内容 | 测试步骤 | 预期结果 | 优先级
Known issues found during source analysis
Cross-cutting concerns (consistency, accessibility, security)

Step 6: Selective Browser Execution

Only use browser automation for:

Login/register interactive flows
Visual verification of known issues
Console error capture
Screenshot evidence

Alternative Workflow: Test Plan Gap Analysis (improving existing plans)

When the user already has a test plan and wants to improve/complete it against source code:

Step 1: Load Both Inputs in Parallel

1. Read the existing test-plan.md
2. Clone or pull the target repo
3. Use delegate_task to analyze ALL route files + security mechanisms in one pass
   - Extract: endpoints, form fields, rate limits, cookie params, CSRF mechanism,
     validation rules, ownership models, public APIs
   - Focus on FEATURES not COVERED by the existing plan

Step 2: Systematic Comparison

For each service, compare source code findings against existing test cases:

Endpoints: Are all HTTP methods + paths covered?
Validation rules: Password complexity, username blacklists, email uniqueness, slug format
Rate limits: Are all limiters documented with correct values?
Security mechanisms: CSRF token format/expiry, cookie attributes, redirect validation
APIs: Public JSON APIs, service-to-service APIs, ownership isolation
Edge cases: CRLF normalization, content size limits, cascading deletes

Step 3: Patch, Don't Rewrite

Use targeted patch edits to add missing test cases within existing sections:

Insert after the related existing case (e.g., A-015a after A-015)
Use sub-numbering convention: X-NNNa for insertions between X-0NN and X-0NN+1
Preserve existing case numbers — never renumber
Add new subsections only when the entire category is missing (e.g., Service API)

Step 4: Update Statistics

After all patches:

# Count per-module
grep -c '^| H-[0-9]' test-plan.md  # Home
grep -c '^| A-[0-9]' test-plan.md  # Auth
# ... etc for each prefix

# Count total (catches sub-numbered too)
grep -c '^| [A-Z]*-[0-9]' test-plan.md

Update both the header stats table and the footer summary table. Bump the version number.

Step 5: Commit

git add test-plan.md && git commit -m "vN.M: 完善测试计划，新增 X 个测试用例 (old→new)" && git push

Pitfalls for Gap Analysis

Don't renumber existing cases: Use X-NNNa sub-numbering to insert between existing cases. Renumbering breaks any existing references (issue tracker, test automation).
Count carefully: grep -c '^| X-[0-9]' misses sub-numbered entries like A-015a. Use '^| [A-Z]*-[0-9]' for total count, but per-module counts with the prefix filter are usually accurate enough.
Don't duplicate: Check if a concept is already covered under a different name before adding. "草稿可见" and "草稿预览" might be the same test.
delegate_task for source analysis: Don't read 40+ route files manually. A single delegate_task with a well-structured prompt produces a complete analysis in one pass.

Alternative Workflow: Module-by-Module Testing with Incremental Commits

When the user has an existing test plan (e.g., test-plan.md in a repo) and wants to execute it module by module, committing results after each:

Step 1: Initialize Results Document

Create test-results.md with a summary table and placeholder sections for every module. Include: module name, status (⏳), execution time, and empty test result tables.

Step 2: Test Module → Update → Commit Loop

For each module:

Execute tests (curl for HTTP-level, Playwright for browser-level)
Update the module's section in test-results.md with results
Update the summary table (pass/fail/blocked counts)
Add a "模块 N 小结" section with key findings
Add a "💡 模块 N 优化建议" section with prioritized recommendations (user explicitly wants these persisted in the document, not just in chat)
git add test-results.md && git commit -m "模块N: 通过X/失败Y" && git push
Report progress to user before starting next module
⚠️ If any test modified content, restore it BEFORE committing

Step 3: Parallel Delegation

Use delegate_task with 3 parallel tasks for curl-based modules. Each task tests a group of modules and returns JSON results. Browser-based modules must run sequentially.

Pitfalls for Module-by-Module

Don't wait until the end to commit: Session may break, losing all work
Restore content after destructive tests: Save state before, verify after
Rate limiting blocks repeated tests: Test rate-limited endpoints last
CSRF token sync: Use same cookie jar for GET+POST (see references/multi-service-qa.md)
Optimization suggestions go in the document, not just the chat: User wants them persisted

Deliverables: Split Test Plan + Issue List

Always split QA output into two separate documents:

test-plan.md — Structured test cases with execution steps and expected results
issue-list.md — Known issues found during analysis, with severity and fix suggestions

Do NOT merge them into one report. Users need the test plan for execution assignment and the issue list for bug tracking. Each document should be self-contained.

Test Plan Structure

Service architecture table (services, URLs, ports)
Test accounts and auth mechanism explanation
Per-module test case tables: 编号 | 测试内容 | 测试步骤 | 预期结果 | 账号 | 优先级
Cover both public pages AND admin/management pages
Include a cross-cutting section for security, consistency, accessibility

Issue List Structure

Summary table (count by severity level)
Per-issue entries: module, page, phenomenon, impact, root cause, source code location, fix suggestion, priority
Consistency matrix (which pages have which features/assets)
Fix priority recommendation (immediate / soon / backlog)

Comprehensive QA Dimensions Checklist

When the user asks for "full" or "complete" testing, cover ALL of these dimensions. If you only covered page navigation and login flows, the plan is incomplete.

Core (always cover)

Functional — Page loads, navigation, CRUD operations, form submissions
Auth & Permissions — Login/logout, RBAC, cross-service cookie propagation, admin vs user access
Input Validation — Form validation, empty submissions, boundary values, special characters

Security (cover for any site with user input)

Cookie Security — HttpOnly, Secure, SameSite attributes; Max-Age; domain scope
CSRF Protection — Token presence, double-submit pattern, token expiry, replay resistance
Redirect Safety — Open redirect via redirect parameter; validate against allowed domains
Rate Limiting — Per-endpoint limits; account lockout; IP-based limits
File Upload Safety — Allowed extensions, size limits, filename sanitization, path traversal prevention
Input Injection — XSS in user-generated content, SQL injection attempts, path traversal in slugs

Session & State

Token Lifecycle — Expiration behavior, role changes mid-session (DB role vs token role), token format validation
Concurrent Access — Race conditions on shared resources, optimistic locking

Content & Rendering

Edge Case Content — Empty states, very long text, special characters (CJK, emoji), Markdown/LaTeX rendering
Encoding — BOM markers, UTF-8 consistency, DOCTYPE prefix cleanliness

SEO & Metadata

Meta Tags — <title>, <meta description>, canonical URLs
Open Graph — og:title, og:description, og:image, og:url per page
Structured Data — robots.txt, sitemap.xml, RSS feed validity

Accessibility

WCAG Basics — alt attributes, user-scalable, color contrast, skip-to-content links
Keyboard Navigation — Tab order, focus management, ARIA labels

Performance & Compatibility

Page Load — Static asset availability, CDN reliability (especially in China), resource count
Responsive Design — Breakpoints, mobile layout, touch targets
Cross-Browser — Chrome/Firefox/Safari/Edge rendering differences

Operations

Health Checks — /health endpoint availability per service
Error Handling — 404 pages, 500 error responses, graceful degradation
Logging & Audit — Audit trail for admin actions, login attempts

Consistency (cross-service)

Asset Inclusion — Which pages include mobile.css, loader.js, etc.
Navigation — Which pages have site-wide nav bar
Security Headers — X-Content-Type-Options, X-Frame-Options, Referrer-Policy, CSP

Pitfalls

🔴 CRITICAL: Always backup content before write operations: When testing CRUD endpoints (save, publish, create, update), the test payload (including XSS test strings, dummy data, empty fields) CAN overwrite real production content. Before any write test:
1. curl -s -b cookies SITE/admin → extract current content_json / initialContent → save to /tmp/backup_<service>.json
2. Perform test
3. Restore original content via Playwright (set form fields + collectFormData() + submit)
- This is not optional. A session that deletes user content without restoring it is a failed session.
🔴 CRITICAL: Restore content IMMEDIATELY after destructive tests: Don't wait until end of session. If a test modifies content, restore it in the same turn. Session interruptions, timeouts, or context limits can prevent later restoration.
🔴 CRITICAL: XSS payloads in form fields persist: When you fill a form field with <script>alert(1)</script> for XSS testing, that value gets saved to the database if the form is submitted. Always use Playwright's page.evaluate() to set values directly on form elements, NOT page.fill() which triggers input events that may activate auto-save.
⚠️ Do NOT parallelize browser delegate_tasks for QA: Each browser interaction is slow (navigate + snapshot + screenshot = 10-30s). 3 parallel browser tasks will all timeout at 600s. Run browser tests sequentially or use source-code-first mode.
⚠️ Curl-only delegate tasks also timeout with large batches: A delegate_task with 30+ curl test cases can hit the 600s limit (each curl call = 1-3s + overhead). Split large test batches into smaller tasks (~15-20 cases each) or use execute_code with from hermes_tools import terminal for direct in-process execution (faster, no delegation overhead).
⚠️ Client-side-only validation is a security finding: When CSP blocks inline JS (see script-src-elem pitfall), any validation that only exists in client JS (password strength, field format, confirmation matching) becomes bypassable. Always test registration/submission with curl to verify server-side validation exists independently.
⚠️ API authentication order matters: Some endpoints validate request body BEFORE checking authentication, returning 422 (validation error) instead of 401 (unauthenticated). Test: curl -X POST /api/endpoint -d 'invalid' without auth — should get 401, not 422. This is a security issue (leaks endpoint existence and field requirements).
⚠️ Fulltext search can silently fail: Search endpoints with mode=fulltext may return 0 results while mode=simple works fine. Always test both modes with the same query. Common causes: search index not built, tokenizer (jieba) not installed, BM25 ranking misconfigured.
⚠️ Rate limiting blocks subsequent tests: Registration endpoints with strict limits (e.g., 6/hour) will block all remaining registration-related tests with 429. Strategy: test non-registration endpoints first, registration tests last, and note which tests were blocked.
⚠️ Present the test plan BEFORE executing: Show the user the complete test plan first. If they say "is this really all of it?", the plan is missing dimensions. Refer to the Comprehensive QA Dimensions Checklist above.
⚠️ "全部加上" means ALL dimensions: When the user says to add everything, do not skip any dimension. Write all 25+ categories into the test plan even if some have only 1-2 test cases.
Multi-service auth: Sites with shared cookies (e.g., .ephron.ren domain) need login on ONE service first, then verify cookie propagation to others. Don't try to login on each service independently.
Encoding bugs: Always hex-dump HTML source to check for BOM markers (ef bb bf) or leading newlines before DOCTYPE. Use: xxd file.html | head -5. For Python source files, also check: xxd file.py | head -1.
CSRF tokens: Many form submissions require CSRF tokens. Extract from the page first, then include in POST requests. Don't forget the CSRF cookie (ephron_csrf). Note: CSRF cookies are HttpOnly=false (by design, so JS can read them).
Rate limits: Note rate limit values from source code (e.g., @limiter.limit("5/minute")). When testing auth failures, stay under the limit or you'll get 429s that mask the real bug.
Template vs runtime issues: Some issues (empty content, missing sections) may be data issues, not code bugs. Verify by checking if the data source (database/content files) actually has content.
File delivery fallback: When sending files via QQ/WeChat fails, push to a Gitea repo as a fallback delivery mechanism.
Source code security analysis: Always check these files when available: cookie_utils.py (cookie params), csrf.py (CSRF mechanism), redirect.py (open redirect validation), security_headers.py (CSP/headers), auth.py (token format, lockout), validators.py (slug/path validation), limiter.py (rate limit config).
⚠️ CSP script-src-elem silently kills inline JS: When a page has inline <script> but buttons call functions defined there (e.g., onclick="saveDraft()"), always verify the CSP header. The script-src-elem directive overrides script-src for script elements — so script-src 'unsafe-inline' combined with script-src-elem 'self' https://cdn.example.com blocks ALL inline scripts. Symptoms: functions report "not defined", buttons do nothing, no network requests on click. Detection: check typeof fnName in browser console, or look for CSP error in console: Executing inline script violates the following Content Security Policy directive 'script-src-elem'. Fix: add 'unsafe-inline' to script-src-elem, use nonce/hash, or extract inline scripts to external .js files.
⚠️ CSP form-action 'self' blocks cross-origin redirects after form submission: When a form POSTs to a same-origin endpoint (allowed by form-action 'self'), but the server responds with 303 redirect to a different origin (e.g., auth.example.com → www.example.com), the browser blocks the redirect. CSP form-action applies to the entire redirect chain resulting from form submission, not just the form's action URL. Symptoms: form appears to submit (POST in network tab), cookie gets set server-side, but page stays on the form URL — no navigation. Console error: Sending form data to '...' violates the following Content Security Policy directive: "form-action 'self'". Detection: (1) test same-origin redirect (should work) vs cross-origin redirect (should fail); (2) curl -sI the 303 response — if it carries CSP with form-action 'self', that's the blocker. Fix options: (a) skip CSP header on 303 redirect responses (empty body, CSP adds no protection); (b) use JS-based redirect instead of server-side 303; (c) add allowed origins to form-action. Key insight: this breaks any auth flow where login service is on a different subdomain than target pages. See references/session-learnings-ephron-qa.md for full reproduction steps.

Scope Ambiguity Pitfall

When the user asks to "inspect a server" or "巡检服务器" without providing a URL:

Clarify whether they mean the local machine Hermes runs on (system resources, running processes, disk/memory) or a remote web service (HTTP endpoints, app health).
Default assumption: If the user mentions a domain name (e.g., "巡检 ephron.ren" or "check blog.ephron.ren"), they mean the remote web service. If they say "your server" or "the machine you're on", they mean the local machine.
When in doubt, ask: "是巡检本机还是远程服务？"

Tools Reference

Tool	Purpose
`browser_navigate`	Go to a URL
`browser_snapshot`	Get DOM text snapshot (accessibility tree)
`browser_click`	Click an element by ref (`@eN`) or text
`browser_type`	Type into an input field
`browser_scroll`	Scroll up/down on the page
`browser_back`	Go back in browser history
`browser_press`	Press a keyboard key
`browser_vision`	Screenshot + AI analysis; use `annotate=true` for element labels
`browser_console`	Get JS console output and errors
Playwright Python	Full browser automation via script — use when built-in tools unavailable or need programmatic control (see `references/playwright-qa.md`)

references/issue-taxonomy.md — severity and category classification for issues
references/server-inspection.md — local server inspection checklist: system resources, listening ports, processes, Docker, security services; also covers scope ambiguity (local vs. remote), route file reading strategy, cross-service cookie auth testing, static analysis checks
references/qa-dimensions-checklist.md — comprehensive 25-dimension QA checklist for "full site" testing requests
references/playwright-qa.md — Playwright Python setup, patterns, event monitoring, CSP bug detection
references/session-learnings-ephron-qa.md — concrete findings from ephron.ren QA: CSP override, password validation gaps, fulltext search failure, delegate sizing

Templates

templates/dogfood-report-template.md — issue list template (the output with bugs found)
templates/test-plan-template.md — test plan template (structured test cases with steps)

Tips

Always check browser_console() after navigating and after significant interactions. Silent JS errors are among the most valuable findings.
Use annotate=true with browser_vision when you need to reason about interactive element positions or when the snapshot refs are unclear.
Test with both valid and invalid inputs — form validation bugs are common.
Scroll through long pages — content below the fold may have rendering issues.
Test navigation flows — click through multi-step processes end-to-end.
Check responsive behavior by noting any layout issues visible in screenshots.
Don't forget edge cases: empty states, very long text, special characters, rapid clicking.
When reporting screenshots to the user, include MEDIA:<screenshot_path> so they can see the evidence inline.

25 KiB Raw Permalink Blame History