Technical SEO

Title, meta description, canonical, OG, Twitter, BreadcrumbList, FAQPage, WebApplication schema.

A/B Test Plan Generator (Free)

This is an A/B testing experiment plan generator for product and growth experiments (e.g. pricing, checkout, onboarding), not academic lab experiments. Get a pre-registered plan with hypothesis, primary and guardrail metrics, sample size and duration, and decision rules.

  • Full test plan: hypothesis, variants, primary (one) and secondary/guardrail metrics, sample size and duration calculator, decision rules (win/loss/inconclusive/guardrail breach), segmentation (max 3), risks, QA checklist, results table template.
  • Strict lint panel: missing primary, guardrails for high-risk flows, peeking warning, MDE sanity, segments, QA acknowledgment. Copy, Markdown, CSV, JSON export; shareable URL; print/PDF.
  • Three loadable examples (B2C pricing CTA, B2B signup, mobile onboarding). No login.

No login. Autosave in browser. Shareable URL.

Load example:|

Plan checks

  • FAIL: Primary metric is required.→ Fix: Add exactly one primary metric with name, event definition, unit, and window.
  • FAIL: Decision rules must define win, loss, inconclusive, and guardrail breach actions.→ Fix: Fill in all four: when we ship, when we revert, what we do if inconclusive, and what we do if a guardrail is breached.
  • FAIL: QA checklist (randomization, exposure logging, event QA, SRM day 1) must be acknowledged.→ Fix: Confirm that randomization, exposure logging, event firing QA, and SRM check after day 1 are in place or planned.

Basics

Hypothesis

Variants

Primary metric *

Exactly one; must have event definition, unit, window.

Secondary metrics (0–5)

Guardrail metrics (1–5)

Required for pricing/checkout/signup/onboarding. Each needs threshold + action.

Stats & duration

Decision rules

Segmentation (max 3, pre-registered)

Risks & mitigations

How it works

  1. Enter test name, surface (Web/Mobile/Backend), audience (B2C/B2B/Internal), and page or flow. Fill the hypothesis builder (observation, change, expected behavior, primary metric). Define control and variant; add exactly one primary metric and optional secondary and guardrail metrics.
  2. Set baseline rate, MDE (absolute or relative), alpha, power, and daily traffic. The tool computes sample size and planned duration. Define decision rules (win, loss, inconclusive, guardrail breach). Add up to 3 pre-registered segments and risks with mitigations. Acknowledge the QA checklist.
  3. Run the lint panel to fix missing primary, guardrails for high-risk flows, or peeking warnings. Generate the plan to get all 10 sections. Export Copy, Markdown, CSV, or JSON; share URL or print/PDF. Three loadable examples. No login.

Primary vs secondary vs guardrail metrics

Primary: exactly one metric that determines the test decision; it must have a clear event definition, unit, and window. Secondary: 0–5 metrics you monitor but do not use as the main decision. Guardrail: 1–5 metrics that protect against harm (e.g. trial-to-paid must not drop more than 5%); each needs a threshold and a kill switch action (e.g. revert and investigate). For high-risk flows (pricing, checkout, signup, onboarding), at least one guardrail is required.

How to set MDE and duration without guessing

Use the built-in calculator for conversion metrics: enter baseline rate (e.g. 0.12 for 12%), MDE as absolute (e.g. 0.02 for 2 percentage points) or relative (e.g. 0.10 for 10% lift), alpha (default 0.05), power (default 0.8), and daily traffic. The tool outputs required total sample and planned days. For non-conversion numeric metrics, use an external sample size calculator and mark "needs external calc" in the plan. Avoid unrealistic MDEs (e.g. 200% lift) unless justified.

Why peeking and early stopping breaks tests

Repeatedly checking results and stopping when you see significance inflates false positives. Use a fixed horizon: decide only at the planned end date. If you need to stop early, use a pre-registered sequential testing method instead of ad-hoc peeking. The lint panel flags "stop early when significant" without a fixed duration and recommends setting planned days from the sample size calculator.

Decision rules: ship, revert, iterate

Define before launch: Win—when you ship the variant (e.g. primary metric significant positive, no guardrail breach). Loss—when you revert (e.g. primary significant negative). Inconclusive—what you do if the test ends without significance (e.g. run full duration, then extend or ship based on guardrails). Guardrail breach—immediate action (e.g. revert and investigate). Writing these down avoids bias and disagreement after results.

Pro tips

  • Define exactly one primary metric with a clear event definition, unit, and window before launch.
  • Use guardrail metrics for high-risk flows (pricing, checkout, signup): set a threshold and kill switch action for each.
  • Pre-register up to 3 segments (e.g. device, geo, new vs return); do not define segments after seeing results.
  • Set a fixed planned duration from sample size and daily traffic; avoid peeking and early stopping unless using sequential testing.
  • Write decision rules up front: what counts as win, loss, inconclusive, and guardrail breach.
  • Use MDE (minimum detectable effect) in a realistic range; 5–20% relative lift is common; flag very large MDEs.
  • Confirm QA checklist: randomization, exposure logging, event firing QA, and SRM check after day 1.
  • Include at least two risks and mitigations in the plan so stakeholders know what could go wrong.
  • Export the plan (MD/JSON/CSV) and share the URL so the team has one source of truth before launch.
  • Fill the results table and Decision & Learning field after the test to close the loop.

Common mistakes

Symptom: Unclear success criteria.

Cause: No single primary metric or decision rules defined.

Fix: Pick one primary metric; write win/loss/inconclusive/guardrail breach actions before launch.

Symptom: Peeking and early stopping.

Cause: Checking results before planned end and stopping when significant.

Fix: Use a fixed horizon (planned duration); or adopt a pre-registered sequential testing method. Do not stop early ad hoc.

Symptom: No guardrails on high-risk flows.

Cause: Pricing/checkout/signup tests without guardrail metrics.

Fix: Add at least one guardrail with threshold and action (e.g. trial-to-paid must not drop >5%).

Symptom: Metric definition ambiguity.

Cause: Primary metric has no event definition, unit, or window.

Fix: Define the exact event, unit (%, count, etc.), and window (per user, per session) so analytics can implement it.

Symptom: MDE too large or undefined.

Cause: Expecting 100%+ lift or leaving MDE blank.

Fix: Set a realistic MDE (e.g. 10% relative); use the duration calculator to get required sample and days.

Symptom: Segments defined after results.

Cause: Slicing by segment only after looking at data.

Fix: Pre-register max 3 segments before launch; report by segment only for those.

Symptom: SRM or instrumentation not checked.

Cause: No day-1 SRM check or event QA.

Fix: Add randomization and exposure logging; run SRM check after day 1; QA events in staging.

Symptom: No risks or mitigations.

Cause: Plan has no section on what could go wrong.

Fix: List at least 2 risks and mitigations so the team is prepared.

FAQ

Is this for product/growth A/B tests or academic experiments?

This is an A/B testing experiment plan generator for product and growth experiments (e.g. pricing, checkout, onboarding, feature flags). It is not for academic lab experiments. You get a pre-registered plan with hypothesis, primary metric, guardrails, sample size, duration, and decision rules.

What is the difference between primary, secondary, and guardrail metrics?

Primary: exactly one metric that determines success; it must have a clear definition, unit, and window. Secondary: 0–5 metrics you will monitor but not use as the main decision. Guardrail: 1–5 metrics that protect against harm (e.g. trial-to-paid must not drop); each needs a threshold and kill switch action.

How is sample size and duration calculated?

For conversion metrics we use a two-proportion sample size formula (baseline rate, MDE, alpha 0.05, power 0.8). You enter baseline, MDE (absolute or relative), and daily traffic; the tool outputs required total sample and planned days. For non-conversion numeric metrics, use an external sample size calculator and mark 'needs external calc'.

Why is peeking or early stopping a problem?

Repeatedly checking results and stopping when you see significance inflates false positives (you can 'win' by chance). Use a fixed planned duration and decide only at the end, or use a pre-registered sequential testing approach. The lint panel flags 'stop early when significant' without a fixed horizon.

What are decision rules?

Decision rules define what you will do when the test ends: ship (win), revert (loss), what to do if results are inconclusive, and what to do if a guardrail is breached (e.g. revert immediately). Defining these before launch avoids bias and disagreement after results.

When are guardrails required?

For high-risk flows (pricing, checkout, signup, onboarding) the tool requires at least one guardrail metric with a threshold and action. This prevents shipping a change that hurts revenue or quality. The lint panel will fail the plan until guardrails are added for those flows.

What is SRM and why check it?

Sample Ratio Mismatch means the control/variant split is not what you expect (e.g. 48/52 instead of 50/50). It can indicate a bug in randomization or exposure logging. Check SRM after day 1; if it fails, investigate before continuing. The QA checklist includes SRM.

Can I export and share the plan?

Yes. You can copy to clipboard, export Markdown (stakeholder doc), CSV (metrics and decision rules), and JSON (full inputs + plan + lint). Share URL reconstructs the plan in a new session. No login required. Print/PDF via browser print. Use the share link to align stakeholders before launch.

How many segments can I pre-register?

Maximum 3 segments (e.g. device, geo, new vs returning). Segments must be defined before launch so you are not slicing the data after seeing results, which would inflate false positives. The lint panel flags more than 3 segments. Keep segment definitions in the plan for reproducibility and audit.

Is the A/B Test Plan Generator free?

Yes. The tool is free, runs in your browser, and requires no login. You get hypothesis builder, variants, primary/secondary/guardrail metrics, sample size and duration calculator, decision rules, segmentation, risks, QA checklist, and Copy/MD/CSV/JSON export. Autosave and shareable URL included. No sign-up or account required.

Learn more with CraftUp

Courses, blog, and glossary for product and experimentation.

From plan to results

Use CraftUp tools and courses to design experiments, set guardrails, and learn from results.

Freshness

Last updated: 2026-03-05

  • 2026-03-05: Launched A/B Test Plan Generator: hypothesis, primary/secondary/guardrail metrics, sample size and duration.
  • 2026-03-05: Strict lint panel: missing primary, decision rules, guardrails for high-risk flows, peeking warning, MDE sanity, segments, QA checklist.
  • 2026-03-05: Three loadable examples (B2C pricing, B2B signup, mobile onboarding). Copy, MD, CSV, JSON export; shareable URL.