Symptom: Unclear success criteria.
Cause: No single primary metric or decision rules defined.
Fix: Pick one primary metric; write win/loss/inconclusive/guardrail breach actions before launch.
Title, meta description, canonical, OG, Twitter, BreadcrumbList, FAQPage, WebApplication schema.
This is an A/B testing experiment plan generator for product and growth experiments (e.g. pricing, checkout, onboarding), not academic lab experiments. Get a pre-registered plan with hypothesis, primary and guardrail metrics, sample size and duration, and decision rules.
No login. Autosave in browser. Shareable URL.
Exactly one; must have event definition, unit, window.
Required for pricing/checkout/signup/onboarding. Each needs threshold + action.
Primary: exactly one metric that determines the test decision; it must have a clear event definition, unit, and window. Secondary: 0–5 metrics you monitor but do not use as the main decision. Guardrail: 1–5 metrics that protect against harm (e.g. trial-to-paid must not drop more than 5%); each needs a threshold and a kill switch action (e.g. revert and investigate). For high-risk flows (pricing, checkout, signup, onboarding), at least one guardrail is required.
Use the built-in calculator for conversion metrics: enter baseline rate (e.g. 0.12 for 12%), MDE as absolute (e.g. 0.02 for 2 percentage points) or relative (e.g. 0.10 for 10% lift), alpha (default 0.05), power (default 0.8), and daily traffic. The tool outputs required total sample and planned days. For non-conversion numeric metrics, use an external sample size calculator and mark "needs external calc" in the plan. Avoid unrealistic MDEs (e.g. 200% lift) unless justified.
Repeatedly checking results and stopping when you see significance inflates false positives. Use a fixed horizon: decide only at the planned end date. If you need to stop early, use a pre-registered sequential testing method instead of ad-hoc peeking. The lint panel flags "stop early when significant" without a fixed duration and recommends setting planned days from the sample size calculator.
Define before launch: Win—when you ship the variant (e.g. primary metric significant positive, no guardrail breach). Loss—when you revert (e.g. primary significant negative). Inconclusive—what you do if the test ends without significance (e.g. run full duration, then extend or ship based on guardrails). Guardrail breach—immediate action (e.g. revert and investigate). Writing these down avoids bias and disagreement after results.
Symptom: Unclear success criteria.
Cause: No single primary metric or decision rules defined.
Fix: Pick one primary metric; write win/loss/inconclusive/guardrail breach actions before launch.
Symptom: Peeking and early stopping.
Cause: Checking results before planned end and stopping when significant.
Fix: Use a fixed horizon (planned duration); or adopt a pre-registered sequential testing method. Do not stop early ad hoc.
Symptom: No guardrails on high-risk flows.
Cause: Pricing/checkout/signup tests without guardrail metrics.
Fix: Add at least one guardrail with threshold and action (e.g. trial-to-paid must not drop >5%).
Symptom: Metric definition ambiguity.
Cause: Primary metric has no event definition, unit, or window.
Fix: Define the exact event, unit (%, count, etc.), and window (per user, per session) so analytics can implement it.
Symptom: MDE too large or undefined.
Cause: Expecting 100%+ lift or leaving MDE blank.
Fix: Set a realistic MDE (e.g. 10% relative); use the duration calculator to get required sample and days.
Symptom: Segments defined after results.
Cause: Slicing by segment only after looking at data.
Fix: Pre-register max 3 segments before launch; report by segment only for those.
Symptom: SRM or instrumentation not checked.
Cause: No day-1 SRM check or event QA.
Fix: Add randomization and exposure logging; run SRM check after day 1; QA events in staging.
Symptom: No risks or mitigations.
Cause: Plan has no section on what could go wrong.
Fix: List at least 2 risks and mitigations so the team is prepared.
This is an A/B testing experiment plan generator for product and growth experiments (e.g. pricing, checkout, onboarding, feature flags). It is not for academic lab experiments. You get a pre-registered plan with hypothesis, primary metric, guardrails, sample size, duration, and decision rules.
Primary: exactly one metric that determines success; it must have a clear definition, unit, and window. Secondary: 0–5 metrics you will monitor but not use as the main decision. Guardrail: 1–5 metrics that protect against harm (e.g. trial-to-paid must not drop); each needs a threshold and kill switch action.
For conversion metrics we use a two-proportion sample size formula (baseline rate, MDE, alpha 0.05, power 0.8). You enter baseline, MDE (absolute or relative), and daily traffic; the tool outputs required total sample and planned days. For non-conversion numeric metrics, use an external sample size calculator and mark 'needs external calc'.
Repeatedly checking results and stopping when you see significance inflates false positives (you can 'win' by chance). Use a fixed planned duration and decide only at the end, or use a pre-registered sequential testing approach. The lint panel flags 'stop early when significant' without a fixed horizon.
Decision rules define what you will do when the test ends: ship (win), revert (loss), what to do if results are inconclusive, and what to do if a guardrail is breached (e.g. revert immediately). Defining these before launch avoids bias and disagreement after results.
For high-risk flows (pricing, checkout, signup, onboarding) the tool requires at least one guardrail metric with a threshold and action. This prevents shipping a change that hurts revenue or quality. The lint panel will fail the plan until guardrails are added for those flows.
Sample Ratio Mismatch means the control/variant split is not what you expect (e.g. 48/52 instead of 50/50). It can indicate a bug in randomization or exposure logging. Check SRM after day 1; if it fails, investigate before continuing. The QA checklist includes SRM.
Yes. You can copy to clipboard, export Markdown (stakeholder doc), CSV (metrics and decision rules), and JSON (full inputs + plan + lint). Share URL reconstructs the plan in a new session. No login required. Print/PDF via browser print. Use the share link to align stakeholders before launch.
Maximum 3 segments (e.g. device, geo, new vs returning). Segments must be defined before launch so you are not slicing the data after seeing results, which would inflate false positives. The lint panel flags more than 3 segments. Keep segment definitions in the plan for reproducibility and audit.
Yes. The tool is free, runs in your browser, and requires no login. You get hypothesis builder, variants, primary/secondary/guardrail metrics, sample size and duration calculator, decision rules, segmentation, risks, QA checklist, and Copy/MD/CSV/JSON export. Autosave and shareable URL included. No sign-up or account required.
Courses, blog, and glossary for product and experimentation.
Use CraftUp tools and courses to design experiments, set guardrails, and learn from results.
Last updated: 2026-03-05