Geo-experiment Simulation Study

By Robson Tigre

Recast Research · June 2026 · Code on GitHub

Open-source geo-experiment tools are not interchangeable.

We ran 32,000 simulated experiments across four common marketing scenarios to benchmark four leading open-source geo-experiment tools. Because we use synthetic data where the true campaign effect is known in advance, we can measure how well each tool recovers it.

These four tools are often treated as interchangeable, and they are not. Where they diverge — sharply, at times — is in how they handle uncertainty: how often their confidence intervals contain the true incremental effect, how often they declare winning results that aren't real (false positives), and how often they come back inconclusive when a real incremental effect exists (false negatives).

Every tool forces a tradeoff between those two mistakes, and the choice of tool becomes a choice about which one is more expensive for your business. That's the focus of this report.

Tools Studied

×4

CausalPy Google MM Meta GeoLift Causal Impact

Scenarios

×4

S1 — Baseline S2 — Outlier market S3 — Small pool S4 — Short pre-period

Effect Sizes

×2

Real incremental effect (7.5%) Null (no effect)

Iterations

×1,000

per tool per scenario per effect size

4×4×2×1,000=32,000

CausalPy vs. Google Matched Markets vs. Meta GeoLift vs. Google Causal Impact: A Head-to-Head Simulation Study

We compared four leading open-source geographic-based experimentation tools — CausalPy, Google Matched Markets, Google Causal Impact, and Meta GeoLift — by running each one on thousands of simulated datasets where the true incremental campaign effect is known beforehand. Throughout this report, "effect" means this incremental lift — the sales a campaign genuinely causes, beyond what would have happened anyway.

CausalPy

Google MM

Meta GeoLift

Causal Impact

In this study, every tool sees the same data, the same treatment and control markets, and is scored against the same ground truth. The only thing that differs is the tool being used for the analysis.

Each simulated dataset is daily sales across many markets, ending with a two-week test period. In that window, one market gets either a 7.5% incremental sales lift (to see whether each tool detects a real effect) or no lift at all (to see how often each tool declares a false win). Each market behaves like a real one: it has its own size, grows slowly over time, has good days and bad days of the week, and experiences random noise where a rough patch tends to bleed into the next day rather than resetting completely.

For the full methodology and technical details, see the Appendix.

Scenarios —Four operating conditions practitioners actually face.

We tested the tools across four scenarios. Each represents a real-world operating condition you'll meet in the course of running an experimentation program.

Baseline

This is clean, well-behaved data with one treated geo and a diverse pool of 20 control markets.

Outlier market

The treated market is 5x larger than the median city in the control pool. In other words, what happens when your CMO insists on testing in New York?

Small control pool

There are only 9 control markets instead of 20, a common issue for companies operating in smaller markets.

Short calibration window

This has 30 days of pre-treatment data instead of 90. Again, a common occurrence where business pressure forces a brand to launch an experiment before they have sufficient pre-experiment data.

Time series visualization of all four simulation scenarios — **Figure 1.** Simulated market data for each scenario, shown with no campaign effect applied. Gray lines represent individual control markets. Blue: control average. Coral: treated market. Dashed vertical line: campaign start.

Across these scenarios, no tool escaped the tradeoff between false positives and false negatives. The interesting question is which tradeoff each one made.

Key findings —What we learned across 32,000 model fits.

Finding 01

There's no free lunch — every tool forces a tradeoff between two kinds of mistakes.

Every tool in this study forces a choice between two errors: false positives (declaring a winning experiment that isn't real) and false negatives (declaring an experiment inconclusive when a real incremental effect exists). No tool escapes the tradeoff. Which mistake is more expensive for your business is the question that should drive which tool you pick.

Confidence intervals across tools and scenarios — **Figure 2.** Uncertainty intervals per tool × scenario, 7.5% effect condition. 50 iterations from the 1,000 simulation runs. Dashed vertical line marks the true ATT = 7.5%. Green = the CI contains 7.5%; red = it misses. The % in each subplot's top-right corner is empirical coverage on those 50 bars (green if ≥ 90%, orange 70–89%, red < 70%).

When a tool provides a confidence interval — for example, "the incremental lift is between 5.2% and 10.1% with 95% confidence" — that interval is supposed to contain the true answer 95% of the time. Keeping that promise comes at a cost. A tool can produce wide intervals that reliably contain the true lift. But if those intervals become so wide that they contain zero, the experiment can't declare a statistically significant effect, even when a real one exists.

On the other hand, a tool can produce tighter, more decisive intervals. But those intervals will miss the true incremental lift more often and generate more false alarms. No tool in this study escapes this tradeoff.

Meta GeoLift

Meta GeoLift is the strongest performer on three of four metrics under consideration: its coverage is closest to the 95% target (92–95%), it has the lowest false positive rate (3–5%), and its point estimates are closest to the true incremental lift in most scenarios. The tradeoff is in its ability to detect real incremental effects, where Meta GeoLift's confidence intervals are wide enough such that they frequently contain both the true effect and zero at the same time. This conservatism can become a limitation for practitioners making decisions with output from the tool.

Causal Impact

Causal Impact sits at the opposite end. Its intervals are narrow enough to exclude zero most of the time, which is why it declares a statistically significant lift more often than any other tool — flagging the real 7.5% effect in 52–66% of runs. But narrow intervals that confidently exclude zero also confidently exclude zero when they shouldn't. Causal Impact fires false alarms nearly 30% of the time, and its estimates carry a consistent positive bias that shifts the whole interval high — making campaigns look more effective than they are. Causal Impact is a decisive tool, but it is often confidently wrong.

Google MM and CausalPy

Google MM and CausalPy sit in between, with coverage of 76–86% and false positive rates of 14–25%. Their confidence intervals are wide enough to avoid constant false alarms, but not so wide that every experiment ends inconclusively. The cost is that they under-deliver on the 95% coverage promise in both directions with more false positives than Meta GeoLift and less detection power than Causal Impact.

"The numbers above tell you what each tool's tradeoff actually is. With this context, picking among them becomes a business decision rather than just a statistical one."

Finding 02

When data is scarce, these tradeoffs get worse.

The 30-day pre-treatment scenario (S4) delivers a highly practical finding: it reflects a situation many practitioners have faced, where leadership wants to move faster than the data allows.

With only 30 days of pre-treatment data, Meta GeoLift and Google MM hold up better than CausalPy and Causal Impact. Importantly, the pattern from Finding 1 doesn't ease under data scarcity — it sharpens. Conservative tools become even more conservative and yield more inconclusive experiments. Decisive tools produce more false reads that can mislead marketers.

Performance under data scarcity: S4 scenario results table

95.7%

Meta GeoLift continues to keep its 95% promise — its confidence intervals still contain the true effect 95% of the time in the S4 scenario. But those intervals are so wide that they also contain zero in 95.7% of runs, meaning most tests come back inconclusive.

14.5%

Google MM holds up well and achieves its lowest false positive rate of any scenario (14.5%), retaining more detection power than Meta GeoLift while staying reasonably well-calibrated.

>24%

CausalPy and Causal Impact show the steepest deterioration — both see false positive rates climb above 24%, meaning roughly one in four declared winning experiments under data scarcity is likely just noise.

The practical takeaway: 30 days of pre-treatment data is insufficient for reliable inference with any of these tools. But if you're in that situation, Meta GeoLift and Google MM give you a more defensible result.

Finding 03

All four tools get close to the right answer in the easy case.

Across non-outlier scenarios, all four tools largely recover the true 7.5% lift within a few percentage points.

ATT estimates across all conditions and tools — **Figure 3.** Effect estimates across 1,000 replications for each tool × scenario, after standardizing the data for CausalPy. Dots represent mean point estimates, whiskers span the 2.5th–97.5th percentile. The dashed line marks the true effect (7.5% left panel, 0% right panel). Note the x-axis scales differ between panels.

The exception is CausalPy in the outlier scenario (S2), where a data-scaling mismatch produces a near-6× overstatement out of the box (corrected to 6.81% after rescaling — see the Appendix for details). The outlier scenario inflates bias for every tool, but the magnitudes elsewhere stay within a few percentage points of the true lift.

Point estimates alone would tell you these tools are interchangeable. The uncertainty story tells you why they aren't.

Practitioner guidance —What this means if you're choosing a tool.

Meta GeoLift

Meta GeoLift produces the most honest confidence intervals of any tool tested — its stated ranges contain the true effect 92–95% of the time, and it almost never declares a winner that isn't real (false positive rate of 3–5%). The cost is that its intervals are very wide which can make decision-making more difficult. Even though its confidence intervals contain the true 7.5% effect 92–95% of the time, those intervals are wide enough that they also contain zero, and as a result, the test comes back inconclusive in 89% of baseline runs. This makes Meta GeoLift an appropriate choice when scaling budget behind a channel that doesn't actually work — the false positive — is more expensive for your business than missing a real, winning channel.

Google Matched Markets

Google Matched Markets sits in the middle. It gives up some calibration (81–86% coverage) in exchange for more decisive results, with a false positive rate of 14–19% and a tendency to overestimate lift by 1–3 percentage points. When you need experiments to produce actionable answers and can tolerate a higher rate of false alarms, Google MM is a practical default.

CausalPy & Causal Impact

CausalPy requires rescaling the input data before fitting¹ — without it, its confidence intervals are far too narrow and its estimates unreliable, particularly when the treated market is much larger than the control markets. With rescaling, it lands in a similar range to Google MM on most metrics. Causal Impact consistently overestimates lift by 2–4 percentage points and has the worst coverage of any tool tested (70–72%). If you are currently using either tool, confirm your results hold up against known ground truth before relying on them for budget decisions — and for CausalPy specifically, ensure your pipeline rescales the data first.

1 When run with default settings, CausalPy's confidence intervals are far too narrow and in the outlier scenario its estimate is off by ~6×. The root cause is that CausalPy's default prior on observation noise assumes residuals are approximately unit-scale, and most marketing KPIs aren't. We address this by rescaling the input data. Full detail and Figure 2 (before/after rescaling) are in the appendix.

"No tool fully compensates for a poorly designed experiment. When the treated market is dramatically larger than the control markets (the New York problem) confidence intervals across all four tools widen 4–5× and most tools overestimate lift by 2–4 percentage points. At that point, the choice of tool matters less than the choice of test market."

Reproduce it —The full study is open source.

All simulation code, panel generation infrastructure, and analysis scripts are publicly available. The repository includes scripts to generate panels for all four scenarios, wrapper code for each tool, orchestration for parallel execution with checkpointing, and analysis code for every metric, table, and figure in this study.

getrecast/geolift-simulation-study

Full code, panel data, analysis scripts

View on GitHub →

We built this study to be extended. Change the scenarios, adjust tool configurations, add an estimator, plug in your own data. If you find results that differ from ours, we want to hear about it.

Limitations —What this study doesn't tell you.

Every simulation study is only as good as the assumptions baked into it, and ours are no exception. Our synthetic data was designed to behave like real marketing data, but real geo-experiments add complications we didn't model — markets that move together due to national events, extreme outliers, spillover effects between neighboring cities, and so on.

It's also worth noting that some of GeoLift's strong performance in this study may reflect that our data was built in a way that happens to suit its statistical approach; under messier real-world conditions, the tool rankings could shift. Full details on these limitations are in the Appendix.

This is the first study in a series. Upcoming posts will include deep dives into each individual tool, simulations under more realistic and difficult data conditions, and eventually tests on real campaign data where the true answer isn't known in advance. If the rankings change under harder conditions, we'll report that too.

We recommend running the simulation yourself under different conditions and seeing if the findings hold.

Methodology & Full Results

By Robson Tigre

Recast Research · June 2026 · Code on GitHub

Geo-experiment Simulation Study Appendix

Full methodology, fair-comparison protocol, scenario-by-scenario results, and configuration details for the 32,000-model-fit benchmark. For the narrative report, see the Executive Report.

Study Summary
How we enforced fair comparison between the tools
Methodology
A Note on CausalPy
Complete Study Results
Study Limitations
Package Versions & Parameters
Repository Structure

01 —Study summary

We benchmark four open-source geo-experiment estimation tools — CausalPy, Google Matched Markets (MM), Meta GeoLift, and Causal Impact — on synthetic panels where the true treatment effect is known. Each scenario represents an operating condition practitioners encounter: ideal conditions; an outlier treated market; a small donor pool; and a short pre-treatment window. Each tool was run under two effect conditions (7.5% lift and 0% lift).

The central finding is that no tool delivers nominal 95% coverage while reliably detecting real incremental effects. At the extremes, Causal Impact detects most real lifts (FNR 34–48%) but fires on noise nearly a third of the time (FPR 28–30%); GeoLift holds false positives near nominal (FPR 3–5%) but misses 89–96% of real effects. The two middle tools, Google MM and CausalPy, are under-calibrated and, relative to Causal Impact, under-powered.

CausalPy required data standardization to produce proper uncertainty intervals. Without it, coverage ranged from 0.3% to 14.1% and false positive rate reached at least 86% across all scenarios; standardization lifted coverage to 76–82% and cut FPR to 18–25%.

02 —How we enforced fair comparison between the tools

Comparing tools that differ in statistical model, inference mechanism, and output format requires deliberate equalization. We standardized every dimension that is not an intrinsic property of the estimator, so that observed differences in performance reflect genuine estimator behavior rather than configuration choices.

Dimension	Rule
Uncertainty level	95% across all tools
Treated / control assignment	Identical per scenario (same panel files)
Pre / post split	Identical per scenario
True ATT% (benchmark)	Computed from DGP counterfactual, same for all tools
Tool ATT% and CI	Each tool reports its own estimate and interval
Bias	ATT% minus true ATT%, where true ATT% is either 7.5% or 0%
Significance criterion	95% uncertainty interval excluding zero
False Positive Rate (FPR)	Share of null iterations where tool declares significance
False Negative Rate (FNR)	Share of effect iterations where tool fails to detect

Table 1. Fair comparison protocol applied across all tools to ensure differences in results are driven solely by the tool.

This protocol isolates three estimator-level differences:

Weight estimation: Dirichlet convex combination (CausalPy) vs Ridge ASCM (GeoLift) vs OLS aggregation (Google MM) vs spike-and-slab BSTS (Causal Impact)
Inference mechanism: Bayesian HDI (CausalPy) vs conformal block permutation (GeoLift) vs t-distribution CI (Google MM) vs Bayesian credible interval (Causal Impact)
Architectural constraints: convex hull restriction, Ridge augmentation, treated/control aggregation, structural time series decomposition

03 —Methodology

The simulation study follows a factorial design: four market scenarios × two effect conditions (null effect or 7.5% effect) × 1,000 synthetic panel iterations = 8,000 unique datasets, each evaluated by all four tools (32,000 total model fits). All tools received identical panel files; no tool saw the true treatment effect. Each tool's output was recorded and evaluated against the known DGP ground truth.

Data-generating process

Each synthetic panel is generated from a multiplicative model. For geo i at time t:

Y_cf(i,t) = baseline_i × trend_t × season_t × exp(noise_level × scale_i × ar_noise(i,t))

Where:

Y_cf(i,t) represents the counterfactual outcome level of city i at period t
baseline_i ~ a number drawn from LogNormal (mean=4000, sdlog=0.6) — persistent geo-level scale
trend_t = 1 + 0.001t — 0.1% daily growth shared across all geos
season_t = 1 + 0.10 × dow_profile — day-of-week seasonality (Mon–Sun pattern), shared across all geos
scale_i = √(baseline_i / mean_baseline) — square-root scaling implements a portfolio effect: larger geos have proportionally smaller noise. This prevents outliers from also amplifying noise amplitude.
ar_noise(i,t) = 0.30 × ar_noise(i,t−1) + ε_t, ε ~ N(0,1) — geo-specific AR(1) noise with autocorrelation ρ=0.30
noise_level = 0.20 — global noise scale applied to all geos

Treatment is multiplicative: Y_obs(i,t) = Y_cf(i,t) × (1 + τ) for post-period observations of the treated geo, where τ = 0 (null, 0% lift) or 0.075 (7.5% positive lift). The treated unit is the geo closest to the median baseline, ensuring it is representative of the control pool rather than an outlier — except in Scenario S2, where a 5× baseline inflation is applied to deliberately induce scale mismatch.

Key design choices: shared trend and seasonality ensure some cross-geo correlations. Multiplicative noise ensures Y is strictly positive and that noise variance scales with the geo's baseline — a common feature of sales and impression data.

04 —A note on CausalPy

CausalPy bakes in an assumption about how noisy your data should be before looking at it. Its default prior on observation noise, HalfNormal(sigma=1), works fine for small-scale KPIs like click-through rates. But daily revenue can swing by hundreds or thousands between cities and days, so the prior is far too tight for this application. As a result, practitioners would get distorted posteriors with confidence intervals that are narrow and overconfident in the wrong regions.

We identified two fixes. You may import pymc_extras and pass a custom sigma prior via PyMC's Prior object — but this requires navigating undocumented guidance on CausalPy's nested prior structure, and it still fails in S2 where the outlier geo runs at roughly 5× baseline. Or you may rescale (e.g. standardize) the input data before fitting. Rescaling solves both problems: it makes S2 viable for standard synthetic control and brings all series to unit variance, which is exactly what the default prior assumes.

We adopted rescaling after discussion with our research team. For each scenario, we standardized each geo's series against its pre-period mean and standard deviation before fitting, then converted estimates back to the original scale afterward. We discuss results both ways below — unscaled and scaled — because the unscaled version is what most practitioners will encounter first. We've also opened a pull request to the CausalPy repository to surface this requirement in their documentation.

The standardization procedure

For each geo independently, using only the days before treatment starts, we compute the pre-period mean and standard deviation. Every day's value is then replaced with (value − mean) / std. CausalPy runs on the rescaled data. Afterward, we undo the rescaling: the estimated treatment effect on rescaled units is multiplied by the treated geo's pre-treatment standard deviation to convert back to the original scale.

Impact of standardization

Standardization resolves the S2 outlier blowup and widens intervals approximately 5–10× , improving coverage from 0.3–14% to the 76–82% range observed in our final results. In the rest of this appendix, we contrast some results obtained without standardization to illustrate the magnitude of the problem.

Specific before/after numbers: in S1, coverage went from 13.4% to 80.3% and FPR from 86.6% to 19.8%; CI widths widened roughly 8× (from 95.42 to 749.10 daily-level units). In S2, ATT fell from 45.2% (bias +37.7 pp) to 6.81% (bias −0.69 pp), coverage rose from 0.3% to 82.1%, and FPR dropped from 99.7% to 18.2%. The pattern held across all scenarios, with CI widths widening 8–13×.

05 —Complete study results

Per-scenario results follow, with each scenario's table, key observations, and the impact of CausalPy standardization. Figures 2 and 3 at the end of this section visualize ATT estimates and per-iteration uncertainty intervals across all conditions.

Scenario S1 · the textbook case

Tool	Avg ATT (%)	Bias (pp)	Coverage	FNR	FPR	Avg CI Width
CausalPy (y_hat)	6.50	−1.00	80.30%	66.20%	19.80% (avg lift: −0.90%)	749.10
Google MM	9.51	+2.01	83.10%	57.10%	16.90% (avg lift: +2.01%)	833.48
Meta GeoLift	7.72	+0.22	92.20%	91.30%	4.60% (avg lift: +0.22%)	2,020.07
Causal Impact	11.32	+3.82	72.20%	37.90%	27.80% (avg lift: +3.82%)	694.26

Table 3. Scenario S1 results — 1,000 iterations each for effect (7.5% true lift used to calculate Bias, Coverage and FNR) and null (no effect, used to calculate FPR) conditions.¹

1 ATT (%) = Mean estimated treatment effect as a percentage of the counterfactual across 1,000 effect-condition iterations; true value is 7.5%. Bias = ATT minus 7.5, in percentage points. Coverage = Share of effect-condition iterations where the 95% CI contains the true ATT. FNR = Share of effect-condition iterations where the interval fails to exclude zero. FPR = Share of null-condition iterations where the interval excludes zero. Significance rule: 95% HDI excludes zero (CausalPy), 95% CI excludes zero (Google MM, GeoLift, Causal Impact). CI Width = Mean interval width in daily level units, from effect-condition iterations.

Even under ideal conditions, no tool hits nominal 95% coverage. GeoLift comes closest (92.20%) but at the cost of extremely wide intervals that make it nearly unable to detect a 7.5% effect (FNR = 91.30%). Causal Impact has the most detection power (FNR = 37.90%) but the worst false positive rate (FPR = 27.80%) and the largest bias (+3.82 pp).

Before standardization, CausalPy told a very different story in S1: 13.4% coverage, 8.0% FNR, 86.6% FPR, and CI width of just 95.42 units — a profile of extreme overconfidence. Standardization widened intervals roughly 8× (to 749.10 units), lifting coverage to 80.3% but shifting FNR from 8.0% to 66.2%. The trade-off is stark: proper calibration came at the cost of the ability to detect real effects.

Scenario S2 · the outlier market

Tool	Avg ATT (%)	Bias (pp)	Coverage	FNR	FPR	Avg CI Width
CausalPy (y_hat)	6.81	−0.69	82.10%	64.30%	18.20% (avg lift: −0.59%)	3,752.54
Google MM	9.85	+2.35	81.70%	53.00%	18.30% (avg lift: +2.35%)	4,170.03
Meta GeoLift	10.72	+3.22	93.60%	91.00%	4.20% (avg lift: +3.22%)	9,954.52
Causal Impact	11.46	+3.96	70.00%	37.20%	30.00% (avg lift: +3.96%)	3,489.26

Table 4. Scenario S2 results — 5× outlier treated market.

The 5× outlier has minimal impact on relative tool ordering. All tools absorb the scale change through proportionally wider CIs. GeoLift's bias increases to +3.22 pp (from +0.22 in S1), suggesting the augmented synthetic control method picks up some of the outlier's scale. CausalPy remains the least biased.

This was not the case before standardization. Without it, CausalPy catastrophically failed in S2: ATT of 45.2% (bias +37.7 pp vs the true 7.5%), coverage of 0.3%, and FPR of 99.7%. The outlier's scale overwhelmed the default prior, producing wildly inflated estimates and near-zero-width intervals. Standardization resolved this entirely — ATT fell to 6.81%, coverage rose to 82.1%, FPR dropped to 18.2%.

Scenario S3 · the small donor pool

Tool	Avg ATT (%)	Bias (pp)	Coverage	FNR	FPR	Avg CI Width
CausalPy (y_hat)	6.93	−0.57	82.40%	65.20%	18.00% (avg lift: −0.46%)	773.73
Google MM	10.80	+3.30	80.90%	47.30%	19.10% (avg lift: +3.30%)	828.80
Meta GeoLift	8.53	+1.03	92.50%	89.30%	4.90% (avg lift: +1.03%)	1,926.35
Causal Impact	11.71	+4.21	71.10%	34.30%	28.90% (avg lift: +4.21%)	686.57

Table 5. Scenario S3 results — 9 control markets instead of 20.

Halving the donor pool from 20 to 9 barely changes the picture. This is expected: the DGP generates geos with shared trend and seasonality, so even a small pool provides adequate counterfactual donors. Google MM's bias worsens to +3.30 pp, the largest degradation of any tool in this scenario.

CausalPy's pre-standardization S3 profile mirrored S1: 14.1% coverage, 86.0% FPR, CI width of 93.91 units. After standardization, coverage improved to 82.4% and FPR fell to 18.0% — confirming that donor pool size was not the binding constraint; prior calibration was.

Scenario S4 · short pre-treatment window

Tool	Avg ATT (%)	Bias (pp)	Coverage	FNR	FPR	Avg CI Width
CausalPy (y_hat)	6.45	−1.05	75.90%	63.50%	24.80% (avg lift: −0.79%)	718.74
Google MM	8.53	+1.03	85.50%	65.90%	14.50% (avg lift: +1.03%)	915.14
Meta GeoLift	7.11	−0.39	95.10%	95.70%	3.30% (avg lift: −0.39%)	5,574.99
Causal Impact	9.37	+1.87	70.50%	47.80%	29.50% (avg lift: +1.87%)	694.78

Table 6. Scenario S4 results — 30 days of pre-treatment data instead of 90.

Cutting the pre-period from 90 to 30 days is the most disruptive scenario. CausalPy's coverage drops to 75.90% and its FPR rises to 24.80% — a meaningful degradation from S1. Google MM improves on bias (+1.03 pp, down from +2.01 in S1) and achieves its best FPR (14.50%) across all scenarios, suggesting TBR benefits from a tighter calibration window. GeoLift hits nominal coverage (95.10%) but almost never detects a real effect — its FNR reaches 95.70%, meaning it detects the effect fewer than 5 times in 100.

Before standardization, CausalPy's S4 profile was equally dire: 9.3% coverage, 90.6% FPR, CI width of 82.42 units. Standardization brought coverage to 75.9% and FPR to 24.8% — the weakest post-standardization results across scenarios, confirming that shorter pre-periods compound the difficulty of prior calibration.

ATT estimates across all conditions

ATT estimates before standardization — **Figure 2.A (before standardizing CausalPy data).** Effect estimates across 1,000 replications for each tool × scenario. Dots represent mean point estimates, whiskers span the 2.5th–97.5th percentile. Dashed line marks the true effect (7.5% left panel, 0% right panel). Note the x-axis scales differ between panels.

ATT estimates after standardization — **Figure 2.B (after standardizing CausalPy data).** Same as Figure 2.A but with standardized data for CausalPy, allowing direct visual comparison across all four tools.

In the left panel in figures 2.A and 2.B (effect condition), CausalPy centers closest to the true 7.5% but with a slight negative bias. Causal Impact consistently overshoots. GeoLift's simulation intervals are the widest, consistent with its conservative conformal inference. In the right panel (null condition), a well-calibrated tool should center on zero with intervals that rarely exclude it. Causal Impact's mean drifts +1.87 to +4.21 pp above zero across all scenarios — the same positive bias that inflates its false positive rate.

Uncertainty intervals up close

Per-iteration confidence intervals — **Figure 3.** Uncertainty intervals per tool × scenario cell, 7.5% effect condition — 50 iterations from the 1,000 simulation runs. Dashed vertical line marks the true ATT = 7.5%. Green = CI contains 7.5%; red = it misses. The percentage in each subplot is empirical coverage (green ≥ 90%, orange 70–89%, red < 70%).

GeoLift's columns are almost entirely green (92–95%), confirming its conservative calibration. Causal Impact shows heavy red, especially in S2 and S4 — roughly 30% of its credible intervals miss the truth under the outlier and short pre-treatment scenarios. CausalPy and Google MM fall in between, with coverage in the 76–86% range.

06 —Study limitations

Every simulation study has blind spots. Here are ours:

Single DGP. All geos share identical trend and seasonality; only baselines and noise differ. This makes counterfactual construction easier than in real data, where geos may have idiosyncratic trends. Results may be optimistic about all tools' performance.
One effect size. We test at 7.5% lift. Detection rates will differ at smaller or larger effects. A 2–3% lift — common in practice — would likely produce higher FNR across all tools.
Short post-period. 15 days of post-treatment data is shorter than many real experiments. Longer post-periods would provide more information and likely improve all tools' detection rates.
No geo-specific trends. Donors move in lockstep (up to noise). This favors synthetic control methods that rely on parallel trends. Real-world violations of this assumption would disproportionately affect tools without augmentation (CausalPy, Causal Impact).
CausalPy required standardization. The comparison is "best effort per tool" — not "out of the box." CausalPy's results reflect a non-trivial preprocessing step that other tools did not require.

07 —Package versions & parameters

Package versions

Tool / Package	Version	Source
CausalPy	0.8.0	PyPI
PyMC	5.28.1	PyPI
ArviZ	0.23.4	PyPI
Google matched_markets	commit 5e3cd95	GitHub
Meta GeoLift	v2.7.5 / commit 4d2afd4	GitHub
augsynth	commit 65c5a6f	GitHub
Causal Impact	1.4.1	CRAN
Python	3.12.8	—

Model parameters

Parameter	CausalPy	Google MM	GeoLift	Causal Impact
Confidence	95% HDI	95% CI	alpha = 0.05	alpha = 0.05
MCMC	4 chains × 1,000 + 1,000 warmup	—	—	2,000 iter
Seasonality	—	—	—	nseasons=7, duration=1
Model	Dirichlet SC	OLS TBR	Ridge ASCM	BSTS local level
Inference	HDI	t-distribution	conformal (block)	Bayesian credible

Simulation parameters

Parameter	Value
Geos (default)	1 treated + 20 control
Total days (default)	105 (90 pre + 15 post)
Baseline mean	4,000
Baseline spread (sdlog)	0.6
Trend slope	0.001/day (0.1%)
Seasonality amplitude	0.10
DOW profile (Mon–Sun)	[−1.0, −0.5, 0.0, 0.2, 0.8, 1.0, 0.5]
AR(1) coefficient	0.30
Noise level (log-scale)	0.20
Outlier multiplier (S2)	5.0
Iterations per cell	1,000
Master seed	42

08 —Repository structure

Path	Description
config/tools.yaml	Tool configurations
src/R/generate_panels.R	Data-generating process
src/R/run_geolift.R	GeoLift wrapper
src/R/run_causalimpact.R	Causal Impact wrapper
src/python/run_causalpy.py	CausalPy wrapper
src/python/run_google_mm.py	Google MM wrapper
analysis/compute_metrics.py	Metric aggregation
analysis/generate_tables.py	Table generation
analysis/plot_forest.py	ATT forest plot
analysis/plot_ci_gallery.py	CI gallery figure
results/aggregated/	Final tables and metrics
results/raw/results.jsonl	Raw per-iteration results
panels/	Generated synthetic data (parquet)
figures/	All figures

CausalPy vs. Google Matched Markets vs. Meta GeoLift vs. Google Causal Impact: A Head-to-Head Simulation Study

Scenarios —Four operating conditions practitioners actually face.

Baseline

Outlier market

Small control pool

Short calibration window

Key findings —What we learned across 32,000 model fits.

There's no free lunch — every tool forces a tradeoff between two kinds of mistakes.

Meta GeoLift

Causal Impact

Google MM and CausalPy

When data is scarce, these tradeoffs get worse.

All four tools get close to the right answer in the easy case.

Practitioner guidance —What this means if you're choosing a tool.

Meta GeoLift

Google Matched Markets

CausalPy & Causal Impact

Reproduce it —The full study is open source.

Limitations —What this study doesn't tell you.

Contents

01 —Study summary

02 —How we enforced fair comparison between the tools

03 —Methodology

Data-generating process

04 —A note on CausalPy

The standardization procedure

Impact of standardization

05 —Complete study results

Scenario S1 · the textbook case

Scenario S2 · the outlier market

Scenario S3 · the small donor pool

Scenario S4 · short pre-treatment window

ATT estimates across all conditions

Uncertainty intervals up close

06 —Study limitations

07 —Package versions & parameters

Package versions

Model parameters

Simulation parameters

08 —Repository structure

Get notified when V2 lands

Join Recast's weekly newsletter