·
7 minutes
Bonferroni Correction: Why Testing 45 Combinations Requires Statistical Discipline
Algorithmic

Bonferroni Correction: Why Testing 45 Combinations Requires Statistical Discipline
I tested 45 target/stop combinations across 89,774 signals.
That is a problem.
Not because testing many combinations is wrong. Because testing many combinations and then presenting the best one — without adjusting for the fact that you tested many — is one of the most common forms of statistical fraud in retail trading research.
It is called the multiple comparisons problem. And if you do not correct for it, your backtest is lying to you.
The multiple comparisons problem, explained simply
Here is the intuition.
Flip a fair coin 20 times. You expect roughly 10 heads. But if you flip 20 coins simultaneously, the chance that at least one coin lands heads 15 or more times is surprisingly high. Not because any individual coin is biased. Because you ran enough trials that an extreme result was likely to appear somewhere.
The same thing happens in backtesting.
Test one parameter combination. If it shows a statistically significant edge, you have learned something. Test 45 combinations. If the best one shows a significant edge, you may have learned nothing. You may have just found the coin that happened to land heads 15 times.
This is what the trading industry calls data mining bias. Academics call it the multiple comparisons problem. The colloquial term is p-hacking. Whatever you call it, the mechanism is the same. Test enough variations and you will find significance by chance alone.
The Algorithmic Suite research framework tests 45 target/stop combinations: targets from 3 to 15 points, stops from 2 to 6 points. That is a 13-by-5 grid. Every combination is evaluated across every signal in the full 18-year dataset.
89,774 signals multiplied by 45 combinations equals over 4 million individual trade evaluations.
I needed to know that the results were real. Not an artifact of testing many things and picking the winner.
What the Bonferroni correction does
The standard threshold for statistical significance is alpha equals 0.05. That means you accept a 5% chance that the result is due to randomness.
But that 5% applies to a single test. When you run 45 tests, the probability that at least one produces a false positive is not 5%. It is much higher. Under independence, it approaches 90%.
The Bonferroni correction is the strictest widely-used solution. It divides the significance threshold by the number of tests performed.
For 45 combinations: 0.05 divided by 45 equals 0.0011.
That is the corrected threshold. Instead of asking "is this result significant at 5%?", you ask "is this result significant at 0.11%?" Every combination must clear that bar individually, or it does not count.
This is deliberately conservative. The Bonferroni correction assumes the worst case — that all 45 tests are independent, which maximizes the probability of false positives. In practice, adjacent parameter combinations (like a 7-point target versus an 8-point target) are correlated, so the true required threshold is probably less strict. But I wanted the strictest standard. If the results survive Bonferroni, they survive anything.
The results: 34 of 45 passed
Thirty-four of forty-five target/stop combinations are statistically significant at the Bonferroni-corrected threshold of 0.0011.
That is more than three quarters of the grid.
This is not a case where one lucky combination squeaked past the standard threshold and everything else was noise. The vast majority of the parameter space produces a genuine, statistically verified edge. The signal is not confined to a narrow band of parameters. It is broad.
The 11 combinations that did not pass at the Bonferroni level are concentrated in the extremes of the grid — very tight stops paired with very large targets, where the base rate of winning is structurally low. These combinations are not necessarily unprofitable. They simply did not clear the most conservative statistical threshold after correcting for 45 simultaneous tests.
That distinction matters. Bonferroni does not say those 11 combinations are random. It says you cannot be 99.89% certain they are not random. Given the strictness of the correction, that is an honest answer rather than a damning one.
But the 34 that passed? Those are clean.
Why this matters more than most traders realize
Here is what a typical retail backtest looks like.
Someone tests 50 or 100 parameter combinations. They find the one that produces the highest win rate or the smoothest equity curve. They present that single result as "the strategy." They might mention they tested other values, but the selection process — the fact that they chose the best from many — is treated as due diligence rather than what it actually is: a source of bias.
Without multiple testing correction, that process is statistically invalid. Even if every individual test was conducted honestly, the act of selecting the best from a large set inflates the apparent significance. You are not reporting a result. You are reporting a maximum. And maxima from large samples are biased upward by definition.
The Bonferroni correction eliminates this. If the best combination passes at the corrected threshold, the result is real regardless of how many other combinations you tested. If 34 of 45 pass, you are not looking at a single lucky draw. You are looking at a framework where the edge is embedded in the structure, not the parameters.
That is the difference between a curve-fit strategy and a genuine statistical framework.
The same-bar tie-break test
There is a subtler integrity check that most backtests never consider.
When evaluating a 1-minute bar series, it is possible for both the target and the stop to be hit on the same bar. The bar's high might reach the target while the bar's low reaches the stop. On a 1-minute resolution, you cannot know which happened first. The intra-bar path is unknown.
This creates an ambiguity. Do you count it as a win or a loss?
The conservative approach is to count same-bar events as losses. The question is whether this choice materially affects the results.
I measured it. The impact of same-bar tie-break treatment on win rate is negative 0.07 percentage points.
Less than one-tenth of a percentage point. Negligible.
Same-bar events occur in only 0.1% to 0.3% of all trades, depending on the combination. They are rare by nature — for both levels to be reached on a single 1-minute bar, the bar must be unusually wide. And when they do occur, the conservative treatment barely moves the needle.
This is the kind of test that does not produce an impressive number. It produces a reassuring one. The framework's results are not sensitive to intra-bar resolution assumptions.
No lookahead bias: entry on bar index plus one
One more integrity check that seems obvious but is violated constantly.
When a signal fires on bar N, the entry must be on bar N plus 1. Never on the signal bar itself. The signal bar's close is not known until the bar completes, so entering on the signal bar means trading on information you do not have yet.
This is lookahead bias. It is the most common error in retail backtesting, and it inflates results in a way that is invisible unless you specifically check for it.
The Algorithmic Suite framework enters on the bar after the signal. One hundred percent of trades. No exceptions. The signal bar is used only for signal detection, never for entry. The first price the framework trades is the open of the next bar — a price that is genuinely available in real time.
This is not a feature. It is a minimum standard. But it is a standard that a surprising number of published backtests do not meet.
What honest research looks like
I want to be direct about why these corrections exist in the framework.
They are not marketing. They are not optional enhancements that make the numbers look more rigorous. They are the baseline requirements for any quantitative research that claims statistical validity.
If you test multiple parameter combinations, you must correct for multiple comparisons. If you select the best from a grid search, you must apply Bonferroni or an equivalent correction. If your backtest resolves trades at bar-level granularity, you must measure same-bar sensitivity. If your signal detection and entry happen on the same data, you must verify there is no lookahead.
These are not high standards. They are minimum standards. The problem is that in retail trading, minimum standards are rarely applied.
The Algorithmic Suite research framework applies them automatically. Every grid search runs a Bonferroni correction. Every backtest enters on bar plus one. Every result is published with the correction applied, not before.
That is not because I am trying to make the results look conservative. It is because results without these corrections are not results. They are noise with a confidence interval.
The full picture
This post covers one piece of the validation framework. Bonferroni correction handles the multiple comparisons problem — the risk that testing 45 combinations produces false positives.
But multiple testing correction is one of many layers. The framework also applies Monte Carlo simulation to test whether the equity curve could have arisen from randomized entries. Walk-forward testing to verify that in-sample optimization holds out of sample. Subsample stability to confirm the edge is distributed, not concentrated. And year-by-year performance analysis to verify consistency across 18 different market regimes.
Each test answers a different question. Together, they form a validation chain where every link must hold.
Thirty-four of forty-five parameter combinations passed the strictest multiple testing correction available. That is not a single lucky result. That is a framework where the edge is structural.
The Algorithmic Suite
Midnight Grid. Quantum Vision. Turning Points.
Three indicators. One framework. 4 million trade evaluations, Bonferroni-corrected to a threshold of 0.0011.
Algorithmic is charting software for decision support on TradingView. It is not financial advice. Trading involves risk. Outcomes depend on your rules, risk management, and execution. Past performance does not guarantee future results.

