·
8 minutes
How We Built a 16-Point Validation Suite Before Releasing a Single Indicator
Algorithmic

How We Built a 16-Point Validation Suite Before Releasing a Single Indicator
A beautiful equity curve built on corrupt data is worthless.
I learned this the hard way. Not from a blown account or a failed launch, but from a bug that lived inside my own backtesting engine for weeks before I caught it. A measurement error that inflated one of the key risk metrics by nine times. The backtest still ran. The numbers still looked plausible. Nothing crashed. Nothing threw an error. The results were just wrong.
That experience changed how I build. It is the reason the Algorithmic Suite framework now passes through a 16-point validation suite before any result is considered publishable. Before any equity curve is drawn. Before any win rate is reported. Before any of the numbers I have shared in this series were committed to a page.
This post is about that validation suite. Every test, what it catches, and why it exists.
Why validation comes before results
Here is how most indicator vendors build and publish a backtest.
They write the code. They run the backtest. They look at the results. If the numbers look good, they publish them. If the numbers look bad, they adjust the parameters and run it again. The backtest is both the test and the proof. There is no independent verification layer between the code and the claim.
That is not research. That is confirmation bias with a compiler.
Real validation means building a separate process — independent of the backtest itself — that checks whether the data is complete, the logic is correct, the results are internally consistent, and the trades can be reconstructed from raw market data. It means designing tests that are specifically built to catch the kinds of errors that backtesting engines naturally produce.
I built 16 of them, organized into six phases. Each phase targets a different category of failure.
Phase 1 — Code Review
The first two tests check whether the backtesting engine itself is doing what it claims.
Test 1: Signal pagination check. The framework pulls 143,538 signals from the database. But database APIs paginate results. If the query silently truncates at 10,000 or 50,000 rows — and many do — you are running a backtest on a fraction of your data without knowing it. This test verifies that all 143,538 signals are loaded. No truncation. No silent API limits.
This sounds trivial. It is the kind of thing you assume is working. It is also the kind of thing that, when it fails, corrupts every downstream result without leaving a trace.
Test 2: MAE measurement resolution. Maximum Adverse Excursion measures how far a trade moves against you before resolving. The correct measurement window is from entry to the bar where the trade hits its target or stop. Not the full window. Not the next 480 bars regardless of outcome.
This test exists because I found the bug. The original implementation measured MAE across the entire bar window, even after the trade had already resolved. The result: MAE values that were inflated by a factor of nine. The trades looked far riskier than they actually were. Median winners MAE jumped from 0.25 points to numbers that made the framework look untradeble.
Nothing else caught it. The backtest ran fine. The win rates were correct. Only a dedicated test — one that specifically checks the measurement boundary — found it.
Phase 2 — Data Integrity
Six tests verify that the underlying dataset is complete, clean, and sane.
Test 3: Row count consistency. The CSV output must contain exactly 143,538 signal rows. Not approximately. Exactly. If a single row is missing or duplicated, the count will not match and this test fails.
Test 4: Date range coverage. The dataset spans January 4, 2008 through March 23, 2026. This test verifies that every expected trading date is present. No gaps. No missing months. No silent holes where data should exist but does not.
Test 5: No weekend dates. Futures markets are closed on weekends. If the dataset contains a Saturday or Sunday date, something in the data pipeline is broken. This test is a simple sanity check that catches ingestion errors.
Test 6: Price range sanity. ES futures traded between roughly 500 and 7,500 over the 18-year sample. If any signal has a price outside that range, it is either a data error or a calculation bug. This test sets conservative bounds and flags anything outside them.
Test 7: Duplicate signal detection. The same signal appearing twice in the dataset would double-count that trade in every metric. This test checks for exact duplicates across all 143,538 rows. The result: zero found.
Test 8: Outcome integrity. For every target/stop combination, wins plus losses plus open trades must equal the total trade count. If they do not add up, the outcome classification logic has a bug. This is basic arithmetic applied as a cross-check. It catches errors that summary statistics would hide.
Phase 3 — Logic Verification
Five tests verify that the trading logic itself is correct. These are the tests that catch the subtle bugs — the ones where the code runs, the numbers look reasonable, but the logic is doing something it should not.
Test 9: Directional confluence check. When the framework identifies a long setup, the relevant price level must be above the current price. When it identifies a short setup, the level must be below. If a long trade references a level that is below price, the directional logic is inverted. This test checks every trade in the dataset for directional consistency.
Test 10: Zone blocking check. The framework uses zone-based filters to prevent redundant trades. This test verifies that the zone blocking logic is applied correctly — that trades which should be blocked are blocked, and trades that should pass through do pass through.
Test 11: No lookahead bias. This is the test that separates honest backtesting from fantasy.
Lookahead bias means using future information to make present-tense decisions. It is the most common and most dangerous error in backtesting. A framework that enters a trade based on information that was not yet available at the time of entry will always look better than it actually is.
The Algorithmic Suite framework enters trades on bar index plus one — the bar after the signal appears, not the signal bar itself. This test verifies that every single entry across 143,538 signals respects this constraint. No entry occurs on the signal bar. No future data leaks into the decision.
Test 12: Interaction type forward limit. The framework classifies how price interacts with indicator levels — touches, breaks, rejections. But this classification must only look at bars within a bounded window. Plus or minus 10 bars from the signal. Not 100. Not 1,000. If the window is wrong, the interaction types are contaminated with information that was too far removed from the trade to be relevant.
Test 13: Volume MA20 warm-up. The framework uses a 20-bar moving average of volume as a filter. The first 19 bars of any session do not have enough history to compute this average. They must be marked as NaN — not available — rather than filled with a default value. If they are filled with zeros or approximations, the volume filter is corrupted for early-session trades.
Phase 4 — Model-Specific
Two tests that are specific to the framework's proprietary logic.
Test 14: Visit tracker zone-based reset. The framework tracks whether a price level has been visited before. A level that has already been tested behaves differently from one that is being approached for the first time. The visit tracker must reset correctly when price moves into a new zone. This test verifies that the reset logic triggers at the right boundaries.
Test 15: T-level structural filter. The framework carries forward certain structural levels from prior trading days. But not all levels qualify. Buy zones, sell zones, and the NY Midnight Open are excluded from carry-forward. Only structural levels are eligible. This test verifies that the filter correctly excludes the non-structural categories.
Phase 5 — Statistical Sanity
One test that checks whether the results are statistically plausible.
Test 16: No year exceeds 85% win rate. A framework with an 85% annual win rate in any single year is almost certainly overfitted to that year's conditions. Real edges in liquid futures markets do not produce those numbers consistently. This test caps the maximum allowable annual win rate and flags any year that exceeds it.
The same test checks bull/bear balance. The framework produces signals in both directions — 50.4% bullish, 49.6% bearish across the full sample. A significant imbalance would suggest directional bias in the signal generation logic. It also verifies that zone distribution spans all seven categories defined by the framework. If any category is empty, the classification logic is not working correctly.
Phase 6 — Spot Verification
The hardest tests. The ones that cannot be automated.
Test: Random trade reconstruction. I selected 10 trades at random from the 143,538 in the dataset. For each one, I went back to the raw 1-minute OHLC data — the actual market bars, not the processed output — and replayed the trade bar by bar. Entry price. Target level. Stop level. Maximum adverse excursion. Volume at entry.
All 10 matched.
That is the gold standard of backtesting validation. Not summary statistics. Not aggregate metrics. Individual trades, verified against raw data, one bar at a time. If a single trade does not reconstruct correctly, the framework has a bug that no amount of aggregate testing will find.
Test: Supabase cross-reference. The CSV output contains 143,538 signals. The database contains 161,042. The delta of 17,504 signals is explained by no-attribution signals — signals that exist in the database but do not meet the framework's minimum attribution criteria for inclusion in the backtest. This test verifies that the delta is accounted for and that no signals are silently dropped.
Three implementations, one answer
There is one more layer that sits above the validation suite.
The Algorithmic Suite runs on three independent implementations. The original TradingView PineScript. A Python computation engine built from scratch. And the production database.
All three must agree.
Not approximately. Not within a tolerance band. They must produce the same values for the same inputs. If TradingView shows a level at 5,250.00, the Python engine must compute 5,250.00, and the database must store 5,250.00. If any of the three disagree, something is wrong and it must be found before any result is published.
The verification process is documented in detail in The Index Futures Research Behind the Algorithmic Suite. The Midnight Grid verification achieved 100% match across all three pipelines. Quantum Vision matched at 95.4%, with the gap explained entirely by Sunday session labeling differences. Turning Point price accuracy was 100% across all matched signals.
Three independent implementations is expensive. It triples the development effort. It requires maintaining three separate codebases that must stay in sync. It is the kind of work that is easy to skip.
It is also the kind of work that catches bugs that nothing else will find.
The results
15 pass. 1 review. 0 fail.
The single review item was not a failure — it was a metric that warranted a closer look and was confirmed correct after manual inspection.
That is the validation suite that sits between the Algorithmic Suite backtesting engine and every number published in this series. Every win rate. Every cost calculation. Every regime test. Every subsample stability result. Every walk-forward window. Every machine learning comparison. Every time-of-day finding. Every equity curve analysis. Every signal quality decomposition. Every Bonferroni-corrected significance test.
The data was validated before the analysis began.
The work that is easy to skip
This is not glamorous work. Building a validation suite does not produce a better equity curve. It does not generate a more impressive win rate. It does not make for a compelling screenshot.
What it does is give you the ability to stand behind your numbers.
Most indicator vendors skip all of it. They trust their code, run the backtest, and publish the results. No validation. No integrity checks. No reconstruction. No independent verification.
I understand why. The validation suite took longer to build than the backtesting engine itself. It occasionally broke things I thought were working. It once showed me that a metric I had been reporting was inflated by 900%.
But that is exactly the point. The validation suite is not there to make the results look better. It is there to make sure the results are real.
That is the difference between research and guessing.
The Algorithmic Suite
Midnight Grid. Quantum Vision. Turning Points.
Three indicators. One framework. 16 validation tests.
Algorithmic is charting software for decision support on TradingView. It is not financial advice. Trading involves risk. Outcomes depend on your rules, risk management, and execution. Past performance does not guarantee future results.

