·

7 minutes

Walk-Forward Testing: How We Validated 18 Years of Futures Data Out-of-Sample

Algorithmic

Walk-Forward Testing: How We Validated 18 Years of Futures Data Out-of-Sample

In The Index Futures Research Behind the Algorithmic Suite, I wrote about the scale of the research behind the framework. The 6.3 million price bars. The three independent verification pipelines. The tens of billions of analytical permutations.

But scale alone does not prove anything.

A backtest that processes 18 years of data can still be worthless if the results were tuned on the same data they were measured against. That is not testing. That is fitting a curve to the past and hoping the future looks similar.

I needed to answer a harder question: does the edge survive when you test it on data it has never seen?

This post is about how I answered that question.

The problem with most backtests

Here is how backtesting typically works in the trading indicator space.

Someone builds an indicator. They run it across historical data. They adjust settings until the results look strong. Then they publish those results as evidence that their tool works.

The problem is obvious once you see it. The indicator was adjusted to fit the data it was tested on. Of course it looks good on that data. It was built to.

That is not validation. That is overfitting. And overfitting is one of the most well-documented failure modes in quantitative finance. I address this from a different angle in the post on Monte Carlo simulation and statistical significance — where we tested whether the results could be explained by chance alone. A strategy that is overfit to historical data will almost always degrade when exposed to new, unseen market conditions. The tighter the fit to the past, the worse the failure in the future.

Every serious quantitative researcher knows this. Every institutional trading desk knows this. And yet most retail indicator vendors either do not understand it or choose to ignore it.

I was not willing to do either.

What walk-forward testing actually is

Walk-forward analysis is the standard methodology used by quantitative funds to validate trading strategies. The concept is simple. The execution is not.

You split your data into two parts. The first part — the training set — is the data you use to develop and evaluate your framework. The second part — the test set — is data that was completely excluded from development. The framework never saw it. No decisions were made using it. No parameters were adjusted based on it.

Then you run your framework on the test set and see what happens.

If the results hold, the edge is likely real. If they degrade significantly, the framework was probably overfit to the training data. If they collapse entirely, you know exactly what you had: noise dressed up as signal.

That is the test. There is no way to game it without lying about the data splits. The test set either confirms the edge or it does not.

The single-split walk-forward test

The first test I ran was the simplest and most aggressive version of walk-forward analysis.

I split the full 18-year ES futures dataset into two halves. The training period covered 2008 through 2019 — twelve years of data. The test period covered 2020 through early 2026 — six years of data that was completely held out.

The training set included the 2008 financial crisis, the European debt crisis, the QE-driven low-volatility years, the Brexit and election shocks of 2016, the trade war whipsaws of 2019. It was a thorough representation of modern market history.

The test set included everything that came after. COVID. The meme-stock era. The fastest rate-hike cycle in 40 years. The 2022 bear market. The AI-driven rally. The current regime.

Two completely different eras. Two completely different market structures.

Results on the training set (2008-2019): a strong baseline win rate.

Results on the test set (2020-2026): an even higher win rate.

Read that again.

The edge did not degrade out-of-sample. It improved by a meaningful margin. The framework performed better on data it had never seen than on the data it was developed against.

That is the opposite of what overfitted strategies do. An overfit strategy will show strong results in-sample and weaker results out-of-sample. The gap is always negative. Always.

A positive gap — an edge that gets stronger on unseen data — is the hallmark of a framework that has captured something structural about how the market behaves, not something accidental about how one particular historical period happened to unfold.

Why the improvement makes sense

This result might seem counterintuitive. How can a framework perform better on data it was never tested against?

The answer has to do with the nature of the market environments in each period.

The 2008-2019 training set includes several years of extremely low volatility — 2013 through 2015, and 2017 in particular. These are periods where the ES traded in compressed ranges with subdued volume. The kind of conditions where any level-based framework has fewer opportunities and tighter margins.

The 2020-2026 test set, by contrast, is dominated by high-volatility environments. COVID produced the fastest bear market in history. 2022 brought sustained directional pressure. Even the calmer years in this period — 2021, 2023 — had more intraday range and volume than the low-vol years of the prior decade.

The Algorithmic Suite framework identifies structural price levels and reversal signals. When volatility is higher, price reaches those levels more frequently and with more force. The signals are cleaner. The interactions are more decisive.

The improvement out-of-sample is not luck. It is a consequence of the framework being tested across a period with more of the market conditions it was designed to identify.

But the critical point is this: the framework was not adjusted to account for that. The same parameters, the same logic, the same decision rules that were evaluated on the training set were applied unchanged to the test set. Nothing was tuned. Nothing was optimized. The improvement happened on its own.

The rolling walk-forward test

A single split is informative but has a limitation. It only tells you about one particular division of the data. What if the edge is concentrated in certain years and absent in others?

To answer that, I ran a rolling walk-forward analysis.

The methodology: take a 5-year rolling training window, then test on the immediately following 1-year period. Slide the window forward by one year and repeat. This produces 14 separate train-test cycles, each with its own training set and its own completely independent out-of-sample test year.

The results: 14 out of 14 rolling windows were profitable after costs.

Every single year, when used as the out-of-sample test period, produced a positive result. Not one year — not during COVID, not during the 2022 bear market, not during the low-volatility years — failed to show a profitable edge when the framework was trained on the preceding five years and tested blind.

On average across all 14 windows, the framework performed slightly better out-of-sample than in-sample — the delta was consistently positive.

That is 14 independent confirmations that the edge is not concentrated in one lucky period. It is distributed across the entire modern history of the ES futures contract.

10-fold time-series cross-validation

Walk-forward analysis tests the framework against future data. Cross-validation tests it from a different angle: does the edge depend on any particular section of the dataset?

Standard k-fold cross-validation randomly shuffles data into folds. That does not work for time-series data because it introduces look-ahead bias — training on future data and testing on past data.

Time-series cross-validation solves this. The data is divided into 10 sequential folds, preserving chronological order. Each fold is tested using only data from earlier folds for training. No future information leaks into the past.

The results: 10 out of 10 folds produced positive expected value.

Every section of the 18-year dataset, when isolated and tested against the preceding sections, showed a profitable edge. Not 8 out of 10. Not 9 out of 10. All 10.

This eliminates the possibility that one exceptionally strong period is carrying the overall result. If the 2020-2022 period were the only source of the edge, the folds covering 2008-2015 would show flat or negative results. They did not.

What the optimal parameters look like across splits

One of the clearest signs of overfitting is when the optimal parameters change dramatically between training and test sets. If a framework needs a 4-point target in one period and a 12-point target in another, the edge is not stable. It is an artifact of parameter optimization.

Across all the walk-forward and cross-validation tests, the best-performing target and stop combination remained consistent: 4-point target, 6-point stop (tp4/sl6).

The same combination that ranked highest on the full 18-year dataset also ranked highest across the majority of individual walk-forward windows and cross-validation folds. The optimal parameters did not drift. They did not require re-optimization for each period.

That stability is what separates a framework from a fitted curve.

The dataset behind all of this

Every test described in this post was run on the same foundation:

  • 6,373,158 one-minute OHLC bars from 77 individual ES futures contract files

  • 89,774 qualifying first-visit signal interactions across 4,721 trading sessions

  • 18 years of continuous data from January 2008 through early 2026

  • ES futures — the E-mini S&P 500, the most liquid index futures contract in the world

  • 45 target/stop combinations evaluated for every signal

  • All results include realistic friction — commission and slippage at institutional standard rates

No simulated data. No synthetic fills. No assumptions about execution that would not hold in a live market.

Why this matters

Most indicator vendors never test out-of-sample. Some have never heard the term.

They optimize on the same data they show results from. They adjust parameters until the equity curve looks smooth. They screenshot the best week and post it as proof.

That is not a framework. That is a marketing exercise.

Walk-forward analysis exists specifically to catch this. It is the standard of evidence in quantitative finance for a reason. It is the test that overfit strategies cannot pass.

The Algorithmic Suite passed it.

Not barely. Not in most years. In every year. In every fold. With an edge that got stronger — not weaker — when exposed to data it had never seen.

I am not asking you to trust a screenshot. I am not asking you to trust a single day's results. I am showing you what happens when you subject a framework to the same validation methodology that quantitative funds use before risking real capital.

If you are evaluating tools for your futures trading, ask the vendor one question: have you tested this out-of-sample? If the answer is no — or if they do not know what that means — you have your answer.

The Algorithmic Suite

Midnight Grid. Quantum Vision. Turning Points.

Three indicators. One framework. Validated out-of-sample across every year of modern market history.

Available on your TradingView charts today.

Start Your 7-Day Free Trial

Algorithmic is charting software for decision support on TradingView. It is not financial advice. Trading involves risk. Outcomes depend on your rules, risk management, and execution. Past performance does not guarantee future results.