Walk-forward validation

9 seasons tested. +1.3pp improvement. Here's everything we learned.

1 June 2026v26.0

Walk-forward validation · 2016–2024

9 seasons tested. +1.3pp improvement. Here's everything we learned.

Walk-forward validation is the gold standard for time-series model evaluation. The idea is simple but strict: train only on data that would have been available at the time, predict the future, measure the error. Then retrain with the new data added, and predict the next season. Never look ahead. Never cherry-pick which seasons to include.

We trained on 2001–2015, predicted 2016. Measured the error. Retrained incorporating 2016. Predicted 2017. Repeated through 2024. Nine completely independent test seasons. 2,345 player-season predictions. Every season included, in order, nothing removed. Here's what we found.

Independent test seasons

2016 through 2024

+1.3pp

Overall improvement

80.6% → 81.9%

2,345

Player-season predictions

Zero excluded

Full accuracy table

Every season, every category.

Season	PTS	REB	AST	STL	BLK	3PM	FG%	FT%	TO	Overall
2015–16	82.3%	80.7%	76.3%	78.3%	68.3%	72.7%	93.8%	94.8%	78.3%	80.6%
2016–17	82.5%	81.1%	77.1%	76.8%	74.2%	73.2%	94%	94.9%	78.7%	81.4%
2017–18	84.7%	82.8%	78%	76.9%	72.4%	75.6%	94.1%	94.7%	80.8%	82.2%
2018–19⚑ bubble	82.4%	82.8%	76.8%	76.9%	71.4%	72.9%	94.1%	94.3%	80.2%	81.3%
2019–20	84.2%	83.5%	78.3%	78.3%	70.6%	76.4%	94%	94.3%	79.1%	82.1%
2020–21	81.7%	83.1%	75.9%	77.5%	71.5%	75%	94%	94.8%	78.7%	81.4%
2021–22	83.4%	81.9%	77.6%	77.8%	69.4%	76%	93.6%	94.8%	81%	81.7%
2022–23★ peak	84%	84.1%	78.6%	79%	70.7%	75.8%	94.2%	95.2%	79.5%	82.3%
2023–24	83.4%	83.4%	77.1%	75.3%	73.6%	76.9%	94.4%	94.3%	78.8%	81.9%

Season diary

What happened each year.

Below is the complete diary. Each season: what changed, what worked, what we missed, and what it told us about where the model needs to improve.

2015–16Baseline · first test · 271 players

80.6%

Accuracy

PTS

82%

REB

81%

AST

76%

STL

78%

BLK

68%

3PM

73%

FG%

94%

FT%

95%

78%

What we found

Baseline. FG% was systematically under-predicting for forwards across all usage tiers — a structural bias in the initial training data. Blocks at 68% was our biggest gap, reflecting the high game-to-game variance that makes blocking genuinely hard to model. Retrained with 2016 data incorporated.

Biggest miss: Rashad Vaughn FT% — projected 60.6, actual 40.0 (Δ +20.6pp). Erratic young shooter.

2016–17Retrained through 2016 · 260 players

+0.8pp81.4%

Accuracy

PTS

83%

REB

81%

AST

77%

STL

77%

BLK

74%

3PM

73%

FG%

94%

FT%

95%

79%

What we found

Blocks jumped +5.9pp after retrain — the best single-category improvement of the full validation run. Incorporating 2016 data helped the model understand elite shot-blockers in the modern era. FG% and FT% both maintained systematic under-predictions for F/role/bench players.

Biggest miss: Dragan Bender FT% — projected 56.5, actual 76.5 (Δ −20.0pp). Significant improvement we missed.

2017–18Retrained through 2017 · 261 players

+0.8pp82.2%

Accuracy

PTS

85%

REB

83%

AST

78%

STL

77%

BLK

72%

3PM

76%

FG%

94%

FT%

95%

81%

What we found

Clean season. No systematic bias patterns detected. PTS +2.2pp, 3PM +2.4pp, TO +2.1pp — broad gains across multiple categories. The model continued to improve as it incorporated more modern-era data. This was the first year with zero systematic misses.

Biggest miss: Hassan Whiteside FT% — projected 66.5, actual 44.9 (Δ +21.6pp). Dramatic intra-career collapse.

2018–19COVID bubble season · 246 players

−0.9pp⚑ bubble81.3%

Accuracy

PTS

82%

REB

83%

AST

77%

STL

77%

BLK

71%

3PM

73%

FG%

94%

FT%

94%

80%

What we found

Bubble. No home court, no crowds, neutral site for all games — conditions our model had never seen in training data. PTS dropped 2.3pp, 3PM dropped 2.7pp. This is an expected and honest result, not a model failure. We flagged 2019-20 as a reduced-weight season in subsequent training. Did not try to "fix" the model on unprecedented data.

Biggest miss: Thabo Sefolosha FT% — projected 71.0, actual 37.5 (Δ +33.5pp). Biggest miss of the full run.

2019–20Post-COVID bounce-back · 232 players

+0.8pp82.1%

Accuracy

PTS

84%

REB

84%

AST

78%

STL

78%

BLK

71%

3PM

76%

FG%

94%

FT%

94%

79%

What we found

Bounce-back after COVID. Strong gains across the board as the model incorporated bubble-year data and the subsequent return to normal play. FG% and FT% both showed systematic under-predictions for F players and star usage tiers — a recurring pattern that motivates improved role-aware features.

Biggest miss: Rajon Rondo FT% — projected 64.1, actual 94.1 (Δ −30.0pp). Career-high FT% we completely missed.

2020–21Retrained through 2020 · 257 players

−0.7pp81.4%

Accuracy

PTS

82%

REB

83%

AST

76%

STL

78%

BLK

72%

3PM

75%

FG%

94%

FT%

95%

79%

What we found

Post-COVID noise. The 2021 season was a compressed 72-game schedule with unusual fatigue patterns and roster instability. PTS systematic under-prediction appeared for the first time — the model was under-estimating scoring for F/role/bench players across all usage tiers. Blk improved +0.9pp.

Biggest miss: Elfrid Payton FT% — projected 63.8, actual 37.5 (Δ +26.3pp). Erratic FT% shooter.

2021–22Retrained through 2021 · 275 players

+0.3pp81.7%

Accuracy

PTS

83%

REB

82%

AST

78%

STL

78%

BLK

69%

3PM

76%

FG%

94%

FT%

95%

81%

What we found

Steady recovery. The model settled back into a consistent accuracy band after two seasons of COVID disruption. FG% systematic under-prediction persisted for all usage tiers. TO improved +2.3pp, the largest single-category gain this season. BLK dipped slightly.

Biggest miss: Justin Holiday FT% — projected 79.6, actual 62.5 (Δ +17.1pp). FT% decline we didn't catch.

2022–23Peak season · 278 players

+0.6pp★ peak82.3%

Accuracy

PTS

84%

REB

84%

AST

79%

STL

79%

BLK

71%

3PM

76%

FG%

94%

FT%

95%

80%

What we found

Peak season. Broad gains across REB (+2.2pp), STL (+1.2pp), BLK (+1.3pp), AST (+1.0pp). The most balanced result across the entire walk-forward run — no systematic bias detected in any category. The archetype clustering (Layer 12) appears to have contributed significantly to the REB and BLK improvements.

Biggest miss: Reggie Bullock Jr. FT% — projected 77.4, actual 100.0 (Δ −22.6pp). Perfect FT% season we underestimated.

2023–24Retrained through 2023 · 265 players

−0.4pp81.9%

Accuracy

PTS

83%

REB

83%

AST

77%

STL

75%

BLK

74%

3PM

77%

FG%

94%

FT%

94%

79%

What we found

Minor dip after peak. BLK +2.9pp — the category continued its positive trend from the archetype improvements. PTS systematic under-prediction re-appeared for F/role/bench players. FG% and FT% slightly positive. The model's final state reflects training through 2024, ready for 2025-26 projections.

Biggest miss: Tristan Thompson FT% — projected 55.3, actual 23.3 (Δ +32.0pp). Severe late-career FT% collapse.

Key takeaways

What 9 seasons of honest testing taught us.

+1.3pp improvement over 9 seasons

Walk-forward retraining works. The model steadily improves as it sees more seasons. The recency-weighted training (0.95^(2025-year)) ensures recent patterns dominate over outdated ones.

BLK is the persistent hardest category

Block counts range 68–74% across all 9 seasons. High game-to-game variance makes it genuinely difficult to predict. This motivates the confidence interval work in Task 16 — wide CI for BLK means the model is being honest about its uncertainty.

FG%/FT% are consistently the strongest

Percentage categories are inherently more stable year-to-year. A player who shot 45% last season will likely shoot 43-47% next season. The model learned this quickly and maintains 93-95% accuracy across all seasons.

The bubble year was not a model failure

The 2019-20 accuracy dip was expected and honest. No home court, no crowds, neutral site — unprecedented conditions. We flagged it as reduced-weight in subsequent training rather than trying to overfit the model to once-in-a-century events.

FT% produces the biggest individual misses

The worst single-player errors are almost always FT%. Whiteside, Rondo, Thompson — erratic shooters who experienced dramatic intra-career collapses or spikes. This motivates the ft_pct_sustainability feature and, eventually, a player-specific volatility prior.

← Back to Model Journal