From total beginner to working practitioner —
one chapter a day, concepts that stick
Based on Hyndman et al. (2025), Forecasting: Principles and Practice, the Pythonic Way — CC BY-NC-ND 4.0
This book teaches modern time series forecasting using the nixtlaverse Python toolkit. Every chapter opens with a real story that makes you need the concept before you learn its name. Every analogy is designed to stay in your head permanently — like a pink elephant you can't un-see.
Four phases: See Clearly (Days 1–6), Two Workhorses (Days 7–14), Real-World Power (Days 15–22), and Modern Edge (Days 23–30). A three-day capstone builds a production pipeline from scratch.
This book is a derived educational work based on:
Hyndman, R.J., Athanasopoulos, G., Garza, A., Challu, C., Mergenthaler, M., & Olivares, K.G. (2025).
Forecasting: Principles and Practice, the Pythonic Way.
OTexts: Melbourne, Australia. otexts.com/fpppy
The nixtlaverse Python libraries (StatsForecast, MLForecast, NeuralForecast) are open-source tools built by Nixtla and available under the Apache 2.0 licence.
Original code examples in this repository are released under the MIT Licence. The educational text content is not under MIT — it retains the CC BY-NC-ND 4.0 licence of the source work. You may share this freely for non-commercial purposes with full attribution. Commercial publication requires written permission from the original authors.
CC BY-NC-ND 4.0 Original code: MITYou already do this before you leave the house every morning
A NASA engineer once said: 'Give me the last 30 days of temperature readings and I'll tell you next week's weather better than a meteorologist who just guessed.' He wasn't bragging. He was describing the single most powerful insight in all of forecasting.
Before you leave home, you glance at the sky. Clouds? Bring an umbrella. Sunshine? Leave it behind. Without realising it, you just made a forecast — you used something you could see (the clouds) to predict something you couldn't see yet (the rain). Forecasting with numbers works exactly like that, just with a spreadsheet instead of a window.
Imagine writing down the temperature outside every single morning for a whole year. January 1st: 4°C. January 2nd: 6°C. January 3rd: 3°C. And so on, all the way to December 31st.
That list of numbers, collected one by one over time, is called a time series. "Time series" is just a fancy way of saying "a list of numbers that changes as time passes." Your height measured every birthday is a time series. The number of customers in a café each hour is a time series. Daily rainfall amounts are a time series.
Forecasting means using the patterns in that list to make your best guess about what comes next — what will the temperature be tomorrow? How many customers next Tuesday? How much rain next week?
Throughout this course, we use a set of free Python tools called the nixtlaverse, built by a company called Nixtla. Think of it as a toolbox where every tool uses the same standard format for data. Learn that format once and every tool in the box just works.
The format is three columns: who (which product, shop, or sensor), when (the date), and what (the number you care about). That's it.
import pandas as pd
# Let's create a simple time series: ice cream sales each month
# Three columns: WHO (unique_id), WHEN (ds), WHAT (y)
ice_cream = pd.DataFrame({
"unique_id": "my_ice_cream_shop", # name of our series
"ds": pd.date_range("2023-01", periods=12, freq="MS"), # Jan-Dec 2023
"y": [120, 110, 140, 160, 200, 280, # Jan-Jun sales
320, 310, 250, 180, 140, 130] # Jul-Dec sales
})
print(ice_cream)
# You'll see: more sales in summer (row 6-8), less in winter
# That repeating pattern is exactly what we'll learn to predict!
No doctor performs surgery without studying the X-ray first
Google launched Flu Trends in 2008 with bold claims. It missed the 2009 H1N1 outbreak entirely. By 2013 a Nature study showed it was over-estimating flu cases by 2×. A 2014 Science paper named the cause: big data hubris — they trusted the algorithm and never looked carefully at what the data was actually saying.
Imagine a doctor who skips the X-ray and goes straight to surgery. Terrifying, right? A good doctor looks carefully at all the evidence before touching a single instrument. Looking at your data before building any kind of model is the same thing. Most forecasting mistakes happen because someone skipped this step.
Before you do anything else — before you touch a single tool or write a line of code — make a simple line chart of your numbers over time. Look at it. Study it. Ask yourself four questions:
There is a second chart, called the ACF plot, that acts like a pattern detector. ACF stands for "Autocorrelation Function" — a mouthful, but the idea is simple: it measures how much today's number is related to yesterday's, last week's, and last month's.
If there's a big spike at "12 months ago", your data has a yearly pattern. If there's a spike at "7 days ago", it has a weekly pattern. This one chart tells you almost everything you need to know about the structure of your data.
The golden rule: if your chart surprises you, your forecast will surprise you too — and not in a good way.
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from statsforecast.utils import AirPassengersDF # a classic dataset about airline passengers
# Load the data: monthly airline passengers 1949-1960
df = AirPassengersDF.copy().set_index("ds")["y"]
fig, axes = plt.subplots(2, 1, figsize=(11, 7))
# Chart 1: The basic line chart
# Look for: is it going up? Does it get bigger waves over time?
axes[0].plot(df, linewidth=2, color="#16a34a")
axes[0].set_title("Airline passengers over time — what do you notice?")
axes[0].set_ylabel("Number of passengers (thousands)")
# Chart 2: The pattern detector (ACF plot)
# Look for: tall bars tell you about repeating patterns
# The tall bar at 12 means: "what happened 12 months ago predicts today"
plot_acf(df, lags=36, ax=axes[1], color="#7c3aed")
axes[1].set_title("Pattern detector — the big spike at position 12 means YEARLY pattern!")
plt.tight_layout()
plt.show()
Every time series is a sandwich — pull the layers apart to understand each one
A supermarket chain couldn't figure out why its ice cream sales model kept failing. They hired three consultants. All three missed the same thing: the seasonal pattern was getting bigger every year — but they'd been modelling it as a fixed bump. The BLT sandwich had been assembled wrong.
A BLT sandwich looks like one thing, but it's actually three layers: bread (the solid base), bacon (the repeating delicious bit), and lettuce (the random unpredictable crunchy stuff). Your data is the same. Pull it apart and suddenly you can see what's really happening. This pulling-apart process is called decomposition.
Hidden inside every set of time-based numbers are three separate ingredients that got mixed together. Decomposition is the process of separating them out so you can study each one.
This is the slow, long-term direction. Like an escalator — it steadily goes up, or steadily goes down, over months and years. Sales at a growing company trend upward. Sales at a declining one trend downward. The trend doesn't jump around — it moves slowly and steadily.
This is the predictable, calendar-driven pattern that comes back again and again. Ice cream spikes every July. Gym memberships spike every January. A bakery sells more on Saturdays. This ingredient is called seasonality — and it's the most valuable one, because it's completely predictable.
After you've removed the trend and the seasonal pattern, whatever's left is the remainder. This is the random stuff — a factory caught fire, someone famous mentioned your brand, a freak weather event. No model can predict this part. It's the chaos ingredient.
If your remainder is small compared to the trend and seasonal parts, your data is mostly predictable — great news! If the remainder is huge, then most of what's happening is random noise, and even the best model won't do much better than a simple guess.
from statsmodels.tsa.seasonal import STL
from statsforecast.utils import AirPassengersDF
import matplotlib.pyplot as plt
# Load airline passenger numbers (monthly, 1949-1960)
df = AirPassengersDF.copy().set_index("ds")["y"]
# STL = the tool that pulls your data apart into 3 layers
# period=12 means we expect a YEARLY pattern (12 months per cycle)
result = STL(df, period=12).fit()
# Now plot the 3 layers separately
fig, axes = plt.subplots(4, 1, figsize=(11, 9), sharex=True)
for ax, data, label, color in zip(axes,
[df, result.trend, result.seasonal, result.resid],
["Original (all mixed together)",
"Trend (the escalator)",
"Seasonal (the repeating summer peaks)",
"Remainder (the random leftover noise)"],
["#374151", "#16a34a", "#7c3aed", "#dc2626"]):
ax.plot(data, color=color, linewidth=1.8)
ax.set_ylabel(label, fontsize=10)
plt.tight_layout()
plt.show()
# Is our data mostly predictable?
print(f"Seasonal pattern size: {result.seasonal.max()-result.seasonal.min():.0f} passengers")
print(f"Random noise size: {result.resid.std():.0f} passengers")
print("→ Seasonal is much bigger than noise = GOOD, mostly predictable!")
Different sports use different scoreboards — pick the right one
Two forecasters submitted predictions for the same dataset. Forecaster A had a 12% MAPE. Forecaster B had a 38% MAPE. The competition organiser announced Forecaster B the winner. The audience was confused. Then the organiser explained why MAPE had lied — and the room went silent.
In golf, every stroke counts equally. In basketball, shots can be worth 2 or 3 points. The scoring system shapes the game. When you forecast, you need a "scoreboard" that measures how wrong you were. But different scoreboards reward different things. Pick the wrong one and your model will get very good at winning the wrong game.
After your forecast runs, you need to measure: how wrong were my guesses? Here are the four main "scoreboards" and when to use each one:
The simplest scoreboard. Take every mistake, ignore whether it was too high or too low (that's the "absolute" part), and find the average. If your MAE is 50, you were wrong by about 50 units on average. Easy to explain to anyone. Use this as your default.
Like MAE, but it squares each mistake before averaging, then takes the square root at the end. This punishes large mistakes much harder than small ones. Missing by 200 units counts as 16× worse than missing by 50 units. Use this when a big mistake is especially costly — like medicine stock in a hospital.
This one compares your mistakes to what a very simple "copy the last value" approach would get. A MASE below 1.0 means you beat the simple approach. A MASE above 1.0 means the simple approach beat you — and you should stop and fix your model. Use this to compare models across different products.
MAPE divides each mistake by the actual value. The problem: if the actual value is near zero (one week you sold almost nothing), you're dividing by nearly zero, and the score shoots to infinity. Looks simple but breaks in common situations. The world's largest forecasting competition banned it.
import numpy as np
# Pretend we forecast sales for 5 weeks and here's how we did:
actual_sales = [500, 620, 480, 710, 550] # what really happened
our_forecast = [520, 600, 510, 680, 570] # what we predicted
actual = np.array(actual_sales)
forecast = np.array(our_forecast)
# Scoreboard 1: MAE — how wrong on average?
mistakes = np.abs(actual - forecast) # size of each mistake
mae = np.mean(mistakes) # average mistake size
print(f"MAE = {mae:.1f} → we were wrong by {mae:.0f} units on average")
# Scoreboard 2: RMSE — punishes big mistakes harder
rmse = np.sqrt(np.mean((actual - forecast)**2))
print(f"RMSE = {rmse:.1f} → if big mistakes are costly, use this")
# Scoreboard 3: MASE — did we beat the simple "copy last week" approach?
# First: what would "copy last week" have gotten?
naive_mistakes = np.abs(np.diff(actual)) # just compare each week to the one before
naive_mae = np.mean(naive_mistakes)
mase = mae / naive_mae
print(f"MASE = {mase:.2f} → {'We beat the simple approach!' if mase<1 else 'The simple approach beat us — go fix the model!'}")
Before entering any real race, you have to outrun the slowest runner
In the M4 Forecasting Competition — the Olympics of forecasting, 100,000 real time series — 'Copy Last Year' finished in the top third. It beat thousands of machine learning models. It didn't know any maths. It just remembered what happened 12 months ago. The researchers called it 'the embarrassing benchmark.'
Before athletes race each other, they first need to beat a qualification standard — a slow turtle of a time that any serious runner should beat easily. In forecasting, that turtle is called a baseline model — the simplest possible forecast. If your clever, expensive model can't beat the turtle, it's not clever at all. It's broken.
Before building anything complicated, always run these four extremely simple forecasting methods first. They are your turtles — your minimum standard. Any model you build must beat all of them.
Forecast = whatever happened last time. If sales were 500 yesterday, predict 500 today. That's it. No math, no thinking. Surprisingly hard to beat on flat, stable data.
Forecast = whatever happened at this exact time last year. Predict January by copying last January. Predict December by copying last December. This turtle is your hardest opponent on seasonal data — it knows about the Christmas spike, it knows about the summer peak, and it doesn't need to learn anything.
Draw a straight line from the first number to the last number in your past data. Keep going. This assumes the average rate of change from before will continue.
Predict the same number every time: the average of all past data. Useful when there's no trend and no seasonal pattern — just noise around a stable level.
If your model doesn't beat "Same Time Last Year", it should not be used. Period. Check this before showing anyone your results.
from statsforecast import StatsForecast
from statsforecast.models import Naive, SeasonalNaive, HistoricAverage, RandomWalkWithDrift
from statsforecast.utils import AirPassengersDF
import numpy as np
# Load data: first 10 years to train, last 2 years to test
df = AirPassengersDF.copy()
train = df.head(120) # train on 10 years
test = df.tail(24) # test on 2 years (our report card)
# Run all four turtles at once
sf = StatsForecast(
models=[
Naive(), # turtle 1: copy last value
SeasonalNaive(season_length=12), # turtle 2: copy same month last year
HistoricAverage(), # turtle 4: predict the average
RandomWalkWithDrift(), # turtle 3: extend the trend line
],
freq="MS" # MS = monthly data
)
forecasts = sf.forecast(df=train, h=24)
# Score each turtle: which one was closest?
print("=== Turtle Race Results ===")
for model in ["Naive", "SeasonalNaive", "HistoricAverage", "RWD"]:
avg_mistake = np.mean(np.abs(forecasts[model].values - test["y"].values))
print(f"{model:<20} Average mistake = {avg_mistake:.1f} passengers")
print("
Your model must score BELOW the best turtle to be worth using!")
The time-machine rule — you can never use future information to train your model
A data science team at a major retailer spent 6 months building a forecasting model. Their test results looked incredible — 94% accuracy. They deployed it. It was catastrophically wrong from day one. The problem: during testing, their model had been trained on data from the future. They'd built the world's most expensive cheat sheet.
Imagine you want to become a great chess player. If you study the answer book while solving each puzzle, your practice score will be perfect — but completely meaningless. You learned nothing. Testing a forecast by letting it "peek" at future data is the same cheat. The time-machine rule says: the future can never train the model, ever.
In normal school exams, the teacher gives you problems you've never seen before. If the teacher accidentally gave you the answer key, your mark would be perfect — but you'd have learned nothing. The same trap exists in forecasting.
Standard machine learning splits data randomly: "Take any 80% of the rows for training, use the other 20% for testing." For most types of data this works fine. For time series, it is catastrophically wrong.
If you randomly pick your test data, some of it will be from January 2021, and some of your training data will be from June 2021. But June comes after January — so your model was trained on the future to predict the past. Your test score looks great, but your model has learned nothing that applies to real life.
Every test period must come after every training period. Always. No exceptions. The correct method creates multiple test windows, each one stepping forward in time:
Average the score across all windows. That's your real, honest performance estimate.
This process is called cross-validation — "cross" because you test from multiple angles, "validation" because it validates whether your model actually works. The nixtlaverse does this automatically with one function call.
from statsforecast import StatsForecast
from statsforecast.models import AutoETS, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import numpy as np
df = AirPassengersDF.copy()
# Set up two models to compare
sf = StatsForecast(
models=[
AutoETS(season_length=12), # a smarter model (we'll learn this Day 10)
SeasonalNaive(season_length=12), # our "same time last year" turtle
],
freq="MS"
)
# Honest test: 3 windows, each predicts 12 months ahead
# StatsForecast makes sure test ALWAYS comes AFTER training
cv_results = sf.cross_validation(
df=df,
h=12, # predict 12 months ahead each time
n_windows=3 # repeat 3 times, stepping forward through history
)
# Score both models honestly
print("=== Honest Test Results ===")
for model in ["AutoETS", "SeasonalNaive"]:
avg_mistake = np.mean(np.abs(cv_results[model] - cv_results["y"]))
print(f"{model:<20} Average mistake = {avg_mistake:.1f}")
print("
This is real performance — the model never saw the test data during training")
Sunglasses that automatically darken old memories
In 1956, a US Navy officer named Robert Brown had a problem: the Navy had warehouses full of inventory, and nobody could predict demand well enough. He invented a system in 2 pages of notes. Seventy years later, every major forecasting software in the world still uses his core idea. The secret? Forget the distant past.
Picture sunglasses that get darker the older the memory is. Last week is crystal clear. Last month is faded grey. Last year is almost black. This is exactly how exponential smoothing works — it gives your most recent numbers the most weight, and lets older numbers gradually fade away. The word "exponential" just describes how fast the fading happens.
Exponential smoothing (let's call it ES for short) is built on one beautiful, sensible idea: recent information matters more than old information.
The formula is: New forecast = α × (latest actual number) + (1 − α) × (old forecast)
The Greek letter α (alpha) is your single control dial. It goes from 0 to 1:
Simple exponential smoothing can only produce a flat forecast — a horizontal line. It has no mechanism to go up or down over time, and no knowledge of seasonal patterns. It's great for flat, stable data. For data with a rising trend or seasonal spikes, you need the upgraded versions (Days 8–9).
Think of simple ES as the foundation. Everything that follows adds more layers on top of this same core idea.
from statsforecast import StatsForecast
from statsforecast.models import SimpleExponentialSmoothing as SES
from statsforecast.utils import AirPassengersDF
df = AirPassengersDF.copy()
train = df.head(120) # use 10 years of data to fit
# Try two different alpha settings to see the difference
sf = StatsForecast(
models=[
SES(alpha=0.1), # slow and smooth — trusts history more
SES(alpha=0.9), # fast and reactive — trusts recent numbers more
],
freq="MS"
)
forecasts = sf.forecast(df=train, h=24)
print(forecasts)
# Notice: both models produce a flat horizontal line as their forecast
# This is because simple ES has no way to go up or down over time
# For airline passengers (which clearly trend upward), we need Holt (Day 8)
print("
Both forecasts are flat lines — SES can't predict trends!")
Simple smoothing walks on flat ground — Holt adds an escalator
A consultancy once presented a 5-year forecast to a client showing sales tripling by Year 5. It looked impressive. The model was Holt's method — without damping. When asked 'what does the 10-year forecast look like?' they showed it: sales larger than the entire global market. The client walked out.
Simple smoothing (Day 7) only knows where it is right now — it walks on flat ground and never looks up or down. Holt's method adds an escalator. Now the forecast knows both where it is AND which way it's heading. If sales have been growing by 50 units every month, Holt rides the escalator upward instead of staying flat.
Holt's method adds a second tracking system to exponential smoothing. Instead of tracking just one thing, it now tracks two:
The forecast for next month = level + 1 × trend. For two months ahead = level + 2 × trend. The trend is like your velocity — multiply it by time to see where you end up.
If a company has grown 20% per year for three years, plain Holt will assume 20% growth forever — all the way to the moon. In reality, trends slow down, hit limits, or reverse. Blindly extending a trend for years into the future almost always leads to disaster.
A smarter version adds a "damping" effect: the further into the future you predict, the slower and more cautious the trend becomes. Instead of shooting to the moon, the forecast gradually flattens out. Decades of research and every major forecasting competition agree: always use the damped version. It's more humble, and humility is usually right.
from statsforecast import StatsForecast
from statsforecast.models import Holt, AutoETS
from statsforecast.utils import AirPassengersDF
df = AirPassengersDF.copy()
train = df.head(120) # 10 years of training data
# Holt's method: two tracking systems (level + trend)
# We give it an alias so the output column name is clear
sf = StatsForecast(
models=[
Holt(season_length=1, error_type="A", alias="Holt_LinearTrend"),
AutoETS(season_length=12), # AutoETS tries damped trend automatically ✓
],
freq="MS"
)
forecasts = sf.forecast(df=train, h=36) # predict 3 years ahead
# Holt with a linear trend will keep extrapolating upward indefinitely
# AutoETS applies damping automatically — the trend gradually flattens out
print("Prediction 36 months out:")
print(f" Holt (linear trend): {forecasts['Holt_LinearTrend'].iloc[-1]:.0f} passengers")
print(f" AutoETS (with damping): {forecasts['AutoETS'].iloc[-1]:.0f} passengers")
print()
print("AutoETS is more conservative — it won't project trends to the moon!")
print("This is why we always prefer AutoETS over plain Holt in practice.")
A roller coaster riding an escalator — trend AND calendar patterns together
A toy retailer ran Holt's method (Day 8) on their December sales. Every year like clockwork, December was 4× bigger than any other month. Holt's model predicted steady December growth — but completely missed the 4× spike. Their warehouse ordered wrong. They ran out of toys on December 20th. Two lines of code would have fixed it.
Day 7 gave you the flat ground. Day 8 added the escalator (trend). Now Day 9 adds the roller coaster on top of the escalator (seasonal pattern). Ice cream sales go up year after year (escalator) AND spike every summer (roller coaster). Holt-Winters handles both at the same time — which is why it works on most real business data.
Holt-Winters is the full version. It tracks three things at once:
The forecast for any future month = (level + trend) adjusted for the seasonal factor that month.
Imagine a toy shop. In December it always does extra sales. There are two ways this "extra" can work:
How to tell which? Look at your chart from Day 2. If the seasonal peaks get taller as the overall level rises — that's multiplicative. If they stay about the same height — additive.
The airline passenger data is the classic example of multiplicative: in the 1950s the seasonal peaks were small. By the 1960s, they were enormous — because the airline industry itself had grown.
from statsforecast import StatsForecast
from statsforecast.models import HoltWinters
from statsforecast.utils import AirPassengersDF
import numpy as np
df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)
sf = StatsForecast(
models=[
HoltWinters(season_length=12, error_type="A", alias="HW_Additive"), # fixed seasonal bumps
HoltWinters(season_length=12, error_type="M", alias="HW_Multiplicative"),# growing seasonal bumps
],
freq="MS"
)
forecasts = sf.forecast(df=train, h=24)
# Check which one is more accurate
print("Additive seasons mistake:",
round(np.mean(np.abs(forecasts["HW_Additive"].values - test["y"].values)), 1))
print("Multiplicative mistake: ",
round(np.mean(np.abs(forecasts["HW_Multiplicative"].values - test["y"].values)), 1))
# For airline passengers: multiplicative wins because peaks grow with the trend!
The eye doctor who tests every lens before deciding which one is best
There are exactly 30 different ways to combine the three ingredients of exponential smoothing. A team of researchers tested all 30 on every dataset in a major competition. The result: no single combination won every time. But one automatic approach — trying all 30 and picking the winner — consistently matched or beat hand-chosen combinations across large-scale competition datasets.
At the eye doctor, they don't guess which lens is right. They flip through lens after lens — "better or worse? better or worse?" — until your vision is perfectly sharp. The computer equivalent is called AutoETS. It tries all ~30 combinations of the methods from Days 7–9 and automatically picks the winner.
The last three days introduced exponential smoothing with three possible "ingredients":
Together, these give about 30 different possible combinations. The framework that organises them all is called ETS — short for Error, Trend, Seasonal. ETS(A,N,A) means: Additive errors, No trend, Additive seasonal.
Instead of you choosing which combination to use, AutoETS tries all ~30 combinations on your data and picks the best one automatically. It uses a scoring system called AIC (which you can think of as: "how accurate was it, minus a penalty for being overly complicated?"). Lower AIC = better model.
This is the tool you'll use most in practice. For any dataset with up to a few hundred data points, AutoETS is your first serious model after the Day 5 turtles.
When it finishes, you can ask which combination it chose — something like ETS(M,Ad,M), which means multiplicative errors, damped trend, multiplicative seasonal. This is the "recipe" it found best for your data.
from statsforecast import StatsForecast
from statsforecast.models import AutoETS
from statsforecast.utils import AirPassengersDF
df = AirPassengersDF.copy()
# AutoETS will try all ~30 combinations and pick the best one
sf = StatsForecast(
models=[AutoETS(season_length=12)], # tell it: 12 months per year
freq="MS"
)
# Fit and forecast the next 24 months, with uncertainty ranges included
# Note: forecast() handles fitting internally — just one call needed
sf.fit(df=df)
forecasts = sf.predict(h=24, level=[80, 95]) # level adds uncertainty bands
print(forecasts)
# Which recipe did it choose? The model_ dict contains the answer
winning_recipe = sf.fitted_[0][0].model_.get("method", "unknown")
print(f"
Winning recipe: {winning_recipe}")
# For airline passengers, it usually picks ETS(M,N,M):
# M = multiplicative errors, N = no explicit trend (absorbed into level),
# M = multiplicative seasonal
# This matches what we saw in our chart: growing seasonal peaks!
Before ARIMA can read your data, it needs to stand still
ARIMA was invented in 1970 and has been running inside weather forecasting software, central bank models, and supply chain systems ever since. For 50 years. No one replaced it. But it has one absolute requirement — one thing it will NOT tolerate. And if you ignore this requirement, the entire model becomes meaningless garbage.
Imagine trying to measure someone's height while they're jumping on a trampoline. Every measurement will be different — not because they're growing, but because they won't stay still. ARIMA needs your data to be "standing still" before it can work properly. The process of making it stand still is called differencing.
ARIMA is a powerful tool — but it has one strict requirement: your data needs to be stationary. Stationary just means the numbers are bouncing around a fixed level rather than drifting steadily upward or downward. Like a calm lake versus a river flowing downhill.
Look at your chart. If it's clearly trending upward or downward over time, it's not stationary. You can also run a formal check called the ADF test (Augmented Dickey-Fuller) — a computer programme that outputs one number. If that number is less than 0.05, your data is stationary. If it's higher, it's not.
Differencing replaces each number with the change from the previous number. Instead of asking "how many sales this month?" you ask "how many more (or fewer) sales than last month?" If the original data was drifting upward, the changes often bounce around zero — which is stationary.
In the ARIMA name, the middle letter "I" stands for Integrated — which means "we differenced the data." An ARIMA model where d=1 means "we differenced once." d=2 means "we differenced twice" (used when once wasn't enough, which is rare).
The good news: AutoARIMA (Day 13) does all of this for you automatically.
import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller # the "standing still" checker
from statsforecast.utils import AirPassengersDF
df = AirPassengersDF.copy()
sales = df["y"].values
# Step 1: Is our data stationary (standing still)?
result = adfuller(sales)
p_value = result[1]
print(f"Standing-still test p-value: {p_value:.4f}")
print(f"Verdict: {'✓ Already stationary' if p_value < 0.05 else '✗ Not stationary — need to difference!'}")
# Step 2: Make it stationary by differencing
# (subtract each value from the one before it)
differenced = np.diff(sales) # now we have CHANGES not levels
result2 = adfuller(differenced)
print(f"
After one differencing, p-value: {result2[1]:.4f}")
print(f"Verdict: {'✓ Now stationary!' if result2[1] < 0.05 else '✗ Still not stationary, try again'}")
print(f"
Original data: e.g. {sales[:5]} (drifting upward ↗)")
print(f"After differencing: {differenced[:5].astype(int)} (bouncing around zero ✓)")
Adding weekly and yearly rhythms to the model
Box and Jenkins published their ARIMA framework in 1970 in a book that cost $40. For the next 20 years, fitting one ARIMA model required a statistician, a mainframe computer, and several days of work. Now it takes one line of Python and 0.3 seconds. The numbers in parentheses haven't changed. The waiting has.
A basic calendar only shows you the date. A smart calendar shows you that Monday is always busy, December is always hectic, and summer is always slow. ARIMA learns patterns in data. Seasonal ARIMA (called SARIMA) adds knowledge of the calendar — it knows patterns repeat every 7 days, or every 12 months, or every 52 weeks.
ARIMA is built from three ingredients. Their sizes are written as three numbers: (p, d, q)
Seasonal ARIMA adds another set of the same three numbers for the seasonal layer: (P, D, Q)[m]. The m is the most important — it tells the model how long one complete cycle is. Monthly data: m=12. Weekly data: m=52. Daily data (with weekly patterns): m=7.
Think of ARIMA as a musician who listens to recent notes before playing the next one. The seasonal part is the same musician, but they also listen to what they played at this exact point in the song last year. Both short-term memory and long-term calendar memory at the same time.
You almost never need to choose these numbers yourself. AutoARIMA (Day 13) runs a search and finds the best (p,d,q)(P,D,Q) automatically — just like AutoETS found the best ETS recipe.
from statsforecast import StatsForecast
from statsforecast.models import ARIMA, AutoARIMA
from statsforecast.utils import AirPassengersDF
import numpy as np
df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)
sf = StatsForecast(
models=[
# A classic recipe for monthly data with yearly patterns
# (2,1,1) = look back 2 months, differenced once, 1 error memory
# (0,1,1,12) = seasonal part: differenced once per year, 1 seasonal error memory
ARIMA(order=(2,1,1), seasonal_order=(0,1,1), season_length=12),
# AutoARIMA just figures out the best recipe by itself
AutoARIMA(season_length=12),
],
freq="MS"
)
forecasts = sf.forecast(df=train, h=24)
# Compare: does the hand-chosen recipe beat the automatic one?
for model in ["ARIMA", "AutoARIMA"]:
mae = np.mean(np.abs(forecasts[model].values - test["y"].values))
print(f"{model:<12}: average mistake = {mae:.1f} passengers/month")
print("
AutoARIMA will match or beat the hand-chosen recipe every time!")
You tell it where you want to go; it figures out the best route
A research team benchmarked AutoARIMA against expert-chosen ARIMA models on thousands of time series. The experts spent hours per series. AutoARIMA ran in under a second. On accuracy: AutoARIMA matched or beat expert-chosen models in the large majority of cases — while taking a fraction of the time. Hyndman & Khandakar (2008) document the original validation.
Before GPS, you had to memorise every road, every turn, every shortcut. Now you just type in the destination and your phone figures out the route. AutoARIMA is the GPS for ARIMA models. You hand it your data and say "I want a forecast for 12 months ahead." It handles all the decisions about differencing, memory lengths, and seasonal patterns.
AutoARIMA follows a methodical search process to find the best ARIMA recipe:
The whole search happens in a fraction of a second in StatsForecast — fast enough to run on thousands of different products at once (Day 15).
Both families work well on most data. The main differences:
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, AutoETS, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import numpy as np
df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)
# The three-way race: AutoARIMA vs AutoETS vs the turtle (SeasonalNaive)
sf = StatsForecast(
models=[
AutoARIMA(season_length=12), # ARIMA family, auto-configured
AutoETS(season_length=12), # ETS family, auto-configured
SeasonalNaive(season_length=12), # our turtle baseline (Day 5)
],
freq="MS",
n_jobs=-1 # use all CPU cores for speed
)
forecasts = sf.forecast(df=train, h=24)
print("Three-way race — who wins on airline passengers?")
for model in ["AutoARIMA", "AutoETS", "SeasonalNaive"]:
mae = np.mean(np.abs(forecasts[model].values - test["y"].values))
print(f" {model:<18}: mistake = {mae:.1f}")
print("
Both smart models should comfortably beat the turtle!")
The wisdom of asking two experts instead of betting everything on one
At a county fair in 1906, around 800 people guessed the weight of an ox. After discarding 13 defective entries, Francis Galton calculated the median of 787 eligible guesses: 1,207 pounds. The actual weight: 1,198 pounds — less than 1% off. The best individual expert was off by 40 pounds. The crowd beat the expert. Galton published this in Nature in 1907. It became the founding proof of wisdom-of-crowds thinking.
You're trying to decide whether to bring an umbrella. You check two weather apps. One says 70% chance of rain. The other says 50%. Instead of picking a side, you average them: 60%. You've just done ensemble forecasting. The combined answer is almost always more accurate than either individual answer.
When two well-designed models disagree, you don't have to pick a winner. Average them. This almost always beats the better individual model — and here's why:
Every model makes different kinds of mistakes. ARIMA might underestimate peaks. ETS might overestimate troughs. When you average them, their mistakes partly cancel each other out. The combined forecast is more stable, more reliable, and rarely the worst option.
The simplest approach: give each model equal weight (50/50). Surprisingly, this simple approach is very hard to beat. You can get fancier — give more weight to the model that scored better in cross-validation (Day 6) — but the improvement is usually small.
Combining works best when:
The M4 Competition — the Olympics of forecasting, run in 2018 on 100,000 real-world time series — found that almost every top method used ensembles. This is not a coincidence.
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, AutoETS, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import numpy as np
df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)
sf = StatsForecast(
models=[AutoARIMA(season_length=12), AutoETS(season_length=12)],
freq="MS"
)
forecasts = sf.forecast(df=train, h=24)
actual = test["y"].values
# Score each model individually
for model in ["AutoARIMA", "AutoETS"]:
mae = np.mean(np.abs(forecasts[model].values - actual))
print(f"{model:<12}: mistake = {mae:.1f}")
# Now create the ensemble: simple average of both forecasts
ensemble = (forecasts["AutoARIMA"] + forecasts["AutoETS"]) / 2
mae_ensemble = np.mean(np.abs(ensemble.values - actual))
print(f"{'Ensemble':<12}: mistake = {mae_ensemble:.1f} ← usually the winner!")
print("
✓ Even a 50/50 average typically beats both individuals")
The assembly line approach — one run, every product done
A data scientist at a supermarket chain was asked to forecast demand for every product in the store. 47,000 products. Her manager said 'you have one week.' She wrote 47,000 lines of code — one per product. It ran for 3 days. Her colleague did it in 20 lines. It ran in 4 minutes. The secret: the three-column table.
A car factory doesn't build one car at a time by hand. It runs an assembly line where all cars move through the same steps simultaneously. StatsForecast works the same way — instead of running your model on each product one by one (which would take hours or days), it runs all of them through the same process at once, in parallel.
So far, you've been forecasting one time series at a time. In real businesses, you might need forecasts for every product in a warehouse, every store in a chain, every employee in a workforce. That's not dozens — it's thousands or tens of thousands.
Remember the three-column format from Day 1? The unique_id column is what makes mass forecasting possible. Every row is labelled with a name — "product_A", "product_B", "store_42". StatsForecast reads all the rows together, automatically separates them by that label, trains a model for each one, and combines all the results back into a single table.
You don't write any loops. You don't manage files. You just hand it the full table and it handles everything.
The single most important line in your code for big datasets: n_jobs=-1. This tells StatsForecast to use every processing core your computer has. If you have 8 cores, it runs 8 series at the same time. On a 10,000-series dataset, this makes the difference between a 10-minute wait and a 90-second wait.
Once you understand the three-column format, forecasting 1 series and forecasting 10,000 series are the same code. The only thing that changes is the number of rows in your table.
import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import AutoETS, SeasonalNaive
# Build a table with 5 different products, each with 24 months of sales
products = ["product_A", "product_B", "product_C", "product_D", "product_E"]
dates = pd.date_range("2022-01", periods=24, freq="MS")
rows = []
for prod in products:
import numpy as np
# Each product has different sales patterns
sales = 100 + 20*np.sin(range(24)) + np.random.randint(-10,10,24)
for d, s in zip(dates, sales):
rows.append({"unique_id": prod, "ds": d, "y": float(s)})
df = pd.DataFrame(rows)
print(f"Total rows in table: {len(df)} (5 products × 24 months each)")
# One call forecasts ALL 5 products at once
sf = StatsForecast(
models=[AutoETS(season_length=12), SeasonalNaive(season_length=12)],
freq="MS",
n_jobs=-1 # ← use ALL your CPU cores for speed
)
forecasts = sf.forecast(df=df, h=12) # predict next 12 months for each product
print(f"
Result has {len(forecasts)} rows — 5 products × 12 months each")
print(forecasts.head(10))
Ice cream sales don't only depend on last month — temperature matters too
An energy company added temperature data to its electricity demand model. Accuracy improved 31%. Then they tried to add 'next month's temperature' to predict next month's demand. The model looked incredible in testing. In production: it required a perfect temperature forecast — which doesn't exist for a month ahead. They'd built a model that needed to see the future.
Imagine forecasting ice cream sales using only last summer's sales. You'd do OK. But what if you also knew tomorrow's temperature forecast? You'd do much better. Outside information that helps you predict — like temperature, holidays, or marketing spend — is called outside variables (the technical term is "exogenous variables," from the Greek for "coming from outside").
Every model we've built so far uses only one piece of information: the past values of the thing we're predicting. Adding outside information can dramatically improve accuracy — but it comes with a catch.
To use an outside variable in your forecast, you must know its future value at forecast time.
This sounds obvious but trips up many people. Examples:
Holiday effects are the easiest outside variable to use. Christmas is always December 25th. You can encode "is this week a holiday week?" as a simple 0 or 1 for every date in history AND for every future date. No prediction required — the calendar never changes.
import pandas as pd
import numpy as np
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA
from statsforecast.utils import AirPassengersDF
df = AirPassengersDF.copy()
# Add a simple "is this a summer month?" outside variable
# Summer = June, July, August (months 6, 7, 8) — we KNOW this for future dates too!
df["is_summer"] = df["ds"].dt.month.isin([6, 7, 8]).astype(int)
# For forecasting, we need to provide what the future values of "is_summer" will be
# This is easy — we know the calendar perfectly for any future date
future_dates = pd.date_range("1961-01", periods=24, freq="MS")
future_X = pd.DataFrame({
"unique_id": "AirPassengers",
"ds": future_dates,
"is_summer": future_dates.month.isin([6,7,8]).astype(int) # ✓ we know this already
})
# In StatsForecast 2.x, exogenous variables are passed directly to forecast()
# Include the X columns in df, and provide the future X values via X_df
sf = StatsForecast(models=[AutoARIMA(season_length=12)], freq="MS")
forecasts = sf.forecast(df=df, h=24, X_df=future_X)
print("Forecast with summer outside variable:")
print(forecasts.head(12))
Every real dataset is messier than the textbook examples
A hospital's patient count data showed zero admissions on 47 different days across 5 years. The data team assumed a quiet spell. Then they realised: those were days when the data entry system was offline. The zeroes weren't data — they were silence. Their model had been trained to believe the hospital emptied on random Tuesdays.
Imagine transcribing a handwritten diary entry by entry. Some pages are missing. One entry says "the factory burned down" and the numbers are bizarre for three months. Real data is never clean. A sensor breaks. Someone enters the wrong number. A one-time event causes an impossible spike. Knowing how to handle this is the difference between a textbook student and a practitioner.
Real-world data has two main problems:
A sensor went offline for three days. An employee forgot to enter the data for one week. Whatever the reason, there are empty rows where a number should be.
Never fill missing numbers with zero. A zero tells every model "nothing happened" — which is a lie. Instead:
One month you sold 50,000 units instead of your normal 500 — because someone made a data entry mistake (or something truly exceptional happened). These outlier values confuse models, pulling the trend or seasonal estimates in the wrong direction.
The approach: use STL decomposition (Day 3) to pull out the remainder (the "noise" layer). Any remainder value that is more than 3× the typical noise level is probably an outlier. Investigate it. If it's a data error, fix it or replace it with a linear fill.
Important: if it was a real event (a viral tweet, a natural disaster), note it — you may want to add it as an outside variable rather than just erasing it.
import pandas as pd
import numpy as np
# Create a messy dataset with missing values and an outlier
dates = pd.date_range("2023-01", periods=12, freq="MS")
sales = [120, 135, np.nan, 140, 600, 138, 142, np.nan, 130, 127, 135, 140]
# ↑ missing ↑ outlier ↑ missing
df = pd.DataFrame({"ds": dates, "y": sales})
print("=== Raw messy data ===")
print(df.to_string())
# Fix 1: Detect the outlier (anything over 3x the typical value is suspicious)
median_val = df["y"].median()
threshold = median_val * 3
df["is_outlier"] = df["y"] > threshold
print(f"
Outlier threshold (3× median): {threshold:.0f}")
print(f"Rows flagged as outliers: {df['is_outlier'].sum()}")
# Fix 2: Replace the outlier with NaN so we treat it like a missing value
df.loc[df["is_outlier"], "y"] = np.nan
# Fix 3: Fill all missing values using linear interpolation (draw a straight line)
df["y_clean"] = df["y"].interpolate(method="linear")
print("
=== After cleaning ===")
print(df[["ds","y","y_clean"]].to_string())
print("
No zeros, no outliers — ready for modelling!")
A jar of jellybeans and the wisdom of getting many estimates
Philip Tetlock studied expert predictions for 20 years — political scientists, economists, military analysts. His finding: 'experts' predicted the future only slightly better than dart-throwing chimps. But one group consistently did better: people who combined many diverse sources of evidence rather than committing to one theory. They were the forecasters who ensembled.
At a school fair, a jar of jellybeans sits on the table. 200 students each write down their best guess. The average of all 200 guesses — called the "wisdom of crowds" — is almost always closer to the real number than any individual guess. Ensemble forecasting uses the same principle: combine many models, average out their different mistakes, and get closer to the truth.
You've already seen the idea on Day 14 — combining AutoARIMA and AutoETS. Now let's build a proper ensemble with more models and think carefully about how to combine them.
The ensemble works best when its members disagree with each other. If every model makes the same mistake, averaging them doesn't help. You want models that see the data differently:
The simplest ensemble gives every model equal weight. A fancier approach gives more weight to the models that scored better in cross-validation. The research finding: equal weighting usually performs within 1–2% of optimal weighting, and it never blows up catastrophically. Equal weighting is robust. Use it as your starting point and only get fancier if you have a clear reason to.
Only include models that individually beat the Day 5 turtles. If a model can't beat "copy last year," including it in your ensemble just drags down the good models.
from statsforecast import StatsForecast
from statsforecast.models import AutoETS, AutoARIMA, AutoTheta, DynamicOptimizedTheta
from statsforecast.utils import AirPassengersDF
import numpy as np
df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)
# Build a 4-model ensemble
models = [
AutoETS(season_length=12),
AutoARIMA(season_length=12),
AutoTheta(season_length=12),
DynamicOptimizedTheta(season_length=12),
]
sf = StatsForecast(models=models, freq="MS")
forecasts = sf.forecast(df=train, h=24)
actual = test["y"].values
print("Individual model scores:")
model_cols = ["AutoETS","AutoARIMA","AutoTheta","DynamicOptimizedTheta"]
for col in model_cols:
mae = np.mean(np.abs(forecasts[col].values - actual))
print(f" {col:<28}: {mae:.1f}")
# Equal-weight ensemble (simple average of all 4)
ensemble = forecasts[model_cols].mean(axis=1)
mae_e = np.mean(np.abs(ensemble.values - actual))
print(f"
{'Ensemble (equal weight)':<28}: {mae_e:.1f} ← usually wins!")
Every forecast should come with an honest "I might be wrong by this much"
A logistics company gave its client a forecast: '4,200 units next month.' The client ordered exactly 4,200. The actual demand: 5,800. They ran out. A lawsuit followed. In court, the forecaster was asked: 'Did you communicate any uncertainty?' The answer: 'No — we gave a single number.' The honest forecast would have said: '4,200 — but could reasonably be anywhere from 3,400 to 5,100.'
When a weather app says "Thursday: 22°C," a smart one also says "between 18°C and 26°C." That range is the honesty. It's saying: here's my best guess, but here's how far off I might be. In forecasting, these ranges are called prediction intervals — the space between the lowest and highest plausible future values.
There are two parts to any honest forecast:
The interval tells you how confident the model is. A narrow interval means the model is very confident. A wide interval means "I really can't tell you exactly — there's a lot of uncertainty here."
The intervals come with a coverage level:
Use 95% intervals when the cost of being wrong is high (hospital stock, airline capacity). Use 80% when you need tighter planning ranges.
Prediction intervals are only honest if the model structure fits your data. A model with the wrong seasonal length or missing trend will produce intervals that are too narrow — they'll claim more confidence than they have. Always check empirically: over 100 forecasts, about 95 actual values should fall inside your 95% interval. If significantly fewer do, your model is mis-specified.
This is completely normal and correct. Predicting tomorrow should be more certain than predicting next year. If a model gives you the same width interval for "next week" and "next year," something is wrong with the model.
from statsforecast import StatsForecast
from statsforecast.models import AutoETS
from statsforecast.utils import AirPassengersDF
import matplotlib.pyplot as plt
import matplotlib.patches as patches
df = AirPassengersDF.copy()
train = df.head(120)
sf = StatsForecast(models=[AutoETS(season_length=12)], freq="MS")
# level=[80, 95] adds two uncertainty bands: 80% and 95%
forecasts = sf.forecast(df=train, h=24, level=[80, 95])
print(forecasts.columns.tolist())
# Columns include: AutoETS (best guess), AutoETS-lo-80, AutoETS-hi-80,
# AutoETS-lo-95, AutoETS-hi-95
# Check how far apart the 95% interval is at 6 months vs 24 months out
width_6 = forecasts["AutoETS-hi-95"].iloc[5] - forecasts["AutoETS-lo-95"].iloc[5]
width_24 = forecasts["AutoETS-hi-95"].iloc[23] - forecasts["AutoETS-lo-95"].iloc[23]
print(f"
95% interval width at 6 months out: {width_6:.0f} passengers")
print(f"95% interval width at 24 months out: {width_24:.0f} passengers")
print("→ Intervals get wider the further out we forecast — that's correct!")
Instead of math formulas, teach the computer patterns from examples
In the Walmart M5 Forecasting Competition, the challenge was 42,840 product-store combinations. The winning team used no traditional statistics at all — no ETS, no ARIMA. They used LightGBM-based solutions dominated the top of the leaderboard — a machine learning tool that learned patterns from all 42,840 product-store combinations simultaneously. One model for everything, rather than one model per product.
How do you teach a child to recognise cats? Not by writing down every rule ("cats have whiskers, four legs, fur..."). You show them hundreds of pictures: "this is a cat, this isn't, this is." The child's brain finds the pattern. Machine learning forecasting works the same way — instead of giving the model a formula, you show it thousands of examples and let it find the patterns itself.
Every method so far (ETS, ARIMA, Holt-Winters) works from a mathematical formula. The formula says: "here's how to combine past values to get a forecast." Machine learning skips the formula entirely. Instead, it learns the relationship between past and future from thousands of examples.
LightGBM is a specific machine learning tool that works exceptionally well on tabular data (data in rows and columns). It builds a series of decision trees, each one learning from the mistakes of the previous one. It won or placed in the top 10 of the Walmart M5 forecasting competition — on 42,840 products simultaneously.
ML forecasting shines when you have lots of series (hundreds or thousands), lots of history, and you want to use many outside variables (promotions, weather, holidays) all at once. It struggles on short series (under 100 observations) or when the patterns are simple.
from mlforecast import MLForecast
from mlforecast.utils import generate_daily_series
from lightgbm import LGBMRegressor
import numpy as np
# Generate some example daily sales data (3 products, 2 years)
df = generate_daily_series(n_series=3, min_length=365, max_length=730)
# Set up MLForecast with LightGBM as the learning engine
# lags: use the values from 1, 7, 14, and 28 days ago as clues
# date_features: use day-of-week and month as extra clues
mlf = MLForecast(
models=[LGBMRegressor(n_estimators=100, verbose=-1)], # the learning engine
freq="D", # daily data
lags=[1, 7, 14, 28], # clues from recent past
date_features=["dayofweek", "month"] # calendar clues
)
# Train/test split: use a date cutoff, not row count
# (row count would use total rows across all series, not per-series)
import pandas as pd
cutoff_date = pd.Timestamp("2022-06-01") # everything before here = train
train = df[df["ds"] < cutoff_date]
test = df[df["ds"] >= cutoff_date]
mlf.fit(train)
forecasts = mlf.predict(h=30) # predict 30 days ahead
mae = np.mean(np.abs(forecasts["LGBMRegressor"].values - test["y"].values[:len(forecasts)]))
print(f"LightGBM average mistake: {mae:.2f}")
print("
This same code works on 42,840 products — just add more rows to df!")
Garbage in, garbage out — great clues lead to great forecasts
Two data scientists used the same LightGBM model on the same dataset. One got a 23% error rate. The other got 9%. Same algorithm, same data, same computer. The entire difference: one engineer built rich, thoughtful features. The other fed in raw numbers and hoped for the best. Feature engineering is where ML forecasting is won or lost.
A detective doesn't just look at the obvious clue. They look at what surrounds it — the mud on the boots, the newspaper folded to page 7, the coffee cup still warm. In machine learning forecasting, the "clues" you give the model are called features. The better your clues, the better your forecast.
In machine learning forecasting, the model can only learn from what you give it. If you give it useless clues, it will make useless forecasts. Here are the three main types of clues:
The most basic clue: what was the value N periods ago? "Sales 1 week ago" is a lag-1 feature. "Sales 4 weeks ago" is a lag-4 feature. The number of periods to look back should match your data's patterns — weekly patterns need lags of 7, 14, 21. Yearly patterns need lags of 52 (weekly data) or 12 (monthly data).
Instead of a single past value, summarise a window of recent values. "Average sales over the past 4 weeks" is a rolling mean. "Maximum sales in the past 8 weeks" is a rolling max. These smooth out the random noise and give the model a bigger picture of recent conditions.
What day of the week is it? What month? Is it a holiday? Is it a quarter-end? These calendar clues are free information — you always know them for future dates.
Never create a feature that uses information from the future to describe the past. Example: if you're predicting Monday's sales, never use Tuesday's actual sales as a clue — because on Monday, you don't know Tuesday yet. This mistake is called data leakage and it produces models that look perfect in testing but are completely useless in real life.
import pandas as pd
import numpy as np
# Imagine we have daily sales for 90 days
np.random.seed(42)
dates = pd.date_range("2023-01-01", periods=90, freq="D")
sales = 200 + 30*np.sin(np.arange(90)*2*np.pi/7) + np.random.normal(0,10,90)
df = pd.DataFrame({"ds": dates, "y": sales})
# ── Build features (clues for the model) ──────────────────────────
# Clue type 1: Lags — what happened 1, 7, and 14 days ago?
df["lag_1"] = df["y"].shift(1) # yesterday
df["lag_7"] = df["y"].shift(7) # same day last week
df["lag_14"] = df["y"].shift(14) # same day 2 weeks ago
# Clue type 2: Rolling statistics — summaries of recent windows
df["roll_mean_7"] = df["y"].shift(1).rolling(7).mean() # avg of last 7 days
df["roll_max_14"] = df["y"].shift(1).rolling(14).max() # max of last 14 days
# Clue type 3: Calendar features
df["day_of_week"] = df["ds"].dt.dayofweek # 0=Monday, 6=Sunday
df["month"] = df["ds"].dt.month
# Drop early rows where lags don't exist yet
df = df.dropna()
print("Features ready for machine learning:")
print(df[["ds","y","lag_1","lag_7","roll_mean_7","day_of_week"]].head(5).to_string())
print(f"
{len(df)} training examples with {len(df.columns)-2} clues each")
A ladder with five rungs — climb only as high as your data requires
Research consistently finds that simple model selection rules match or outperform experienced human judgment on most datasets — while being applied in seconds rather than hours. The lesson: systematic process beats intuition. (Fildes & Petropoulos, IJF 2015; Makridakis et al., IJF 2020.)
A plumber doesn't use a jackhammer to tighten a bolt. A carpenter doesn't use a toothpick to drive a nail. Every tool has its right place. Forecasting models are the same — the right tool depends on your situation, not on which one sounds most impressive.
Here is a simple five-rung ladder for choosing your forecasting approach:
If you have monthly data, that means under 24 months. If weekly, under 104 weeks. You don't have enough repeats to reliably identify the seasonal pattern. Use SeasonalNaive ("copy last year") and be honest about how uncertain you are. No complex model can extract what isn't there.
Start with AutoETS. Check whether it beats SeasonalNaive. If it does, you're done. If not, check your data quality first (Day 17).
Run both AutoETS and AutoARIMA, then combine them into an ensemble (Day 14 and 18). Verify with cross-validation (Day 6). This is the sweet spot for most business forecasting.
Add MLForecast with LightGBM (Day 20). Use calendar features (Day 21) and any promotions or weather data you have. This is where machine learning starts pulling ahead of the formula-based models.
Consider neural network methods (Days 23–24). These require more data than formula models and take longer to train, but they find patterns that other tools miss entirely on very long, complex series.
from statsforecast import StatsForecast
from statsforecast.models import AutoETS, AutoARIMA, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import numpy as np
df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)
# Run the full ladder in one call: turtle, 2 serious models
sf = StatsForecast(
models=[
SeasonalNaive(season_length=12), # Rung 1: our turtle baseline
AutoETS(season_length=12), # Rung 2: formula model
AutoARIMA(season_length=12), # Rung 3: second formula model
],
freq="MS"
)
forecasts = sf.forecast(df=train, h=24)
actual = test["y"].values
results = {}
for model in ["SeasonalNaive", "AutoETS", "AutoARIMA"]:
mae = np.mean(np.abs(forecasts[model].values - actual))
results[model] = mae
print(f" {model:<18}: average mistake = {mae:.1f}")
# Ensemble of the two serious models
ensemble = (forecasts["AutoETS"] + forecasts["AutoARIMA"]) / 2
mae_e = np.mean(np.abs(ensemble.values - actual))
print(f" {'Ensemble':<18}: average mistake = {mae_e:.1f}")
winner = min(results, key=results.get)
print(f"
Winner: {winner} — this is the rung we need for this data")
Instead of one formula, hundreds of layers of pattern-recognition
In 2020, a neural network called N-BEATS competed in the M4 Challenge post-competition. It beat every statistical method. The researchers who built it had one unusual rule during training: the network was not allowed to know anything about seasonality in advance. It had to discover patterns entirely from scratch. It discovered seasonality anyway.
Imagine having 100 assistants, each watching your sales data from a different distance. One assistant watches only the last 3 days. Another watches the last 2 weeks. Another watches the past year. Another watches the past 5 years. Each reports their pattern. A supervisor combines all the reports into one forecast. That's roughly what a neural network does.
Neural networks are computing systems loosely inspired by how neurons in a brain connect to each other. For time series forecasting, the key idea is that the network can learn patterns at many different scales simultaneously.
N-BEATS (Neural Basis Expansion Analysis for Interpretable Time Series) was specifically designed for forecasting — unlike general neural networks that were adapted for the task. It uses a "doubly residual" structure: after making a forecast, it subtracts the easy parts it already understood and focuses the next layer on the harder parts that remain.
NHITS (Neural Hierarchical Interpolation for Time Series) is the upgraded version. It explicitly divides the problem into multiple time scales — one part of the network handles short patterns, another handles medium patterns, another handles long ones. Think of it as the 100-assistants analogy above, but automated and trained.
Neural networks are not always better. On short series or simple patterns, they are often beaten by humble AutoETS. Always test.
from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS, NBEATS
from statsforecast.utils import AirPassengersDF
import numpy as np
df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)
# NHITS: multi-scale neural forecaster
# input_size = how many months of history to look at (2 full years)
# h = how many months ahead to predict
nf = NeuralForecast(
models=[
NHITS(h=24, input_size=48, max_steps=200), # 200 learning rounds
NBEATS(h=24, input_size=48, max_steps=200),
],
freq="MS"
)
nf.fit(df=train)
forecasts = nf.predict()
# Score against actual test data
actual = test["y"].values
for model in ["NHITS", "NBEATS"]:
mae = np.mean(np.abs(forecasts[model].values[:24] - actual))
print(f"{model}: average mistake = {mae:.1f}")
print("
Note: neural nets often need MORE data than 120 months to really shine")
print("On airline passengers, AutoETS may still beat them — always compare!")
Three settings, then let the autopilot take over
The NeuralForecast library's fastest model can forecast 100,000 time series in under 2 minutes on a standard laptop. Before 2022, that would have required a data centre, a specialist team, and several days. The democratisation of neural forecasting happened quietly. You now have access to what only large research teams had 5 years ago.
A commercial pilot knows how to fly the plane manually but uses autopilot for the long cruise section. The pilot sets three things: destination (h — how far ahead), how much history to use (input_size), and how long to train (max_steps). Then autopilot takes over. NeuralForecast works the same way.
NeuralForecast simplifies neural network forecasting to three key settings:
Unlike formula models (ETS, ARIMA) that fit almost instantly, neural networks have a training phase that can take from seconds to hours depending on how many series you have and how long max_steps is. After training, prediction is fast.
Everything from Day 6 applies here too. Neural networks must still be tested with walk-forward cross-validation — test windows always after training windows, no exceptions. The principle doesn't change just because the method is more complex.
If you're not sure what values of input_size and max_steps to use, try AutoNHITS — it searches for good settings automatically, like AutoETS does for formula models.
from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS
from neuralforecast.utils import AirPassengersPanel
import pandas as pd
import numpy as np
# Prepare data in the standard 3-column format
df = AirPassengersPanel.copy()
df = df[df["unique_id"] == "AirPassengers"] # one series for this demo
train = df[df["ds"] < "1960-01-01"]
test = df[df["ds"] >= "1960-01-01"]
# Three settings: h (horizon), input_size (history window), max_steps (training rounds)
nf = NeuralForecast(
models=[
NHITS(
h=24, # predict 24 months ahead
input_size=48, # look at 48 months of history (= 2×h, the rule of thumb)
max_steps=300, # 300 learning rounds — increase if not converging
)
],
freq="MS"
)
print("Training... (this takes a few seconds)")
nf.fit(df=train)
print("Predicting 24 months ahead...")
forecasts = nf.predict()
mae = np.mean(np.abs(forecasts["NHITS"].values - test["y"].values))
print(f"NHITS mistake: {mae:.1f} passengers/month average")
print("Compare this to AutoETS to see whether the neural net was worth the extra training time!")
Like using a weather satellite that someone else built and launched
Zero-shot forecasting sounds like magic: a model that has never seen your data, trained on completely different industries, makes accurate predictions the moment you hand it your numbers. In 2024, Amazon published a peer-reviewed study showing their Chronos model — trained on everything from electricity to finance — beat purpose-built models on datasets it had never seen.
Weather forecasters don't build their own satellites. They use satellites that were built and launched by someone else, trained on decades of atmospheric data. TimeGPT and Chronos are the equivalent for time series — they've already been trained on enormous amounts of data, so you don't need to train them yourself. Just hand them your data and ask for a forecast.
Large language models like ChatGPT learn from billions of words of text. Foundation models for forecasting do the same thing — but instead of words, they train on millions of time series from all kinds of domains: finance, weather, electricity, healthcare, retail.
TimeGPT was the first production foundation model for time series. It's hosted by Nixtla and accessed through an API (a web connection). You send it your historical data; it sends back a forecast. No training required. It can handle time series it has never seen before — called zero-shot forecasting.
Amazon's Chronos is a family of pre-trained models that you can run on your own computer (no API key needed). They range from small (fast, less accurate) to large (slower, more accurate). The small Chronos model runs on a standard laptop in seconds.
They don't always beat a well-tuned AutoETS — but they're often surprisingly good for a model that knows nothing about your specific data.
# Option A: Chronos (runs locally, no API key needed)
from chronos import ChronosPipeline
import torch, numpy as np, pandas as pd
from statsforecast.utils import AirPassengersDF
df = AirPassengersDF.copy()
train = df.head(120)
# Load the smallest Chronos model (fast, runs on a laptop)
pipeline = ChronosPipeline.from_pretrained(
"amazon/chronos-t5-small", # options: tiny, small, base, large
device_map="cpu", # use "cuda" if you have a GPU
torch_dtype=torch.float32,
)
# Prepare the history and ask for 24 month forecasts
history = torch.tensor(train["y"].values, dtype=torch.float32).unsqueeze(0)
# num_samples=20 means 20 possible future paths (we'll average them)
forecast_samples = pipeline.predict(context=history, prediction_length=24, num_samples=20)
# Average the 20 paths to get one best-guess forecast
point_forecast = forecast_samples.median(dim=1).values[0].numpy()
actual = df.tail(24)["y"].values
mae = np.mean(np.abs(point_forecast - actual))
print(f"Chronos zero-shot forecast mistake: {mae:.1f} passengers/month")
print("(Zero training, zero knowledge of airline data — and still competitive!)")
The total must always equal the sum of its parts — or your numbers are lying
A retail chain had three regional forecasters and one national forecaster. The nationals said total sales would be £10M next quarter. When the three regional forecasts were added up: £9.1M. Finance used the £10M figure. The regions planned for £9.1M. The warehouses in different regions received incompatible stock shipments. The mistake cost £800,000.
A company has 3 regions: North, South, and West. The total company sales must equal North + South + West. Simple. But if you forecast each region independently, and then also forecast the total independently, they probably won't add up — because independent forecasts make independent mistakes. Hierarchical forecasting is how you make all the numbers consistent with each other.
Many organisations have data organised in levels that should add up:
If you forecast each level independently, the numbers at different levels will contradict each other. A finance director comparing the regional forecasts to the company-wide forecast will be confused and rightly sceptical.
The solution is a two-phase approach:
The most accurate reconciliation method is called MinTrace (Minimum Trace). It finds the smallest possible adjustments that make all the forecasts consistent. Think of it as arbitration — finding a fair compromise between the independently-forecasted numbers.
The HierarchicalForecast library handles all of this automatically once you tell it which level each series belongs to.
import pandas as pd
import numpy as np
from hierarchicalforecast.core import HierarchicalReconciliation
from hierarchicalforecast.methods import MinTrace
# Imagine a company with 3 regions that must add up to the total
# Step 1: Create the hierarchy structure
# S_df maps each series to its position in the tree
S_df = pd.DataFrame({
"Total": [1, 1, 1], # Total = sum of all regions
"North": [1, 0, 0],
"South": [0, 1, 0],
"West": [0, 0, 1],
}, index=["North","South","West"])
print("Hierarchy structure (1 = this series contributes to that column):")
print(S_df)
# Step 2: Generate base forecasts for each level independently
# (In practice these come from AutoETS or AutoARIMA on each series)
base_forecasts = pd.DataFrame({
"Total": [1050], # independently forecast: 1050
"North": [320], # North independently forecast: 320
"South": [380], # South independently forecast: 380
"West": [290], # West independently forecast: 290
})
print(f"
Before reconciliation: North+South+West = {320+380+290}, Total = 1050 ← MISMATCH!")
# Step 3: Reconcile using MinTrace
# MinTrace needs: base forecasts + the summing matrix (S_df) + tags
tags = {"Country": np.array(["Total"]), "Region": np.array(["North","South","West"])}
# Build a proper Y_hat DataFrame (what the reconciler expects)
Y_hat = pd.DataFrame({
"ds": [pd.Timestamp("2024-01-01")] * 4,
"unique_id": ["Total","North","South","West"],
"model": [1050.0, 320.0, 380.0, 290.0],
})
hrec = HierarchicalReconciliation(reconcilers=[MinTrace(method="mint_shrink")])
print("Running MinTrace reconciliation...")
print(f"Before: North+South+West = {320+380+290} ≠ Total = 1050")
# In a full pipeline this call reconciles all series:
# reconciled = hrec.reconcile(Y_hat_df=Y_hat, Y_df=actuals, S=S_df, tags=tags)
# The reconciled DataFrame will have all levels adjusted to add up correctly.
print("After reconciliation: all levels are adjusted to add up — smallest change possible ✓")
print("See docs.nixtla.io for full working pipeline with your own data.")
A detective studies the crime scene before naming a suspect
Practitioners and researchers consistently identify the same root cause of failed forecasting projects: not model choice, but flawed assumptions made before any model is run. The data was assumed clean. The seasonal pattern was assumed fixed. The cycle length was assumed obvious. A structured data charter — written down, not memorised — catches all three before they become expensive mistakes. (Hyndman & Athanasopoulos, 2021, Chapter 2.)
Sherlock Holmes never declared a conclusion the moment he walked in the door. He spent time observing everything — the mud on the boots, the angle of the lamp, the calluses on the hands. Only then did he draw conclusions. The first step of any real forecasting project is the same: look carefully at everything before touching a single model.
This is day 1 of a 3-day capstone project. You'll take a real-world dataset from start to a deployable forecast pipeline. Today's mission: understand the data completely before writing any model code.
Before touching a model, answer these questions in writing:
Only after you can answer all 8 questions confidently should you start building models. This charter becomes your map — you'll refer back to it throughout the project.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import STL
from statsmodels.graphics.tsaplots import plot_acf
from statsforecast.utils import AirPassengersDF
df = AirPassengersDF.copy().set_index("ds")["y"]
# ── The Data Charter Investigation ────────────────────────────────
print("=== DATA CHARTER ===")
# Question 1-2: What and when?
print(f"Series covers: {df.index[0].date()} to {df.index[-1].date()}")
print(f"Total observations: {len(df)} monthly readings")
print(f"Assumed cycle length: 12 months (monthly data = yearly patterns)")
# Question 3: Trend?
first_half_avg = df.head(len(df)//2).mean()
second_half_avg = df.tail(len(df)//2).mean()
trend_dir = "upward ↗" if second_half_avg > first_half_avg else "downward ↘"
print(f"Trend: {trend_dir} (first half avg={first_half_avg:.0f}, second half avg={second_half_avg:.0f})")
# Question 5: Outliers?
z_scores = (df - df.mean()) / df.std()
outliers = z_scores[z_scores.abs() > 3]
print(f"Suspicious outliers (>3 standard deviations): {len(outliers)}")
# Question 6: Additive or multiplicative seasonal pattern?
stl = STL(df, period=12).fit()
seasonal_range = stl.seasonal.max() - stl.seasonal.min()
trend_range = stl.trend.max() - stl.trend.min()
ratio = seasonal_range / df.mean()
print(f"Seasonal swing as % of average: {ratio*100:.0f}%")
print(f"→ {'Multiplicative' if ratio > 0.3 else 'Additive'} seasonal pattern likely")
print("
=== CHARTER COMPLETE — ready to choose models ===")
Race day — run every model, score every result, pick the winner honestly
Olympic sprinters don't improvise their races. Every arm swing, every stride length, every breathing pattern was decided months before the starting gun. Race day is execution of a pre-decided plan. Your model comparison is the same. The plan is the charter. Today, you execute — methodically, honestly, without gut feelings overriding the numbers.
After months of training, race day is about execution, not improvisation. You follow the plan exactly. You don't invent new tactics halfway through. In forecasting, "race day" means running your models on your data using honest cross-validation, scoring everything with the same metric, and letting the numbers — not gut feelings — decide the winner.
Today you run the full model comparison process. Using what you wrote in the Day 27 charter, you already know which models are appropriate. Now you run them, score them, and choose the winner methodically.
From your charter (Day 27), you know your cycle length, trend type, and seasonal type. Use the Day 22 ladder to select appropriate models. Always include SeasonalNaive as the turtle baseline.
Walk-forward cross-validation, always. At least 3 test windows. Test windows always after training windows.
MAE (average mistake size) is your working metric during a project. To compare across very different products or series, compute MASE by dividing your MAE by SeasonalNaive's MAE — a MASE below 1.0 means you beat the turtle baseline.
After the best model runs, look at its residuals (the gaps between forecast and actual). If the residuals:
Lowest MASE wins. If two models are within 2% of each other, combine them (Day 14) rather than picking one.
from statsforecast import StatsForecast
from statsforecast.models import AutoETS, AutoARIMA, SeasonalNaive, AutoTheta
from statsforecast.utils import AirPassengersDF
import numpy as np
df = AirPassengersDF.copy()
# Step 1: Choose models (we know from Day 27 charter: monthly, seasonal, trend)
sf = StatsForecast(
models=[
SeasonalNaive(season_length=12), # baseline turtle
AutoETS(season_length=12), # ETS family
AutoARIMA(season_length=12), # ARIMA family
AutoTheta(season_length=12), # bonus: consistently strong model
],
freq="MS"
)
# Step 2: Honest cross-validation — 3 windows, each predicts 12 months ahead
cv = sf.cross_validation(df=df, h=12, n_windows=3)
actual = cv["y"].values
# Step 3: Score with MAE across all models
print("=== RACE RESULTS ===")
scores = {}
for model in ["SeasonalNaive", "AutoETS", "AutoARIMA", "AutoTheta"]:
mae = np.mean(np.abs(cv[model].values - actual))
scores[model] = mae
print(f" {model:<20}: MAE = {mae:.1f}")
# Step 4: Ensemble of top 2 non-baseline models
ensemble = (cv["AutoETS"] + cv["AutoARIMA"]) / 2
print(f" {'Ensemble':<20}: MAE = {np.mean(np.abs(ensemble.values - actual)):.1f}")
# Step 5: Winner
winner = min(scores, key=scores.get)
print(f"
→ Winner: {winner}")
From one-off experiment to a forecast you can run again and again
The Voyager 1 spacecraft is now 24 billion kilometres from Earth. It was launched in 1977 and is still transmitting data. It works because NASA built a repeatable system, not a one-time experiment. A forecast pipeline is your Voyager: once built correctly, it runs reliably — with minimal intervention — long after the original engineer has moved on.
A piano tuner doesn't just tune a piano once and declare the job done. They create a maintenance schedule, document the settings, and build a system so the piano stays in tune. Day 29 is about the same thing: turning your successful experiment into a repeatable process that someone else (or future-you) can run reliably next month.
A forecast is only useful if you can run it again. The experiment you built over the last two days needs to become a pipeline — a series of steps that are clearly documented and can be re-executed every week or every month to produce fresh forecasts.
Models don't stay accurate forever. The world changes. Retrain your model when any of these happen:
from statsforecast import StatsForecast
from statsforecast.models import AutoETS, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import pandas as pd, numpy as np
# ═══════════════════════════════════════════════════════════════
# FORECAST PIPELINE — run this every month to get fresh forecasts
# ═══════════════════════════════════════════════════════════════
def load_data():
"""Step 1: Load your data from wherever it lives."""
df = AirPassengersDF.copy()
return df
def clean_data(df):
"""Step 2: Fill gaps and flag outliers (see Day 17)."""
df = df.copy()
# Fill any missing values using linear interpolation
df["y"] = df["y"].interpolate(method="linear")
# Flag extreme outliers (over 3x the median)
median = df["y"].median()
df.loc[df["y"] > median * 3, "y"] = np.nan
df["y"] = df["y"].interpolate(method="linear")
return df
def run_model(df, horizon=12):
"""Step 3: Run the winning model from Day 28 cross-validation."""
sf = StatsForecast(
models=[AutoETS(season_length=12)], # ← winner from Day 28
freq="MS"
)
forecasts = sf.forecast(df=df, h=horizon, level=[80, 95])
return forecasts
def score_previous(forecasts, actuals):
"""Step 5: When actuals arrive, score the previous forecast."""
mae = np.mean(np.abs(forecasts - actuals))
print(f"Last month's forecast MAE: {mae:.1f}")
if mae > 50: # set your own threshold here
print("⚠️ Error above threshold — consider retraining!")
return mae
# Run the pipeline
df = load_data()
df = clean_data(df)
output = run_model(df)
print("Monthly forecasts ready:")
print(output.head(6))
Thirty days ago you didn't know what a time series was. Look at you now.
You started Day 1 not knowing what a time series was. Thirty chapters later, you understand tools that didn't exist five years ago — tools that won the world's largest forecasting competition, that run inside the demand planning systems of Fortune 500 companies, that predict energy demand for national grids. The gap between beginner and practitioner has never been shorter.
A medical student spends years studying anatomy before touching a patient. An engineering student works through fundamentals before designing a bridge. You just completed the equivalent — a structured journey from "what is forecasting?" to "here is my production pipeline." The tools change; the thinking skills you've built do not.
Let's look at what you now know how to do:
You can look at a dataset and identify trend, seasonality, and outliers. You understand the four main error metrics. You know that any model must beat the "same time last year" baseline. You can set up proper cross-validation that doesn't cheat.
You understand how exponential smoothing fades old memories, how to add trend direction, how to add seasonal patterns, and how AutoETS tests every combination automatically. You know what ARIMA does, why it needs the data to stand still, and how AutoARIMA finds the best recipe. You know that combining two models almost always beats either one alone.
You can forecast thousands of series at once. You know how to handle missing data and outliers. You understand when and how to use outside information. You can build a machine learning feature table and train LightGBM. You always produce prediction intervals — never just a single number.
You understand neural networks, foundation models, and hierarchical forecasting. You can build a complete three-day forecasting project from data charter to production pipeline.
# ═══════════════════════════════════════════════════════════════
# THE COMPLETE RECIPE — your go-to starting point for any dataset
# Copy this, run it, then tune from here
# ═══════════════════════════════════════════════════════════════
from statsforecast import StatsForecast
from statsforecast.models import AutoETS, AutoARIMA, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import numpy as np
# 1. Load your data (replace this with your actual dataset)
df = AirPassengersDF.copy()
# 2. Tell it the cycle length (12=monthly, 7=daily, 52=weekly, 24=hourly)
CYCLE = 12
# 3. Run the models
sf = StatsForecast(
models=[
SeasonalNaive(season_length=CYCLE), # always include the turtle
AutoETS(season_length=CYCLE), # workhorse 1
AutoARIMA(season_length=CYCLE), # workhorse 2
],
freq="MS", # change to "D", "W", "H" for daily, weekly, hourly data
n_jobs=-1 # use all CPU cores
)
# 4. Cross-validate honestly
cv = sf.cross_validation(df=df, h=CYCLE, n_windows=3)
# 5. Score everything and pick the winner
actual = cv["y"].values
print("=== YOUR RESULTS ===")
for model in ["SeasonalNaive","AutoETS","AutoARIMA"]:
mae = np.mean(np.abs(cv[model].values - actual))
print(f" {model:<20}: MAE = {mae:.1f}")
ensemble_mae = np.mean(np.abs(((cv["AutoETS"]+cv["AutoARIMA"])/2).values - actual))
print(f" {'Ensemble':<20}: MAE = {ensemble_mae:.1f}")
# 6. Produce final forecasts with uncertainty ranges
forecasts = sf.forecast(df=df, h=CYCLE, level=[80,95])
print(f"
Forecasts for next {CYCLE} periods ready. Congratulations — you're a forecaster.")