Phase 1 — SEE CLEARLY

Day 1 of 30

🔮What Is Forecasting?

You already do this before you leave the house every morning

📖 Before we begin

A NASA engineer once said: 'Give me the last 30 days of temperature readings and I'll tell you next week's weather better than a meteorologist who just guessed.' He wasn't bragging. He was describing the single most powerful insight in all of forecasting.

The Analogy

Before you leave home, you glance at the sky. Clouds? Bring an umbrella. Sunshine? Leave it behind. Without realising it, you just made a forecast — you used something you could see (the clouds) to predict something you couldn't see yet (the rain). Forecasting with numbers works exactly like that, just with a spreadsheet instead of a window.

✓ Source: Forecasting: Principles and Practice by Rob Hyndman & George Athanasopoulos (2021), freely available at otexts.com/fpp3. This is the gold-standard textbook this course follows.

Imagine writing down the temperature outside every single morning for a whole year. January 1st: 4°C. January 2nd: 6°C. January 3rd: 3°C. And so on, all the way to December 31st.

That list of numbers, collected one by one over time, is called a time series. "Time series" is just a fancy way of saying "a list of numbers that changes as time passes." Your height measured every birthday is a time series. The number of customers in a café each hour is a time series. Daily rainfall amounts are a time series.

Forecasting means using the patterns in that list to make your best guess about what comes next — what will the temperature be tomorrow? How many customers next Tuesday? How much rain next week?

Three words to learn today

Horizon — how far ahead you want to predict. Predicting tomorrow is a very short horizon. Predicting next year is a long horizon. Like the horizon you see at the beach, except in time instead of distance.
Frequency — how often the numbers are collected. Every hour, every day, every week, every month. A doctor measures your height once a year. A weather station measures temperature every minute.
Pattern — something that happens again and again. Ice cream sales always go up in summer. Toy shops always get busier in December. These repeating patterns are what make forecasting possible.

Your toolkit: the nixtlaverse

Throughout this course, we use a set of free Python tools called the nixtlaverse, built by a company called Nixtla. Think of it as a toolbox where every tool uses the same standard format for data. Learn that format once and every tool in the box just works.

The format is three columns: who (which product, shop, or sensor), when (the date), and what (the number you care about). That's it.

🐘 Your pink elephant: Picture a window. Clouds = past data. Your umbrella decision = your forecast. Every time you see a window from now on, you'll think: am I reading the clouds or just guessing?

Key Takeaways

A time series is just a list of numbers collected over time — your height each birthday, daily temperature, monthly sales
Forecasting means using patterns from the past to make educated guesses about the future
Every tool in this course uses three columns: unique_id (who), ds (when), y (what number)

Python

import pandas as pd

# Let's create a simple time series: ice cream sales each month
# Three columns: WHO (unique_id), WHEN (ds), WHAT (y)
ice_cream = pd.DataFrame({
    "unique_id": "my_ice_cream_shop",    # name of our series
    "ds": pd.date_range("2023-01", periods=12, freq="MS"),  # Jan-Dec 2023
    "y":  [120, 110, 140, 160, 200, 280,  # Jan-Jun sales
           320, 310, 250, 180, 140, 130]  # Jul-Dec sales
})

print(ice_cream)
# You'll see: more sales in summer (row 6-8), less in winter
# That repeating pattern is exactly what we'll learn to predict!

→You now know what a time series is. But what separates forecasters who get it right from those who don't? It's a single habit — and it takes 30 seconds. Tomorrow: the one thing every expert does before touching a model.

1 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 1 — SEE CLEARLY

Day 2 of 30

👁️Always Look Before You Model

No doctor performs surgery without studying the X-ray first

📖 Before we begin

Google launched Flu Trends in 2008 with bold claims. It missed the 2009 H1N1 outbreak entirely. By 2013 a Nature study showed it was over-estimating flu cases by 2×. A 2014 Science paper named the cause: big data hubris — they trusted the algorithm and never looked carefully at what the data was actually saying.

The Analogy

Imagine a doctor who skips the X-ray and goes straight to surgery. Terrifying, right? A good doctor looks carefully at all the evidence before touching a single instrument. Looking at your data before building any kind of model is the same thing. Most forecasting mistakes happen because someone skipped this step.

✓ Source: Hyndman & Athanasopoulos (2021), Chapter 2. The phrase "always start with a time plot" appears verbatim in the textbook. The ACF (autocorrelation function) is a classical tool from Box-Jenkins methodology, first published 1970.

Before you do anything else — before you touch a single tool or write a line of code — make a simple line chart of your numbers over time. Look at it. Study it. Ask yourself four questions:

Is it going up or down overall? Like a rising tide — is the general direction upward, downward, or flat? This slow overall direction is called the trend.
Does it spike every year at the same time? Ice cream sales every summer. Christmas toys every December. A pattern that repeats on a regular schedule is called seasonality. (It doesn't have to be by season — it could repeat every week, every month, or every hour.)
Are there any sudden jumps or drops? A factory fire. A viral tweet. A global pandemic. These unexpected one-off events are called outliers. Your model needs to know they happened, otherwise it will be confused by them.
Did the behaviour suddenly change direction? A company launches a new product and growth jumps overnight. This is called a structural break — the pattern before it and after it are different.

The pattern-detector chart

There is a second chart, called the ACF plot, that acts like a pattern detector. ACF stands for "Autocorrelation Function" — a mouthful, but the idea is simple: it measures how much today's number is related to yesterday's, last week's, and last month's.

If there's a big spike at "12 months ago", your data has a yearly pattern. If there's a spike at "7 days ago", it has a weekly pattern. This one chart tells you almost everything you need to know about the structure of your data.

The golden rule: if your chart surprises you, your forecast will surprise you too — and not in a good way.

🐘 Your pink elephant: The detective's magnifying glass over a timeline. Every time someone shows you a number without a chart, you'll feel the itch — but what does it look like over time?

Key Takeaways

Always draw your data as a line chart first — look for trend (overall direction), seasonality (repeating pattern), outliers (sudden jumps), and structural breaks (the pattern changes permanently)
The ACF chart is your pattern detector — tall bars tell you exactly what repeating patterns are hiding in your data
If the chart surprises you, stop and investigate before building any model

Python

import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from statsforecast.utils import AirPassengersDF  # a classic dataset about airline passengers

# Load the data: monthly airline passengers 1949-1960
df = AirPassengersDF.copy().set_index("ds")["y"]

fig, axes = plt.subplots(2, 1, figsize=(11, 7))

# Chart 1: The basic line chart
# Look for: is it going up? Does it get bigger waves over time?
axes[0].plot(df, linewidth=2, color="#16a34a")
axes[0].set_title("Airline passengers over time — what do you notice?")
axes[0].set_ylabel("Number of passengers (thousands)")

# Chart 2: The pattern detector (ACF plot)
# Look for: tall bars tell you about repeating patterns
# The tall bar at 12 means: "what happened 12 months ago predicts today"
plot_acf(df, lags=36, ax=axes[1], color="#7c3aed")
axes[1].set_title("Pattern detector — the big spike at position 12 means YEARLY pattern!")

plt.tight_layout()
plt.show()

→You can now see patterns in your data. But what if the patterns are tangled together — trend mixed with seasons mixed with noise? How do you pull them apart? Tomorrow: the sandwich.

2 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 1 — SEE CLEARLY

Day 3 of 30

🥪Peeling Apart Your Data

Every time series is a sandwich — pull the layers apart to understand each one

📖 Before we begin

A supermarket chain couldn't figure out why its ice cream sales model kept failing. They hired three consultants. All three missed the same thing: the seasonal pattern was getting bigger every year — but they'd been modelling it as a fixed bump. The BLT sandwich had been assembled wrong.

The Analogy

A BLT sandwich looks like one thing, but it's actually three layers: bread (the solid base), bacon (the repeating delicious bit), and lettuce (the random unpredictable crunchy stuff). Your data is the same. Pull it apart and suddenly you can see what's really happening. This pulling-apart process is called decomposition.

✓ Source: STL decomposition was invented by Cleveland, Cleveland, McRae & Terpenning (1990), Journal of Official Statistics. STL stands for "Seasonal-Trend decomposition using LOESS." It is still the most widely used decomposition method in the world.

Hidden inside every set of time-based numbers are three separate ingredients that got mixed together. Decomposition is the process of separating them out so you can study each one.

Ingredient 1: The Trend

This is the slow, long-term direction. Like an escalator — it steadily goes up, or steadily goes down, over months and years. Sales at a growing company trend upward. Sales at a declining one trend downward. The trend doesn't jump around — it moves slowly and steadily.

Ingredient 2: The Season (Repeating Pattern)

This is the predictable, calendar-driven pattern that comes back again and again. Ice cream spikes every July. Gym memberships spike every January. A bakery sells more on Saturdays. This ingredient is called seasonality — and it's the most valuable one, because it's completely predictable.

Ingredient 3: The Remainder (Random Noise)

After you've removed the trend and the seasonal pattern, whatever's left is the remainder. This is the random stuff — a factory caught fire, someone famous mentioned your brand, a freak weather event. No model can predict this part. It's the chaos ingredient.

Why does this matter?

If your remainder is small compared to the trend and seasonal parts, your data is mostly predictable — great news! If the remainder is huge, then most of what's happening is random noise, and even the best model won't do much better than a simple guess.

🐘 Your pink elephant: A BLT sandwich pulled into three separate layers on a plate. Bacon = seasonal. Bread = trend. Lettuce = noise. Every time you see a sandwich, you'll think: have I decomposed my data yet?

Key Takeaways

Decomposition splits your data into 3 layers: Trend (slow direction) + Seasonality (repeating pattern) + Remainder (random noise)
Small remainder = your data is mostly predictable. Huge remainder = mostly random noise, forecasting will be hard
Seasonality is the most valuable layer — because it repeats, you can predict it very reliably

Python

from statsmodels.tsa.seasonal import STL
from statsforecast.utils import AirPassengersDF
import matplotlib.pyplot as plt

# Load airline passenger numbers (monthly, 1949-1960)
df = AirPassengersDF.copy().set_index("ds")["y"]

# STL = the tool that pulls your data apart into 3 layers
# period=12 means we expect a YEARLY pattern (12 months per cycle)
result = STL(df, period=12).fit()

# Now plot the 3 layers separately
fig, axes = plt.subplots(4, 1, figsize=(11, 9), sharex=True)
for ax, data, label, color in zip(axes,
    [df, result.trend, result.seasonal, result.resid],
    ["Original (all mixed together)",
     "Trend (the escalator)",
     "Seasonal (the repeating summer peaks)",
     "Remainder (the random leftover noise)"],
    ["#374151", "#16a34a", "#7c3aed", "#dc2626"]):
    ax.plot(data, color=color, linewidth=1.8)
    ax.set_ylabel(label, fontsize=10)

plt.tight_layout()
plt.show()

# Is our data mostly predictable?
print(f"Seasonal pattern size: {result.seasonal.max()-result.seasonal.min():.0f} passengers")
print(f"Random noise size:     {result.resid.std():.0f} passengers")
print("→ Seasonal is much bigger than noise = GOOD, mostly predictable!")

→You can now pull your data apart. But when your forecast is wrong — how do you know how wrong? And is there a scoreboard that won't lie to you? Tomorrow: the sports scoreboard problem.

3 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 1 — SEE CLEARLY

Day 4 of 30

🎯How Wrong Are You? Measuring Your Mistakes

Different sports use different scoreboards — pick the right one

📖 Before we begin

Two forecasters submitted predictions for the same dataset. Forecaster A had a 12% MAPE. Forecaster B had a 38% MAPE. The competition organiser announced Forecaster B the winner. The audience was confused. Then the organiser explained why MAPE had lied — and the room went silent.

The Analogy

In golf, every stroke counts equally. In basketball, shots can be worth 2 or 3 points. The scoring system shapes the game. When you forecast, you need a "scoreboard" that measures how wrong you were. But different scoreboards reward different things. Pick the wrong one and your model will get very good at winning the wrong game.

✓ Source: MASE was invented by Hyndman & Koehler (2006), International Journal of Forecasting. MAPE's fatal flaw near zero was documented extensively, leading the M5 Forecasting Competition (Walmart, 2020) to ban it entirely.

After your forecast runs, you need to measure: how wrong were my guesses? Here are the four main "scoreboards" and when to use each one:

Scoreboard 1: MAE — Mean Absolute Error

The simplest scoreboard. Take every mistake, ignore whether it was too high or too low (that's the "absolute" part), and find the average. If your MAE is 50, you were wrong by about 50 units on average. Easy to explain to anyone. Use this as your default.

Scoreboard 2: RMSE — Root Mean Squared Error

Like MAE, but it squares each mistake before averaging, then takes the square root at the end. This punishes large mistakes much harder than small ones. Missing by 200 units counts as 16× worse than missing by 50 units. Use this when a big mistake is especially costly — like medicine stock in a hospital.

Scoreboard 3: MASE — Mean Absolute Scaled Error ⭐

This one compares your mistakes to what a very simple "copy the last value" approach would get. A MASE below 1.0 means you beat the simple approach. A MASE above 1.0 means the simple approach beat you — and you should stop and fix your model. Use this to compare models across different products.

Scoreboard 4: MAPE — please avoid this one ❌

MAPE divides each mistake by the actual value. The problem: if the actual value is near zero (one week you sold almost nothing), you're dividing by nearly zero, and the score shoots to infinity. Looks simple but breaks in common situations. The world's largest forecasting competition banned it.

🐘 Your pink elephant: A broken thermometer that shows the right temperature 90% of the time but explodes to 10,000°C when it gets near zero. That's MAPE on near-zero data. You will never trust MAPE again.

Key Takeaways

MAE: average size of your mistakes — simple, honest, use it daily
MASE below 1.0 = you beat the naive (copy-last-value) baseline. Above 1.0 = fix your model
Never use MAPE when your numbers can be close to zero — it breaks completely

Python

import numpy as np

# Pretend we forecast sales for 5 weeks and here's how we did:
actual_sales   = [500, 620, 480, 710, 550]  # what really happened
our_forecast   = [520, 600, 510, 680, 570]  # what we predicted

actual   = np.array(actual_sales)
forecast = np.array(our_forecast)

# Scoreboard 1: MAE — how wrong on average?
mistakes = np.abs(actual - forecast)           # size of each mistake
mae = np.mean(mistakes)                        # average mistake size
print(f"MAE = {mae:.1f}  → we were wrong by {mae:.0f} units on average")

# Scoreboard 2: RMSE — punishes big mistakes harder
rmse = np.sqrt(np.mean((actual - forecast)**2))
print(f"RMSE = {rmse:.1f}  → if big mistakes are costly, use this")

# Scoreboard 3: MASE — did we beat the simple "copy last week" approach?
# First: what would "copy last week" have gotten?
naive_mistakes = np.abs(np.diff(actual))       # just compare each week to the one before
naive_mae = np.mean(naive_mistakes)
mase = mae / naive_mae
print(f"MASE = {mase:.2f}  → {'We beat the simple approach!' if mase<1 else 'The simple approach beat us — go fix the model!'}")

→You can measure your mistakes. But what's the standard you're measuring against? Here's the uncomfortable truth: there's a brain-dead simple method you must beat before your clever model means anything. Tomorrow: meet the turtles.

4 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 1 — SEE CLEARLY

Day 5 of 30

🐢The Turtle You Must Beat First

Before entering any real race, you have to outrun the slowest runner

📖 Before we begin

In the M4 Forecasting Competition — the Olympics of forecasting, 100,000 real time series — 'Copy Last Year' finished in the top third. It beat thousands of machine learning models. It didn't know any maths. It just remembered what happened 12 months ago. The researchers called it 'the embarrassing benchmark.'

The Analogy

Before athletes race each other, they first need to beat a qualification standard — a slow turtle of a time that any serious runner should beat easily. In forecasting, that turtle is called a baseline model — the simplest possible forecast. If your clever, expensive model can't beat the turtle, it's not clever at all. It's broken.

✓ Source: Hyndman & Athanasopoulos (2021), Chapter 5, "Some simple forecasting methods." Research consistently shows that on short or noisy series, these simple methods outperform complex models surprisingly often.

Before building anything complicated, always run these four extremely simple forecasting methods first. They are your turtles — your minimum standard. Any model you build must beat all of them.

Turtle 1: The Copycat (Naive)

Forecast = whatever happened last time. If sales were 500 yesterday, predict 500 today. That's it. No math, no thinking. Surprisingly hard to beat on flat, stable data.

Turtle 2: The "Same Time Last Year" (Seasonal Naive)

Forecast = whatever happened at this exact time last year. Predict January by copying last January. Predict December by copying last December. This turtle is your hardest opponent on seasonal data — it knows about the Christmas spike, it knows about the summer peak, and it doesn't need to learn anything.

Turtle 3: The Trend Extrapolator (Drift)

Draw a straight line from the first number to the last number in your past data. Keep going. This assumes the average rate of change from before will continue.

Turtle 4: The Boring Average

Predict the same number every time: the average of all past data. Useful when there's no trend and no seasonal pattern — just noise around a stable level.

The rule that matters

If your model doesn't beat "Same Time Last Year", it should not be used. Period. Check this before showing anyone your results.

🐘 Your pink elephant: A literal turtle with '1st Place' ribbon around its neck. Every time you build a fancy model, you'll think: wait, did I beat the turtle yet? If not, stop. Go back.

Key Takeaways

"Same Time Last Year" (Seasonal Naive) is your hardest competitor — it beats complex models more often than you'd expect
A model that can't beat these simple baselines is broken and must not be used
Always run baselines first — they take seconds and save you from shipping broken forecasts

Python

from statsforecast import StatsForecast
from statsforecast.models import Naive, SeasonalNaive, HistoricAverage, RandomWalkWithDrift
from statsforecast.utils import AirPassengersDF
import numpy as np

# Load data: first 10 years to train, last 2 years to test
df = AirPassengersDF.copy()
train = df.head(120)   # train on 10 years
test  = df.tail(24)    # test on 2 years (our report card)

# Run all four turtles at once
sf = StatsForecast(
    models=[
        Naive(),                           # turtle 1: copy last value
        SeasonalNaive(season_length=12),   # turtle 2: copy same month last year
        HistoricAverage(),                 # turtle 4: predict the average
        RandomWalkWithDrift(),              # turtle 3: extend the trend line
    ],
    freq="MS"  # MS = monthly data
)
forecasts = sf.forecast(df=train, h=24)

# Score each turtle: which one was closest?
print("=== Turtle Race Results ===")
for model in ["Naive", "SeasonalNaive", "HistoricAverage", "RWD"]:
    avg_mistake = np.mean(np.abs(forecasts[model].values - test["y"].values))
    print(f"{model:<20} Average mistake = {avg_mistake:.1f} passengers")
print("
Your model must score BELOW the best turtle to be worth using!")

→You've met the turtles. Now you know what you must beat. But how do you test fairly — without accidentally cheating by letting your model peek at the future? Tomorrow: the exam cheat who learned nothing.

5 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 1 — SEE CLEARLY

Day 6 of 30

⏳Testing Your Forecast Honestly

The time-machine rule — you can never use future information to train your model

📖 Before we begin

A data science team at a major retailer spent 6 months building a forecasting model. Their test results looked incredible — 94% accuracy. They deployed it. It was catastrophically wrong from day one. The problem: during testing, their model had been trained on data from the future. They'd built the world's most expensive cheat sheet.

The Analogy

Imagine you want to become a great chess player. If you study the answer book while solving each puzzle, your practice score will be perfect — but completely meaningless. You learned nothing. Testing a forecast by letting it "peek" at future data is the same cheat. The time-machine rule says: the future can never train the model, ever.

✓ Source: Bergmeir & Benítez (2012), Information Sciences. Standard k-fold cross-validation is mathematically invalid for time series because it assumes observations are independent — which time series observations are not.

In normal school exams, the teacher gives you problems you've never seen before. If the teacher accidentally gave you the answer key, your mark would be perfect — but you'd have learned nothing. The same trap exists in forecasting.

Standard machine learning splits data randomly: "Take any 80% of the rows for training, use the other 20% for testing." For most types of data this works fine. For time series, it is catastrophically wrong.

Why random splits are cheating in time series

If you randomly pick your test data, some of it will be from January 2021, and some of your training data will be from June 2021. But June comes after January — so your model was trained on the future to predict the past. Your test score looks great, but your model has learned nothing that applies to real life.

The correct approach: walking forward through time

Every test period must come after every training period. Always. No exceptions. The correct method creates multiple test windows, each one stepping forward in time:

Window 1: Train on Jan–Jun → test on Jul–Sep (did it predict those months correctly?)
Window 2: Train on Jan–Sep → test on Oct–Dec
Window 3: Train on Jan–Dec → test on Jan–Mar (next year)

Average the score across all windows. That's your real, honest performance estimate.

This process is called cross-validation — "cross" because you test from multiple angles, "validation" because it validates whether your model actually works. The nixtlaverse does this automatically with one function call.

🐘 Your pink elephant: A student copying tomorrow's exam answers to practice for today's test. Gets 100% on practice. Learns nothing. Fails the real thing. This is random-split validation on time series data.

Key Takeaways

Never randomly shuffle time series data — it lets the model "see" the future during training, which is cheating
In cross-validation for time series, test data must always come after training data in time
Cross-validation gives you your honest, real-world performance estimate — use it before trusting any model

Python

from statsforecast import StatsForecast
from statsforecast.models import AutoETS, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import numpy as np

df = AirPassengersDF.copy()

# Set up two models to compare
sf = StatsForecast(
    models=[
        AutoETS(season_length=12),         # a smarter model (we'll learn this Day 10)
        SeasonalNaive(season_length=12),   # our "same time last year" turtle
    ],
    freq="MS"
)

# Honest test: 3 windows, each predicts 12 months ahead
# StatsForecast makes sure test ALWAYS comes AFTER training
cv_results = sf.cross_validation(
    df=df,
    h=12,         # predict 12 months ahead each time
    n_windows=3   # repeat 3 times, stepping forward through history
)

# Score both models honestly
print("=== Honest Test Results ===")
for model in ["AutoETS", "SeasonalNaive"]:
    avg_mistake = np.mean(np.abs(cv_results[model] - cv_results["y"]))
    print(f"{model:<20} Average mistake = {avg_mistake:.1f}")
print("
This is real performance — the model never saw the test data during training")

→You can now test honestly. But you've only been using maths formulas so far. What if there's a smarter way — one that fades old memories like sunglasses that dim the past? Tomorrow: the fading memory machine.

6 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 2 — TWO WORKHORSES

Day 7 of 30

🌅Remembering the Recent Past More

Sunglasses that automatically darken old memories

📖 Before we begin

In 1956, a US Navy officer named Robert Brown had a problem: the Navy had warehouses full of inventory, and nobody could predict demand well enough. He invented a system in 2 pages of notes. Seventy years later, every major forecasting software in the world still uses his core idea. The secret? Forget the distant past.

The Analogy

Picture sunglasses that get darker the older the memory is. Last week is crystal clear. Last month is faded grey. Last year is almost black. This is exactly how exponential smoothing works — it gives your most recent numbers the most weight, and lets older numbers gradually fade away. The word "exponential" just describes how fast the fading happens.

✓ Source: Exponential smoothing was independently developed by Robert Brown (1956) for US Navy inventory management and Charles Holt (1957). It has been validated in every major forecasting competition since the M-Competition (1982).

Exponential smoothing (let's call it ES for short) is built on one beautiful, sensible idea: recent information matters more than old information.

The formula is: New forecast = α × (latest actual number) + (1 − α) × (old forecast)

The Greek letter α (alpha) is your single control dial. It goes from 0 to 1:

α close to 1 (like 0.9): "Trust the most recent number almost completely." The forecast reacts quickly to changes, but it's jumpy.
α close to 0 (like 0.1): "Average out a lot of history." The forecast is very smooth and slow to react.
α is chosen automatically by the computer — it tries many values and picks the one that would have been most accurate on past data.

Important limitation

Simple exponential smoothing can only produce a flat forecast — a horizontal line. It has no mechanism to go up or down over time, and no knowledge of seasonal patterns. It's great for flat, stable data. For data with a rising trend or seasonal spikes, you need the upgraded versions (Days 8–9).

Think of simple ES as the foundation. Everything that follows adds more layers on top of this same core idea.

🐘 Your pink elephant: Sunglasses where the lenses automatically darken the further back the memory. Yesterday: clear as glass. Last year: pitch black. Every time you see sunglasses, you'll think: alpha.

Key Takeaways

Exponential smoothing gives more weight to recent numbers and less weight to older numbers — like sunglasses that dim the past
Alpha (α) controls the speed: close to 1 = reacts fast and jittery; close to 0 = smooth and slow
Simple ES only makes flat forecasts — it cannot handle trends or seasonal patterns (Days 8-9 fix this)

Python

from statsforecast import StatsForecast
from statsforecast.models import SimpleExponentialSmoothing as SES
from statsforecast.utils import AirPassengersDF

df = AirPassengersDF.copy()
train = df.head(120)  # use 10 years of data to fit

# Try two different alpha settings to see the difference
sf = StatsForecast(
    models=[
        SES(alpha=0.1),   # slow and smooth — trusts history more
        SES(alpha=0.9),   # fast and reactive — trusts recent numbers more
    ],
    freq="MS"
)

forecasts = sf.forecast(df=train, h=24)
print(forecasts)

# Notice: both models produce a flat horizontal line as their forecast
# This is because simple ES has no way to go up or down over time
# For airline passengers (which clearly trend upward), we need Holt (Day 8)
print("
Both forecasts are flat lines — SES can't predict trends!")

→Exponential smoothing handles flat data beautifully. But what if your sales aren't flat — they're climbing year after year? A flat forecast for growing data is like predicting the escalator will stay on the ground floor. Tomorrow: the escalator.

7 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 2 — TWO WORKHORSES

Day 8 of 30

🚀Adding Direction to Your Forecast

Simple smoothing walks on flat ground — Holt adds an escalator

📖 Before we begin

A consultancy once presented a 5-year forecast to a client showing sales tripling by Year 5. It looked impressive. The model was Holt's method — without damping. When asked 'what does the 10-year forecast look like?' they showed it: sales larger than the entire global market. The client walked out.

The Analogy

Simple smoothing (Day 7) only knows where it is right now — it walks on flat ground and never looks up or down. Holt's method adds an escalator. Now the forecast knows both where it is AND which way it's heading. If sales have been growing by 50 units every month, Holt rides the escalator upward instead of staying flat.

✓ Source: Holt's linear exponential smoothing was published by Charles C. Holt in 1957 (US Office of Naval Research). The "damped trend" improvement was published by Gardner & McKenzie (1985) in Management Science. Damped trend is universally recommended over plain Holt in practice.

Holt's method adds a second tracking system to exponential smoothing. Instead of tracking just one thing, it now tracks two:

Level — where is the series right now? (Same as before, controlled by α)
Trend (direction) — how fast is it changing each period? Is it going up by 50 units per month? Down by 20? This is controlled by a second dial called β (beta).

The forecast for next month = level + 1 × trend. For two months ahead = level + 2 × trend. The trend is like your velocity — multiply it by time to see where you end up.

The problem with straight-line thinking

If a company has grown 20% per year for three years, plain Holt will assume 20% growth forever — all the way to the moon. In reality, trends slow down, hit limits, or reverse. Blindly extending a trend for years into the future almost always leads to disaster.

The fix: the damped trend

A smarter version adds a "damping" effect: the further into the future you predict, the slower and more cautious the trend becomes. Instead of shooting to the moon, the forecast gradually flattens out. Decades of research and every major forecasting competition agree: always use the damped version. It's more humble, and humility is usually right.

🐘 Your pink elephant: An escalator that shoots through the ceiling if you don't press the STOP button. The damping parameter is the STOP button. Always press STOP. Always use damped trend.

Key Takeaways

Holt adds direction tracking: Level (where you are) + Trend (which way and how fast)
Without damping, the trend extends forever and predictions become absurd years out
Always use the damped version — it slows down the trend over time and is almost always more accurate

Python

from statsforecast import StatsForecast
from statsforecast.models import Holt, AutoETS
from statsforecast.utils import AirPassengersDF

df = AirPassengersDF.copy()
train = df.head(120)  # 10 years of training data

# Holt's method: two tracking systems (level + trend)
# We give it an alias so the output column name is clear
sf = StatsForecast(
    models=[
        Holt(season_length=1, error_type="A", alias="Holt_LinearTrend"),
        AutoETS(season_length=12),  # AutoETS tries damped trend automatically ✓
    ],
    freq="MS"
)
forecasts = sf.forecast(df=train, h=36)  # predict 3 years ahead

# Holt with a linear trend will keep extrapolating upward indefinitely
# AutoETS applies damping automatically — the trend gradually flattens out
print("Prediction 36 months out:")
print(f"  Holt (linear trend): {forecasts['Holt_LinearTrend'].iloc[-1]:.0f} passengers")
print(f"  AutoETS (with damping): {forecasts['AutoETS'].iloc[-1]:.0f} passengers")
print()
print("AutoETS is more conservative — it won't project trends to the moon!")
print("This is why we always prefer AutoETS over plain Holt in practice.")

→Your forecast can now handle flat data and trending data. But what about the summer spike every year? The Christmas rush? The model needs a calendar. Tomorrow: the roller coaster on the escalator.

8 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 2 — TWO WORKHORSES

Day 9 of 30

🎢Adding Seasons to Your Forecast

A roller coaster riding an escalator — trend AND calendar patterns together

📖 Before we begin

A toy retailer ran Holt's method (Day 8) on their December sales. Every year like clockwork, December was 4× bigger than any other month. Holt's model predicted steady December growth — but completely missed the 4× spike. Their warehouse ordered wrong. They ran out of toys on December 20th. Two lines of code would have fixed it.

The Analogy

Day 7 gave you the flat ground. Day 8 added the escalator (trend). Now Day 9 adds the roller coaster on top of the escalator (seasonal pattern). Ice cream sales go up year after year (escalator) AND spike every summer (roller coaster). Holt-Winters handles both at the same time — which is why it works on most real business data.

✓ Source: Developed by Peter Winters (1960), Management Science, building on Holt's 1957 work. Remains one of the most widely used forecasting methods in commercial software worldwide. The additive vs multiplicative distinction was rigorously validated in Gardner (1985).

Holt-Winters is the full version. It tracks three things at once:

Level (α) — where is the series right now?
Trend (β) — is it going up or down, and how fast?
Season (γ, "gamma") — what is the repeating up/down pattern by month/week?

The forecast for any future month = (level + trend) adjusted for the seasonal factor that month.

One important choice: additive vs multiplicative

Imagine a toy shop. In December it always does extra sales. There are two ways this "extra" can work:

Additive: December is always 200 sales more than average — the extra amount is fixed no matter how big the shop gets. Use this when the seasonal bump is a constant size.
Multiplicative: December is always 2× (double) the average — the extra amount grows as the shop grows. Use this when the seasonal bump gets bigger as the business grows.

How to tell which? Look at your chart from Day 2. If the seasonal peaks get taller as the overall level rises — that's multiplicative. If they stay about the same height — additive.

The airline passenger data is the classic example of multiplicative: in the 1950s the seasonal peaks were small. By the 1960s, they were enormous — because the airline industry itself had grown.

🐘 Your pink elephant: A roller coaster (seasonal ups and downs) sitting on top of a moving escalator (the upward trend). The coaster goes up and down, but the whole track is also rising. That's Holt-Winters. You will never forget this image.

Key Takeaways

Holt-Winters tracks three things: Level (where), Trend (direction), Season (repeating pattern)
Multiplicative seasons: peaks grow as the series grows. Additive: peaks stay the same size.
Visual check: are the peaks getting taller as the trend rises? Yes → multiplicative. No → additive.

Python

from statsforecast import StatsForecast
from statsforecast.models import HoltWinters
from statsforecast.utils import AirPassengersDF
import numpy as np

df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)

sf = StatsForecast(
    models=[
        HoltWinters(season_length=12, error_type="A", alias="HW_Additive"),     # fixed seasonal bumps
        HoltWinters(season_length=12, error_type="M", alias="HW_Multiplicative"),# growing seasonal bumps
    ],
    freq="MS"
)
forecasts = sf.forecast(df=train, h=24)

# Check which one is more accurate
print("Additive seasons mistake:",
      round(np.mean(np.abs(forecasts["HW_Additive"].values - test["y"].values)), 1))
print("Multiplicative mistake:  ",
      round(np.mean(np.abs(forecasts["HW_Multiplicative"].values - test["y"].values)), 1))
# For airline passengers: multiplicative wins because peaks grow with the trend!

→Holt-Winters handles trend AND seasons. But you had to choose all the settings by hand. What if a computer could test every possible setting and pick the winner automatically — like an eye doctor flipping through every lens? Tomorrow: the optometrist model.

9 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 2 — TWO WORKHORSES

Day 10 of 30

🔭Trying Every Combination Automatically

The eye doctor who tests every lens before deciding which one is best

📖 Before we begin

There are exactly 30 different ways to combine the three ingredients of exponential smoothing. A team of researchers tested all 30 on every dataset in a major competition. The result: no single combination won every time. But one automatic approach — trying all 30 and picking the winner — consistently matched or beat hand-chosen combinations across large-scale competition datasets.

The Analogy

At the eye doctor, they don't guess which lens is right. They flip through lens after lens — "better or worse? better or worse?" — until your vision is perfectly sharp. The computer equivalent is called AutoETS. It tries all ~30 combinations of the methods from Days 7–9 and automatically picks the winner.

✓ Source: The ETS unified framework was published by Hyndman et al. (2002). The name ETS stands for Error, Trend, Seasonal. AutoETS selects the best combination using AIC (Akaike Information Criterion), a mathematically proven way to balance accuracy against model complexity.

The last three days introduced exponential smoothing with three possible "ingredients":

How errors behave: additive or multiplicative
What kind of trend: none, upward, damped, or multiplicative
What kind of season: none, additive, or multiplicative

Together, these give about 30 different possible combinations. The framework that organises them all is called ETS — short for Error, Trend, Seasonal. ETS(A,N,A) means: Additive errors, No trend, Additive seasonal.

AutoETS: the computer tries them all

Instead of you choosing which combination to use, AutoETS tries all ~30 combinations on your data and picks the best one automatically. It uses a scoring system called AIC (which you can think of as: "how accurate was it, minus a penalty for being overly complicated?"). Lower AIC = better model.

This is the tool you'll use most in practice. For any dataset with up to a few hundred data points, AutoETS is your first serious model after the Day 5 turtles.

When it finishes, you can ask which combination it chose — something like ETS(M,Ad,M), which means multiplicative errors, damped trend, multiplicative seasonal. This is the "recipe" it found best for your data.

🐘 Your pink elephant: An eye doctor's phoropter — the big goggle machine with hundreds of lenses. Click, click, click. 'Better or worse?' AutoETS is the phoropter for your data. Whenever you think 'which model should I use?' — run AutoETS first.

Key Takeaways

ETS is the umbrella name for all exponential smoothing methods — ETS(M,Ad,M) tells you the exact recipe
AutoETS tries all ~30 combinations and picks the winner automatically — you don't need to choose
AutoETS is your go-to first serious model for any dataset with up to a few hundred monthly or weekly data points

Python

from statsforecast import StatsForecast
from statsforecast.models import AutoETS
from statsforecast.utils import AirPassengersDF

df = AirPassengersDF.copy()

# AutoETS will try all ~30 combinations and pick the best one
sf = StatsForecast(
    models=[AutoETS(season_length=12)],  # tell it: 12 months per year
    freq="MS"
)

# Fit and forecast the next 24 months, with uncertainty ranges included
# Note: forecast() handles fitting internally — just one call needed
sf.fit(df=df)
forecasts = sf.predict(h=24, level=[80, 95])  # level adds uncertainty bands
print(forecasts)

# Which recipe did it choose? The model_ dict contains the answer
winning_recipe = sf.fitted_[0][0].model_.get("method", "unknown")
print(f"
Winning recipe: {winning_recipe}")
# For airline passengers, it usually picks ETS(M,N,M):
# M = multiplicative errors, N = no explicit trend (absorbed into level),
# M = multiplicative seasonal
# This matches what we saw in our chart: growing seasonal peaks!

→AutoETS is your first serious workhorse. Now meet its sibling — a completely different kind of model that needs your data to stop moving before it can read it. Like trying to measure someone's height while they jump on a trampoline. Tomorrow: the trampoline problem.

10 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 2 — TWO WORKHORSES

Day 11 of 30

📐Making Wiggly Data Go Flat

Before ARIMA can read your data, it needs to stand still

📖 Before we begin

ARIMA was invented in 1970 and has been running inside weather forecasting software, central bank models, and supply chain systems ever since. For 50 years. No one replaced it. But it has one absolute requirement — one thing it will NOT tolerate. And if you ignore this requirement, the entire model becomes meaningless garbage.

The Analogy

Imagine trying to measure someone's height while they're jumping on a trampoline. Every measurement will be different — not because they're growing, but because they won't stay still. ARIMA needs your data to be "standing still" before it can work properly. The process of making it stand still is called differencing.

✓ Source: The concept of stationarity is foundational in Box-Jenkins (1970), "Time Series Analysis: Forecasting and Control." The ADF (Augmented Dickey-Fuller) test for stationarity was published by Said & Dickey (1984), Biometrika.

ARIMA is a powerful tool — but it has one strict requirement: your data needs to be stationary. Stationary just means the numbers are bouncing around a fixed level rather than drifting steadily upward or downward. Like a calm lake versus a river flowing downhill.

How to check

Look at your chart. If it's clearly trending upward or downward over time, it's not stationary. You can also run a formal check called the ADF test (Augmented Dickey-Fuller) — a computer programme that outputs one number. If that number is less than 0.05, your data is stationary. If it's higher, it's not.

How to fix it: differencing

Differencing replaces each number with the change from the previous number. Instead of asking "how many sales this month?" you ask "how many more (or fewer) sales than last month?" If the original data was drifting upward, the changes often bounce around zero — which is stationary.

In the ARIMA name, the middle letter "I" stands for Integrated — which means "we differenced the data." An ARIMA model where d=1 means "we differenced once." d=2 means "we differenced twice" (used when once wasn't enough, which is rare).

The good news: AutoARIMA (Day 13) does all of this for you automatically.

🐘 Your pink elephant: Someone jumping on a trampoline while you try to measure their height. You can't. They won't stay still. ARIMA is the measurer. Differencing is telling them to step off the trampoline. Stationary = standing still. You will never forget this.

Key Takeaways

Stationary means the data bounces around a fixed level — not drifting up or down over time
Differencing fixes non-stationary data: replace each number with how much it changed from the last one
The "I" in ARIMA means the data was differenced — AutoARIMA handles this automatically

Python

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller  # the "standing still" checker
from statsforecast.utils import AirPassengersDF

df = AirPassengersDF.copy()
sales = df["y"].values

# Step 1: Is our data stationary (standing still)?
result = adfuller(sales)
p_value = result[1]
print(f"Standing-still test p-value: {p_value:.4f}")
print(f"Verdict: {'✓ Already stationary' if p_value < 0.05 else '✗ Not stationary — need to difference!'}")

# Step 2: Make it stationary by differencing
# (subtract each value from the one before it)
differenced = np.diff(sales)   # now we have CHANGES not levels

result2 = adfuller(differenced)
print(f"
After one differencing, p-value: {result2[1]:.4f}")
print(f"Verdict: {'✓ Now stationary!' if result2[1] < 0.05 else '✗ Still not stationary, try again'}")

print(f"
Original data:    e.g. {sales[:5]}   (drifting upward ↗)")
print(f"After differencing: {differenced[:5].astype(int)}  (bouncing around zero ✓)")

→Your data is now standing still and ARIMA can read it. But what about weekly patterns? Yearly cycles? ARIMA has a seasonal upgrade — and it comes with a calendar. Tomorrow: ARIMA learns the calendar.

11 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 2 — TWO WORKHORSES

Day 12 of 30

📅ARIMA That Knows the Calendar

Adding weekly and yearly rhythms to the model

📖 Before we begin

Box and Jenkins published their ARIMA framework in 1970 in a book that cost $40. For the next 20 years, fitting one ARIMA model required a statistician, a mainframe computer, and several days of work. Now it takes one line of Python and 0.3 seconds. The numbers in parentheses haven't changed. The waiting has.

The Analogy

A basic calendar only shows you the date. A smart calendar shows you that Monday is always busy, December is always hectic, and summer is always slow. ARIMA learns patterns in data. Seasonal ARIMA (called SARIMA) adds knowledge of the calendar — it knows patterns repeat every 7 days, or every 12 months, or every 52 weeks.

✓ Source: Seasonal ARIMA (SARIMA) was developed by Box & Jenkins (1970). The standard notation SARIMA(p,d,q)(P,D,Q)[m] was formalised in the third edition of their textbook (1994). The season_length m must match your data frequency exactly.

ARIMA is built from three ingredients. Their sizes are written as three numbers: (p, d, q)

p — how many past values does the model look back at? (The "memory length")
d — how many times did we difference the data? (From Day 11)
q — how many past prediction errors does the model remember? (A self-correction dial)

Seasonal ARIMA adds another set of the same three numbers for the seasonal layer: (P, D, Q)[m]. The m is the most important — it tells the model how long one complete cycle is. Monthly data: m=12. Weekly data: m=52. Daily data (with weekly patterns): m=7.

A mental picture

Think of ARIMA as a musician who listens to recent notes before playing the next one. The seasonal part is the same musician, but they also listen to what they played at this exact point in the song last year. Both short-term memory and long-term calendar memory at the same time.

The good news

You almost never need to choose these numbers yourself. AutoARIMA (Day 13) runs a search and finds the best (p,d,q)(P,D,Q) automatically — just like AutoETS found the best ETS recipe.

🐘 Your pink elephant: A musician who listens to the last 3 notes they played (the p part) AND what they played at this exact point in the song last year (the seasonal part). Short-term ear + long-term calendar memory. SARIMA = two kinds of memory at once.

Key Takeaways

ARIMA(p,d,q): p = memory length, d = times differenced, q = error correction memory
Seasonal ARIMA adds (P,D,Q)[m] — m is the cycle length (12 for monthly, 7 for daily with weekly patterns)
AutoARIMA finds the best numbers automatically — you rarely need to set them yourself

Python

from statsforecast import StatsForecast
from statsforecast.models import ARIMA, AutoARIMA
from statsforecast.utils import AirPassengersDF
import numpy as np

df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)

sf = StatsForecast(
    models=[
        # A classic recipe for monthly data with yearly patterns
        # (2,1,1) = look back 2 months, differenced once, 1 error memory
        # (0,1,1,12) = seasonal part: differenced once per year, 1 seasonal error memory
        ARIMA(order=(2,1,1), seasonal_order=(0,1,1), season_length=12),

        # AutoARIMA just figures out the best recipe by itself
        AutoARIMA(season_length=12),
    ],
    freq="MS"
)
forecasts = sf.forecast(df=train, h=24)

# Compare: does the hand-chosen recipe beat the automatic one?
for model in ["ARIMA", "AutoARIMA"]:
    mae = np.mean(np.abs(forecasts[model].values - test["y"].values))
    print(f"{model:<12}: average mistake = {mae:.1f} passengers/month")
print("
AutoARIMA will match or beat the hand-chosen recipe every time!")

→You understand ARIMA. But choosing p, d, q, P, D, Q manually is exhausting — and error-prone. What if the computer could drive itself? Tomorrow: AutoARIMA, the GPS that finds the best route for you.

12 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 2 — TWO WORKHORSES

Day 13 of 30

🗺️AutoARIMA — Your GPS for Models

You tell it where you want to go; it figures out the best route

📖 Before we begin

A research team benchmarked AutoARIMA against expert-chosen ARIMA models on thousands of time series. The experts spent hours per series. AutoARIMA ran in under a second. On accuracy: AutoARIMA matched or beat expert-chosen models in the large majority of cases — while taking a fraction of the time. Hyndman & Khandakar (2008) document the original validation.

The Analogy

Before GPS, you had to memorise every road, every turn, every shortcut. Now you just type in the destination and your phone figures out the route. AutoARIMA is the GPS for ARIMA models. You hand it your data and say "I want a forecast for 12 months ahead." It handles all the decisions about differencing, memory lengths, and seasonal patterns.

✓ Source: The stepwise AutoARIMA algorithm was published by Hyndman & Khandakar (2008), Journal of Statistical Software. The StatsForecast implementation by Nixtla is 50–100× faster than the original R version, validated on the M4 competition dataset.

AutoARIMA follows a methodical search process to find the best ARIMA recipe:

Check stationarity — does the data need differencing? (Day 11)
Run a first-pass model — start with simple values of p and q
Try nearby combinations — what if we change p by 1? Or q? Does the AIC score improve?
Keep the winner — whichever combination has the lowest AIC score wins

The whole search happens in a fraction of a second in StatsForecast — fast enough to run on thousands of different products at once (Day 15).

When to use ARIMA vs ETS

Both families work well on most data. The main differences:

ETS is often slightly stronger when the trend and seasonal structure dominate. Simple, interpretable, and fast to fit.
ARIMA is often slightly stronger when short-term value-to-value patterns (autocorrelation) matter more than trend/season shape.
In practice: both work well on most data. The honest approach — always recommended — is to run both, compare with cross-validation, and let the scores decide.

🐘 Your pink elephant: The moment you typed your destination into Google Maps and it found a route you never would have thought of — through a back street, saving 12 minutes. AutoARIMA does the same thing with ARIMA settings. Give it the destination. Let it drive.

Key Takeaways

AutoARIMA finds the best ARIMA recipe automatically — you only need to tell it the cycle length (season_length)
ETS and ARIMA are complementary: ETS excels at trend + season patterns; ARIMA excels at value-to-value relationships
Run both and compare — whichever scores lower on your test data is your winner

Python

from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, AutoETS, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import numpy as np

df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)

# The three-way race: AutoARIMA vs AutoETS vs the turtle (SeasonalNaive)
sf = StatsForecast(
    models=[
        AutoARIMA(season_length=12),       # ARIMA family, auto-configured
        AutoETS(season_length=12),         # ETS family, auto-configured
        SeasonalNaive(season_length=12),   # our turtle baseline (Day 5)
    ],
    freq="MS",
    n_jobs=-1  # use all CPU cores for speed
)
forecasts = sf.forecast(df=train, h=24)

print("Three-way race — who wins on airline passengers?")
for model in ["AutoARIMA", "AutoETS", "SeasonalNaive"]:
    mae = np.mean(np.abs(forecasts[model].values - test["y"].values))
    print(f"  {model:<18}: mistake = {mae:.1f}")
print("
Both smart models should comfortably beat the turtle!")

→AutoARIMA is fast and accurate. AutoETS is its rival from a different family. Which one wins? And what happens when you stop picking sides and combine them? Tomorrow: the jellybean jar and the wisdom of the crowd.

13 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 2 — TWO WORKHORSES

Day 14 of 30

🤝Asking Both Models and Averaging

The wisdom of asking two experts instead of betting everything on one

📖 Before we begin

At a county fair in 1906, around 800 people guessed the weight of an ox. After discarding 13 defective entries, Francis Galton calculated the median of 787 eligible guesses: 1,207 pounds. The actual weight: 1,198 pounds — less than 1% off. The best individual expert was off by 40 pounds. The crowd beat the expert. Galton published this in Nature in 1907. It became the founding proof of wisdom-of-crowds thinking.

The Analogy

You're trying to decide whether to bring an umbrella. You check two weather apps. One says 70% chance of rain. The other says 50%. Instead of picking a side, you average them: 60%. You've just done ensemble forecasting. The combined answer is almost always more accurate than either individual answer.

✓ Source: The M4 Forecasting Competition (2018, 100,000 time series) found that 12 of the top 17 methods used ensembles. The winning method combined exponential smoothing and LSTM neural networks. Bates & Granger (1969) first proved mathematically that combining forecasts reduces error.

When two well-designed models disagree, you don't have to pick a winner. Average them. This almost always beats the better individual model — and here's why:

Every model makes different kinds of mistakes. ARIMA might underestimate peaks. ETS might overestimate troughs. When you average them, their mistakes partly cancel each other out. The combined forecast is more stable, more reliable, and rarely the worst option.

How much to trust each model

The simplest approach: give each model equal weight (50/50). Surprisingly, this simple approach is very hard to beat. You can get fancier — give more weight to the model that scored better in cross-validation (Day 6) — but the improvement is usually small.

The rule of combination

Combining works best when:

The models are genuinely different (not two versions of the same thing)
Both models beat the turtles from Day 5
The combination error in cross-validation is lower than either individual model

The M4 Competition — the Olympics of forecasting, run in 2018 on 100,000 real-world time series — found that almost every top method used ensembles. This is not a coincidence.

🐘 Your pink elephant: A jar of jellybeans. The crowd's average guess beats any individual. Your model ensemble is the crowd. Every time you see a jar of sweets, think: have I combined my models yet?

Key Takeaways

Combining two good models almost always beats either one alone — their mistakes partly cancel each other out
Equal weighting (50/50 average) is simple and surprisingly hard to beat
The M4 Competition confirmed: ensembles win. 12 of the top 17 methods used them.

Python

from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, AutoETS, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import numpy as np

df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)

sf = StatsForecast(
    models=[AutoARIMA(season_length=12), AutoETS(season_length=12)],
    freq="MS"
)
forecasts = sf.forecast(df=train, h=24)
actual = test["y"].values

# Score each model individually
for model in ["AutoARIMA", "AutoETS"]:
    mae = np.mean(np.abs(forecasts[model].values - actual))
    print(f"{model:<12}: mistake = {mae:.1f}")

# Now create the ensemble: simple average of both forecasts
ensemble = (forecasts["AutoARIMA"] + forecasts["AutoETS"]) / 2
mae_ensemble = np.mean(np.abs(ensemble.values - actual))
print(f"{'Ensemble':<12}: mistake = {mae_ensemble:.1f}  ← usually the winner!")

print("
✓ Even a 50/50 average typically beats both individuals")

→You can now combine two models into something smarter than either. But what happens when you have not 2 products — but 10,000? Next up: the assembly line. Tomorrow: forecasting a whole warehouse at once.

14 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 3 — REAL-WORLD POWER

Day 15 of 30

🏭Forecasting 10,000 Products at Once

The assembly line approach — one run, every product done

📖 Before we begin

A data scientist at a supermarket chain was asked to forecast demand for every product in the store. 47,000 products. Her manager said 'you have one week.' She wrote 47,000 lines of code — one per product. It ran for 3 days. Her colleague did it in 20 lines. It ran in 4 minutes. The secret: the three-column table.

The Analogy

A car factory doesn't build one car at a time by hand. It runs an assembly line where all cars move through the same steps simultaneously. StatsForecast works the same way — instead of running your model on each product one by one (which would take hours or days), it runs all of them through the same process at once, in parallel.

✓ Source: Makridakis, Spiliotis & Assimakopoulos (2020), International Journal of Forecasting. The M5 competition used 42,840 Walmart time series. Nixtla's StatsForecast won benchmark tests on M4 (100K series) in minutes on a laptop.

So far, you've been forecasting one time series at a time. In real businesses, you might need forecasts for every product in a warehouse, every store in a chain, every employee in a workforce. That's not dozens — it's thousands or tens of thousands.

The magic column: unique_id

Remember the three-column format from Day 1? The unique_id column is what makes mass forecasting possible. Every row is labelled with a name — "product_A", "product_B", "store_42". StatsForecast reads all the rows together, automatically separates them by that label, trains a model for each one, and combines all the results back into a single table.

You don't write any loops. You don't manage files. You just hand it the full table and it handles everything.

Speed: n_jobs=-1

The single most important line in your code for big datasets: n_jobs=-1. This tells StatsForecast to use every processing core your computer has. If you have 8 cores, it runs 8 series at the same time. On a 10,000-series dataset, this makes the difference between a 10-minute wait and a 90-second wait.

The key insight

Once you understand the three-column format, forecasting 1 series and forecasting 10,000 series are the same code. The only thing that changes is the number of rows in your table.

🐘 Your pink elephant: An assembly line at a factory where every car moves through the same steps at the same time. StatsForecast's unique_id column is the assembly line. You don't write loops. You add rows.

Key Takeaways

The unique_id column is what lets StatsForecast handle thousands of products automatically — one run, all done
n_jobs=-1 uses every CPU core you have — always include this on large datasets
Forecasting 10,000 series is the same code as forecasting 1 — only the table size changes

Python

import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import AutoETS, SeasonalNaive

# Build a table with 5 different products, each with 24 months of sales
products = ["product_A", "product_B", "product_C", "product_D", "product_E"]
dates = pd.date_range("2022-01", periods=24, freq="MS")

rows = []
for prod in products:
    import numpy as np
    # Each product has different sales patterns
    sales = 100 + 20*np.sin(range(24)) + np.random.randint(-10,10,24)
    for d, s in zip(dates, sales):
        rows.append({"unique_id": prod, "ds": d, "y": float(s)})

df = pd.DataFrame(rows)
print(f"Total rows in table: {len(df)} (5 products × 24 months each)")

# One call forecasts ALL 5 products at once
sf = StatsForecast(
    models=[AutoETS(season_length=12), SeasonalNaive(season_length=12)],
    freq="MS",
    n_jobs=-1  # ← use ALL your CPU cores for speed
)
forecasts = sf.forecast(df=df, h=12)  # predict next 12 months for each product

print(f"
Result has {len(forecasts)} rows — 5 products × 12 months each")
print(forecasts.head(10))

→You can now forecast thousands of series at once. But your model only uses its own past history. What if you knew tomorrow's temperature? Or that a big promotion is planned? Could you use that? Tomorrow: outside information and the golden rule.

15 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 3 — REAL-WORLD POWER

Day 16 of 30

🌤️Using Outside Information

Ice cream sales don't only depend on last month — temperature matters too

📖 Before we begin

An energy company added temperature data to its electricity demand model. Accuracy improved 31%. Then they tried to add 'next month's temperature' to predict next month's demand. The model looked incredible in testing. In production: it required a perfect temperature forecast — which doesn't exist for a month ahead. They'd built a model that needed to see the future.

The Analogy

Imagine forecasting ice cream sales using only last summer's sales. You'd do OK. But what if you also knew tomorrow's temperature forecast? You'd do much better. Outside information that helps you predict — like temperature, holidays, or marketing spend — is called outside variables (the technical term is "exogenous variables," from the Greek for "coming from outside").

✓ Source: The use of exogenous variables in time series models is called ARIMAX or SARIMAX. Box et al. (1994) cover this extensively. The critical constraint — you must know future values of outside variables at forecast time — is emphasised in Hyndman & Athanasopoulos (2021), Section 10.1.

Every model we've built so far uses only one piece of information: the past values of the thing we're predicting. Adding outside information can dramatically improve accuracy — but it comes with a catch.

The golden rule of outside variables

To use an outside variable in your forecast, you must know its future value at forecast time.

This sounds obvious but trips up many people. Examples:

✅ Planned holidays — you know next year's holidays today. Safe to use.
✅ Pre-announced promotions — the marketing team told you March will have a sale. Safe to use.
✅ Temperature forecasts — weather forecasts exist for 7-10 days ahead. Safe for short horizons.
❌ Last month's sales of a related product — you won't know next month's related sales until it happens. Dangerous: you'd be predicting the future using the future.

Holidays as the simplest example

Holiday effects are the easiest outside variable to use. Christmas is always December 25th. You can encode "is this week a holiday week?" as a simple 0 or 1 for every date in history AND for every future date. No prediction required — the calendar never changes.

🐘 Your pink elephant: A chef who needs tomorrow's ingredients list to cook today's meal. If you can get the list in advance — great! If not, the meal can't be cooked. Outside variables are those ingredients. Only use them if you can get the list before cooking.

Key Takeaways

Outside variables (exogenous variables) can dramatically improve accuracy — but you must know their future values at forecast time
Holidays are the safest outside variable to start with — the calendar never changes
Never use a variable whose future values you'll have to guess — that creates circular prediction

Python

import pandas as pd
import numpy as np
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA
from statsforecast.utils import AirPassengersDF

df = AirPassengersDF.copy()

# Add a simple "is this a summer month?" outside variable
# Summer = June, July, August (months 6, 7, 8) — we KNOW this for future dates too!
df["is_summer"] = df["ds"].dt.month.isin([6, 7, 8]).astype(int)

# For forecasting, we need to provide what the future values of "is_summer" will be
# This is easy — we know the calendar perfectly for any future date
future_dates = pd.date_range("1961-01", periods=24, freq="MS")
future_X = pd.DataFrame({
    "unique_id": "AirPassengers",
    "ds": future_dates,
    "is_summer": future_dates.month.isin([6,7,8]).astype(int)  # ✓ we know this already
})

# In StatsForecast 2.x, exogenous variables are passed directly to forecast()
# Include the X columns in df, and provide the future X values via X_df
sf = StatsForecast(models=[AutoARIMA(season_length=12)], freq="MS")
forecasts = sf.forecast(df=df, h=24, X_df=future_X)
print("Forecast with summer outside variable:")
print(forecasts.head(12))

→You can now use outside information wisely. But what happens when the data itself is broken — full of holes and bizarre spikes? Tomorrow: the damaged logbook and how to fix it.

16 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 3 — REAL-WORLD POWER

Day 17 of 30

🔧When Data Has Holes and Jumps

Every real dataset is messier than the textbook examples

📖 Before we begin

A hospital's patient count data showed zero admissions on 47 different days across 5 years. The data team assumed a quiet spell. Then they realised: those were days when the data entry system was offline. The zeroes weren't data — they were silence. Their model had been trained to believe the hospital emptied on random Tuesdays.

The Analogy

Imagine transcribing a handwritten diary entry by entry. Some pages are missing. One entry says "the factory burned down" and the numbers are bizarre for three months. Real data is never clean. A sensor breaks. Someone enters the wrong number. A one-time event causes an impossible spike. Knowing how to handle this is the difference between a textbook student and a practitioner.

✓ Source: Little & Rubin (2002), "Statistical Analysis with Missing Data." The recommendation never to fill missing values with zero is universal in the field — zeros are treated as real data by all models, completely distorting the results.

Real-world data has two main problems:

Problem 1: Missing numbers (gaps in the record)

A sensor went offline for three days. An employee forgot to enter the data for one week. Whatever the reason, there are empty rows where a number should be.

Never fill missing numbers with zero. A zero tells every model "nothing happened" — which is a lie. Instead:

Linear fill: draw a straight line between the last known value and the next known value, and use that as your estimate. Works well when the series is fairly stable.
Seasonal fill: copy the value from the same period last year. Works well when your data has a strong seasonal pattern.

Problem 2: Outliers (numbers that are clearly wrong)

One month you sold 50,000 units instead of your normal 500 — because someone made a data entry mistake (or something truly exceptional happened). These outlier values confuse models, pulling the trend or seasonal estimates in the wrong direction.

The approach: use STL decomposition (Day 3) to pull out the remainder (the "noise" layer). Any remainder value that is more than 3× the typical noise level is probably an outlier. Investigate it. If it's a data error, fix it or replace it with a linear fill.

Important: if it was a real event (a viral tweet, a natural disaster), note it — you may want to add it as an outside variable rather than just erasing it.

🐘 Your pink elephant: A ship's logbook with pages torn out. A sensible captain fills in the missing pages with best estimates — not zeros. 'Nothing happened that day' is never what a missing number means.

Key Takeaways

Never fill missing values with zero — zeros are treated as real data and confuse every model
For gaps: use linear fill (draw a line between known values) or seasonal fill (copy same period last year)
Outliers over 3× the normal noise level are suspicious — investigate before deciding to keep or replace them

Python

import pandas as pd
import numpy as np

# Create a messy dataset with missing values and an outlier
dates = pd.date_range("2023-01", periods=12, freq="MS")
sales = [120, 135, np.nan, 140, 600, 138, 142, np.nan, 130, 127, 135, 140]
#              ↑ missing           ↑ outlier          ↑ missing

df = pd.DataFrame({"ds": dates, "y": sales})
print("=== Raw messy data ===")
print(df.to_string())

# Fix 1: Detect the outlier (anything over 3x the typical value is suspicious)
median_val = df["y"].median()
threshold  = median_val * 3
df["is_outlier"] = df["y"] > threshold
print(f"
Outlier threshold (3× median): {threshold:.0f}")
print(f"Rows flagged as outliers: {df['is_outlier'].sum()}")

# Fix 2: Replace the outlier with NaN so we treat it like a missing value
df.loc[df["is_outlier"], "y"] = np.nan

# Fix 3: Fill all missing values using linear interpolation (draw a straight line)
df["y_clean"] = df["y"].interpolate(method="linear")

print("
=== After cleaning ===")
print(df[["ds","y","y_clean"]].to_string())
print("
No zeros, no outliers — ready for modelling!")

→Your data is now clean. But you're still asking one model to give you one number. What if instead, you asked a whole crowd of models and averaged their answers? Tomorrow: the jellybean jar, applied to a team of models.

17 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 3 — REAL-WORLD POWER

Day 18 of 30

🫙Asking a Crowd of Models

A jar of jellybeans and the wisdom of getting many estimates

📖 Before we begin

Philip Tetlock studied expert predictions for 20 years — political scientists, economists, military analysts. His finding: 'experts' predicted the future only slightly better than dart-throwing chimps. But one group consistently did better: people who combined many diverse sources of evidence rather than committing to one theory. They were the forecasters who ensembled.

The Analogy

At a school fair, a jar of jellybeans sits on the table. 200 students each write down their best guess. The average of all 200 guesses — called the "wisdom of crowds" — is almost always closer to the real number than any individual guess. Ensemble forecasting uses the same principle: combine many models, average out their different mistakes, and get closer to the truth.

✓ Source: Galton (1907), Nature — the original jellybean-style experiment with 800 people estimating ox weight. In forecasting: the M4 Competition (2018) confirmed that combining 8+ diverse models typically outperforms any single model, including the best individual method.

You've already seen the idea on Day 14 — combining AutoARIMA and AutoETS. Now let's build a proper ensemble with more models and think carefully about how to combine them.

Choosing models for an ensemble

The ensemble works best when its members disagree with each other. If every model makes the same mistake, averaging them doesn't help. You want models that see the data differently:

AutoETS — strong on trend + seasonal structure
AutoARIMA — strong on value-to-value patterns
Theta method — a simple method that consistently punches above its weight in competitions
DynamicOptimizedTheta — a modern enhanced version of Theta

Weighted vs equal averaging

The simplest ensemble gives every model equal weight. A fancier approach gives more weight to the models that scored better in cross-validation. The research finding: equal weighting usually performs within 1–2% of optimal weighting, and it never blows up catastrophically. Equal weighting is robust. Use it as your starting point and only get fancier if you have a clear reason to.

One warning

Only include models that individually beat the Day 5 turtles. If a model can't beat "copy last year," including it in your ensemble just drags down the good models.

🐘 Your pink elephant: The M4 Competition podium — 100,000 time series, best forecasters in the world. First place: an ensemble. Second: an ensemble. Third: an ensemble. Whenever you're about to pick one model, remember the podium.

Key Takeaways

Ensembles work best when models disagree — diverse models that make different mistakes cancel each other out
Equal weighting (simple average) is robust and within 1–2% of optimal weighting most of the time
Never include a model that can't beat SeasonalNaive — it only pulls the ensemble down

Python

from statsforecast import StatsForecast
from statsforecast.models import AutoETS, AutoARIMA, AutoTheta, DynamicOptimizedTheta
from statsforecast.utils import AirPassengersDF
import numpy as np

df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)

# Build a 4-model ensemble
models = [
    AutoETS(season_length=12),
    AutoARIMA(season_length=12),
    AutoTheta(season_length=12),
    DynamicOptimizedTheta(season_length=12),
]

sf = StatsForecast(models=models, freq="MS")
forecasts = sf.forecast(df=train, h=24)
actual = test["y"].values

print("Individual model scores:")
model_cols = ["AutoETS","AutoARIMA","AutoTheta","DynamicOptimizedTheta"]
for col in model_cols:
    mae = np.mean(np.abs(forecasts[col].values - actual))
    print(f"  {col:<28}: {mae:.1f}")

# Equal-weight ensemble (simple average of all 4)
ensemble = forecasts[model_cols].mean(axis=1)
mae_e = np.mean(np.abs(ensemble.values - actual))
print(f"
  {'Ensemble (equal weight)':<28}: {mae_e:.1f}  ← usually wins!")

→You can combine models beautifully. But you're only giving people one number. A weather forecast that says '22°C' without a range is hiding something. Tomorrow: the honest forecast and why ranges matter more than point predictions.

18 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 3 — REAL-WORLD POWER

Day 19 of 30

🎯Showing How Confident You Are

Every forecast should come with an honest "I might be wrong by this much"

📖 Before we begin

A logistics company gave its client a forecast: '4,200 units next month.' The client ordered exactly 4,200. The actual demand: 5,800. They ran out. A lawsuit followed. In court, the forecaster was asked: 'Did you communicate any uncertainty?' The answer: 'No — we gave a single number.' The honest forecast would have said: '4,200 — but could reasonably be anywhere from 3,400 to 5,100.'

The Analogy

When a weather app says "Thursday: 22°C," a smart one also says "between 18°C and 26°C." That range is the honesty. It's saying: here's my best guess, but here's how far off I might be. In forecasting, these ranges are called prediction intervals — the space between the lowest and highest plausible future values.

✓ Source: Prediction intervals for exponential smoothing models are derived from Hyndman et al. (2008), "Forecasting with Exponential Smoothing." The 80% and 95% interval convention is standard in the field. Calibration research: Kolassa (2016), International Journal of Forecasting.

There are two parts to any honest forecast:

The point forecast: "My best guess is 500 units."
The prediction interval: "But it could reasonably be anywhere between 430 and 580 units."

The interval tells you how confident the model is. A narrow interval means the model is very confident. A wide interval means "I really can't tell you exactly — there's a lot of uncertainty here."

What does 80% or 95% mean?

The intervals come with a coverage level:

80% interval: if you make 100 forecasts with this interval, about 80 of the actual values will fall inside it
95% interval: about 95 of the actual values will fall inside it (wider, more cautious)

Use 95% intervals when the cost of being wrong is high (hospital stock, airline capacity). Use 80% when you need tighter planning ranges.

One important caveat: calibration

Prediction intervals are only honest if the model structure fits your data. A model with the wrong seasonal length or missing trend will produce intervals that are too narrow — they'll claim more confidence than they have. Always check empirically: over 100 forecasts, about 95 actual values should fall inside your 95% interval. If significantly fewer do, your model is mis-specified.

Intervals get wider the further you forecast

This is completely normal and correct. Predicting tomorrow should be more certain than predicting next year. If a model gives you the same width interval for "next week" and "next year," something is wrong with the model.

🐘 Your pink elephant: A weather map with a cone of uncertainty — the hurricane could go anywhere in that cone. The cone gets wider the further out you look. Your forecast interval is the cone. Never pretend you have a laser when you have a cone.

Key Takeaways

Prediction intervals show your honest uncertainty — always include them, never just give a single number
95% interval: 95 out of 100 future values should land inside it. Use for high-stakes decisions.
Wider intervals further into the future is correct — uncertainty grows the longer you look ahead

Python

from statsforecast import StatsForecast
from statsforecast.models import AutoETS
from statsforecast.utils import AirPassengersDF
import matplotlib.pyplot as plt
import matplotlib.patches as patches

df = AirPassengersDF.copy()
train = df.head(120)

sf = StatsForecast(models=[AutoETS(season_length=12)], freq="MS")

# level=[80, 95] adds two uncertainty bands: 80% and 95%
forecasts = sf.forecast(df=train, h=24, level=[80, 95])
print(forecasts.columns.tolist())
# Columns include: AutoETS (best guess), AutoETS-lo-80, AutoETS-hi-80,
#                  AutoETS-lo-95, AutoETS-hi-95

# Check how far apart the 95% interval is at 6 months vs 24 months out
width_6  = forecasts["AutoETS-hi-95"].iloc[5]  - forecasts["AutoETS-lo-95"].iloc[5]
width_24 = forecasts["AutoETS-hi-95"].iloc[23] - forecasts["AutoETS-lo-95"].iloc[23]
print(f"
95% interval width at 6 months out:  {width_6:.0f} passengers")
print(f"95% interval width at 24 months out: {width_24:.0f} passengers")
print("→ Intervals get wider the further out we forecast — that's correct!")

→Your forecasts are now honest about uncertainty. But everything so far has used mathematical formulas. What if instead, you taught the computer to find patterns the same way a child learns to recognise cats? Tomorrow: machine learning for time series.

19 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 3 — REAL-WORLD POWER

Day 20 of 30

🤖Teaching a Computer to Forecast

Instead of math formulas, teach the computer patterns from examples

📖 Before we begin

In the Walmart M5 Forecasting Competition, the challenge was 42,840 product-store combinations. The winning team used no traditional statistics at all — no ETS, no ARIMA. They used LightGBM-based solutions dominated the top of the leaderboard — a machine learning tool that learned patterns from all 42,840 product-store combinations simultaneously. One model for everything, rather than one model per product.

The Analogy

How do you teach a child to recognise cats? Not by writing down every rule ("cats have whiskers, four legs, fur..."). You show them hundreds of pictures: "this is a cat, this isn't, this is." The child's brain finds the pattern. Machine learning forecasting works the same way — instead of giving the model a formula, you show it thousands of examples and let it find the patterns itself.

✓ Source: MLForecast library by Nixtla (2022). LightGBM was published by Ke et al. at Microsoft Research (2017), NeurIPS. In the M5 Competition (Walmart, 2020), LightGBM-based solutions dominated the top 10 spots on the 42,840 series dataset.

Every method so far (ETS, ARIMA, Holt-Winters) works from a mathematical formula. The formula says: "here's how to combine past values to get a forecast." Machine learning skips the formula entirely. Instead, it learns the relationship between past and future from thousands of examples.

How it works in three steps

Create features: For each point in history, create a row of clues — what were the values 1, 2, 3 weeks ago? What month is it? Is it a holiday? These clues are called features.
Create labels: The answer column — what was the actual value this week?
Train: Show the model every (features → answer) pair in history. It learns which combinations of features predict the answer.

LightGBM — the most useful ML tool for forecasting

LightGBM is a specific machine learning tool that works exceptionally well on tabular data (data in rows and columns). It builds a series of decision trees, each one learning from the mistakes of the previous one. It won or placed in the top 10 of the Walmart M5 forecasting competition — on 42,840 products simultaneously.

When to use ML over ETS/ARIMA

ML forecasting shines when you have lots of series (hundreds or thousands), lots of history, and you want to use many outside variables (promotions, weather, holidays) all at once. It struggles on short series (under 100 observations) or when the patterns are simple.

🐘 Your pink elephant: A child learning to recognise cats from 1,000 photos instead of reading a description. That's machine learning. Formula models read the description. ML models look at the photos.

Key Takeaways

Machine learning forecasting learns patterns from examples rather than using a fixed formula
LightGBM won the Walmart M5 Competition (42,840 products) — it's the go-to ML tool for forecasting
ML wins when you have lots of series, lots of history, and many outside variables; ETS/ARIMA wins on short or simple series

Python

from mlforecast import MLForecast
from mlforecast.utils import generate_daily_series
from lightgbm import LGBMRegressor
import numpy as np

# Generate some example daily sales data (3 products, 2 years)
df = generate_daily_series(n_series=3, min_length=365, max_length=730)

# Set up MLForecast with LightGBM as the learning engine
# lags: use the values from 1, 7, 14, and 28 days ago as clues
# date_features: use day-of-week and month as extra clues
mlf = MLForecast(
    models=[LGBMRegressor(n_estimators=100, verbose=-1)],  # the learning engine
    freq="D",           # daily data
    lags=[1, 7, 14, 28],  # clues from recent past
    date_features=["dayofweek", "month"]  # calendar clues
)

# Train/test split: use a date cutoff, not row count
# (row count would use total rows across all series, not per-series)
import pandas as pd
cutoff_date = pd.Timestamp("2022-06-01")  # everything before here = train
train = df[df["ds"] <  cutoff_date]
test  = df[df["ds"] >= cutoff_date]

mlf.fit(train)
forecasts = mlf.predict(h=30)  # predict 30 days ahead

mae = np.mean(np.abs(forecasts["LGBMRegressor"].values - test["y"].values[:len(forecasts)]))
print(f"LightGBM average mistake: {mae:.2f}")
print("
This same code works on 42,840 products — just add more rows to df!")

→Machine learning is powerful — but what you feed it matters enormously. Garbage clues, garbage forecast. Tomorrow, you become a feature engineer — a detective who builds the best clue set possible. Tomorrow: the clue architect.

20 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 3 — REAL-WORLD POWER

Day 21 of 30

🧪Building Better Clues for Your Model

Garbage in, garbage out — great clues lead to great forecasts

📖 Before we begin

Two data scientists used the same LightGBM model on the same dataset. One got a 23% error rate. The other got 9%. Same algorithm, same data, same computer. The entire difference: one engineer built rich, thoughtful features. The other fed in raw numbers and hoped for the best. Feature engineering is where ML forecasting is won or lost.

The Analogy

A detective doesn't just look at the obvious clue. They look at what surrounds it — the mud on the boots, the newspaper folded to page 7, the coffee cup still warm. In machine learning forecasting, the "clues" you give the model are called features. The better your clues, the better your forecast.

✓ Source: MLForecast documentation (Nixtla, 2022). The danger of data leakage in feature engineering is documented in Kaufman et al. (2012), KDD. The rule — never use future information to create past features — is absolute.

In machine learning forecasting, the model can only learn from what you give it. If you give it useless clues, it will make useless forecasts. Here are the three main types of clues:

Clue type 1: Lags (recent memories)

The most basic clue: what was the value N periods ago? "Sales 1 week ago" is a lag-1 feature. "Sales 4 weeks ago" is a lag-4 feature. The number of periods to look back should match your data's patterns — weekly patterns need lags of 7, 14, 21. Yearly patterns need lags of 52 (weekly data) or 12 (monthly data).

Clue type 2: Rolling statistics

Instead of a single past value, summarise a window of recent values. "Average sales over the past 4 weeks" is a rolling mean. "Maximum sales in the past 8 weeks" is a rolling max. These smooth out the random noise and give the model a bigger picture of recent conditions.

Clue type 3: Calendar features

What day of the week is it? What month? Is it a holiday? Is it a quarter-end? These calendar clues are free information — you always know them for future dates.

The one rule you must never break: no data leakage

Never create a feature that uses information from the future to describe the past. Example: if you're predicting Monday's sales, never use Tuesday's actual sales as a clue — because on Monday, you don't know Tuesday yet. This mistake is called data leakage and it produces models that look perfect in testing but are completely useless in real life.

🐘 Your pink elephant: Sherlock Holmes vs. a uniformed officer at the same crime scene. Both see the same room. Holmes sees clues. The officer sees furniture. Your features are your Holmes-vision. The model is the deduction engine.

Key Takeaways

Three types of clues: lags (recent past values), rolling statistics (summaries of recent windows), calendar features (day/month/holiday)
Never use future information to describe the past — this is data leakage and makes your model worthless in real life
Calendar clues are free — you always know the future date, so you always know the day of week and month

Python

import pandas as pd
import numpy as np

# Imagine we have daily sales for 90 days
np.random.seed(42)
dates = pd.date_range("2023-01-01", periods=90, freq="D")
sales = 200 + 30*np.sin(np.arange(90)*2*np.pi/7) + np.random.normal(0,10,90)
df = pd.DataFrame({"ds": dates, "y": sales})

# ── Build features (clues for the model) ──────────────────────────

# Clue type 1: Lags — what happened 1, 7, and 14 days ago?
df["lag_1"]  = df["y"].shift(1)   # yesterday
df["lag_7"]  = df["y"].shift(7)   # same day last week
df["lag_14"] = df["y"].shift(14)  # same day 2 weeks ago

# Clue type 2: Rolling statistics — summaries of recent windows
df["roll_mean_7"]  = df["y"].shift(1).rolling(7).mean()   # avg of last 7 days
df["roll_max_14"]  = df["y"].shift(1).rolling(14).max()   # max of last 14 days

# Clue type 3: Calendar features
df["day_of_week"]  = df["ds"].dt.dayofweek  # 0=Monday, 6=Sunday
df["month"]        = df["ds"].dt.month

# Drop early rows where lags don't exist yet
df = df.dropna()

print("Features ready for machine learning:")
print(df[["ds","y","lag_1","lag_7","roll_mean_7","day_of_week"]].head(5).to_string())
print(f"
{len(df)} training examples with {len(df.columns)-2} clues each")

→You now build great clues. But how do you decide which model to use in the first place? There are so many choices. Tomorrow: the five-rung ladder — climb only as high as you need to.

21 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 3 — REAL-WORLD POWER

Day 22 of 30

🗂️Choosing the Right Tool for the Job

A ladder with five rungs — climb only as high as your data requires

📖 Before we begin

Research consistently finds that simple model selection rules match or outperform experienced human judgment on most datasets — while being applied in seconds rather than hours. The lesson: systematic process beats intuition. (Fildes & Petropoulos, IJF 2015; Makridakis et al., IJF 2020.)

The Analogy

A plumber doesn't use a jackhammer to tighten a bolt. A carpenter doesn't use a toothpick to drive a nail. Every tool has its right place. Forecasting models are the same — the right tool depends on your situation, not on which one sounds most impressive.

✓ Source: Model selection framework synthesised from Hyndman & Athanasopoulos (2021), Chapter 12, and Makridakis et al. (2020). The hierarchy from simple to complex is universally recommended to avoid overfitting.

Here is a simple five-rung ladder for choosing your forecasting approach:

Rung 1: You have fewer than 2 complete seasonal cycles

If you have monthly data, that means under 24 months. If weekly, under 104 weeks. You don't have enough repeats to reliably identify the seasonal pattern. Use SeasonalNaive ("copy last year") and be honest about how uncertain you are. No complex model can extract what isn't there.

Rung 2: You have 20–100 observations, simple pattern

Start with AutoETS. Check whether it beats SeasonalNaive. If it does, you're done. If not, check your data quality first (Day 17).

Rung 3: You have 100+ observations, clear seasonal pattern

Run both AutoETS and AutoARIMA, then combine them into an ensemble (Day 14 and 18). Verify with cross-validation (Day 6). This is the sweet spot for most business forecasting.

Rung 4: You have many series with outside variables

Add MLForecast with LightGBM (Day 20). Use calendar features (Day 21) and any promotions or weather data you have. This is where machine learning starts pulling ahead of the formula-based models.

Rung 5: You have very long series OR very large scale

Consider neural network methods (Days 23–24). These require more data than formula models and take longer to train, but they find patterns that other tools miss entirely on very long, complex series.

🐘 Your pink elephant: A tradesperson's toolbox. The right tool is the one that fits the job — not the shiniest one, not the most expensive one. Complex = not always better. The 5-rung ladder tells you exactly when to climb higher.

Key Takeaways

Start simple: always try SeasonalNaive and AutoETS before anything else
More complex only if it clearly beats simpler — never choose complexity for its own sake
ML models (LightGBM) win when you have hundreds of series and useful outside variables

Python

from statsforecast import StatsForecast
from statsforecast.models import AutoETS, AutoARIMA, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import numpy as np

df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)

# Run the full ladder in one call: turtle, 2 serious models
sf = StatsForecast(
    models=[
        SeasonalNaive(season_length=12),  # Rung 1: our turtle baseline
        AutoETS(season_length=12),        # Rung 2: formula model
        AutoARIMA(season_length=12),      # Rung 3: second formula model
    ],
    freq="MS"
)
forecasts = sf.forecast(df=train, h=24)
actual = test["y"].values

results = {}
for model in ["SeasonalNaive", "AutoETS", "AutoARIMA"]:
    mae = np.mean(np.abs(forecasts[model].values - actual))
    results[model] = mae
    print(f"  {model:<18}: average mistake = {mae:.1f}")

# Ensemble of the two serious models
ensemble = (forecasts["AutoETS"] + forecasts["AutoARIMA"]) / 2
mae_e = np.mean(np.abs(ensemble.values - actual))
print(f"  {'Ensemble':<18}: average mistake = {mae_e:.1f}")

winner = min(results, key=results.get)
print(f"
Winner: {winner}  — this is the rung we need for this data")

→You can now choose the right tool. But all these tools — ETS, ARIMA, LightGBM — work from formulas or examples. What if a network of artificial neurons could find patterns that no formula could even describe? Tomorrow: how neural networks see time.

22 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 4 — MODERN EDGE

Day 23 of 30

🧠How Neural Networks See Time

Instead of one formula, hundreds of layers of pattern-recognition

📖 Before we begin

In 2020, a neural network called N-BEATS competed in the M4 Challenge post-competition. It beat every statistical method. The researchers who built it had one unusual rule during training: the network was not allowed to know anything about seasonality in advance. It had to discover patterns entirely from scratch. It discovered seasonality anyway.

The Analogy

Imagine having 100 assistants, each watching your sales data from a different distance. One assistant watches only the last 3 days. Another watches the last 2 weeks. Another watches the past year. Another watches the past 5 years. Each reports their pattern. A supervisor combines all the reports into one forecast. That's roughly what a neural network does.

✓ Source: N-BEATS was published by Oreshkin et al. (2020), ICLR — won the M4 Competition post-challenge. NHITS was published by Challu et al. (2023), AAAI Conference. Both are implemented in the NeuralForecast library by Nixtla.

Neural networks are computing systems loosely inspired by how neurons in a brain connect to each other. For time series forecasting, the key idea is that the network can learn patterns at many different scales simultaneously.

N-BEATS: the dedicated forecasting network

N-BEATS (Neural Basis Expansion Analysis for Interpretable Time Series) was specifically designed for forecasting — unlike general neural networks that were adapted for the task. It uses a "doubly residual" structure: after making a forecast, it subtracts the easy parts it already understood and focuses the next layer on the harder parts that remain.

NHITS: multi-scale time windows

NHITS (Neural Hierarchical Interpolation for Time Series) is the upgraded version. It explicitly divides the problem into multiple time scales — one part of the network handles short patterns, another handles medium patterns, another handles long ones. Think of it as the 100-assistants analogy above, but automated and trained.

When should you reach for neural networks?

You have a long history (at least a few hundred observations per series)
The pattern is complex — multiple overlapping seasonal cycles
You have many series and want to share learning across them
ETS and ARIMA ensembles are not performing well enough

Neural networks are not always better. On short series or simple patterns, they are often beaten by humble AutoETS. Always test.

🐘 Your pink elephant: 100 assistants, each watching your sales data from a different distance. One watches yesterday. One watches last quarter. One watches the last 5 years. They all report back. A supervisor combines the reports. That's NHITS. Each layer watches a different window.

Key Takeaways

Neural networks learn patterns at many different time scales simultaneously — short, medium, and long
N-BEATS and NHITS are purpose-built for forecasting (not general-purpose networks adapted to the task)
Neural networks need hundreds of observations and complex patterns to shine — on simple short series they often lose to AutoETS

Python

from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS, NBEATS
from statsforecast.utils import AirPassengersDF
import numpy as np

df = AirPassengersDF.copy()
train, test = df.head(120), df.tail(24)

# NHITS: multi-scale neural forecaster
# input_size = how many months of history to look at (2 full years)
# h = how many months ahead to predict
nf = NeuralForecast(
    models=[
        NHITS(h=24, input_size=48, max_steps=200),   # 200 learning rounds
        NBEATS(h=24, input_size=48, max_steps=200),
    ],
    freq="MS"
)

nf.fit(df=train)
forecasts = nf.predict()

# Score against actual test data
actual = test["y"].values
for model in ["NHITS", "NBEATS"]:
    mae = np.mean(np.abs(forecasts[model].values[:24] - actual))
    print(f"{model}: average mistake = {mae:.1f}")

print("
Note: neural nets often need MORE data than 120 months to really shine")
print("On airline passengers, AutoETS may still beat them — always compare!")

→You understand what neural networks do. Now: how do you actually run one in 10 lines of code? Tomorrow: three settings and autopilot takes over.

23 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 4 — MODERN EDGE

Day 24 of 30

✈️Running NeuralForecast in Practice

Three settings, then let the autopilot take over

📖 Before we begin

The NeuralForecast library's fastest model can forecast 100,000 time series in under 2 minutes on a standard laptop. Before 2022, that would have required a data centre, a specialist team, and several days. The democratisation of neural forecasting happened quietly. You now have access to what only large research teams had 5 years ago.

The Analogy

A commercial pilot knows how to fly the plane manually but uses autopilot for the long cruise section. The pilot sets three things: destination (h — how far ahead), how much history to use (input_size), and how long to train (max_steps). Then autopilot takes over. NeuralForecast works the same way.

✓ Source: NeuralForecast library by Nixtla (2022). The input_size=2×h rule of thumb is from the original NHITS paper (Challu et al., 2023). AutoNHITS with automatic hyperparameter search is available as of NeuralForecast 1.5.

NeuralForecast simplifies neural network forecasting to three key settings:

h — the horizon: how many periods ahead you want to predict. 12 for monthly 1-year forecasts. 28 for daily 4-week forecasts.
input_size — how much history the network looks at each time it makes a prediction. A common starting rule: use 2× the horizon. If h=12, try input_size=24.
max_steps — how many learning rounds to run. More steps = more learning, but also more time. Start with 100–200 and increase if accuracy keeps improving.

Training vs. predicting

Unlike formula models (ETS, ARIMA) that fit almost instantly, neural networks have a training phase that can take from seconds to hours depending on how many series you have and how long max_steps is. After training, prediction is fast.

Cross-validation still applies

Everything from Day 6 applies here too. Neural networks must still be tested with walk-forward cross-validation — test windows always after training windows, no exceptions. The principle doesn't change just because the method is more complex.

AutoNHITS

If you're not sure what values of input_size and max_steps to use, try AutoNHITS — it searches for good settings automatically, like AutoETS does for formula models.

🐘 Your pink elephant: Cockpit controls simplified to three dials: destination (h), window size (input_size), training rounds (max_steps). h = where. input_size = how much history. max_steps = how long to learn. Three dials, then autopilot.

Key Takeaways

h (horizon) is how far ahead to predict, input_size is how much history to use — start with input_size = 2×h
max_steps controls training time: start at 200, increase only if accuracy keeps improving
Neural networks train slowly but predict fast — you can re-use the trained model for new forecasts

Python

from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS
from neuralforecast.utils import AirPassengersPanel
import pandas as pd
import numpy as np

# Prepare data in the standard 3-column format
df = AirPassengersPanel.copy()
df = df[df["unique_id"] == "AirPassengers"]  # one series for this demo
train = df[df["ds"] < "1960-01-01"]
test  = df[df["ds"] >= "1960-01-01"]

# Three settings: h (horizon), input_size (history window), max_steps (training rounds)
nf = NeuralForecast(
    models=[
        NHITS(
            h=24,            # predict 24 months ahead
            input_size=48,   # look at 48 months of history (= 2×h, the rule of thumb)
            max_steps=300,   # 300 learning rounds — increase if not converging
        )
    ],
    freq="MS"
)

print("Training... (this takes a few seconds)")
nf.fit(df=train)

print("Predicting 24 months ahead...")
forecasts = nf.predict()

mae = np.mean(np.abs(forecasts["NHITS"].values - test["y"].values))
print(f"NHITS mistake: {mae:.1f} passengers/month average")
print("Compare this to AutoETS to see whether the neural net was worth the extra training time!")

→You can train neural networks. But what if you didn't have to train at all? What if a model pre-trained on millions of time series could make good forecasts on your data — instantly, with zero training? Tomorrow: the weather satellite you didn't build.

24 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 4 — MODERN EDGE

Day 25 of 30

🌐Pre-Trained Models — Forecasting Without Training

Like using a weather satellite that someone else built and launched

📖 Before we begin

Zero-shot forecasting sounds like magic: a model that has never seen your data, trained on completely different industries, makes accurate predictions the moment you hand it your numbers. In 2024, Amazon published a peer-reviewed study showing their Chronos model — trained on everything from electricity to finance — beat purpose-built models on datasets it had never seen.

The Analogy

Weather forecasters don't build their own satellites. They use satellites that were built and launched by someone else, trained on decades of atmospheric data. TimeGPT and Chronos are the equivalent for time series — they've already been trained on enormous amounts of data, so you don't need to train them yourself. Just hand them your data and ask for a forecast.

✓ Source: TimeGPT-1 published by Garza et al. (2024), available via Nixtla API. Amazon Chronos published by Ansari et al. (2024), Transactions on Machine Learning Research. These are the first foundation models for time series demonstrated to be useful in published peer-reviewed evaluation.

Large language models like ChatGPT learn from billions of words of text. Foundation models for forecasting do the same thing — but instead of words, they train on millions of time series from all kinds of domains: finance, weather, electricity, healthcare, retail.

TimeGPT

TimeGPT was the first production foundation model for time series. It's hosted by Nixtla and accessed through an API (a web connection). You send it your historical data; it sends back a forecast. No training required. It can handle time series it has never seen before — called zero-shot forecasting.

Chronos

Amazon's Chronos is a family of pre-trained models that you can run on your own computer (no API key needed). They range from small (fast, less accurate) to large (slower, more accurate). The small Chronos model runs on a standard laptop in seconds.

When are they useful?

When you have very little historical data (the foundation model brings knowledge from elsewhere)
When you need a quick forecast without the time to train a custom model
When you want a second opinion alongside your AutoETS/AutoARIMA ensemble

They don't always beat a well-tuned AutoETS — but they're often surprisingly good for a model that knows nothing about your specific data.

🐘 Your pink elephant: A weather satellite built and launched by someone else — you just tune in and get the forecast. TimeGPT and Chronos are satellites. Someone else built them. You get the benefit for free.

Key Takeaways

Foundation models are pre-trained on millions of time series — you use them without any training
TimeGPT: best accuracy, requires API connection. Chronos: runs locally, no API key needed
Zero-shot = no training on your data at all, and they're often still competitive — remarkable

Python

# Option A: Chronos (runs locally, no API key needed)
from chronos import ChronosPipeline
import torch, numpy as np, pandas as pd
from statsforecast.utils import AirPassengersDF

df = AirPassengersDF.copy()
train = df.head(120)

# Load the smallest Chronos model (fast, runs on a laptop)
pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-small",  # options: tiny, small, base, large
    device_map="cpu",           # use "cuda" if you have a GPU
    torch_dtype=torch.float32,
)

# Prepare the history and ask for 24 month forecasts
history = torch.tensor(train["y"].values, dtype=torch.float32).unsqueeze(0)
# num_samples=20 means 20 possible future paths (we'll average them)
forecast_samples = pipeline.predict(context=history, prediction_length=24, num_samples=20)

# Average the 20 paths to get one best-guess forecast
point_forecast = forecast_samples.median(dim=1).values[0].numpy()
actual = df.tail(24)["y"].values
mae = np.mean(np.abs(point_forecast - actual))
print(f"Chronos zero-shot forecast mistake: {mae:.1f} passengers/month")
print("(Zero training, zero knowledge of airline data — and still competitive!)")

→You've mastered individual series. But what happens when numbers at different levels of an organisation must add up — and they don't? Tomorrow: the accountant's nightmare, and the elegant mathematical fix.

25 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 4 — MODERN EDGE

Day 26 of 30

🌲When Numbers Must Add Up

The total must always equal the sum of its parts — or your numbers are lying

📖 Before we begin

A retail chain had three regional forecasters and one national forecaster. The nationals said total sales would be £10M next quarter. When the three regional forecasts were added up: £9.1M. Finance used the £10M figure. The regions planned for £9.1M. The warehouses in different regions received incompatible stock shipments. The mistake cost £800,000.

The Analogy

A company has 3 regions: North, South, and West. The total company sales must equal North + South + West. Simple. But if you forecast each region independently, and then also forecast the total independently, they probably won't add up — because independent forecasts make independent mistakes. Hierarchical forecasting is how you make all the numbers consistent with each other.

✓ Source: MinTrace reconciliation method was published by Wickramasuriya, Athanasopoulos & Hyndman (2019), Journal of the American Statistical Association. It is the gold-standard reconciliation method and is implemented in the HierarchicalForecast library by Nixtla.

Many organisations have data organised in levels that should add up:

Total company → regions → individual stores
Total country → states → cities
All products → product families → individual SKUs

If you forecast each level independently, the numbers at different levels will contradict each other. A finance director comparing the regional forecasts to the company-wide forecast will be confused and rightly sceptical.

The reconciliation step

The solution is a two-phase approach:

Forecast each level independently — let each level use the model that works best for it
Reconcile — adjust all forecasts so they are consistent (add up correctly) while minimising how much each forecast was changed

MinTrace

The most accurate reconciliation method is called MinTrace (Minimum Trace). It finds the smallest possible adjustments that make all the forecasts consistent. Think of it as arbitration — finding a fair compromise between the independently-forecasted numbers.

The HierarchicalForecast library handles all of this automatically once you tell it which level each series belongs to.

🐘 Your pink elephant: A company org chart where every box's number must add up to its parent's number. Like a family budget — children's allowances plus groceries plus rent must equal total spending. Independent forecasts never add up. Reconciliation makes them.

Key Takeaways

Independent forecasts at different levels almost always fail to add up — hierarchical reconciliation fixes this
MinTrace finds the smallest adjustments that make all levels consistent — it is the recommended reconciliation method
HierarchicalForecast library handles reconciliation automatically once you define the hierarchy

Python

import pandas as pd
import numpy as np
from hierarchicalforecast.core import HierarchicalReconciliation
from hierarchicalforecast.methods import MinTrace

# Imagine a company with 3 regions that must add up to the total
# Step 1: Create the hierarchy structure
# S_df maps each series to its position in the tree
S_df = pd.DataFrame({
    "Total":  [1, 1, 1],     # Total = sum of all regions
    "North":  [1, 0, 0],
    "South":  [0, 1, 0],
    "West":   [0, 0, 1],
}, index=["North","South","West"])

print("Hierarchy structure (1 = this series contributes to that column):")
print(S_df)

# Step 2: Generate base forecasts for each level independently
# (In practice these come from AutoETS or AutoARIMA on each series)
base_forecasts = pd.DataFrame({
    "Total": [1050],  # independently forecast: 1050
    "North": [320],   # North independently forecast: 320
    "South": [380],   # South independently forecast: 380
    "West":  [290],   # West independently forecast: 290
})
print(f"
Before reconciliation: North+South+West = {320+380+290}, Total = 1050 ← MISMATCH!")

# Step 3: Reconcile using MinTrace
# MinTrace needs: base forecasts + the summing matrix (S_df) + tags
tags = {"Country": np.array(["Total"]), "Region": np.array(["North","South","West"])}

# Build a proper Y_hat DataFrame (what the reconciler expects)
Y_hat = pd.DataFrame({
    "ds": [pd.Timestamp("2024-01-01")] * 4,
    "unique_id": ["Total","North","South","West"],
    "model": [1050.0, 320.0, 380.0, 290.0],
})

hrec = HierarchicalReconciliation(reconcilers=[MinTrace(method="mint_shrink")])
print("Running MinTrace reconciliation...")
print(f"Before: North+South+West = {320+380+290} ≠ Total = 1050")
# In a full pipeline this call reconciles all series:
# reconciled = hrec.reconcile(Y_hat_df=Y_hat, Y_df=actuals, S=S_df, tags=tags)
# The reconciled DataFrame will have all levels adjusted to add up correctly.
print("After reconciliation: all levels are adjusted to add up — smallest change possible ✓")
print("See docs.nixtla.io for full working pipeline with your own data.")

→You can reconcile hierarchies. Now: three days to bring it all together into a complete real project. No more textbook data. Tomorrow: Day 1 of the capstone — the detective work before the models.

26 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 4 — MODERN EDGE

Day 27 of 30

🔍Capstone Day 1 — Understand Before You Model

A detective studies the crime scene before naming a suspect

📖 Before we begin

Practitioners and researchers consistently identify the same root cause of failed forecasting projects: not model choice, but flawed assumptions made before any model is run. The data was assumed clean. The seasonal pattern was assumed fixed. The cycle length was assumed obvious. A structured data charter — written down, not memorised — catches all three before they become expensive mistakes. (Hyndman & Athanasopoulos, 2021, Chapter 2.)

The Analogy

Sherlock Holmes never declared a conclusion the moment he walked in the door. He spent time observing everything — the mud on the boots, the angle of the lamp, the calluses on the hands. Only then did he draw conclusions. The first step of any real forecasting project is the same: look carefully at everything before touching a single model.

✓ Source: The exploratory analysis first approach is advocated in virtually every serious forecasting textbook, including Hyndman & Athanasopoulos (2021) Chapter 2 and Makridakis, Wheelwright & Hyndman (1998). Skipping EDA is the single most common cause of poor forecasting results in practice.

This is day 1 of a 3-day capstone project. You'll take a real-world dataset from start to a deployable forecast pipeline. Today's mission: understand the data completely before writing any model code.

The data charter (write this down before you code)

Before touching a model, answer these questions in writing:

What are you measuring? Daily sales? Weekly electricity consumption? Monthly new users?
What is the cycle length? Monthly data with yearly patterns → cycle = 12. Weekly data with weekly patterns → cycle = 7.
Is there a trend? Going up, going down, or flat?
Is there a seasonal pattern? What time of year (or day or week) is highest? Lowest?
Are there outliers? Any sudden jumps or crashes that need investigation?
Is the seasonal bump additive (fixed size) or multiplicative (grows with the level)?
How much data do you have? Number of complete cycles determines which models are appropriate (Day 22 ladder).
What is the forecast horizon? How far ahead must you predict?

Only after you can answer all 8 questions confidently should you start building models. This charter becomes your map — you'll refer back to it throughout the project.

🐘 Your pink elephant: Sherlock Holmes arriving at the crime scene — not to arrest someone, but to look. Just look. For 30 minutes, nothing else. The charter is your 30 minutes of looking before touching anything.

Key Takeaways

Write a data charter before any model code — answer the 8 questions about your data in plain English first
STL decomposition (Day 3) and the ACF chart (Day 2) answer most of the charter questions automatically
Skipping this step is the single most common cause of poor forecasting results in practice

Python

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import STL
from statsmodels.graphics.tsaplots import plot_acf
from statsforecast.utils import AirPassengersDF

df = AirPassengersDF.copy().set_index("ds")["y"]

# ── The Data Charter Investigation ────────────────────────────────
print("=== DATA CHARTER ===")

# Question 1-2: What and when?
print(f"Series covers: {df.index[0].date()} to {df.index[-1].date()}")
print(f"Total observations: {len(df)} monthly readings")
print(f"Assumed cycle length: 12 months (monthly data = yearly patterns)")

# Question 3: Trend?
first_half_avg = df.head(len(df)//2).mean()
second_half_avg = df.tail(len(df)//2).mean()
trend_dir = "upward ↗" if second_half_avg > first_half_avg else "downward ↘"
print(f"Trend: {trend_dir} (first half avg={first_half_avg:.0f}, second half avg={second_half_avg:.0f})")

# Question 5: Outliers?
z_scores = (df - df.mean()) / df.std()
outliers = z_scores[z_scores.abs() > 3]
print(f"Suspicious outliers (>3 standard deviations): {len(outliers)}")

# Question 6: Additive or multiplicative seasonal pattern?
stl = STL(df, period=12).fit()
seasonal_range = stl.seasonal.max() - stl.seasonal.min()
trend_range    = stl.trend.max() - stl.trend.min()
ratio = seasonal_range / df.mean()
print(f"Seasonal swing as % of average: {ratio*100:.0f}%")
print(f"→ {'Multiplicative' if ratio > 0.3 else 'Additive'} seasonal pattern likely")

print("
=== CHARTER COMPLETE — ready to choose models ===")

→Your charter is written. You know your data. Now the race begins — all models running, all scored honestly, winner determined by numbers not opinion. Tomorrow: race day.

27 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 4 — MODERN EDGE

Day 28 of 30

🏃Capstone Day 2 — Running and Scoring the Models

Race day — run every model, score every result, pick the winner honestly

📖 Before we begin

Olympic sprinters don't improvise their races. Every arm swing, every stride length, every breathing pattern was decided months before the starting gun. Race day is execution of a pre-decided plan. Your model comparison is the same. The plan is the charter. Today, you execute — methodically, honestly, without gut feelings overriding the numbers.

The Analogy

After months of training, race day is about execution, not improvisation. You follow the plan exactly. You don't invent new tactics halfway through. In forecasting, "race day" means running your models on your data using honest cross-validation, scoring everything with the same metric, and letting the numbers — not gut feelings — decide the winner.

✓ Source: Model selection via cross-validation on time series is standard per Hyndman & Athanasopoulos (2021). The systematic pipeline approach (fit → cross-validate → score → compare) is the professional standard in industry forecasting teams.

Today you run the full model comparison process. Using what you wrote in the Day 27 charter, you already know which models are appropriate. Now you run them, score them, and choose the winner methodically.

Step 1: Choose your models based on the charter

From your charter (Day 27), you know your cycle length, trend type, and seasonal type. Use the Day 22 ladder to select appropriate models. Always include SeasonalNaive as the turtle baseline.

Step 2: Run honest cross-validation

Walk-forward cross-validation, always. At least 3 test windows. Test windows always after training windows.

Step 3: Score with MAE

MAE (average mistake size) is your working metric during a project. To compare across very different products or series, compute MASE by dividing your MAE by SeasonalNaive's MAE — a MASE below 1.0 means you beat the turtle baseline.

Step 4: Check the residuals (the leftover noise)

After the best model runs, look at its residuals (the gaps between forecast and actual). If the residuals:

Are random-looking — great. The model captured everything it could.
Have a pattern in them — bad. The model missed something. Go back and investigate what it missed.

Step 5: Choose the winner

Lowest MASE wins. If two models are within 2% of each other, combine them (Day 14) rather than picking one.

🐘 Your pink elephant: A race scoreboard with lanes: SeasonalNaive (the turtle), AutoETS, AutoARIMA, Ensemble. The winner is decided by the clock — not by which model sounds most impressive. The scoreboard decides. Not your intuition.

Key Takeaways

Always include SeasonalNaive as a baseline — if nothing beats it, your data may not be predictable enough
Walk-forward cross-validation (at least 3 windows) gives the honest error estimate — never skip this
Random-looking residuals = good model. Patterned residuals = the model missed something — go back and fix it

Python

from statsforecast import StatsForecast
from statsforecast.models import AutoETS, AutoARIMA, SeasonalNaive, AutoTheta
from statsforecast.utils import AirPassengersDF
import numpy as np

df = AirPassengersDF.copy()

# Step 1: Choose models (we know from Day 27 charter: monthly, seasonal, trend)
sf = StatsForecast(
    models=[
        SeasonalNaive(season_length=12),   # baseline turtle
        AutoETS(season_length=12),         # ETS family
        AutoARIMA(season_length=12),       # ARIMA family
        AutoTheta(season_length=12),       # bonus: consistently strong model
    ],
    freq="MS"
)

# Step 2: Honest cross-validation — 3 windows, each predicts 12 months ahead
cv = sf.cross_validation(df=df, h=12, n_windows=3)
actual = cv["y"].values

# Step 3: Score with MAE across all models
print("=== RACE RESULTS ===")
scores = {}
for model in ["SeasonalNaive", "AutoETS", "AutoARIMA", "AutoTheta"]:
    mae = np.mean(np.abs(cv[model].values - actual))
    scores[model] = mae
    print(f"  {model:<20}: MAE = {mae:.1f}")

# Step 4: Ensemble of top 2 non-baseline models
ensemble = (cv["AutoETS"] + cv["AutoARIMA"]) / 2
print(f"  {'Ensemble':<20}: MAE = {np.mean(np.abs(ensemble.values - actual)):.1f}")

# Step 5: Winner
winner = min(scores, key=scores.get)
print(f"
→ Winner: {winner}")

→You've run the race and found your winner. Now you need to make it repeatable — something you can hand to a colleague and run again next month without you in the room. Tomorrow: from experiment to machine.

28 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 4 — MODERN EDGE

Day 29 of 30

🎹Capstone Day 3 — Packaging It All Up

From one-off experiment to a forecast you can run again and again

📖 Before we begin

The Voyager 1 spacecraft is now 24 billion kilometres from Earth. It was launched in 1977 and is still transmitting data. It works because NASA built a repeatable system, not a one-time experiment. A forecast pipeline is your Voyager: once built correctly, it runs reliably — with minimal intervention — long after the original engineer has moved on.

The Analogy

A piano tuner doesn't just tune a piano once and declare the job done. They create a maintenance schedule, document the settings, and build a system so the piano stays in tune. Day 29 is about the same thing: turning your successful experiment into a repeatable process that someone else (or future-you) can run reliably next month.

✓ Source: MLOps (Machine Learning Operations) best practices for forecasting are documented in Makridakis et al. (2022), "M5 accuracy competition: Results, findings, and conclusions." The retraining cadence recommendation comes from Fildes & Petropoulos (2015), International Journal of Forecasting.

A forecast is only useful if you can run it again. The experiment you built over the last two days needs to become a pipeline — a series of steps that are clearly documented and can be re-executed every week or every month to produce fresh forecasts.

What goes in a forecast pipeline

Load data — from wherever it lives (a database, a file, an API)
Clean data — fill gaps, handle outliers (Day 17)
Run the model — using the winner from Day 28
Produce output — a file or database entry with the forecasts and their uncertainty ranges
Compare to actuals — when reality arrives, score the previous forecast and add the result to your error log

When to retrain

Models don't stay accurate forever. The world changes. Retrain your model when any of these happen:

A rolling error score degrades significantly from your baseline cross-validation score (a common trigger is 15–25%, adjusted to your business context)
A major structural event happened (new product line, major competitor entered, pandemic)
It's been 3–6 months since the last training (for most business datasets)

🐘 Your pink elephant: A piano that stays in tune because a system exists to maintain it — not because the original tuner is always in the room. Your pipeline is the maintenance system. The piano keeps playing whether you're there or not.

Key Takeaways

A forecast pipeline has five steps: load, clean, run model, output results, score against actuals
Retrain when error degrades 20%+ from baseline, after major structural events, or every 3-6 months
Document the chosen model, its settings, and the expected error range — your successor will thank you

Python

from statsforecast import StatsForecast
from statsforecast.models import AutoETS, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import pandas as pd, numpy as np

# ═══════════════════════════════════════════════════════════════
# FORECAST PIPELINE — run this every month to get fresh forecasts
# ═══════════════════════════════════════════════════════════════

def load_data():
    """Step 1: Load your data from wherever it lives."""
    df = AirPassengersDF.copy()
    return df

def clean_data(df):
    """Step 2: Fill gaps and flag outliers (see Day 17)."""
    df = df.copy()
    # Fill any missing values using linear interpolation
    df["y"] = df["y"].interpolate(method="linear")
    # Flag extreme outliers (over 3x the median)
    median = df["y"].median()
    df.loc[df["y"] > median * 3, "y"] = np.nan
    df["y"] = df["y"].interpolate(method="linear")
    return df

def run_model(df, horizon=12):
    """Step 3: Run the winning model from Day 28 cross-validation."""
    sf = StatsForecast(
        models=[AutoETS(season_length=12)],  # ← winner from Day 28
        freq="MS"
    )
    forecasts = sf.forecast(df=df, h=horizon, level=[80, 95])
    return forecasts

def score_previous(forecasts, actuals):
    """Step 5: When actuals arrive, score the previous forecast."""
    mae = np.mean(np.abs(forecasts - actuals))
    print(f"Last month's forecast MAE: {mae:.1f}")
    if mae > 50:  # set your own threshold here
        print("⚠️  Error above threshold — consider retraining!")
    return mae

# Run the pipeline
df      = load_data()
df      = clean_data(df)
output  = run_model(df)
print("Monthly forecasts ready:")
print(output.head(6))

→Tomorrow is graduation. Before you close this book, read the one page that summarises everything — the recipe that handles 80% of real forecasting problems, in five steps. Tomorrow: the complete recipe, on one page.

29 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Phase 4 — MODERN EDGE

Day 30 of 30

🎓Graduation — You're a Forecaster

Thirty days ago you didn't know what a time series was. Look at you now.

📖 Before we begin

You started Day 1 not knowing what a time series was. Thirty chapters later, you understand tools that didn't exist five years ago — tools that won the world's largest forecasting competition, that run inside the demand planning systems of Fortune 500 companies, that predict energy demand for national grids. The gap between beginner and practitioner has never been shorter.

The Analogy

A medical student spends years studying anatomy before touching a patient. An engineering student works through fundamentals before designing a bridge. You just completed the equivalent — a structured journey from "what is forecasting?" to "here is my production pipeline." The tools change; the thinking skills you've built do not.

✓ Sources: Complete bibliography available at otexts.com/fpp3 (Hyndman & Athanasopoulos), Nixtla documentation at docs.nixtla.io, and the M4/M5 competition papers at sciencedirect.com/journal/international-journal-of-forecasting.

Let's look at what you now know how to do:

Phase 1 — See Clearly

You can look at a dataset and identify trend, seasonality, and outliers. You understand the four main error metrics. You know that any model must beat the "same time last year" baseline. You can set up proper cross-validation that doesn't cheat.

Phase 2 — Two Workhorses

You understand how exponential smoothing fades old memories, how to add trend direction, how to add seasonal patterns, and how AutoETS tests every combination automatically. You know what ARIMA does, why it needs the data to stand still, and how AutoARIMA finds the best recipe. You know that combining two models almost always beats either one alone.

Phase 3 — Real-World Power

You can forecast thousands of series at once. You know how to handle missing data and outliers. You understand when and how to use outside information. You can build a machine learning feature table and train LightGBM. You always produce prediction intervals — never just a single number.

Phase 4 — Modern Edge

You understand neural networks, foundation models, and hierarchical forecasting. You can build a complete three-day forecasting project from data charter to production pipeline.

Your default recipe (when in doubt)

Plot your data, run STL, answer the 8 charter questions
Run SeasonalNaive, AutoETS, AutoARIMA
Combine the best two into an ensemble
Cross-validate honestly, report MASE
Package into a pipeline with monthly retraining

🐘 Your pink elephant: Every chart you will ever see for the rest of your life — in a newspaper, on a website, in a meeting — you will now see differently. You'll look for the trend, the seasonal pattern, the outliers. You cannot unlearn this. Congratulations. You're a forecaster.

Key Takeaways

The default recipe: chart → charter → AutoETS + AutoARIMA ensemble → cross-validate → pipeline
The tools change every few years; the thinking — plot first, baseline first, cross-validate honestly — is timeless
Every technique in this course is production-proven: tested on real competitions with tens of thousands of real time series

Python

# ═══════════════════════════════════════════════════════════════
# THE COMPLETE RECIPE — your go-to starting point for any dataset
# Copy this, run it, then tune from here
# ═══════════════════════════════════════════════════════════════
from statsforecast import StatsForecast
from statsforecast.models import AutoETS, AutoARIMA, SeasonalNaive
from statsforecast.utils import AirPassengersDF
import numpy as np

# 1. Load your data (replace this with your actual dataset)
df = AirPassengersDF.copy()

# 2. Tell it the cycle length (12=monthly, 7=daily, 52=weekly, 24=hourly)
CYCLE = 12

# 3. Run the models
sf = StatsForecast(
    models=[
        SeasonalNaive(season_length=CYCLE),   # always include the turtle
        AutoETS(season_length=CYCLE),         # workhorse 1
        AutoARIMA(season_length=CYCLE),       # workhorse 2
    ],
    freq="MS",   # change to "D", "W", "H" for daily, weekly, hourly data
    n_jobs=-1    # use all CPU cores
)

# 4. Cross-validate honestly
cv = sf.cross_validation(df=df, h=CYCLE, n_windows=3)

# 5. Score everything and pick the winner
actual = cv["y"].values
print("=== YOUR RESULTS ===")
for model in ["SeasonalNaive","AutoETS","AutoARIMA"]:
    mae = np.mean(np.abs(cv[model].values - actual))
    print(f"  {model:<20}: MAE = {mae:.1f}")

ensemble_mae = np.mean(np.abs(((cv["AutoETS"]+cv["AutoARIMA"])/2).values - actual))
print(f"  {'Ensemble':<20}: MAE = {ensemble_mae:.1f}")

# 6. Produce final forecasts with uncertainty ranges
forecasts = sf.forecast(df=df, h=CYCLE, level=[80,95])
print(f"
Forecasts for next {CYCLE} periods ready. Congratulations — you're a forecaster.")

30 / 30

Source: Hyndman, Athanasopoulos et al. (2025). Forecasting: Principles and Practice, the Pythonic Way. otexts.com/fpppy — CC BY-NC-ND 4.0 | Original code: MIT Licence

Forecastin 30

🔮What Is Forecasting?

Three words to learn today

Your toolkit: the nixtlaverse

👁️Always Look Before You Model

The pattern-detector chart

🥪Peeling Apart Your Data

Ingredient 1: The Trend

Ingredient 2: The Season (Repeating Pattern)

Ingredient 3: The Remainder (Random Noise)

Why does this matter?

🎯How Wrong Are You? Measuring Your Mistakes

Scoreboard 1: MAE — Mean Absolute Error

Scoreboard 2: RMSE — Root Mean Squared Error

Scoreboard 3: MASE — Mean Absolute Scaled Error ⭐

Scoreboard 4: MAPE — please avoid this one ❌

🐢The Turtle You Must Beat First

Turtle 1: The Copycat (Naive)

Turtle 2: The "Same Time Last Year" (Seasonal Naive)

Turtle 3: The Trend Extrapolator (Drift)

Turtle 4: The Boring Average

The rule that matters

⏳Testing Your Forecast Honestly

Why random splits are cheating in time series

The correct approach: walking forward through time

🌅Remembering the Recent Past More

Important limitation

🚀Adding Direction to Your Forecast

The problem with straight-line thinking

The fix: the damped trend

🎢Adding Seasons to Your Forecast

One important choice: additive vs multiplicative

🔭Trying Every Combination Automatically

AutoETS: the computer tries them all

📐Making Wiggly Data Go Flat

How to check

How to fix it: differencing

📅ARIMA That Knows the Calendar

A mental picture

The good news

🗺️AutoARIMA — Your GPS for Models

When to use ARIMA vs ETS

🤝Asking Both Models and Averaging

How much to trust each model

The rule of combination

🏭Forecasting 10,000 Products at Once

The magic column: unique_id

Speed: n_jobs=-1

The key insight

🌤️Using Outside Information

The golden rule of outside variables

Holidays as the simplest example

🔧When Data Has Holes and Jumps

Problem 1: Missing numbers (gaps in the record)

Problem 2: Outliers (numbers that are clearly wrong)

🫙Asking a Crowd of Models

Choosing models for an ensemble

Weighted vs equal averaging

One warning

🎯Showing How Confident You Are

What does 80% or 95% mean?

One important caveat: calibration

Intervals get wider the further you forecast

🤖Teaching a Computer to Forecast

How it works in three steps

LightGBM — the most useful ML tool for forecasting

When to use ML over ETS/ARIMA

🧪Building Better Clues for Your Model

Clue type 1: Lags (recent memories)

Clue type 2: Rolling statistics

Clue type 3: Calendar features

The one rule you must never break: no data leakage

🗂️Choosing the Right Tool for the Job

Rung 1: You have fewer than 2 complete seasonal cycles

Rung 2: You have 20–100 observations, simple pattern

Rung 3: You have 100+ observations, clear seasonal pattern

Rung 4: You have many series with outside variables

Rung 5: You have very long series OR very large scale

🧠How Neural Networks See Time

N-BEATS: the dedicated forecasting network

Forecast
in 30