Transforming Time Series Data into Supervised Learning Datasets with Pandas shift() and series_to_supervised()
This tutorial explains how to convert single‑variable and multi‑variable time‑series data into a supervised‑learning format using Pandas' shift() function and a custom series_to_supervised() helper, covering one‑step, multi‑step, and multivariate forecasting examples with complete Python code.
Time Series vs. Supervised Learning
Before starting, we clarify the data formats of time series and supervised learning. A time series is an ordered list of numeric values indexed by time, e.g.:
0
1
2
3
4
5
6
7
8
9A supervised‑learning problem consists of input (X) and output (y) pairs that an algorithm learns to map, for example:
X, y
1, 2
2, 3
3, 4
4, 5
5, 6
6, 7
7, 8
8, 9Pandas shift() Function
The shift() method is essential for converting a time‑series DataFrame into a supervised‑learning structure. It creates lagged (past) or forward (future) copies of a column, inserting NaN rows at the beginning or end as needed.
Example of creating a simple series:
from pandas import DataFrame
df = DataFrame()
df['t'] = [x for x in range(10)]
print(df)Result:
t
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9Shifting the column forward by one step creates a lag observation column:
from pandas import DataFrame
df = DataFrame()
df['t'] = [x for x in range(10)]
df['t-1'] = df['t'].shift(1)
print(df)Result (first row contains NaN and is later dropped):
t t-1
0 0 NaN
1 1 0.0
2 2 1.0
3 3 2.0
4 4 3.0
5 5 4.0
6 6 5.0
7 7 6.0
8 8 7.0
9 9 8.0Shifting with a negative integer inserts rows at the end, useful for creating forecast columns:
from pandas import DataFrame
df = DataFrame()
df['t'] = [x for x in range(10)]
df['t+1'] = df['t'].shift(-1)
print(df)Result (last row contains NaN):
t t+1
0 0 1.0
1 1 2.0
2 2 3.0
3 3 4.0
4 4 5.0
5 5 6.0
6 6 7.0
7 7 8.0
8 8 9.0
9 9 NaNThe series_to_supervised() Function
To automate the creation of supervised‑learning datasets from time series, the tutorial defines a reusable series_to_supervised() function. It accepts four parameters: Data: list or 2‑D NumPy array of observations (required) n_in: number of lag observations to use as inputs (default 1) n_out: number of future observations to use as outputs (default 1) dropnan: whether to drop rows containing NaN values (default True)
The function builds lagged input columns and forward output columns, concatenates them, optionally removes NaNs, and returns the resulting DataFrame.
from pandas import DataFrame, concat
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
"""Transform a time series into a supervised learning dataset.
Parameters:
data: sequence of observations (list or NumPy array).
n_in: number of lag observations (X).
n_out: number of forecast observations (y).
dropnan: whether to drop rows with NaN values.
Returns:
Pandas DataFrame ready for supervised learning.
"""
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t‑n, …, t‑1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, …, t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
agg = concat(cols, axis=1)
agg.columns = names
if dropnan:
agg.dropna(inplace=True)
return aggOne‑Step Univariate Forecast
Using the default parameters ( n_in=1, n_out=1) the function creates a DataFrame where column var1(t‑1) is the lag input and var1(t) is the target:
var1(t-1) var1(t)
1 0.0 1
2 1.0 2
3 2.0 3
4 3.0 4
5 4.0 5
6 5.0 6
7 6.0 7
8 7.0 8
9 8.0 9Multi‑Step (Sequence) Forecast
Setting n_in=2, n_out=2 creates two lag inputs and two future outputs, useful for sequence‑to‑sequence prediction:
var1(t-2) var1(t-1) var1(t) var1(t+1)
2 0.0 1.0 2 3.0
3 1.0 2.0 3 4.0
4 2.0 3.0 4 5.0
5 3.0 4.0 5 6.0
6 4.0 5.0 6 7.0
7 5.0 6.0 7 8.0
8 6.0 7.0 8 9.0Multivariate Forecast
When the original data contains multiple columns (e.g., ob1 and ob2), the same function produces lagged and forecast columns for each variable, enabling models that predict one or several series simultaneously:
var1(t-1) var2(t-1) var1(t) var2(t)
1 0.0 50.0 1 51
2 1.0 51.0 2 52
3 2.0 52.0 3 53
4 3.0 53.0 4 54
5 4.0 54.0 5 55
6 5.0 55.0 6 56
7 6.0 56.0 7 57
8 7.0 57.0 8 58
9 8.0 58.0 9 59By adjusting n_in and n_out, you can experiment with various input‑output window sizes to find the configuration that yields the best forecasting performance on your own dataset.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
