Time Series Data Preprocessing: Missing Value Imputation, Denoising, and Outlier Detection
This article explains essential time series preprocessing techniques—including data sorting, handling missing values with interpolation methods, applying rolling averages, Fourier transform denoising, and detecting anomalies using rolling statistics, isolation forests, and K‑means clustering—illustrated with Python code on the AirPassengers and Google stock datasets.
Time series data appear everywhere, and proper preprocessing is crucial for accurate modeling.
We first define a time series as a uniformly spaced sequence of observations, e.g., monthly gold prices, and emphasize the importance of sorting and converting timestamps to datetime objects.
Using the Kaggle AirPassengers dataset, we demonstrate data loading and sorting:
import pandas as pd
passenger = pd.read_csv('AirPassengers.csv')
passenger['Date'] = pd.to_datetime(passenger['Date'])
passenger.sort_values(by=['Date'], inplace=True, ascending=True)Missing values in time series require special interpolation methods because order matters. We apply three techniques: time‑based interpolation, spline (order 3), and linear interpolation, and visualize the results.
passenger['Linear'] = passenger['Passengers'].interpolate(method='linear')
passenger['Spline order 3'] = passenger['Passengers'].interpolate(method='spline', order=3)
passenger['Time'] = passenger['Passengers'].interpolate(method='time')
methods = ['Linear', 'Spline order 3', 'Time']
for method in methods:
figure(figsize=(12, 4), dpi=80, linewidth=10)
plt.plot(passenger["Date"], passenger[method])
plt.title('Air Passengers Imputation using: ' + method)
plt.xlabel('Years', fontsize=14)
plt.ylabel('Number of Passengers', fontsize=14)
plt.show()All methods work well for short gaps but struggle with long consecutive missing segments.
For denoising, we discuss rolling averages and Fourier transform. The rolling mean smooths a window of previous observations, illustrated on Google stock prices:
rolling_google = google_stock_price['Open'].rolling(20).mean()
plt.plot(google_stock_price['Date'], google_stock_price['Open'])
plt.plot(google_stock_price['Date'], rolling_google)
plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.legend(['Open','Rolling Mean'])
plt.show() denoised_google_stock_price = fft_denoiser(value, 0.001, True)
plt.plot(time, google_stock['Open'][0:300])
plt.plot(time, denoised_google_stock_price)
plt.xlabel('Date', fontsize=13)
plt.ylabel('Stock Price', fontsize=13)
plt.legend(['Open','Denoised: 0.001'])
plt.show()Outlier detection methods include rolling statistics, isolation forest, and K‑means clustering. Rolling statistics define dynamic upper and lower bounds based on a moving window. Isolation forest isolates anomalies using decision‑tree partitions, while K‑means clusters points and flags those far from centroids.
Finally, we list possible interview questions related to time series preprocessing, such as methods for handling missing values, meaning of a time‑series window, explanation of isolation forest, purpose of Fourier transform, and various imputation techniques.
The article concludes that applying these preprocessing steps—sorting, interpolation, denoising, and outlier detection—ensures high‑quality data ready for building complex models.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.