Time Series Archives

Extract Dates from Text with Datefinder

2 Comments / Natural Language Processing / Khuyen Tran

Motivation

Extracting dates from unstructured text can be a frustrating and error-prone task when dates appear in varying formats. For example:

# Example Input
string_with_dates = """
We have one meeting on May 17th, 2021 at 9:00am
and another meeting on 5/18/2021 at 10:00.
I hope you can attend one of the meetings.
"""

# Traditional string processing to find dates
import re

# Basic regex for dates
pattern = r"(\b\d{1,2}/\d{1,2}/\d{4}\b)|(\b\w+\s\d{1,2}(st|nd|rd|th)?,\s\d{4}\b)"
matches = re.findall(pattern, string_with_dates)
print(f"Matches: {[match[0] or match[1] for match in matches]}") # Limited and inflexible

Output:

Matches: ['May 17th, 2021', '5/18/2021']

Using basic regular expressions, the extracted dates are limited and often incomplete, especially when handling a variety of date formats. This makes it difficult to consistently extract and process dates from large, diverse datasets.

Introduction to Datefinder

Datefinder is a Python library designed to simplify the extraction of dates from text. It intelligently detects date-like strings and converts them into Python datetime objects, handling a wide range of formats automatically.

To install Datefinder, simply use the following command:

pip install datefinder

In this post, we will explore how Datefinder can be used to efficiently extract dates from unstructured text.

Extracting Dates from Text

Datefinder makes the process of identifying and extracting dates straightforward, even when the formats vary within the text. Below is an example demonstrating its use.

import datefinder

# Example Input
string_with_dates = """
We have one meeting on May 17th, 2021 at 9:00am
and another meeting on 5/18/2021 at 10:00.
I hope you can attend one of the meetings.
"""

# Extract dates from text
matches = datefinder.find_dates(string_with_dates)

# Print each match
for match in matches:
print(match)

In the above code:

datefinder.find_dates() scans the input text for potential date strings.

The string_with_dates variable contains examples of multiple date formats.

The matches iterator yields each identified date as a Python datetime object.

When you run the above code, Datefinder will identify and extract both dates:

2021-05-17 09:00:00
2021-05-18 10:00:00

Datefinder not only detects the dates but also converts them into a standard, machine-readable format (datetime objects), which can then be used for further processing or analysis.

Conclusion

Datefinder is a powerful tool for extracting dates from unstructured text. It simplifies the process by handling various date formats and converting them into datetime objects. Whether you’re working on NLP tasks, data preprocessing, or automating workflows that involve date extraction, Datefinder saves time and effort.

Link to Datefinder.

Extract Dates from Text with Datefinder Read More »

MLForecast: Automate External Feature Handling

3 Comments / Time Series / Khuyen Tran

Motivation

Time series forecasting often requires incorporating external factors that can influence the target variable. However, handling these external factors (exogenous features) can be complex, especially when some features remain constant while others change over time.

# Example without proper handling of exogenous features
import pandas as pd

# Sales data with product info and prices
data = pd.DataFrame({
'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
'product_id': [1, 1, 1],
'category': ['electronics', 'electronics', 'electronics'],
'price': [99.99, 89.99, 94.99],
'sales': [150, 200, 175]
})

# Difficult to handle static (category) vs dynamic (price) features
# Risk of data leakage or incorrect feature engineering

This code shows the challenge of handling both static features (product category) and dynamic features (price) in time series forecasting. Without proper handling, you might incorrectly use future information or miss important patterns in the data.

Understanding Features in Time Series

Before diving into MLForecast, let’s understand two important concepts:

Static features: These are features that don’t change over time (like product category or location)

Dynamic features (exogenous): These are features that change over time (like price or weather)

Introduction to MLForecast

MLForecast is a Python library that simplifies time series forecasting with machine learning models while properly handling both static and dynamic features. It can be installed using:

pip install mlforecast

As covered in the past article about MLForecast’s workflow, it provides an integrated approach to time series forecasting. In this post, we will focus on its exogenous features capabilities.

Working with Exogenous Features

MLForecast makes it easy to handle both static and dynamic features in your forecasting models. Here’s how:

First, let’s prepare our data with both types of features:

import lightgbm as lgb
from mlforecast import MLForecast
from mlforecast.utils import generate_daily_series, generate_prices_for_series

# Generate sample data
series = generate_daily_series(100, equal_ends=True, n_static_features=2)
series = series.rename(columns={"static_0": "store_id", "static_1": "product_id"})

# Generate price catalog (dynamic feature)
prices_catalog = generate_prices_for_series(series)

# Merge static and dynamic features
series_with_prices = series.merge(prices_catalog, how='left')
print(series_with_prices.head(10))

Output:

unique_id ds y store_id product_id price
0 id_00 2000-10-05 39.811983 79 45 0.548814
1 id_00 2000-10-06 103.274013 79 45 0.715189
2 id_00 2000-10-07 176.574744 79 45 0.602763
3 id_00 2000-10-08 258.987900 79 45 0.544883
4 id_00 2000-10-09 344.940404 79 45 0.423655
5 id_00 2000-10-10 413.520305 79 45 0.645894
6 id_00 2000-10-11 506.990093 79 45 0.437587
7 id_00 2000-10-12 12.688070 79 45 0.891773
8 id_00 2000-10-13 111.133819 79 45 0.963663
9 id_00 2000-10-14 197.982842 79 45 0.383442

Now, let’s create and train our model:

# Create MLForecast model
fcst = MLForecast(
models=lgb.LGBMRegressor(random_state=0),
freq="D",
lags=[7], # Use 7-day lag
date_features=["dayofweek"], # Add day of week as feature
)

# Fit model specifying which features are static
fcst.fit(
series_with_prices,
static_features=["store_id", "product_id"], # Specify static features
)

# Check which features are used for training
print("\nFeatures used for training:")
print(fcst.ts.features_order_)

Output:

Features used for training:
['store_id', 'product_id', 'price', 'lag7', 'dayofweek']

Generate predictions:

# Make predictions using future prices
predictions = fcst.predict(
h=7, # Forecast 7 days ahead
X_df=prices_catalog # Provide future prices
)
predictions.head(10)

Output:

unique_id ds LGBMRegressor
0 id_00 2001-05-15 421.301684
1 id_00 2001-05-16 497.335181
2 id_00 2001-05-17 20.108545
3 id_00 2001-05-18 101.930145
4 id_00 2001-05-19 184.264253
5 id_00 2001-05-20 260.803990
6 id_00 2001-05-21 343.501305
7 id_01 2001-05-15 118.299009
8 id_01 2001-05-16 148.793503
9 id_01 2001-05-17 184.066779

The output shows forecasted values that take into account both static features (product information) and dynamic features (prices).

MLForecast vs Traditional Approaches

Traditional approaches often require separate handling of static and dynamic features, leading to complex preprocessing pipelines. MLForecast simplifies this by:

Automatically managing feature types

Preventing data leakage

Providing an integrated workflow

Conclusion

MLForecast’s handling of exogenous features significantly simplifies time series forecasting by providing a clean interface for both static and dynamic features. This makes it easier to incorporate external information into your forecasts while maintaining proper time series practices.

Link to MLForecast

MLForecast: Automate External Feature Handling Read More »

Enhancing Predictive Models with Workalendar’s Holiday Handling

Leave a Comment / Time Series / Khuyen Tran

Incorporating holiday and working day information into predictive models can enhance accuracy, but it’s challenging due to regional variations in holiday schedules.

Workalendar simplifies this process by handling working days, holidays, and business calendars for various countries and regions.

Here are some examples:

Get US holidays in 2024:

from datetime import date
from workalendar.usa import UnitedStates

US_cal = UnitedStates()
US_cal.holidays(2024)

[(datetime.date(2024, 1, 1), 'New year'),
(datetime.date(2024, 1, 15), 'Birthday of Martin Luther King, Jr.'),
(datetime.date(2024, 2, 19), "Washington's Birthday"),
(datetime.date(2024, 5, 27), 'Memorial Day'),
(datetime.date(2024, 7, 4), 'Independence Day'),
(datetime.date(2024, 9, 2), 'Labor Day'),
(datetime.date(2024, 10, 14), 'Columbus Day'),
(datetime.date(2024, 11, 11), 'Veterans Day'),
(datetime.date(2024, 11, 28), 'Thanksgiving Day'),
(datetime.date(2024, 12, 25), 'Christmas Day')]

Check if a date is a working day:

US_cal.is_working_day(date(2024, 9, 15)) # Sunday

False

US_cal.is_working_day(date(2024, 9, 2)) # Labor Day

False

Calculate the number of working days between two dates, excluding weekends and holidays:

# Calculate working days between 2024/1/19 and 2024/5/15
US_cal.get_working_days_delta(date(2024, 1, 19), date(2024, 5, 15))

Get Japan holidays in 2024:

from workalendar.asia import Japan

# Get holidays in Japan
JA_cal = Japan()
JA_cal.holidays(2024)

[(datetime.date(2024, 1, 1), 'New year'),
(datetime.date(2024, 1, 8), 'Coming of Age Day'),
(datetime.date(2024, 2, 11), 'Foundation Day'),
(datetime.date(2024, 2, 23), "The Emperor's Birthday"),
(datetime.date(2024, 3, 20), 'Vernal Equinox Day'),
(datetime.date(2024, 4, 29), 'Showa Day'),
(datetime.date(2024, 5, 3), 'Constitution Memorial Day'),
(datetime.date(2024, 5, 4), 'Greenery Day'),
(datetime.date(2024, 5, 5), "Children's Day"),
(datetime.date(2024, 7, 15), 'Marine Day'),
(datetime.date(2024, 8, 11), 'Mountain Day'),
(datetime.date(2024, 9, 16), 'Respect-for-the-Aged Day'),
(datetime.date(2024, 9, 22), 'Autumnal Equinox Day'),
(datetime.date(2024, 10, 14), 'Sports Day'),
(datetime.date(2024, 11, 3), 'Culture Day'),
(datetime.date(2024, 11, 23), 'Labour Thanksgiving Day')]

Link to Workalendar.

Next Step

If you liked this blog post, you might also like:

Pendulum: Python Datetimes Made Easy – For intuitive datetime handling and timezone management.

Maya: Convert the string to datetime automatically – For effortless string-to-datetime conversion.

Datefinder: Automatically Find Dates and Time in a Python String – For extracting dates from text.

Enhancing Predictive Models with Workalendar’s Holiday Handling Read More »

Avoiding Data Leakage in Time Series Analysis with TimeSeriesSplit

Leave a Comment / Feature Engineer, Time Series / Khuyen Tran

Time series data is unique because it has a temporal order. This means that data from the future shouldn’t influence predictions about the past. However, standard cross-validation techniques like K-Fold randomly shuffle the data, potentially using future information to predict past events.

scikit-learn’s TimeSeriesSplit is a specialized cross-validator for time series data. It respects the temporal order of our data, ensuring that we always train on past data and test on future data.

Let’s explore how to use TimeSeriesSplit with a simple example:

import numpy as np
from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])

tscv = TimeSeriesSplit(n_splits=3)

for i, (train_index, test_index) in enumerate(tscv.split(X)):
print(f"Fold {i}:")
print(f" Train: index={train_index}")
print(f" Test: index={test_index}")

Fold 0:
Train: index=[0 1 2]
Test: index=[3]
Fold 1:
Train: index=[0 1 2 3]
Test: index=[4]
Fold 2:
Train: index=[0 1 2 3 4]
Test: index=[5]

From the outputs, we can see that:

Temporal Integrity: The split always respects the original order of the data.

Growing Training Set: With each fold, the training set expands to include more historical data.

Forward-Moving Test Set: The test set is always a single future sample, progressing with each fold.

No Data Leakage: Future information is never used to predict past events.

This approach mimics real-world forecasting scenarios, where models use historical data to predict future outcomes.

Avoiding Data Leakage in Time Series Analysis with TimeSeriesSplit Read More »

Pendulum: Python Datetimes Made Easy

1 Comment / Time Series / Khuyen Tran

While Python’s built-in datetime library is sufficient for basic use cases, it can become cumbersome when dealing with more complex scenarios.

Pendulum offers a more intuitive and user-friendly API, serving as a convenient drop-in replacement for the standard datetime class.

Here’s a comparison of syntax and functionality between the standard datetime library and Pendulum:

Creating a datetime

Datetime:

from datetime import datetime
now = datetime.now()

Pendulum:

import pendulum
now = pendulum.now()

Date arithmetic

Datetime:

from datetime import timedelta
future = now + timedelta(days=7)

Pendulum:

future = now.add(days=7)

Timezone handling: Datetime (with pytz):

import pytz
utc_now = datetime.now(pytz.UTC)
tokyo_tz = pytz.timezone('Asia/Tokyo')
tokyo_time = utc_now.astimezone(tokyo_tz)

Pendulum:

tokyo_time = now.in_timezone("Asia/Tokyo")

Parsing dates

Datetime:

parsed = datetime.strptime("2023-05-15 14:30:00", "%Y-%m-%d %H:%M:%S")
parsed = pytz.UTC.localize(parsed)

Pendulum:

parsed = pendulum.parse("2023-05-15 14:30:00")

Time differences

Datetime:

diff = parsed – now_utc
print(f"Difference: {diff}")

Pendulum:

diff = parsed – now
print(f"Difference: {diff.in_words()}")

Key Advantages of Pendulum

More intuitive API for date arithmetic and timezone handling

Automatic timezone awareness (UTC by default)

Flexible parsing without specifying exact formats

Human-readable time differences

Link to Pendulum.

Pendulum: Python Datetimes Made Easy Read More »

Hierarchical Forecasting in Python

Leave a Comment / Time Series / Khuyen Tran

In complex datasets, forecasts at detailed levels (e.g., regions, products) should align with higher-level forecasts (e.g., countries, categories). Inconsistent forecasts can lead to poor decisions.

Hierarchical forecasting ensures forecasts are consistent across all levels to reconcile and match forecasts from lower to higher levels.

HierarchicalForecast from Nixtla is an open-source library that provides tools and methods for creating and reconciling hierarchical forecasts

For illustrative purposes, consider a sales dataset with the following columns:

Country: The country where the sales occurred.

Region: The region within the country.

State: The state within the region.

Purpose: The purpose of the sale (e.g., Business, Leisure).

ds: The date of the sale.

y: The sales amount.

import numpy as np
import pandas as pd

Y_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/tourism.csv')
Y_df = Y_df.rename({'Trips': 'y', 'Quarter': 'ds'}, axis=1)
Y_df.insert(0, 'Country', 'Australia')
Y_df = Y_df[['Country', 'State', 'Region', 'Purpose', 'ds', 'y']]
Y_df['ds'] = Y_df['ds'].str.replace(r'(\d+) (Q\d)', r'\1-\2', regex=True)
Y_df['ds'] = pd.to_datetime(Y_df['ds'])
Y_df.head()

Country
State
Region
Purpose
ds
y

Australia
South Australia
Adelaide
Business
1998-01-01
135.077690

Australia
South Australia
Adelaide
Business
1998-04-01
109.987316

Australia
South Australia
Adelaide
Business
1998-07-01
166.034687

Australia
South Australia
Adelaide
Business
1998-10-01
127.160464

Australia
South Australia
Adelaide
Business
1999-01-01
137.448533

The dataset can be grouped in the following non-strictly hierarchical structure:

Country

Country, State

Country, Purpose

Country, State, Region

Country, State, Purpose

Country, State, Region, Purpose

spec = [
['Country'],
['Country', 'State'],
['Country', 'Purpose'],
['Country', 'State', 'Region'],
['Country', 'State', 'Purpose'],
['Country', 'State', 'Region', 'Purpose']
]

Using the aggregate function from HierarchicalForecast we can get the full set of time series.

from hierarchicalforecast.utils import aggregate

Y_df, S_df, tags = aggregate(Y_df, spec)
Y_df = Y_df.reset_index()
Y_df.sample(10)

unique_id
ds
y

12251
Australia/New South Wales/Outback NSW/Business
2000-10-01

33131
Australia/Western Australia/Australia’s North
2000-10-01

22034
Australia/South Australia/Fleurieu Peninsula/Other
2006-07-01

31119
Australia/Victoria/Phillip Island/Visiting
2017-10-01

7671
Australia/New South Wales/Other
2015-10-01

18339
Australia/Queensland/Mackay/Business
2002-10-01

23043
Australia/South Australia/Limestone Coast/Visiting
1998-10-01

22129
Australia/South Australia/Fleurieu Peninsula/Visiting
2010-04-01

11349
Australia/New South Wales/Hunter/Business
2015-04-01

16599
Australia/Queensland/Brisbane/Other
2007-10-01

Get all the distinct ‘Country/Purpose’ combinations present in the dataset:

tags['Country/Purpose']

array(['Australia/Business', 'Australia/Holiday', 'Australia/Other',
'Australia/Visiting'], dtype=object)

We use the final two years (8 quarters) as test set.

Y_test_df = Y_df.groupby('unique_id').tail(8)
Y_train_df = Y_df.drop(Y_test_df.index)

Y_test_df = Y_test_df.set_index('unique_id')
Y_train_df = Y_train_df.set_index('unique_id')

Y_train_df.groupby('unique_id').size()

unique_id
count

Australia
72

Australia/ACT
72

Australia/ACT/Business
72

Australia/ACT/Canberra
72

Australia/ACT/Canberra/Business
72

…
…

Australia/Western Australia/Experience Perth/Other
72

Australia/Western Australia/Experience Perth/Visiting
72

Australia/Western Australia/Holiday
72

Australia/Western Australia/Other
72

Australia/Western Australia/Visiting
72

The following code generates base forecasts for each time series in Y_df using the ETS model. The forecasts and fitted values are stored in Y_hat_df and Y_fitted_df, respectively.

%%capture
from statsforecast.models import ETS
from statsforecast.core import StatsForecast

fcst = StatsForecast(df=Y_train_df,
models=[ETS(season_length=4, model='ZZA')],
freq='QS', n_jobs=-1)
Y_hat_df = fcst.forecast(h=8, fitted=True)
Y_fitted_df = fcst.forecast_fitted_values()

Since Y_hat_df contains forecasts that are not coherent—meaning forecasts at detailed levels (e.g., by State, Region, Purpose) may not align with those at higher levels (e.g., by Country, State, Purpose)—we will use the HierarchicalReconciliation class with the BottomUp approach to ensure coherence.

from hierarchicalforecast.methods import BottomUp
from hierarchicalforecast.core import HierarchicalReconciliation

reconcilers = [BottomUp()]
hrec = HierarchicalReconciliation(reconcilers=reconcilers)
Y_rec_df = hrec.reconcile(Y_hat_df=Y_hat_df, Y_df=Y_fitted_df, S=S_df, tags=tags)

The dataframe Y_rec_df contains the reconciled forecasts.

Y_rec_df.head()

unique_id
ds
ETS
ETS/BottomUp

Australia
2016-01-01
25990.068359
24380.257812

Australia
2016-04-01
24458.490234
22902.765625

Australia
2016-07-01
23974.056641
22412.982422

Australia
2016-10-01
24563.455078
23127.439453

Australia
2017-01-01
25990.068359
24516.759766

Link to Hierarchical Forecast

What is the Bottom-Up Approach?

The bottom-up approach is a method where forecasts are initially created at the most granular level of a hierarchy and then aggregated up to higher levels. This approach ensures that detailed trends at lower levels are captured and accurately reflected in higher-level forecasts. It contrasts with top-down methods, which start with aggregate forecasts and distribute them downwards.

Steps in the Bottom-Up Approach

Forecast at the Lowest Level

First, forecasts are created at the most detailed level: Country, State, Region, Purpose. For example, the forecast for the next date might look like this:

Country
State
Region
Purpose
ds
y_forecast

USA
NY
East
Business
2023-01-02
105

USA
NY
East
Leisure
2023-01-02
85

USA
NJ
East
Business
2023-01-02
95

USA
NJ
East
Leisure
2023-01-02
75

USA
CA
West
Business
2023-01-02
125

USA
CA
West
Leisure
2023-01-02
115

USA
NV
West
Business
2023-01-02
65

USA
NV
West
Leisure
2023-01-02
55

Country, State, Purpose

Sum the forecasts for each Country, State, Purpose combination.

Country
State
Purpose
ds
y_forecast

USA
NY
Business
2023-01-02
105

USA
NY
Leisure
2023-01-02
85

USA
NJ
Business
2023-01-02
95

USA
NJ
Leisure
2023-01-02
75

USA
CA
Business
2023-01-02
125

USA
CA
Leisure
2023-01-02
115

USA
NV
Business
2023-01-02
65

USA
NV
Leisure
2023-01-02
55

Country, State, Region

Sum the forecasts for each Country, State, Region combination.

Country
State
Region
ds
y_forecast

USA
NY
East
2023-01-02
190

USA
NJ
East
2023-01-02
170

USA
CA
West
2023-01-02
240

USA
NV
West
2023-01-02
120

Country, Purpose

Sum the forecasts for each Country, Purpose combination.

Country
Purpose
ds
y_forecast

USA
Business
2023-01-02
390

USA
Leisure
2023-01-02
330

Country

Sum the forecasts for the entire Country.

Country
ds
y_forecast

USA
2023-01-02
720

Conclusion
Hierarchical forecasting solves one of the most common pain points in multi-level time series work: forecasts that don’t add up across a hierarchy. Instead of manually reconciling regional totals with national forecasts after the fact, HierarchicalForecast from Nixtla builds consistency into the forecasting step itself. This matters for any team that reports at multiple granularities — sales by country and by region, inventory by warehouse and by SKU, demand by category and by product. Reconciled forecasts mean every stakeholder is working from the same numbers, and decisions made at the executive level don’t contradict what operations teams are planning on the ground.
If you’re already using statsmodels, Prophet, or scikit-learn for forecasting but finding that your aggregates don’t match, HierarchicalForecast is designed to plug in alongside them. You get a principled reconciliation layer (bottom-up, top-down, or optimal combination methods) without having to rewrite your base models.
Read Next
Processing large time series data? Read our deep dive: pandas vs Polars vs DuckDB: A Data Scientist’s Guide. Find the fastest tool for your data workloads.

Hierarchical Forecasting in Python Read More »

Sliding Window Approach to Time Series Cross-Validation

Leave a Comment / Time Series / Khuyen Tran

Time series cross-validation evaluates a model’s predictive performance by training on past data and testing on subsequent time periods using a sliding window approach.

MLForecast offers an efficient and easy-to-use implementation of this technique.

To see how to implement time series cross-validation with MLForecast, let’s start reading a subset of the M4 Competition hourly dataset.

import pandas as pd
from utilsforecast.plotting import plot_series

Y_df = pd.read_csv("https://datasets-nixtla.s3.amazonaws.com/m4-hourly.csv").query(
"unique_id == 'H1'"
)
Y_df

unique_id ds y
0 H1 1 605.0
1 H1 2 586.0
2 H1 3 586.0
3 H1 4 559.0
4 H1 5 511.0
.. … … …
743 H1 744 785.0
744 H1 745 756.0
745 H1 746 719.0
746 H1 747 703.0
747 H1 748 659.0

[748 rows x 3 columns]

Plot the time series:

fig = plot_series(Y_df, plot_random=False, max_insample_length=24 * 14)
fig

Instantiate a new MLForecast object:

from mlforecast import MLForecast
from mlforecast.target_transforms import Differences
from sklearn.linear_model import LinearRegression

mlf = MLForecast(
models=[LinearRegression()],
freq=1,
target_transforms=[Differences([24])],
lags=range(1, 25),
)

Once the MLForecast object has been instantiated, we can use the cross_validation method.

For this particular example, we’ll use 3 windows of 24 hours.

# use 3 windows of 24 hours
cross_validation_df = mlf.cross_validation(
df=Y_df,
h=24,
n_windows=3,
)
cross_validation_df.head()

unique_id ds cutoff y LinearRegression
0 H1 677 676 691.0 676.726797
1 H1 678 676 618.0 559.559522
2 H1 679 676 563.0 549.167938
3 H1 680 676 529.0 505.930997
4 H1 681 676 504.0 481.981893

We’ll now plot the forecast for each cutoff period.

import matplotlib.pyplot as plt

def plot_cv(df, df_cv, last_n=24 * 14):
cutoffs = df_cv["cutoff"].unique()
fig, ax = plt.subplots(
nrows=len(cutoffs), ncols=1, figsize=(14, 6), gridspec_kw=dict(hspace=0.8)
)
for cutoff, axi in zip(cutoffs, ax.flat):
df.tail(last_n).set_index("ds").plot(ax=axi, y="y")
df_cv.query("cutoff == @cutoff").set_index("ds").plot(
ax=axi,
y="LinearRegression",
title=f"{cutoff=}",
)

plot_cv(Y_df, cross_validation_df)

Notice that in each cutoff period, we generated a forecast for the next 24 hours using only the data y before said period.

Link to MLForecast.

Run in Google Colab.

Sliding Window Approach to Time Series Cross-Validation Read More »

Backtesting: Assess Trading Strategy Performance Effortlessly in Python

Leave a Comment / Testing, Time Series / Khuyen Tran

Evaluating trading strategies’ effectiveness is crucial for financial decision-making, but it’s challenging due to the complexities of historical data analysis and strategy testing.

Backtesting allows users to simulate trades based on historical data and visualize the outcomes through interactive plots in three lines of code.

To see how Backtesting works, let’s create our first strategy to backtest on these Google data, a simple moving average (MA) cross-over strategy.

from backtesting.test import GOOG

GOOG.tail()

Open High Low Close Volume
2013-02-25 802.3 808.41 790.49 790.77 2303900
2013-02-26 795.0 795.95 784.40 790.13 2202500
2013-02-27 794.8 804.75 791.11 799.78 2026100
2013-02-28 801.1 806.99 801.03 801.20 2265800
2013-03-01 797.8 807.14 796.15 806.19 2175400

import pandas as pd

def SMA(values, n):
"""
Return simple moving average of `values`, at
each step taking into account `n` previous values.
"""
return pd.Series(values).rolling(n).mean()

from backtesting import Strategy
from backtesting.lib import crossover

class SmaCross(Strategy):
# Define the two MA lags as *class variables*
# for later optimization
n1 = 10
n2 = 20

def init(self):
# Precompute the two moving averages
self.sma1 = self.I(SMA, self.data.Close, self.n1)
self.sma2 = self.I(SMA, self.data.Close, self.n2)

def next(self):
# If sma1 crosses above sma2, close any existing
# short trades, and buy the asset
if crossover(self.sma1, self.sma2):
self.position.close()
self.buy()

# Else, if sma1 crosses below sma2, close any existing
# long trades, and sell the asset
elif crossover(self.sma2, self.sma1):
self.position.close()
self.sell()

To assess the performance of our investment strategy, we will instantiate a Backtest object, using Google stock data as our asset of interest and incorporating the SmaCross strategy class. We’ll start with an initial cash balance of 10,000 units and set the broker’s commission to a realistic rate of 0.2%.

from backtesting import Backtest

bt = Backtest(GOOG, SmaCross, cash=10_000, commission=.002)
stats = bt.run()
stats

Start 2004-08-19 00:00:00
End 2013-03-01 00:00:00
Duration 3116 days 00:00:00
Exposure Time [%] 97.067039
Equity Final [$] 68221.96986
Equity Peak [$] 68991.21986
Return [%] 582.219699
Buy & Hold Return [%] 703.458242
Return (Ann.) [%] 25.266427
Volatility (Ann.) [%] 38.383008
Sharpe Ratio 0.658271
Sortino Ratio 1.288779
Calmar Ratio 0.763748
Max. Drawdown [%] -33.082172
Avg. Drawdown [%] -5.581506
Max. Drawdown Duration 688 days 00:00:00
Avg. Drawdown Duration 41 days 00:00:00
# Trades 94
Win Rate [%] 54.255319
Best Trade [%] 57.11931
Worst Trade [%] -16.629898
Avg. Trade [%] 2.074326
Max. Trade Duration 121 days 00:00:00
Avg. Trade Duration 33 days 00:00:00
Profit Factor 2.190805
Expectancy [%] 2.606294
SQN 1.990216
_strategy SmaCross
_equity_curve …
_trades Size EntryB…
dtype: object

Plot the outcomes:

bt.plot()

Link to Backtesting.

Run in Google Colab.

Backtesting: Assess Trading Strategy Performance Effortlessly in Python Read More »

tsmoothie: Fast and Flexible Tool for Exponential Smoothing

Leave a Comment / Time Series / Khuyen Tran

Smoothing is useful for capturing the underlying pattern in time series data, especially for data with a strong trend or seasonal component.

The tsmoothie library is a fast and efficient Python tool for performing time-series smoothing operations.

To see how tsmoothie works, let’s generate a single random walk time series of length 200 using the sim_randomwalk() function.

import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.utils_func import sim_randomwalk
from tsmoothie.smoother import LowessSmoother

# generate a random walk of length 200
np.random.seed(123)
data = sim_randomwalk(n_series=1, timesteps=200, process_noise=10, measure_noise=30)

Next, create a LowessSmoother object with a smooth_fraction of 0.1 (i.e., 10% of the data points are used for local regression) and 1 iteration. We then apply the smoothing operation to the data using the smooth() method.

# operate smoothing
smoother = LowessSmoother(smooth_fraction=0.1, iterations=1)
smoother.smooth(data)

After smoothing the data, we use the get_intervals() method of the LowessSmoother object to calculate the lower and upper bounds of the prediction interval for the smoothed time series.

# generate intervals
low, up = smoother.get_intervals("prediction_interval")

Finally, we plot the smoothed time series (as a blue line), and the prediction interval (as a shaded region) using matplotlib.

# plot the smoothed time series with intervals
plt.figure(figsize=(10, 5))

plt.plot(smoother.smooth_data[0], linewidth=3, color="blue")
plt.plot(smoother.data[0], ".k")
plt.title(f"timeseries")
plt.xlabel("time")

plt.fill_between(range(len(smoother.data[0])), low[0], up[0], alpha=0.3)

This graph effectively highlights the trend and seasonal components present in the time series data through the use of a smoothed representation.

Link to tsmoothie.

Run in Google Colab.

tsmoothie: Fast and Flexible Tool for Exponential Smoothing Read More »

Beyond Point Estimates: Leverage Prediction Intervals for Robust Forecasting

Leave a Comment / Time Series / Khuyen Tran

Generating a forecast typically produces a single-point estimate, which does not reflect the uncertainty associated with the prediction.

To quantify this uncertainty, we need prediction intervals – a range of values the forecast can take with a given probability. MLForecast allows you to train sklearn models to generate both point forecasts and prediction intervals.

To demonstrate this, let’s consider the following example:

import pandas as pd
from utilsforecast.plotting import plot_series

train = pd.read_csv("https://auto-arima-results.s3.amazonaws.com/M4-Hourly.csv")
test = pd.read_csv("https://auto-arima-results.s3.amazonaws.com/M4-Hourly-test.csv")
train.head()
"""
unique_id ds y
0 H1 1 605.0
1 H1 2 586.0
2 H1 3 586.0
3 H1 4 559.0
4 H1 5 511.0
"""

We’ll only use the first series of the dataset.

n_series = 1
uids = train["unique_id"].unique()[:n_series]
train = train.query("unique_id in @uids")
test = test.query("unique_id in @uids")

Plot these series using the plot_series function from the utilsforecast library:

fig = plot_series(
df=train,
forecasts_df=test.rename(columns={"y": "y_test"}),
models=["y_test"],
palette="tab10",
)

fig.set_size_inches(8, 3)
fig

Train multiple models that follow the sklearn syntax:

from mlforecast import MLForecast
from mlforecast.target_transforms import Differences
from mlforecast.utils import PredictionIntervals
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

mlf = MLForecast(
models=[
LinearRegression(),
KNeighborsRegressor(),
],
freq=1,
target_transforms=[Differences([1])],
lags=[24 * (i + 1) for i in range(7)],
)

Apply the feature engineering and train the models:

mlf.fit(
data=train,
prediction_intervals=PredictionIntervals(n_windows=10, h=48),
)

Generate forecasts with prediction intervals:

# A list of floats with the confidence levels of the prediction intervals
levels = [50, 80, 95]

# Predict the next 48 hours
horizon = 48

# Generate forecasts with prediction intervals
forecasts = mlf.predict(h=horizon, level=levels)

Merge the test data with forecasts:

test_with_forecasts = test.merge(forecasts, how="left", on=["unique_id", "ds"])

Plot the point and the prediction intervals:

levels = [50, 80, 95]
fig = plot_series(
train,
test_with_forecasts,
plot_random=False,
models=["KNeighborsRegressor"],
level=levels,
max_insample_length=48,
palette='tab10',
)
fig.set_size_inches(8, 4)
fig

Link to MLForecast.

View in Google Colab.

Beyond Point Estimates: Leverage Prediction Intervals for Robust Forecasting Read More »

Time Series

Extract Dates from Text with Datefinder

MLForecast: Automate External Feature Handling

Enhancing Predictive Models with Workalendar’s Holiday Handling

Avoiding Data Leakage in Time Series Analysis with TimeSeriesSplit

Pendulum: Python Datetimes Made Easy

Hierarchical Forecasting in Python

Sliding Window Approach to Time Series Cross-Validation

Backtesting: Assess Trading Strategy Performance Effortlessly in Python

tsmoothie: Fast and Flexible Tool for Exponential Smoothing

Beyond Point Estimates: Leverage Prediction Intervals for Robust Forecasting

Get in touch

Join the Newsletter

Follow Us on Social Media

Time Series

Work with Khuyen Tran

Work with Khuyen Tran