Time Series Processing with Pandas

From NovaOrdis Knowledge Base
Jump to navigation Jump to search

Internal

TODO

TO PROCESS:

Overview

This article provides hints on how time series can be processed with Pandas.

A time series is a sequence of data points indexed in time order. The time index is a Datetime index object that contains timestamps corresponding to each data point. This time index allows for operations such as resampling, rolling and filtering.

Import Pandas

import pandas as pd

Synthetic Time Series

df = pd.DataFrame({
    'date': [
        pd.to_datetime('2023-10-01'),
        pd.to_datetime('2023-10-05'),
        pd.to_datetime('2023-10-15')],
    'value': [10, 15, 22]
})

Even if to_datetime() creates Timestamp object instances, the df['date'] Series is a timeseries, its elements are datetime64[ns]. Why?.

Load a Time Series

Assuming the data comes from a CSV file whose first column, labeled "date", contains timestamp-formatted strings, and the second column contains values corresponding to those timestamps, this is how the data is loaded and turned into a Pandas Series.

The content of the CSV file should be similar to:

date, value
2023-10-01, 133
2023-10-02, 135
2023-10-03, 139
2023-10-04, 123
2023-10-05, 122
2023-10-06, 119
2023-10-07, 117
2023-10-08, 130
2023-10-09, 132

Create a DataFrame by reading the CSV with read_csv() function.

Parse Timestamps while Loading

While loading it, we handle the "date" column as a datetime type and we parse it accordingly by specifying the column to use as date to the parse_dates parameter:

df = pd.read_csv("./timeseries.csv", parse_dates=["date"])

The date is expected in a YYYY-MM-DD "2023-12-31" format. To handle custom date or time formats, see:

read_csv() Custom Date Format

To verify that the date column was correctly parsed, display df['date'], it should have a datetime64[ns] type.

Parse Timestamps after Loading

Not tested yet.

Alternatively, the column carrying timestamps can be converted to datetime after loading:

df['date'] = pd.to_datetime(df['date'])

Reset the Index

The DataFrame has a default integral index, and we replace it with the the content of the "date" column, turning it into a time index:

df = df.set_index('date')

Note that setting the index to the "date" column changes the DataFrame dimensionality, it reduces the number of column with one, as the "date" column will be used as index. Also, the index replacement takes place for a newly created DataFrame, returned as the result of the function. To perform an in-place replacement, use inplace=True and drop=True as arguments. For more details, see

DataFrame | set_index()

Then we extract the "value" column as a time series, since the DataFrame index is already a time index. The "value" column is the only column in the DataFrame after we replaced the index:

s = df.iloc[:,0]

Get the Interesting Series

The result is a time series:

date
2023-10-01    133
2023-10-02    135
2023-10-03    139
2023-10-04    123
2023-10-05    122
2023-10-06    119
2023-10-07    117
2023-10-08    130
2023-10-09    132
Name:  value, dtype: int64

Transform the Elements of the Series

If the elements of the series need transformation, use the methods described here:

Series Transformation

Filter the Series

Series Filtering

Resample a Time Series with Another Frequency

TODO: https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html#resample-a-time-series-to-another-frequency