Time Series Processing with Pandas

From NovaOrdis Knowledge Base
Jump to navigation Jump to search

Internal

Overview

This article provides hints on how time series can be processed with Pandas.

Load a Time Series

Assuming the data comes from a CSV file whose first column, labeled "date", contains timestamp-formatted strings, and the second column contains values corresponding to those timestamps, this is how the data is loaded and turned into a Pandas Series.

The content of the CSV file should be similar to:

date, value
2023-10-01, 133
2023-10-02, 135
2023-10-03, 139
2023-10-04, 123
2023-10-05, 122
2023-10-06, 119
2023-10-07, 117
2023-10-08, 130
2023-10-09, 132

Create a DataFrame by reading the CSV with read_csv() function. While loading it, we handle the "date" column as a datetime type and we parse it accordingly by specifying the column to use as date to the parse_dates parameter:

df = pd.read_csv("./timeseries.csv", parse_dates=["date"])

The DataFrame has a default integral index, and we replace it with the the content of the "date" column, turning it into a time index.

df = df.set_index(['date'])

Note that setting the index to the "data" column changes the DataFrame dimensionality, it converts it from a (9, 2) DataFrame to a (9, 1) DataFrame, with a single column.

Then we extract the "value" column as a time series, since the DataFrame index is already a time index. The "value" column is the only column in the DataFrame after we replaced the index:

s = df.iloc[:,0]

The result is a time series:

date
2023-10-01    133
2023-10-02    135
2023-10-03    139
2023-10-04    123
2023-10-05    122
2023-10-06    119
2023-10-07    117
2023-10-08    130
2023-10-09    132
Name:  value, dtype: int64