Time Series Processing with Pandas: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
Line 58: Line 58:
df = pd.read_csv("./timeseries.csv", parse_dates=["date"], date_format='%m/%Y/%d')
df = pd.read_csv("./timeseries.csv", parse_dates=["date"], date_format='%m/%Y/%d')
</syntaxhighlight>
</syntaxhighlight>
For more details on timestamp parsing see: {{Internal|Time,_Date,_Timestamp_in_Python#Time.2C_Date_and_Timestamp_Parsing|Time, Date, Timestamp in Python}}
The DataFrame has a [[Pandas_Concepts#RangeIndex|default integral index]], and we replace it with the the content of the "date" column, turning it into a time index.
The DataFrame has a [[Pandas_Concepts#RangeIndex|default integral index]], and we replace it with the the content of the "date" column, turning it into a time index.



Revision as of 23:39, 8 October 2023

Internal

TODO

TO PROCESS:

Overview

This article provides hints on how time series can be processed with Pandas.

A time series is a sequence of data points indexed in time order. The time index is a Datetime index object that contains timestamps corresponding to each data point. This time index allows for operations such as resampling, rolling and filtering.

Import Pandas

import pandas as pd

Load a Time Series

Assuming the data comes from a CSV file whose first column, labeled "date", contains timestamp-formatted strings, and the second column contains values corresponding to those timestamps, this is how the data is loaded and turned into a Pandas Series.

The content of the CSV file should be similar to:

date, value
2023-10-01, 133
2023-10-02, 135
2023-10-03, 139
2023-10-04, 123
2023-10-05, 122
2023-10-06, 119
2023-10-07, 117
2023-10-08, 130
2023-10-09, 132

Create a DataFrame by reading the CSV with read_csv() function. While loading it, we handle the "date" column as a datetime type and we parse it accordingly by specifying the column to use as date to the parse_dates parameter:

df = pd.read_csv("./timeseries.csv", parse_dates=["date"])

This syntax assumes that the "date" column is encoded in the default Panda date format ('YYYY-MM-DD'). If that is not the case, the format can be specified with the date_format parameters, as shown below:

df = pd.read_csv("./timeseries.csv", parse_dates=["date"], date_format='%m/%Y/%d')

The common timestamp elements are '%Y-%m-%d %H:%M:%S'. For more details on date format, see ?

For more complicated formats, the parsing function can be provided as a named function or a lambda:

def parse_timestamp(s: str):
  ???
df = pd.read_csv("./timeseries.csv", parse_dates=["date"], date_format='%m/%Y/%d')

For more details on timestamp parsing see:

Time, Date, Timestamp in Python

The DataFrame has a default integral index, and we replace it with the the content of the "date" column, turning it into a time index.

df = df.set_index(['date'])

Note that setting the index to the "data" column changes the DataFrame dimensionality, it converts it from a (9, 2) DataFrame to a (9, 1) DataFrame, with a single column.

Then we extract the "value" column as a time series, since the DataFrame index is already a time index. The "value" column is the only column in the DataFrame after we replaced the index:

s = df.iloc[:,0]

The result is a time series:

date
2023-10-01    133
2023-10-02    135
2023-10-03    139
2023-10-04    123
2023-10-05    122
2023-10-06    119
2023-10-07    117
2023-10-08    130
2023-10-09    132
Name:  value, dtype: int64

Filter the Series

loc[]

Creates a new, filtered time series:

s = s.loc['2023-09-17':'2023-10-05']