Time Series Processing with Pandas: Difference between revisions
Line 26: | Line 26: | ||
<syntaxhighlight lang='py'> | <syntaxhighlight lang='py'> | ||
import pandas as pd | import pandas as pd | ||
</syntaxhighlight> | |||
=Synthetic Time Series= | |||
<syntaxhighlight lang='py'> | |||
</syntaxhighlight> | </syntaxhighlight> | ||
Revision as of 21:42, 20 October 2023
Internal
TODO
TO PROCESS:
- https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html
- https://saturncloud.io/blog/how-to-filter-pandas-dataframe-by-time-index/
- https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html#pandas.Timestamp
- https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-overview
- https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-datetimeindex
Overview
This article provides hints on how time series can be processed with Pandas.
A time series is a sequence of data points indexed in time order. The time index is a Datetime index object that contains timestamps corresponding to each data point. This time index allows for operations such as resampling, rolling and filtering.
Import Pandas
import pandas as pd
Synthetic Time Series
Load a Time Series
Assuming the data comes from a CSV file whose first column, labeled "date", contains timestamp-formatted strings, and the second column contains values corresponding to those timestamps, this is how the data is loaded and turned into a Pandas Series.
The content of the CSV file should be similar to:
date, value 2023-10-01, 133 2023-10-02, 135 2023-10-03, 139 2023-10-04, 123 2023-10-05, 122 2023-10-06, 119 2023-10-07, 117 2023-10-08, 130 2023-10-09, 132
Create a DataFrame by reading the CSV with read_csv()
function. While loading it, we handle the "date" column as a datetime type and we parse it accordingly by specifying the column to use as date to the parse_dates
parameter:
df = pd.read_csv("./timeseries.csv", parse_dates=["date"])
The date is expected in a YYYY-MM-DD
"2023-12-31" format. To handle custom date or time formats, see:
To verify that the date column was correctly parsed, display df['date']
, it should have a datetime64[ns]
type.
The DataFrame has a default integral index, and we replace it with the the content of the "date" column, turning it into a time index:
df = df.set_index('date')
For more details on set_index()
, see:
Note that setting the index to the "date" column changes the DataFrame dimensionality, it reduces the number of column with one, as the "date" column will be used as index. Also, the index replacement takes place for a newly created DataFrame, returned as the result of the function. To perform an in-place replacement, use inplace=True
and drop=True
as arguments.
Then we extract the "value" column as a time series, since the DataFrame index is already a time index. The "value" column is the only column in the DataFrame after we replaced the index:
s = df.iloc[:,0]
The result is a time series:
date 2023-10-01 133 2023-10-02 135 2023-10-03 139 2023-10-04 123 2023-10-05 122 2023-10-06 119 2023-10-07 117 2023-10-08 130 2023-10-09 132 Name: value, dtype: int64
Transform the Elements of the Series
If the elements of the series need transformation, use the methods described here:
Filter the Series
loc[]
Creates a new, filtered time series:
s = s.loc['2023-09-17':'2023-10-05']