Time Series Processing with Pandas: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(30 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Internal=
=Internal=
* [[Pandas_Concepts#Time_Series_Processing_with_Pandas|Pandas Concepts]]
* [[Pandas_Concepts#Time_Series_Processing_with_Pandas|Pandas Concepts]]
* [[Pandas_DataFrame|DataFrame]]
* [[Pandas_Series|Series]]
* [[Time, Date, Timestamp in Python]]
* [[Time, Date, Timestamp in Python]]
* [[Time Series]]


=TODO=
=TODO=
Line 7: Line 10:
<font color=darkkhaki>
<font color=darkkhaki>
TO PROCESS:
TO PROCESS:
 
* https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html
* https://saturncloud.io/blog/how-to-filter-pandas-dataframe-by-time-index/
* https://saturncloud.io/blog/how-to-filter-pandas-dataframe-by-time-index/
* https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html#pandas.Timestamp
* https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-overview
* https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-datetimeindex
</font>
</font>


Line 21: Line 27:
import pandas as pd
import pandas as pd
</syntaxhighlight>
</syntaxhighlight>
=Synthetic Time Series=
<syntaxhighlight lang='py'>
df = pd.DataFrame({
    'date': [
        pd.to_datetime('2023-10-01'),
        pd.to_datetime('2023-10-05'),
        pd.to_datetime('2023-10-15')],
    'value': [10, 15, 22]
})
</syntaxhighlight>
Even if <code>to_datetime()</code> creates <code>Timestamp</code> object instances, the <code>df['date']</code> Series is a timeseries, its elements are <code>datetime64[ns]</code>. <font color=darkkhaki>Why?</font>.


=Load a Time Series=
=Load a Time Series=
Line 40: Line 57:
</font>
</font>


Create a [[Pandas_DataFrame|DataFrame]] by reading the CSV with <code>read_csv()</code> function. While loading it, we handle the "date" column as a [[Pandas_Concepts#Datetime|datetime]] type and we parse it accordingly by specifying the column to use as date to the <code>parse_dates</code> parameter:
Create a [[Pandas_DataFrame|DataFrame]] by reading the CSV with <code>read_csv()</code> function.  
 
==Parse Timestamps while Loading==
 
While loading it, we handle the "date" column as a [[Pandas_Concepts#Datetime|datetime]] type and we parse it accordingly by specifying the column to use as date to the <code>parse_dates</code> parameter:


<syntaxhighlight lang='py'>
<syntaxhighlight lang='py'>
Line 46: Line 67:
</syntaxhighlight>
</syntaxhighlight>


This syntax assumes that the "date" column is encoded in the default Panda date format ('YYYY-MM-DD'). If that is not the case, the format can be specified with the <code>date_format</code> parameters, as shown below:
The date is expected in a <code>YYYY-MM-DD</code> "2023-12-31" format. <span id='Custom_Date_Format'></span>To handle custom date or time formats, see: {{Internal|Pandas read_csv Custom Date Format|<tt>read_csv()</tt> Custom Date Format}}
 
To verify that the date column was correctly parsed, display <code>df['date']</code>, it should have a <code>datetime64[ns]</code> type.
 
==Parse Timestamps after Loading==
<font color=darkkhaki>Not tested yet.
 
Alternatively, the column carrying timestamps can be converted to <code>datetime</code> after loading:
 
<syntaxhighlight lang='py'>
<syntaxhighlight lang='py'>
df = pd.read_csv("./timeseries.csv", parse_dates=["date"], date_format='%m/%Y/%d')
df['date'] = pd.to_datetime(df['date'])
</syntaxhighlight>
</syntaxhighlight>
The common timestamp elements are '%Y-%m-%d %H:%M:%S'. <font color=darkkhaki>For more details on date format, see ?</font>
</font>


For more complicated formats, the parsing function can be provided as a named function or a lambda:
==Reset the Index==
<syntaxhighlight lang='py'>
def parse_timestamp(s: str):
  ???
df = pd.read_csv("./timeseries.csv", parse_dates=["date"], date_format='%m/%Y/%d')
</syntaxhighlight>
For more details on timestamp parsing see: {{Internal|Time,_Date,_Timestamp_in_Python#Time.2C_Date_and_Timestamp_Parsing|Time, Date, Timestamp in Python}}


The DataFrame has a [[Pandas_Concepts#RangeIndex|default integral index]], and we replace it with the the content of the "date" column, turning it into a time index.
The DataFrame has a [[Pandas_Concepts#RangeIndex|default integral index]], and we replace it with the the content of the "date" column, turning it into a time index:


<syntaxhighlight lang='py'>
<syntaxhighlight lang='py'>
df = df.set_index(['date'])
df = df.set_index('date')
</syntaxhighlight>
</syntaxhighlight>


Note that setting the index to the "data" column changes the DataFrame dimensionality, it converts it from a (9, 2) DataFrame to a (9, 1) DataFrame, with a single column.
Note that setting the index to the "date" column changes the DataFrame dimensionality, it reduces the number of column with one, as the "date" column will be used as index. Also, the index replacement takes place for a newly created DataFrame, returned as the result of the function. To perform an in-place replacement, use <code>inplace=True</code> and <code>drop=True</code> as arguments. For more details, see  {{Internal|Pandas_DataFrame#set_index.28.29|DataFrame &#124; <tt>set_index()</tt>}}
 
Then we extract the "value" column as a time series, since the DataFrame index is already a time index. The "value" column is the only column in the DataFrame after we replaced the index:
Then we extract the "value" column as a time series, since the DataFrame index is already a time index. The "value" column is the only column in the DataFrame after we replaced the index:


Line 74: Line 96:
</syntaxhighlight>
</syntaxhighlight>


==Get the Interesting Series==
The result is a time series:
The result is a time series:


Line 89: Line 112:
  Name:  value, dtype: int64
  Name:  value, dtype: int64
</font>
</font>
=Transform the Elements of the Series=
If the elements of the series need transformation, use the methods described here:
{{Internal|Pandas_Series#Transformation|Series Transformation}}


=Filter the Series=
=Filter the Series=
{{Internal|Pandas_Series#Filtering|Series Filtering}}


==<tt>loc[]</tt>==
=Resample a Time Series with Another Frequency=
Creates a new, filtered time series:
<font color=darkkhaki>TODO: https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html#resample-a-time-series-to-another-frequency</font>
<syntaxhighlight lang='py'>
s = s.loc['2023-09-17':'2023-10-05']
</syntaxhighlight>

Latest revision as of 01:28, 21 October 2023

Internal

TODO

TO PROCESS:

Overview

This article provides hints on how time series can be processed with Pandas.

A time series is a sequence of data points indexed in time order. The time index is a Datetime index object that contains timestamps corresponding to each data point. This time index allows for operations such as resampling, rolling and filtering.

Import Pandas

import pandas as pd

Synthetic Time Series

df = pd.DataFrame({
    'date': [
        pd.to_datetime('2023-10-01'),
        pd.to_datetime('2023-10-05'),
        pd.to_datetime('2023-10-15')],
    'value': [10, 15, 22]
})

Even if to_datetime() creates Timestamp object instances, the df['date'] Series is a timeseries, its elements are datetime64[ns]. Why?.

Load a Time Series

Assuming the data comes from a CSV file whose first column, labeled "date", contains timestamp-formatted strings, and the second column contains values corresponding to those timestamps, this is how the data is loaded and turned into a Pandas Series.

The content of the CSV file should be similar to:

date, value
2023-10-01, 133
2023-10-02, 135
2023-10-03, 139
2023-10-04, 123
2023-10-05, 122
2023-10-06, 119
2023-10-07, 117
2023-10-08, 130
2023-10-09, 132

Create a DataFrame by reading the CSV with read_csv() function.

Parse Timestamps while Loading

While loading it, we handle the "date" column as a datetime type and we parse it accordingly by specifying the column to use as date to the parse_dates parameter:

df = pd.read_csv("./timeseries.csv", parse_dates=["date"])

The date is expected in a YYYY-MM-DD "2023-12-31" format. To handle custom date or time formats, see:

read_csv() Custom Date Format

To verify that the date column was correctly parsed, display df['date'], it should have a datetime64[ns] type.

Parse Timestamps after Loading

Not tested yet.

Alternatively, the column carrying timestamps can be converted to datetime after loading:

df['date'] = pd.to_datetime(df['date'])

Reset the Index

The DataFrame has a default integral index, and we replace it with the the content of the "date" column, turning it into a time index:

df = df.set_index('date')

Note that setting the index to the "date" column changes the DataFrame dimensionality, it reduces the number of column with one, as the "date" column will be used as index. Also, the index replacement takes place for a newly created DataFrame, returned as the result of the function. To perform an in-place replacement, use inplace=True and drop=True as arguments. For more details, see

DataFrame | set_index()

Then we extract the "value" column as a time series, since the DataFrame index is already a time index. The "value" column is the only column in the DataFrame after we replaced the index:

s = df.iloc[:,0]

Get the Interesting Series

The result is a time series:

date
2023-10-01    133
2023-10-02    135
2023-10-03    139
2023-10-04    123
2023-10-05    122
2023-10-06    119
2023-10-07    117
2023-10-08    130
2023-10-09    132
Name:  value, dtype: int64

Transform the Elements of the Series

If the elements of the series need transformation, use the methods described here:

Series Transformation

Filter the Series

Series Filtering

Resample a Time Series with Another Frequency

TODO: https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html#resample-a-time-series-to-another-frequency