Pandas DataFrame: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
No edit summary
 
(51 intermediate revisions by the same user not shown)
Line 8: Line 8:


=Overview=
=Overview=
A DataFrame is a two-dimensional data structure with columns of potentially different types. The data structure also contains labeled axes, for both rows and columns.
A DataFrame is a two-dimensional data structure with columns of potentially different types and rows. A useful mental model for a DataFrame is a a dict-like container for [[Pandas_Series#Overview|Series]] objects, where each column is a [[Pandas_Series#Overview|Series]]. Some times, documentation refers to columns as "features" or "variables". The values stored in rows are referred to as "records". DataFrame provides both row and column labels.


Can be thought of as a dict-like container for [[Pandas_Series#Overview|Series]] objects, where each column is a [[Pandas_Series#Overview|Series]]. The dimensionality of the DataFrame is given by its <code>[[#Shape|shape]]</code> property.
=Axes=
The DataFrame has two [[Pandas_Concepts#Axis|axes]]: "axis 0" which is aligned alongside the DataFrame's rows pointing "downwards", representing rows, and "axis 1", which is aligned alongside the column headers, pointing from left to right, representing columns:
:::[[File:Panda_DataFrame_Axes.png]]
A DataFrame shares the same direction for "axis 0" with its Series' [[Pandas_Series#Axis|axes]].
 
The DataFrame <code>axes</code> property gives access to an array containing two [[Pandas_Concepts#Index|Index]] instances, the first for the rows, the second one for the columns:
<syntaxhighlight lang='py'>
assert len(df.axes) == 2
print(df.axes)
</syntaxhighlight>
<font size=-1>
[RangeIndex(start=0, stop=6, step=1), Index(['distance', 'strength'], dtype='object')]
</font>


=Shape=
=Shape=
The DataFrame <code>shape</code> property contains a tuple that returns the dimensionality of the DataFrame: (rows, columns).


<code>shape</code> is a property of the DataFrame, containing a tuple that returns the dimensionality of the DataFrame: rows, columns.
=Index=
 
By default, the DataFrame gets a [[Pandas_Concepts#RangeIndex|RangeIndex]].
 
The index of the DataFrame can be accessed with the <code>index</code> property:
<syntaxhighlight lang='py'>
df = pd.DataFrame(...)
df.index
</syntaxhighlight>
 
However, the index of the DataFrame can be replaced with <code>set_index()</code>.
 
<font color=darkkhaki>A DataFrame seems to have a horizontal axis index, shown below, but it is not clear how it can be used:</font>
<syntaxhighlight lang='py'>
Index(['distance', 'strength'], dtype='object')
</syntaxhighlight>
Also see: {{Internal|Pandas_Concepts#Index|Pandas Concepts &#124; Index}}
 
==<tt>set_index()</tt>==
<code>set_index(<column-name>)</code> function replaces the DataFrame's index with the column whose name is specified as argument:
<syntaxhighlight lang='py'>
df2 = df.set_index('date')
</syntaxhighlight>
Setting the index to a column, ("date" column in this case) changes the DataFrame dimensionality, it reduces the number of column with one, as the "date" column will be used as index. Also, the index replacement takes place for a newly created DataFrame, returned as the result of the function. To perform an in-place replacement, use <code>inplace=True</code> and <code>drop=True</code> as arguments. For more details, see  {{Internal|Pandas_DataFrame#set_index.28.29|DataFrame &#124; <tt>set_index()</tt>}}


=Create a DataFrame=
=Create a DataFrame=
==Create a Data Frame Programmatically==
<syntaxhighlight lang='py'>
import pandas as pd
df = pd.DataFrame({
    'distance': [1, 2, 5, 8, 10, 25],
    'strength': [0.98, 0.97, 0.88, 0.45, 0.20, 0.02]
})
</syntaxhighlight>
<span id='example'></span>Shows up as:
:::[[File:Panda_DataFrame.png|254px]]
==Create a DataFrame from a CSV File==
==Create a DataFrame from a CSV File==
<syntaxhighlight lang='py'>
import pandas as pd
df = pd.read_csv("./analysis.csv")
</syntaxhighlight>
If the CSV file contains column that need to be handled as time series, see: {{Internal|Time_Series_Processing_with_Pandas#Load_a_Time_Series|Load a Time Series}}


=Accessing Elements of a DataFrame=
=Accessing Elements of a DataFrame=
==Accessing a Column==
An individual column can be accessed with the <code>[]</code> operator, by specifying the column column name. The result is a [[Pandas_Series|Series]]:
<syntaxhighlight lang='py'>
df = ...
df['Date']
type(df['Date']) # displays pandas.core.series.Series
</syntaxhighlight>
==<tt>iloc[]</tt>==
==<tt>iloc[]</tt>==
A property that allows integer-based access (indexing).
A property that allows integer-based access (indexing). The location is specified as a 0-based index position. The property accepts a wide variety of arguments.
===<tt>iloc[0_based_row_index, 0_based_column_index]</tt>===
<code>iloc[0_based_row_index, 0_based_column_index]</code>. For the example [[#example|above]], <code>df.iloc[0,0]</code> would return 1,  <code>df.iloc[1,0]</code> would return 2 and <code>df.iloc[0, 1]</code> would return 0.98.
===<span id='Extract_a_Series_from_the_DataFrame'></span><tt>iloc[:, 0_based_column_index]</tt>===
Extract the entire series corresponding to the provided 0-based column index:
<syntaxhighlight lang='py'>
df = ...
# extract a series corresponding to DataFrame column 0
s = df.iloc[:,0]
</syntaxhighlight>
===<tt>iloc[idx0:idx1, 0_based_column_index]</tt>===
Use the slice expression to extract a series subset, also as a series:
Extract the entire series corresponding to the provided 0-based column index:
<syntaxhighlight lang='py'>
df = ...
s = df.iloc[1:2,0]
</syntaxhighlight>


==<tt>loc[]</tt>==
==<tt>loc[]</tt>==
Line 29: Line 105:


=Operations on DataFrames=
=Operations on DataFrames=
==Filtering==
To filter DataFrame rows based on the values of the index, use <code>loc[<index_condition>, <column_specification>]</code>.
The index condition is similar to:
<syntaxhighlight lang='py'>
df2 = df.loc[df.index >= '2023-06-01', :]
</syntaxhighlight>
This applies to a timeseries, filtering by date and returning all columns. To return a specific column, as a Series:
<syntaxhighlight lang='py'>
s = df.loc[df.index >= '2023-06-01', 'some_column']
</syntaxhighlight>

Latest revision as of 23:26, 14 May 2024

External

Internal

Overview

A DataFrame is a two-dimensional data structure with columns of potentially different types and rows. A useful mental model for a DataFrame is a a dict-like container for Series objects, where each column is a Series. Some times, documentation refers to columns as "features" or "variables". The values stored in rows are referred to as "records". DataFrame provides both row and column labels.

Axes

The DataFrame has two axes: "axis 0" which is aligned alongside the DataFrame's rows pointing "downwards", representing rows, and "axis 1", which is aligned alongside the column headers, pointing from left to right, representing columns:

Panda DataFrame Axes.png

A DataFrame shares the same direction for "axis 0" with its Series' axes.

The DataFrame axes property gives access to an array containing two Index instances, the first for the rows, the second one for the columns:

assert len(df.axes) == 2
print(df.axes)

[RangeIndex(start=0, stop=6, step=1), Index(['distance', 'strength'], dtype='object')]

Shape

The DataFrame shape property contains a tuple that returns the dimensionality of the DataFrame: (rows, columns).

Index

By default, the DataFrame gets a RangeIndex.

The index of the DataFrame can be accessed with the index property:

df = pd.DataFrame(...)
df.index

However, the index of the DataFrame can be replaced with set_index().

A DataFrame seems to have a horizontal axis index, shown below, but it is not clear how it can be used:

Index(['distance', 'strength'], dtype='object')

Also see:

Pandas Concepts | Index

set_index()

set_index(<column-name>) function replaces the DataFrame's index with the column whose name is specified as argument:

df2 = df.set_index('date')

Setting the index to a column, ("date" column in this case) changes the DataFrame dimensionality, it reduces the number of column with one, as the "date" column will be used as index. Also, the index replacement takes place for a newly created DataFrame, returned as the result of the function. To perform an in-place replacement, use inplace=True and drop=True as arguments. For more details, see

DataFrame | set_index()

Create a DataFrame

Create a Data Frame Programmatically

import pandas as pd
df = pd.DataFrame({
    'distance': [1, 2, 5, 8, 10, 25],
    'strength': [0.98, 0.97, 0.88, 0.45, 0.20, 0.02]
})

Shows up as:

Panda DataFrame.png

Create a DataFrame from a CSV File

import pandas as pd
df = pd.read_csv("./analysis.csv")

If the CSV file contains column that need to be handled as time series, see:

Load a Time Series

Accessing Elements of a DataFrame

Accessing a Column

An individual column can be accessed with the [] operator, by specifying the column column name. The result is a Series:

df = ...
df['Date']
type(df['Date']) # displays pandas.core.series.Series

iloc[]

A property that allows integer-based access (indexing). The location is specified as a 0-based index position. The property accepts a wide variety of arguments.

iloc[0_based_row_index, 0_based_column_index]

iloc[0_based_row_index, 0_based_column_index]. For the example above, df.iloc[0,0] would return 1, df.iloc[1,0] would return 2 and df.iloc[0, 1] would return 0.98.

iloc[:, 0_based_column_index]

Extract the entire series corresponding to the provided 0-based column index:

df = ...
# extract a series corresponding to DataFrame column 0
s = df.iloc[:,0]

iloc[idx0:idx1, 0_based_column_index]

Use the slice expression to extract a series subset, also as a series: Extract the entire series corresponding to the provided 0-based column index:

df = ...
s = df.iloc[1:2,0]

loc[]

A property that allows label-based access (indexing).

squeeze()

[]

Operations on DataFrames

Filtering

To filter DataFrame rows based on the values of the index, use loc[<index_condition>, <column_specification>].

The index condition is similar to:

df2 = df.loc[df.index >= '2023-06-01', :]

This applies to a timeseries, filtering by date and returning all columns. To return a specific column, as a Series:

s = df.loc[df.index >= '2023-06-01', 'some_column']