Pandas DataFrame

From NovaOrdis Knowledge Base
Jump to navigation Jump to search

External

Internal

Overview

A DataFrame is a two-dimensional data structure with columns of potentially different types and rows. A useful mental model for a DataFrame is a a dict-like container for Series objects, where each column is a Series. Some times, documentation refers to columns as "features" or "variables". The values stored in rows are referred to as "records". DataFrame provides both row and column labels.

Axes

The DataFrame has two axes: "axis 0" which is aligned alongside the DataFrame's rows pointing "downwards", representing rows, and "axis 1", which is aligned alongside the column headers, pointing from left to right, representing columns:

Panda DataFrame Axes.png

A DataFrame shares the same direction for "axis 0" with its Series' axes.

The DataFrame axes property gives access to an array containing two Index instances, the first for the rows, the second one for the columns:

assert len(df.axes) == 2
print(df.axes)

[RangeIndex(start=0, stop=6, step=1), Index(['distance', 'strength'], dtype='object')]

Shape

The DataFrame shape property contains a tuple that returns the dimensionality of the DataFrame: (rows, columns).

Index

By default, the DataFrame gets a RangeIndex.

The index of the DataFrame can be accessed with the index property:

df = pd.DataFrame(...)
df.index

However, the index of the DataFrame can be replaced with set_index().

A DataFrame seems to have a horizontal axis index, shown below, but it is not clear how it can be used:

Index(['distance', 'strength'], dtype='object')

Also see:

Pandas Concepts | Index

set_index()

set_index(<column-name>) function replaces the DataFrame's index with the column whose name is specified as argument:

df2 = df.set_index('date')

Setting the index to a column, ("date" column in this case) changes the DataFrame dimensionality, it reduces the number of column with one, as the "date" column will be used as index. Also, the index replacement takes place for a newly created DataFrame, returned as the result of the function. To perform an in-place replacement, use inplace=True and drop=True as arguments. For more details, see

DataFrame | set_index()

Create a DataFrame

Create a Data Frame Programmatically

import pandas as pd
df = pd.DataFrame({
    'distance': [1, 2, 5, 8, 10, 25],
    'strength': [0.98, 0.97, 0.88, 0.45, 0.20, 0.02]
})

Shows up as:

Panda DataFrame.png

Create a DataFrame from a CSV File

import pandas as pd
df = pd.read_csv("./analysis.csv")

If the CSV file contains column that need to be handled as time series, see:

Load a Time Series

Accessing Elements of a DataFrame

Accessing a Column

An individual column can be accessed with the [] operator, by specifying the column column name. The result is a Series:

df = ...
df['Date']
type(df['Date']) # displays pandas.core.series.Series

iloc[]

A property that allows integer-based access (indexing). The location is specified as a 0-based index position. The property accepts a wide variety of arguments.

iloc[0_based_row_index, 0_based_column_index]

iloc[0_based_row_index, 0_based_column_index]. For the example above, df.iloc[0,0] would return 1, df.iloc[1,0] would return 2 and df.iloc[0, 1] would return 0.98.

iloc[:, 0_based_column_index]

Extract the entire series corresponding to the provided 0-based column index:

df = ...
# extract a series corresponding to DataFrame column 0
s = df.iloc[:,0]

iloc[idx0:idx1, 0_based_column_index]

Use the slice expression to extract a series subset, also as a series: Extract the entire series corresponding to the provided 0-based column index:

df = ...
s = df.iloc[1:2,0]

loc[]

A property that allows label-based access (indexing).

squeeze()

[]

Operations on DataFrames

Filtering

To filter DataFrame rows based on the values of the index, use loc[<index_condition>, <column_specification>].

The index condition is similar to:

df2 = df.loc[df.index >= '2023-06-01', :]

This applies to a timeseries, filtering by date and returning all columns. To return a specific column, as a Series:

s = df.loc[df.index >= '2023-06-01', 'some_column']