Pandas DataFrame: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
Line 38: Line 38:
</syntaxhighlight>
</syntaxhighlight>
This kind of index is associated with the [[#Axes|column axis]] of a DataFrame.
This kind of index is associated with the [[#Axes|column axis]] of a DataFrame.
Also see: {{Internal|Panda_Concepts#Index|Panda Concepts &#124; Index}}
Also see: {{Internal|Pandas_Concepts#Index|Pandas Concepts &#124; Index}}


=Create a DataFrame=
=Create a DataFrame=

Revision as of 01:44, 16 October 2023

External

Internal

Overview

A DataFrame is a two-dimensional data structure with columns of potentially different types and rows. A useful mental model for a DataFrame is a a dict-like container for Series objects, where each column is a Series. Some times, documentation refers to columns as "features" or "variables". The values stored in rows are referred to as "records".

Axes

The DataFrame has two axes: "axis 0" which is aligned alongside the DataFrame's rows pointing "downwards", representing rows, and "axis 1", which is aligned alongside the column headers, pointing from left to right, representing columns:

Panda DataFrame Axes.png

A DataFrame shares the same direction for "axis 0" with its Series' axes.

The DataFrame axes property gives access to an array containing two Index instances, the first for the rows, the second one for the columns:

assert len(df.axes) == 2
print(df.axes)

[RangeIndex(start=0, stop=6, step=1), Index(['distance', 'strength'], dtype='object')]

Shape

The DataFrame shape property contains a tuple that returns the dimensionality of the DataFrame: (rows, columns).

Index

By default, the DataFrame gets a RangeIndex.

However, the index of the DataFrame can be replaced with set_index().

The columns can be accessed via a generic index:

Index(['distance', 'strength'], dtype='object')

This kind of index is associated with the column axis of a DataFrame.

Also see:

Pandas Concepts | Index

Create a DataFrame

Create a Data Frame Programmatically

import pandas as pd
df = pd.DataFrame({
    'distance': [1, 2, 5, 8, 10, 25],
    'strength': [0.98, 0.97, 0.88, 0.45, 0.20, 0.02]
})

Shows up as:

Panda DataFrame.png

Create a DataFrame from a CSV File

import pandas as pd
df = pd.read_csv("./analysis.csv")

If the CSV file contains column that need to be handled as time series, see:

Load a Time Series

Accessing Elements of a DataFrame

Accessing a Column

An individual column can be accessed with the [] operator, by specifying the column column name. The result is a Series:

df = ...
df['Date']
type(df['Date']) # displays pandas.core.series.Series

iloc[]

A property that allows integer-based access (indexing). The location is specified as a 0-based index position. The property accepts a wide variety of arguments.

Used in the following situations:

Extract a Series from the DataFrame

iloc[] can be used to extract a series from the DataFrame. The first argument is a slice specifying the series indexes, : to extract the entire series, and the second argument specifies the column index in the DataFrame. The Series gets a default RangeIndex:

df = ...
# extract a series corresponding to DataFrame column 0
s = df.iloc[:,0]

loc[]

A property that allows label-based access (indexing).

squeeze()

[]

Operations on DataFrames