Pandas Concepts: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(56 intermediate revisions by the same user not shown)
Line 1: Line 1:
=External=
* https://pandas.pydata.org/docs/user_guide/index.html#user-guide
=Internal=
=Internal=
* [[Pandas#Subjects|Pandas]]
* [[Pandas#Subjects|Pandas]]


=Overview=
=Overview=
<code>pandas</code> is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.
<code>pandas</code> is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. As a result of having been built initially to solve finance and business analytics problems, Pandas features especially deep time series functionality and tools well suited for working with time-indexed data. Pandas is built in top of [[Numpy Concepts#Overview|Numpy]]. Pandas provides two main data structures [[Pandas_DataFrame|DataFrame]] and [[Pandas_Series|Series]], which share concepts like [[#Axis|axis]] and [[#Index|index]]. Pandas blends the array-computing ideas of NumPy with the kinds of data manipulation capabilities found in spreadsheets and relational databases, such as SQL. The Pandas name is derived from "panel data", an econometrics term for multidimensional structured datasets, and a play on the phrase "Python data analysis".
=Import Convention=
<syntaxhighlight lang='py'>
import pandas as pd
</syntaxhighlight>


Pandas is built in top of [[Numpy Concepts#Overview|Numpy]].
=Axis=
Both [[Pandas_DataFrame|DataFrames]] and [[Pandas_Series|Series]] use the concept of axis. An axis means the direction in a bidimensional data matrix or a vector along which the means are computed. A DataFrame has [[Pandas_DataFrame#Axes|a row axis and a column axis]], while a Series has [[Pandas_Series#Axis|only one axis]]. The term comes from numpy, whose [[Numpy_Concepts#ndarray|ndarray]] is used to implement the Panda [[Pandas_Series#Overview|Series]]. The [[#Index|indexes]] corresponding to a DataFrame or a Series axes are returned by the <code>axes</code> property.
 
=Index=
{{External|https://pandas.pydata.org/docs/reference/indexing.html}}
An index is an immutable sequence used to address data stored in a [[Pandas_DataFrame#Index|DataFrame]] or a [[Pandas_Series#Index|Series]], and it can be thought of as individual elements labels, for a Series or as the row labels for a DataFrame. By default, it consists in 0-based monotonically increasing integers <span id='RangeIndex'></span>(<code>[[RangeIndex]]</code>), but it can also consists in string labels, or in case of time series, by datetime instances <span id='DatetimeIndex'></span>(<code>[[DatetimeIndex]]</code>). Other indexes: <code>CategoricalIndex</code>, <code>MultiIndex</code>, <code>IntervalIndex</code>, <code>TimedeltaIndex</code>, <code>PeriodIndex</code>.
 
DataFrame and Series data elements can be addressed via both integral zero-based location, using the <code>iloc[]</code> syntax, and also via index values, using the <code>loc[]</code> syntax.
 
For details related to DataFrame and Series indexes, see:
* [[Pandas_DataFrame#Index|DataFrame Index]]
* [[Pandas_Series#Index|Series Index]]


=<span id='Data_Frame'></span>DataFrame=
=<span id='Data_Frame'></span>DataFrame=
Line 12: Line 29:
=Series=
=Series=
{{Internal|Pandas_Series|Series}}
{{Internal|Pandas_Series|Series}}
=Axis=
==Time Series Processing with Pandas==
 
{{Internal|Time Series Processing with Pandas#Overview|Time Series Processing with Panda}}
Both [[Pandas_DataFrame|DataFrames]] and [[Pandas_Series|Series]] use the concept of axis. <font color=darkkhaki>Formally define. By default, an axis comprises of monotonically increasing integers with step 1, from 0 to length - 1</font>


=Index=
{{External|https://pandas.pydata.org/docs/reference/indexing.html}}
==RangeIndex==
<code>RangeIndex(start=0, stop=3, step=1)</code>
=Data Types=
=Data Types=
==String==
==String==
==Datetime==
==<tt>Datetime</tt>==
<font color=darkkhaki>TO PROCESS: https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html#pandas.Timestamp
Date format when represented as string: <code>YYYY-MM-DD</code>, "2023-10-01".
==<tt>Timestamp</tt>==
Pandas has a replacement for Python <code>[[Time,_Date,_Timestamp_in_Python#The_datetime.datetime_Type|datetime.datetime]]</code> object, and that is <code>pandas.Timestamp</code>: {{External|https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html#pandas.Timestamp}}


Reported as <code> datetime64[ns]</code>. What is this?
To create a <code>Timestamp</code> object:
<syntaxhighlight lang='py'>
import pandas as pd


ts = pd.to_datetime('2023-10-01')
</syntaxhighlight>
Also see {{Internal|Pandas_Series#Create_a_Time_Series_from_CSV|Create a Time Series from CSV}}
Also see {{Internal|Pandas_Series#Create_a_Time_Series_from_CSV|Create a Time Series from CSV}}


</font>
=Visualization=
 
Both the [[Pandas_DataFrame|DataFrame]] and [[Pandas_Series|Series]] have a <code>plot()</code> method, <font color=darkkhaki>which delegates to [[Matplotlib|matplotlib]].</font>
 
=Data Loading=
==Excel==
{{Internal|Pandas Excel|Pandas Excel}}
==CSV==
{{Internal|Pandas CSV|Pandas CSV}}
 
=Datareader=
 
<syntaxhighlight lang='bash'>
pip3 install pandas_datareader
</syntaxhighlight>

Latest revision as of 00:02, 15 May 2024

External

Internal

Overview

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. As a result of having been built initially to solve finance and business analytics problems, Pandas features especially deep time series functionality and tools well suited for working with time-indexed data. Pandas is built in top of Numpy. Pandas provides two main data structures DataFrame and Series, which share concepts like axis and index. Pandas blends the array-computing ideas of NumPy with the kinds of data manipulation capabilities found in spreadsheets and relational databases, such as SQL. The Pandas name is derived from "panel data", an econometrics term for multidimensional structured datasets, and a play on the phrase "Python data analysis".

Import Convention

import pandas as pd

Axis

Both DataFrames and Series use the concept of axis. An axis means the direction in a bidimensional data matrix or a vector along which the means are computed. A DataFrame has a row axis and a column axis, while a Series has only one axis. The term comes from numpy, whose ndarray is used to implement the Panda Series. The indexes corresponding to a DataFrame or a Series axes are returned by the axes property.

Index

https://pandas.pydata.org/docs/reference/indexing.html

An index is an immutable sequence used to address data stored in a DataFrame or a Series, and it can be thought of as individual elements labels, for a Series or as the row labels for a DataFrame. By default, it consists in 0-based monotonically increasing integers (RangeIndex), but it can also consists in string labels, or in case of time series, by datetime instances (DatetimeIndex). Other indexes: CategoricalIndex, MultiIndex, IntervalIndex, TimedeltaIndex, PeriodIndex.

DataFrame and Series data elements can be addressed via both integral zero-based location, using the iloc[] syntax, and also via index values, using the loc[] syntax.

For details related to DataFrame and Series indexes, see:

DataFrame

DataFrame

Series

Series

Time Series Processing with Pandas

Time Series Processing with Panda

Data Types

String

Datetime

Date format when represented as string: YYYY-MM-DD, "2023-10-01".

Timestamp

Pandas has a replacement for Python datetime.datetime object, and that is pandas.Timestamp:

https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html#pandas.Timestamp

To create a Timestamp object:

import pandas as pd

ts = pd.to_datetime('2023-10-01')

Also see

Create a Time Series from CSV

Visualization

Both the DataFrame and Series have a plot() method, which delegates to matplotlib.

Data Loading

Excel

Pandas Excel

CSV

Pandas CSV

Datareader

pip3 install pandas_datareader