Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Pandas Essentials

As models need a interface to read and understand the data, Pandas is a python package, which is used for data manipulation and exploratory data analysis before we submit the data for the model training. Its ability to read from and write to an extensive list of formats makes it a versatile tool for data science practitioners.

Goal

  • setup and load data using pandas.

  • data selection and update.

  • handling missing data.

  • save processed data to file.

  • import data from file.

Setup and load data using pandas

Importing the package

import pandas as pd
import numpy as np

Create a Panda Series,Data Frame

Pandas provides two types of classes for handling data:

  • Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.

  • DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

## Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
## DataFrame
dates = pd.date_range("20230101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df)
                   A         B         C         D
2023-01-01 -0.869058 -0.447785  0.280211 -0.045918
2023-01-02 -0.092181  1.527722  1.080601  0.557053
2023-01-03 -0.601719  1.098997  1.869991  0.571191
2023-01-04  0.401115 -0.731362 -0.626753  2.199811
2023-01-05  0.699923 -2.314103  2.045064 -0.076148
2023-01-06  0.620450  0.954470 -1.577435 -0.138895
## Creating a DataFrame by passing a dictionary of objects where the keys are the 
## column labels and the values are the column values.
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20230102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
## DataFrame Column Types
print(df2.dtypes)
A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

Viewing

Basic functions

  • df.head, for first n records.

  • df.tail, for last n records.

  • df.index, show the current index list of the data frame.

  • df.columns, shows the current columns of the data frame.

  • df.describe, shows a quick statistic summary of the data.

  • df.T, to re-orient or transpose the data frame.

  • df.sort_index, to sort the data frame by index.

  • df.sort_by_value, to sort the data frame by value.

## DataFrame First and Last Rows
print('Head ====================')
print(df2.head(2))
print('Tail ====================')
print(df2.tail(3))
print('Index ====================')
print(df2.index)
print('Columns ====================')
print(df2.columns)
print('Numpy ====================')
print(df2.to_numpy())
print('Describe ====================')
print(df2.describe())
print('Transpose ====================')
print(df2.T)
print('Sort by Axis ====================')
print(df2.sort_index(axis=1, ascending=False))
print('Sort by Values ====================')
print(df2.sort_values(by="B"))
Head ====================
     A          B    C  D      E    F
0  1.0 2023-01-02  1.0  3   test  foo
1  1.0 2023-01-02  1.0  3  train  foo
Tail ====================
     A          B    C  D      E    F
1  1.0 2023-01-02  1.0  3  train  foo
2  1.0 2023-01-02  1.0  3   test  foo
3  1.0 2023-01-02  1.0  3  train  foo
Index ====================
Index([0, 1, 2, 3], dtype='int64')
Columns ====================
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
Numpy ====================
[[1.0 Timestamp('2023-01-02 00:00:00') 1.0 3 'test' 'foo']
 [1.0 Timestamp('2023-01-02 00:00:00') 1.0 3 'train' 'foo']
 [1.0 Timestamp('2023-01-02 00:00:00') 1.0 3 'test' 'foo']
 [1.0 Timestamp('2023-01-02 00:00:00') 1.0 3 'train' 'foo']]
Describe ====================
         A                    B    C    D
count  4.0                    4  4.0  4.0
mean   1.0  2023-01-02 00:00:00  1.0  3.0
min    1.0  2023-01-02 00:00:00  1.0  3.0
25%    1.0  2023-01-02 00:00:00  1.0  3.0
50%    1.0  2023-01-02 00:00:00  1.0  3.0
75%    1.0  2023-01-02 00:00:00  1.0  3.0
max    1.0  2023-01-02 00:00:00  1.0  3.0
std    0.0                  NaN  0.0  0.0
Transpose ====================
                     0                    1                    2  \
A                  1.0                  1.0                  1.0   
B  2023-01-02 00:00:00  2023-01-02 00:00:00  2023-01-02 00:00:00   
C                  1.0                  1.0                  1.0   
D                    3                    3                    3   
E                 test                train                 test   
F                  foo                  foo                  foo   

                     3  
A                  1.0  
B  2023-01-02 00:00:00  
C                  1.0  
D                    3  
E                train  
F                  foo  
Sort by Axis ====================
     F      E  D    C          B    A
0  foo   test  3  1.0 2023-01-02  1.0
1  foo  train  3  1.0 2023-01-02  1.0
2  foo   test  3  1.0 2023-01-02  1.0
3  foo  train  3  1.0 2023-01-02  1.0
Sort by Values ====================
     A          B    C  D      E    F
0  1.0 2023-01-02  1.0  3   test  foo
1  1.0 2023-01-02  1.0  3  train  foo
2  1.0 2023-01-02  1.0  3   test  foo
3  1.0 2023-01-02  1.0  3  train  foo

Data Selection

Selecting data by using []

## Selection by rows with range query
print(df2[0:2])
## Selection by label
print('1. =========================')
print(df2.index)
print('2. =========================')
## Selecting all rows (:) with a select column labels:
print(df2.loc[:, ["A", "B"]])
print('3. =========================')
## For label slicing, both endpoints are included:
print(df2.loc[0:2, ["A", "B"]])
print('4. =========================')
## Selection by position
print(df2.iloc[1:3, 0:2])
print('5. =========================')
## Boolean indexing
print(df2[df2.A > 0])
     A          B    C  D      E    F
0  1.0 2023-01-02  1.0  3   test  foo
1  1.0 2023-01-02  1.0  3  train  foo
1. =========================
Index([0, 1, 2, 3], dtype='int64')
2. =========================
     A          B
0  1.0 2023-01-02
1  1.0 2023-01-02
2  1.0 2023-01-02
3  1.0 2023-01-02
3. =========================
     A          B
0  1.0 2023-01-02
1  1.0 2023-01-02
2  1.0 2023-01-02
4. =========================
     A          B
1  1.0 2023-01-02
2  1.0 2023-01-02
5. =========================
     A          B    C  D      E    F
0  1.0 2023-01-02  1.0  3   test  foo
1  1.0 2023-01-02  1.0  3  train  foo
2  1.0 2023-01-02  1.0  3   test  foo
3  1.0 2023-01-02  1.0  3  train  foo

Selecting data using loc,iloc

  • In pandas, the loc and iloc indexers are used to select data from a DataFrame, with the key difference being that loc uses labels (row and column names), while iloc uses integer positions (0-based indices).

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['NY', 'LA', 'Chicago']}
addr_df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])

print('DataFrame:')
print(addr_df)

# Select row with label 'row2' and column with label 'Name'
print('Data in data[row2][name] using loc')
print(addr_df.loc['row2', 'Name'])
# Output: Bob

# Select all rows from label 'row1' to 'row3' (inclusive) and specific columns by label
print('Using the loc')
print(addr_df.loc['row1':'row3', ['Name', 'City']])

# Assuming the same DataFrame 'addr_df' as above
# Select the second row (position 1) and the first column (position 0)
print('Data in data[row2][name] using iloc')
print(addr_df.iloc[1, 0])
# Output: Bob

# Select the first three rows (positions 0, 1, 2) and the first two columns (positions 0, 1)
print('Using the iloc')
print(addr_df.iloc[0:3, 0:2])
DataFrame:
         Name  Age     City
row1    Alice   25       NY
row2      Bob   30       LA
row3  Charlie   35  Chicago
Data in data[row2][name] using loc
Bob
Using the loc
         Name     City
row1    Alice       NY
row2      Bob       LA
row3  Charlie  Chicago
Data in data[row2][name] using iloc
Bob
Using the iloc
         Name  Age
row1    Alice   25
row2      Bob   30
row3  Charlie   35

Selection by position

## Select a row
print('==========================')
print(addr_df.iloc[0])
## Select multiple rows and columns
print('==========================')
print(addr_df.iloc[1:3, 0:2])
## Lists of integer position locations:
print('==========================')
print(addr_df.iloc[[0, 2], [1, 2]])
## Slicing rows and select all the columns:
print('==========================')
print(addr_df.iloc[1:3, :])
## slicing columns explicitly
print('==========================')
print(addr_df.iloc[:, [0]])
==========================
Name    Alice
Age        25
City       NY
Name: row1, dtype: object
==========================
         Name  Age
row2      Bob   30
row3  Charlie   35
==========================
      Age     City
row1   25       NY
row3   35  Chicago
==========================
         Name  Age     City
row2      Bob   30       LA
row3  Charlie   35  Chicago
==========================
         Name
row1    Alice
row2      Bob
row3  Charlie

Selection by Boolean indexing

## Select rows where df.A is greater than 0.
print(addr_df[addr_df.Age > 25])
print('==========================')
## Selecting values from a DataFrame where a boolean condition is met:
print(addr_df[addr_df['City'] == 'NY'])
print('==========================')
## Using isin() method for filtering
print(addr_df[addr_df['City'].isin(['NY', 'Chicago'])])
         Name  Age     City
row2      Bob   30       LA
row3  Charlie   35  Chicago
==========================
       Name  Age City
row1  Alice   25   NY
==========================
         Name  Age     City
row1    Alice   25       NY
row3  Charlie   35  Chicago

Data Update

print(addr_df)
print('==========================')
## Setting a new column automatically aligns the data by the indexes:
addr_df['Country'] = ['USA', 'USA', 'USA']
print(addr_df)
print('==========================')
## updating values in a DataFrame
addr_df.loc[addr_df['Name'] == 'Alice', 'Age'] = 26
print(addr_df)
print('==========================')
         Name  Age     City
row1    Alice   25       NY
row2      Bob   30       LA
row3  Charlie   35  Chicago
==========================
         Name  Age     City Country
row1    Alice   25       NY     USA
row2      Bob   30       LA     USA
row3  Charlie   35  Chicago     USA
==========================
         Name  Age     City Country
row1    Alice   26       NY     USA
row2      Bob   30       LA     USA
row3  Charlie   35  Chicago     USA
==========================

Adding and Dropping Columns

### Add a new column
addr_df['Profession'] = ['Engineer', 'Doctor', 'Artist']
print(addr_df)
print('==========================')
### Add column - salary
addr_df['Salary'] = [70000, 120000, 50000]
print(addr_df)
print('==========================')
### Add column - Experience, Tax
addr_df['Experience'] = [3, 8, 5]
addr_df['Tax'] = addr_df['Salary'] * 0.2
print(addr_df)
print('==========================')
### Drop a column Tax
addr_df = addr_df.drop(columns=['Tax'])
print(addr_df)
print('==========================')
## Increase Salary by 10% for experience greater than 5 years
## change dtype of salary to float
addr_df['Salary'] = addr_df['Salary'].astype(float)

addr_df.loc[addr_df['Experience'] >= 5, 'Salary'] *= 1.1
print(addr_df)
print('==========================')
## using at to update a single value
addr_df.at['row1', 'City'] = 'San Francisco'
print(addr_df)
print('==========================')

         Name  Age     City Country Profession
row1    Alice   26       NY     USA   Engineer
row2      Bob   30       LA     USA     Doctor
row3  Charlie   35  Chicago     USA     Artist
==========================
         Name  Age     City Country Profession  Salary
row1    Alice   26       NY     USA   Engineer   70000
row2      Bob   30       LA     USA     Doctor  120000
row3  Charlie   35  Chicago     USA     Artist   50000
==========================
         Name  Age     City Country Profession  Salary  Experience      Tax
row1    Alice   26       NY     USA   Engineer   70000           3  14000.0
row2      Bob   30       LA     USA     Doctor  120000           8  24000.0
row3  Charlie   35  Chicago     USA     Artist   50000           5  10000.0
==========================
         Name  Age     City Country Profession  Salary  Experience
row1    Alice   26       NY     USA   Engineer   70000           3
row2      Bob   30       LA     USA     Doctor  120000           8
row3  Charlie   35  Chicago     USA     Artist   50000           5
==========================
         Name  Age     City Country Profession    Salary  Experience
row1    Alice   26       NY     USA   Engineer   70000.0           3
row2      Bob   30       LA     USA     Doctor  132000.0           8
row3  Charlie   35  Chicago     USA     Artist   55000.0           5
==========================
         Name  Age           City Country Profession    Salary  Experience
row1    Alice   26  San Francisco     USA   Engineer   70000.0           3
row2      Bob   30             LA     USA     Doctor  132000.0           8
row3  Charlie   35        Chicago     USA     Artist   55000.0           5
==========================

Missing data

  • the data is expected to have null or NAN values as part of it. From NumPy data types, np.nan represents missing data.

  • we can either delete or replace data within a DataFrame to prepare it for a model, depending on the specific context, data type, and the expected outcome. This process is a crucial part of data cleaning and preprocessing.

## DataFrame with missing values
data_with_nan = {
    'A': [1, 2, np.nan],
    'B': [np.nan, 2, 3],
    'C': [1, np.nan, np.nan]
}
df_with_nan = pd.DataFrame(data_with_nan)
print('DataFrame with NaN values:')
print(df_with_nan)
## Sum of NaN values in each column
print('Sum of NaN values in each column:')
print(df_with_nan.isnull().sum())
DataFrame with NaN values:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  NaN
2  NaN  3.0  NaN
Sum of NaN values in each column:
A    1
B    1
C    2
dtype: int64

Handling Missing Data - By Delete

# Handling missing data
print('Drop rows with any NaN values:')
# print(df_with_nan.dropna())
## Drop Null values in columns
df_with_nan.dropna(axis=1,inplace=True)
# print('Fill NaN values with 0:')
# print(df_with_nan.fillna(0))
print('Sum of NaN values in each column:')
print(df_with_nan.isna().sum())
Drop rows with any NaN values:
Sum of NaN values in each column:
Series([], dtype: float64)

Handling the missing data - By Replace

## Handling the missing data - replace by mean
data_with_nan = {
    'A': [1, 2, np.nan],
    'B': [np.nan, 2, 3],
    'C': [1, np.nan, np.nan]
}
df_with_nan = pd.DataFrame(data_with_nan)
print('DataFrame with NaN values:')
print(df_with_nan)
# Fill NaN values with the mean of each column
df_filled = df_with_nan.fillna(df_with_nan.mean())
print('DataFrame after filling NaN values with column means:')
print(df_filled)
## Verify no NaN values remain
print('Sum of NaN values in each column after filling:')
print(df_filled.isna().sum())
DataFrame with NaN values:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  NaN
2  NaN  3.0  NaN
DataFrame after filling NaN values with column means:
     A    B    C
0  1.0  2.5  1.0
1  2.0  2.0  1.0
2  1.5  3.0  1.0
Sum of NaN values in each column after filling:
A    0
B    0
C    0
dtype: int64

Importing and exporting data

## Exporting data to CSV to a file.
df_filled.to_csv("foo.csv")
## Importin the exported CSV file
imported_df = pd.read_csv("foo.csv", index_col=0)
print('Imported DataFrame from CSV:')
print(imported_df)
Imported DataFrame from CSV:
     A    B    C
0  1.0  2.5  1.0
1  2.0  2.0  1.0
2  1.5  3.0  1.0