As models need a interface to read and understand the data, Pandas is a python package, which is used for data manipulation and exploratory data analysis before we submit the data for the model training. Its ability to read from and write to an extensive list of formats makes it a versatile tool for data science practitioners.
Goal¶
setup and load data using pandas.
data selection and update.
handling missing data.
save processed data to file.
import data from file.
import pandas as pd
import numpy as npCreate a Panda Series,Data Frame¶
Pandas provides two types of classes for handling data:
Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.
DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.
## Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
## DataFrame
dates = pd.date_range("20230101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print(df)
A B C D
2023-01-01 -0.869058 -0.447785 0.280211 -0.045918
2023-01-02 -0.092181 1.527722 1.080601 0.557053
2023-01-03 -0.601719 1.098997 1.869991 0.571191
2023-01-04 0.401115 -0.731362 -0.626753 2.199811
2023-01-05 0.699923 -2.314103 2.045064 -0.076148
2023-01-06 0.620450 0.954470 -1.577435 -0.138895
## Creating a DataFrame by passing a dictionary of objects where the keys are the
## column labels and the values are the column values.
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20230102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)## DataFrame Column Types
print(df2.dtypes)A float64
B datetime64[s]
C float32
D int32
E category
F object
dtype: object
Viewing¶
Basic functions¶
df.head, for first n records.
df.tail, for last n records.
df.index, show the current index list of the data frame.
df.columns, shows the current columns of the data frame.
df.describe, shows a quick statistic summary of the data.
df.T, to re-orient or transpose the data frame.
df.sort_index, to sort the data frame by index.
df.sort_by_value, to sort the data frame by value.
## DataFrame First and Last Rows
print('Head ====================')
print(df2.head(2))
print('Tail ====================')
print(df2.tail(3))
print('Index ====================')
print(df2.index)
print('Columns ====================')
print(df2.columns)
print('Numpy ====================')
print(df2.to_numpy())
print('Describe ====================')
print(df2.describe())
print('Transpose ====================')
print(df2.T)
print('Sort by Axis ====================')
print(df2.sort_index(axis=1, ascending=False))
print('Sort by Values ====================')
print(df2.sort_values(by="B"))
Head ====================
A B C D E F
0 1.0 2023-01-02 1.0 3 test foo
1 1.0 2023-01-02 1.0 3 train foo
Tail ====================
A B C D E F
1 1.0 2023-01-02 1.0 3 train foo
2 1.0 2023-01-02 1.0 3 test foo
3 1.0 2023-01-02 1.0 3 train foo
Index ====================
Index([0, 1, 2, 3], dtype='int64')
Columns ====================
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
Numpy ====================
[[1.0 Timestamp('2023-01-02 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2023-01-02 00:00:00') 1.0 3 'train' 'foo']
[1.0 Timestamp('2023-01-02 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2023-01-02 00:00:00') 1.0 3 'train' 'foo']]
Describe ====================
A B C D
count 4.0 4 4.0 4.0
mean 1.0 2023-01-02 00:00:00 1.0 3.0
min 1.0 2023-01-02 00:00:00 1.0 3.0
25% 1.0 2023-01-02 00:00:00 1.0 3.0
50% 1.0 2023-01-02 00:00:00 1.0 3.0
75% 1.0 2023-01-02 00:00:00 1.0 3.0
max 1.0 2023-01-02 00:00:00 1.0 3.0
std 0.0 NaN 0.0 0.0
Transpose ====================
0 1 2 \
A 1.0 1.0 1.0
B 2023-01-02 00:00:00 2023-01-02 00:00:00 2023-01-02 00:00:00
C 1.0 1.0 1.0
D 3 3 3
E test train test
F foo foo foo
3
A 1.0
B 2023-01-02 00:00:00
C 1.0
D 3
E train
F foo
Sort by Axis ====================
F E D C B A
0 foo test 3 1.0 2023-01-02 1.0
1 foo train 3 1.0 2023-01-02 1.0
2 foo test 3 1.0 2023-01-02 1.0
3 foo train 3 1.0 2023-01-02 1.0
Sort by Values ====================
A B C D E F
0 1.0 2023-01-02 1.0 3 test foo
1 1.0 2023-01-02 1.0 3 train foo
2 1.0 2023-01-02 1.0 3 test foo
3 1.0 2023-01-02 1.0 3 train foo
Data Selection¶
Selecting data by using []¶
## Selection by rows with range query
print(df2[0:2])
## Selection by label
print('1. =========================')
print(df2.index)
print('2. =========================')
## Selecting all rows (:) with a select column labels:
print(df2.loc[:, ["A", "B"]])
print('3. =========================')
## For label slicing, both endpoints are included:
print(df2.loc[0:2, ["A", "B"]])
print('4. =========================')
## Selection by position
print(df2.iloc[1:3, 0:2])
print('5. =========================')
## Boolean indexing
print(df2[df2.A > 0]) A B C D E F
0 1.0 2023-01-02 1.0 3 test foo
1 1.0 2023-01-02 1.0 3 train foo
1. =========================
Index([0, 1, 2, 3], dtype='int64')
2. =========================
A B
0 1.0 2023-01-02
1 1.0 2023-01-02
2 1.0 2023-01-02
3 1.0 2023-01-02
3. =========================
A B
0 1.0 2023-01-02
1 1.0 2023-01-02
2 1.0 2023-01-02
4. =========================
A B
1 1.0 2023-01-02
2 1.0 2023-01-02
5. =========================
A B C D E F
0 1.0 2023-01-02 1.0 3 test foo
1 1.0 2023-01-02 1.0 3 train foo
2 1.0 2023-01-02 1.0 3 test foo
3 1.0 2023-01-02 1.0 3 train foo
Selecting data using loc,iloc¶
In pandas, the loc and iloc indexers are used to select data from a DataFrame, with the key difference being that loc uses labels (row and column names), while iloc uses integer positions (0-based indices).
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['NY', 'LA', 'Chicago']}
addr_df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
print('DataFrame:')
print(addr_df)
# Select row with label 'row2' and column with label 'Name'
print('Data in data[row2][name] using loc')
print(addr_df.loc['row2', 'Name'])
# Output: Bob
# Select all rows from label 'row1' to 'row3' (inclusive) and specific columns by label
print('Using the loc')
print(addr_df.loc['row1':'row3', ['Name', 'City']])
# Assuming the same DataFrame 'addr_df' as above
# Select the second row (position 1) and the first column (position 0)
print('Data in data[row2][name] using iloc')
print(addr_df.iloc[1, 0])
# Output: Bob
# Select the first three rows (positions 0, 1, 2) and the first two columns (positions 0, 1)
print('Using the iloc')
print(addr_df.iloc[0:3, 0:2])
DataFrame:
Name Age City
row1 Alice 25 NY
row2 Bob 30 LA
row3 Charlie 35 Chicago
Data in data[row2][name] using loc
Bob
Using the loc
Name City
row1 Alice NY
row2 Bob LA
row3 Charlie Chicago
Data in data[row2][name] using iloc
Bob
Using the iloc
Name Age
row1 Alice 25
row2 Bob 30
row3 Charlie 35
Selection by position¶
## Select a row
print('==========================')
print(addr_df.iloc[0])
## Select multiple rows and columns
print('==========================')
print(addr_df.iloc[1:3, 0:2])
## Lists of integer position locations:
print('==========================')
print(addr_df.iloc[[0, 2], [1, 2]])
## Slicing rows and select all the columns:
print('==========================')
print(addr_df.iloc[1:3, :])
## slicing columns explicitly
print('==========================')
print(addr_df.iloc[:, [0]])==========================
Name Alice
Age 25
City NY
Name: row1, dtype: object
==========================
Name Age
row2 Bob 30
row3 Charlie 35
==========================
Age City
row1 25 NY
row3 35 Chicago
==========================
Name Age City
row2 Bob 30 LA
row3 Charlie 35 Chicago
==========================
Name
row1 Alice
row2 Bob
row3 Charlie
Selection by Boolean indexing¶
## Select rows where df.A is greater than 0.
print(addr_df[addr_df.Age > 25])
print('==========================')
## Selecting values from a DataFrame where a boolean condition is met:
print(addr_df[addr_df['City'] == 'NY'])
print('==========================')
## Using isin() method for filtering
print(addr_df[addr_df['City'].isin(['NY', 'Chicago'])]) Name Age City
row2 Bob 30 LA
row3 Charlie 35 Chicago
==========================
Name Age City
row1 Alice 25 NY
==========================
Name Age City
row1 Alice 25 NY
row3 Charlie 35 Chicago
Data Update¶
print(addr_df)
print('==========================')
## Setting a new column automatically aligns the data by the indexes:
addr_df['Country'] = ['USA', 'USA', 'USA']
print(addr_df)
print('==========================')
## updating values in a DataFrame
addr_df.loc[addr_df['Name'] == 'Alice', 'Age'] = 26
print(addr_df)
print('==========================')
Name Age City
row1 Alice 25 NY
row2 Bob 30 LA
row3 Charlie 35 Chicago
==========================
Name Age City Country
row1 Alice 25 NY USA
row2 Bob 30 LA USA
row3 Charlie 35 Chicago USA
==========================
Name Age City Country
row1 Alice 26 NY USA
row2 Bob 30 LA USA
row3 Charlie 35 Chicago USA
==========================
Adding and Dropping Columns¶
### Add a new column
addr_df['Profession'] = ['Engineer', 'Doctor', 'Artist']
print(addr_df)
print('==========================')
### Add column - salary
addr_df['Salary'] = [70000, 120000, 50000]
print(addr_df)
print('==========================')
### Add column - Experience, Tax
addr_df['Experience'] = [3, 8, 5]
addr_df['Tax'] = addr_df['Salary'] * 0.2
print(addr_df)
print('==========================')
### Drop a column Tax
addr_df = addr_df.drop(columns=['Tax'])
print(addr_df)
print('==========================')
## Increase Salary by 10% for experience greater than 5 years
## change dtype of salary to float
addr_df['Salary'] = addr_df['Salary'].astype(float)
addr_df.loc[addr_df['Experience'] >= 5, 'Salary'] *= 1.1
print(addr_df)
print('==========================')
## using at to update a single value
addr_df.at['row1', 'City'] = 'San Francisco'
print(addr_df)
print('==========================')
Name Age City Country Profession
row1 Alice 26 NY USA Engineer
row2 Bob 30 LA USA Doctor
row3 Charlie 35 Chicago USA Artist
==========================
Name Age City Country Profession Salary
row1 Alice 26 NY USA Engineer 70000
row2 Bob 30 LA USA Doctor 120000
row3 Charlie 35 Chicago USA Artist 50000
==========================
Name Age City Country Profession Salary Experience Tax
row1 Alice 26 NY USA Engineer 70000 3 14000.0
row2 Bob 30 LA USA Doctor 120000 8 24000.0
row3 Charlie 35 Chicago USA Artist 50000 5 10000.0
==========================
Name Age City Country Profession Salary Experience
row1 Alice 26 NY USA Engineer 70000 3
row2 Bob 30 LA USA Doctor 120000 8
row3 Charlie 35 Chicago USA Artist 50000 5
==========================
Name Age City Country Profession Salary Experience
row1 Alice 26 NY USA Engineer 70000.0 3
row2 Bob 30 LA USA Doctor 132000.0 8
row3 Charlie 35 Chicago USA Artist 55000.0 5
==========================
Name Age City Country Profession Salary Experience
row1 Alice 26 San Francisco USA Engineer 70000.0 3
row2 Bob 30 LA USA Doctor 132000.0 8
row3 Charlie 35 Chicago USA Artist 55000.0 5
==========================
Missing data¶
the data is expected to have null or NAN values as part of it. From NumPy data types, np.nan represents missing data.
we can either delete or replace data within a DataFrame to prepare it for a model, depending on the specific context, data type, and the expected outcome. This process is a crucial part of data cleaning and preprocessing.
## DataFrame with missing values
data_with_nan = {
'A': [1, 2, np.nan],
'B': [np.nan, 2, 3],
'C': [1, np.nan, np.nan]
}
df_with_nan = pd.DataFrame(data_with_nan)
print('DataFrame with NaN values:')
print(df_with_nan)
## Sum of NaN values in each column
print('Sum of NaN values in each column:')
print(df_with_nan.isnull().sum())DataFrame with NaN values:
A B C
0 1.0 NaN 1.0
1 2.0 2.0 NaN
2 NaN 3.0 NaN
Sum of NaN values in each column:
A 1
B 1
C 2
dtype: int64
Handling Missing Data - By Delete¶
# Handling missing data
print('Drop rows with any NaN values:')
# print(df_with_nan.dropna())
## Drop Null values in columns
df_with_nan.dropna(axis=1,inplace=True)
# print('Fill NaN values with 0:')
# print(df_with_nan.fillna(0))
print('Sum of NaN values in each column:')
print(df_with_nan.isna().sum())Drop rows with any NaN values:
Sum of NaN values in each column:
Series([], dtype: float64)
Handling the missing data - By Replace¶
## Handling the missing data - replace by mean
data_with_nan = {
'A': [1, 2, np.nan],
'B': [np.nan, 2, 3],
'C': [1, np.nan, np.nan]
}
df_with_nan = pd.DataFrame(data_with_nan)
print('DataFrame with NaN values:')
print(df_with_nan)
# Fill NaN values with the mean of each column
df_filled = df_with_nan.fillna(df_with_nan.mean())
print('DataFrame after filling NaN values with column means:')
print(df_filled)
## Verify no NaN values remain
print('Sum of NaN values in each column after filling:')
print(df_filled.isna().sum())
DataFrame with NaN values:
A B C
0 1.0 NaN 1.0
1 2.0 2.0 NaN
2 NaN 3.0 NaN
DataFrame after filling NaN values with column means:
A B C
0 1.0 2.5 1.0
1 2.0 2.0 1.0
2 1.5 3.0 1.0
Sum of NaN values in each column after filling:
A 0
B 0
C 0
dtype: int64
Importing and exporting data¶
## Exporting data to CSV to a file.
df_filled.to_csv("foo.csv")## Importin the exported CSV file
imported_df = pd.read_csv("foo.csv", index_col=0)
print('Imported DataFrame from CSV:')
print(imported_df)Imported DataFrame from CSV:
A B C
0 1.0 2.5 1.0
1 2.0 2.0 1.0
2 1.5 3.0 1.0