Lesson 22. Working With Data In Pandas

Pandas are built on top of Numpy. It is one of the most used and preferred data analysis library. Data manipulation with Pandas become a lot easier and intuitve. If you are familiar with Numpy arrays, then moving onto pandas would be a lot easier. You can convert Numpy array into pandas dataframe by passing array object to Pandas' DataFrame function.

The cool thing about pandas is that it can take data from multiple sources (like numpy arrays, Ecxcel sheets in csv formant, or SQL databases) and creates table like grid having rows and columns that is very similar to format we see in relational databases. If you are familiar with R language, you will see similarities too.

Pandas also allows us to easily access a portion of data using indexing and perform operations on that portion of data. Perfoming operations on a portion of data, especially when there are multiple lists, becomes cumbersome using builtin python lists.

You can do away with a lot of overhead, when you load tabular data into a pandas DataFrame, as you can see most commonly used statistical information like mean,average,max,std,count with just one describe() function.

Examples

Example #1: Importing Data From CSV File

You can import any excel sheet into pandas DataFrame, but for sake of this tutorial, I am going to load this open source titanic dataset.

In [7]:
import pandas as pd
import shutil
import glob
import os

if not 'script_dir' in globals():
    script_dir = os.getcwd()
data_directory = 'data\\'
example_directory = 'PandasExample\\'
source_file_name = 'titanic.csv'
target_file_name = 'female_dataset.csv'

source_path = os.path.join(script_dir, data_directory, example_directory, file_name)
target_path = os.path.join(script_dir, data_directory, example_directory, target_file_name)

#Import and show top five rows.
dataset = pd.read_csv(source_path)
dataset.head(5)
Out[7]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Example #2: Exploring Your Data

Below is some basic exploritory data analysis.

In [4]:
dataset.describe()
Out[4]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [5]:
#Let's only select femal passenger's data
female_dataset = dataset[dataset.Sex == "female"]
female_dataset.head(5)
Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

Example #3: Writing DataFrames To Disk

Let's save the new dataset into csv file Piple delimited. The default delimiter is comma (You can use any delimiter). Run the code and check the example folder for the new file.

In [8]:
female_dataset.to_csv(target_path, sep='|')

Copyright © 2020, Mass Street Analytics, LLC. All Rights Reserved.