Lesson 22. Working With Data In Pandas

Pandas are built on top of Numpy. It is one of the most used and preferred data analysis library. Data manipulation with Pandas become a lot easier and intuitve. If you are familiar with Numpy arrays, then moving onto pandas would be a lot easier. You can convert Numpy array into pandas dataframe by passing array object to Pandas' DataFrame function.

The cool thing about pandas is that it can take data from multiple sources (like numpy arrays, Ecxcel sheets in csv formant, or SQL databases) and creates table like grid having rows and columns that is very similar to format we see in relational databases. If you are familiar with R language, you will see similarities too.

Pandas also allows us to easily access a portion of data using indexing and perform operations on that portion of data. Perfoming operations on a portion of data, especially when there are multiple lists, becomes cumbersome using builtin python lists.

You can do away with a lot of overhead, when you load tabular data into a pandas DataFrame, as you can see most commonly used statistical information like mean,average,max,std,count with just one describe() function.

Examples

Example #1: Importing Data From CSV File

You can import any excel sheet into pandas DataFrame, but for sake of this tutorial, I am going to load this open source titanic dataset.

import pandas as pd
import shutil
import glob
import os

if not 'script_dir' in globals():
    script_dir = os.getcwd()
data_directory = 'data\\'
example_directory = 'PandasExample\\'
source_file_name = 'titanic.csv'
target_file_name = 'female_dataset.csv'

source_path = os.path.join(script_dir, data_directory, example_directory, file_name)
target_path = os.path.join(script_dir, data_directory, example_directory, target_file_name)

#Import and show top five rows.
dataset = pd.read_csv(source_path)
dataset.head(5)

Example #2: Exploring Your Data

Below is some basic exploritory data analysis.

dataset.describe()

#Let's only select femal passenger's data
female_dataset = dataset[dataset.Sex == "female"]
female_dataset.head(5)

Example #3: Writing DataFrames To Disk

Let's save the new dataset into csv file Piple delimited. The default delimiter is comma (You can use any delimiter). Run the code and check the example folder for the new file.

female_dataset.to_csv(target_path, sep='|')

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C