Table of Contents
Welcome! This notebook will teach you about using Pandas
in the Python Programming Language. By the end of this lab, you’ll know how to use Pandas
package to view and access data.
About the Dataset
The table has one row for each album and several columns
- artist: Name of the artist
- album: Name of the album
- released_year: Year the album was released
- length_min_sec: Length of the album (hours,minutes,seconds)
- genre: Genre of the album
- music_recording_sales_millions: Music recording sales (millions in USD) on [SONG://DATABASE]
- claimed_sales_millions: Album’s claimed sales (millions in USD) on [SONG://DATABASE]
- date_released: Date on which the album was released
- soundtrack: Indicates if the album is the movie soundtrack (Y) or (N)
- rating_of_friends: Indicates the rating from your friends from 1 to 10
You can see the dataset here:
Artist | Album | Released | Length | Genre | Music recording sales (millions) | Claimed sales (millions) | Released | Soundtrack | Rating (friends) |
Michael Jackson | Thriller | 1982 | 00:42:19 | Pop, rock, R&B | 46 | 65 | 30-Nov-82 | 10 | |
AC/DC | Back in Black | 1980 | 00:42:11 | Hard rock | 26.1 | 50 | 25-Jul-80 | 8.5 | |
Pink Floyd | The Dark Side of the Moon | 1973 | 00:42:49 | Progressive rock | 24.2 | 45 | 01-Mar-73 | 9.5 | |
Whitney Houston | The Bodyguard | 1992 | 00:57:44 | Soundtrack/R&B, soul, pop | 26.1 | 50 | 25-Jul-80 | Y | 7 |
Meat Loaf | Bat Out of Hell | 1977 | 00:46:33 | Hard rock, progressive rock | 20.6 | 43 | 21-Oct-77 | 7 | |
Eagles | Their Greatest Hits (1971-1975) | 1976 | 00:43:08 | Rock, soft rock, folk rock | 32.2 | 42 | 17-Feb-76 | 9.5 | |
Bee Gees | Saturday Night Fever | 1977 | 01:15:54 | Disco | 20.6 | 40 | 15-Nov-77 | Y | 9 |
Fleetwood Mac | Rumours | 1977 | 00:40:01 | Soft rock | 27.9 | 40 | 04-Feb-77 | 9.5 |
Introduction of Pandas
# Dependency needed to install file
!pip install xlrd
Collecting xlrd Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB) Installing collected packages: xlrd Successfully installed xlrd-2.0.1
In [3]:
# Import required library
import pandas as pd
After the import command, we now have access to a large number of pre-built classes and functions. This assumes the library is installed; in our lab environment all the necessary libraries are installed. One way pandas allows you to work with data is a dataframe. Let’s go through the process to go from a comma separated values (.csv) file to a dataframe. This variable csv_path
stores the path of the .csv, that is used as an argument to the read_csv
function. The result is stored in the object df
, this is a common short form used for a variable referring to a Pandas dataframe.
# Read data from CSV file
csv_path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/PY0101EN/Chapter%204/Datasets/TopSellingAlbums.csv'
df = pd.read_csv(csv_path)
head() method
We can use the method head()
to examine the first five rows of a dataframe:
# Print first five rows of the dataframe
df.head()
Out[5]:
Artist | Album | Released | Length | Genre | Music Recording Sales (millions) | Claimed Sales (millions) | Released.1 | Soundtrack | Rating | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Michael Jackson | Thriller | 1982 | 0:42:19 | pop, rock, R&B | 46.0 | 65 | 30-Nov-82 | NaN | 10.0 |
1 | AC/DC | Back in Black | 1980 | 0:42:11 | hard rock | 26.1 | 50 | 25-Jul-80 | NaN | 9.5 |
2 | Pink Floyd | The Dark Side of the Moon | 1973 | 0:42:49 | progressive rock | 24.2 | 45 | 01-Mar-73 | NaN | 9.0 |
3 | Whitney Houston | The Bodyguard | 1992 | 0:57:44 | R&B, soul, pop | 27.4 | 44 | 17-Nov-92 | Y | 8.5 |
4 | Meat Loaf | Bat Out of Hell | 1977 | 0:46:33 | hard rock, progressive rock | 20.6 | 43 | 21-Oct-77 | NaN | 8.0 |
We use the path of the excel file and the function read_excel
. The result is a data frame as before:
# Read data from Excel File and print the first five rows
xlsx_path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/PY0101EN/Chapter%204/Datasets/TopSellingAlbums.xlsx'
df = pd.read_excel(xlsx_path)
df.head()
Artist | Album | Released | Length | Genre | Music Recording Sales (millions) | Claimed Sales (millions) | Released.1 | Soundtrack | Rating | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Michael Jackson | Thriller | 1982 | 00:42:19 | pop, rock, R&B | 46.0 | 65 | 1982-11-30 | NaN | 10.0 |
1 | AC/DC | Back in Black | 1980 | 00:42:11 | hard rock | 26.1 | 50 | 1980-07-25 | NaN | 9.5 |
2 | Pink Floyd | The Dark Side of the Moon | 1973 | 00:42:49 | progressive rock | 24.2 | 45 | 1973-03-01 | NaN | 9.0 |
3 | Whitney Houston | The Bodyguard | 1992 | 00:57:44 | R&B, soul, pop | 27.4 | 44 | 1992-11-17 | Y | 8.5 |
4 | Meat Loaf | Bat Out of Hell | 1977 | 00:46:33 | hard rock, progressive rock | 20.6 | 43 | 1977-10-21 | NaN | 8.0 |
Column length
We can access the column Length and assign it a new dataframe x:
# Access to the column Length
x = df[['Length']]
x
The process is shown in the figure:
Viewing Data and Accessing Data
You can also get a column as a series. You can think of a Pandas series as a 1-D dataframe. Just use one bracket:
# Get the column as a series
x = df['Length']
x
0 0:42:19 1 0:42:11 2 0:42:49 3 0:57:44 4 0:46:33 5 0:43:08 6 1:15:54 7 0:40:01 Name: Length, dtype: object
You can also get a column as a dataframe. For example, we can assign the column Artist:
# Get the column as a dataframe
x = type(df[['Artist']])
x
pandas.core.frame.DataFrame
Access to multiple columns
You can do the same thing for multiple columns; we just put the dataframe name, in this case, df
, and the name of the multiple column headers enclosed in double brackets. The result is a new dataframe comprised of the specified columns:
# Access to multiple columns
y = df[['Artist','Length','Genre']]
y
Artist | Length | Genre | |
---|---|---|---|
0 | Michael Jackson | 0:42:19 | pop, rock, R&B |
1 | AC/DC | 0:42:11 | hard rock |
2 | Pink Floyd | 0:42:49 | progressive rock |
3 | Whitney Houston | 0:57:44 | R&B, soul, pop |
4 | Meat Loaf | 0:46:33 | hard rock, progressive rock |
5 | Eagles | 0:43:08 | rock, soft rock, folk rock |
6 | Bee Gees | 1:15:54 | disco |
7 | Fleetwood Mac | 0:40:01 | soft rock |
The process is shown in the figure:
One way to access unique elements is the iloc
method, where you can access the 1st row and the 1st column as follows:
# Access the value on the first row and the first column
df.iloc[0, 0]
'Michael Jackson'
You can access the 2nd row and the 1st column as follows:
# Access the value on the second row and the first column
df.iloc[1,0]
'AC/DC'
You can access the 1st row and the 3rd column as follows:
# Access the value on the first row and the third column
df.iloc[0,2]
1982
You can access the column using the name as well, the following are the same as above:
# Access the column using the name
df.loc[0, 'Artist']
'Michael Jackson'
# Access the column using the name
df.loc[1, 'Artist']
'AC/DC'
# Access the column using the name
df.loc[0, 'Released']
1982
# Access the column using the name
df.loc[1, 'Released']
1980
Slicing
You can perform slicing using both the index and the name of the column:
# Slicing the dataframe
df.iloc[0:2, 0:3]
Artist | Album | Released | |
---|---|---|---|
0 | Michael Jackson | Thriller | 1982 |
1 | AC/DC | Back in Black | 1980 |
# Slicing the dataframe using name
df.loc[0:2, 'Artist':'Released']
Artist | Album | Released | |
---|---|---|---|
0 | Michael Jackson | Thriller | 1982 |
1 | AC/DC | Back in Black | 1980 |
2 | Pink Floyd | The Dark Side of the Moon | 1973 |
Quiz on DataFrame
Use a variable q
to store the column Rating as a dataframe
Your answer is below:
q = df[[‘Rating’]]
q
Assign the variable q
to the dataframe that is made up of the column Released and Artist:
Your answer is below:
q = df[[‘Released’, ‘Artist’]]
q
Access the 2nd row and the 3rd column of df
:
Your answer is below:
df.iloc[1, 2]