We’ve made it to assignment number 4! This was my last assignment in the first unit of my data science course. We’ll cover making visuals to represent your findings from the data, which will be crucial to your data science career! Let’s get to it!
Our first task is much like all of our other assignments so far: read in the url and create a DataFrame named Sleep
, then view the first five rows.
# Task 1
# Access sleep.csv with this url
sleep_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Sleep/Sleep.csv'
# Read in the sleep.csv file included above as a DataFrame named Sleep and print the first 5 rows.
import pandas as pd
import numpy as np
Sleep = pd.read_csv(sleep_url)
# View the DataFrame
Sleep.head()
In task 2, we will plot a histogram using matplotlib
! To start, you have to code fig, ax = plt.sublots()
. Once you’ve done this first step, you can create the histogram by using ax.hist()
, then entering in the column you want graphed. We will also update our x-axis, y-axis, and title. To have the graph visual print out, you’ll use plt.show()
for your last line. Voila!
# Task 2
# Import matplotlib
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
# Plot a histogram of Parasleep from the Sleep DataFrame
ax.hist(Sleep['Parasleep'])
# Specify the axis labels and plot title
ax.set_xlabel('Total Hours of Dreaming Sleep') ax.set_ylabel('Frequency')
ax.set_title('Daily dreaming sleep in mammal species') plt.show()
For task 3, we will prepare some data for plotting. We are challenged to create subsets of the Sleep[‘Danger’]
category, one if the value is equal to 1, one if the value is equal to 5. It’s always good to print out the newly created DataFrames, so you can double check that your code worked as you wanted it to. We can see that the first DF only has 1 for the Danger
value, and the second only has 5.
# Task 3
# Create danger category 1 and danger category 5
Danger_1 = Sleep[Sleep.Danger == 1]
Danger_5 = Sleep[Sleep.Danger == 5]
# View your DataFrames
print(Danger_1.head())
print(Danger_5.head())
Now, for task 4, we will practice with box plots. We’ll make a comparison of dreaming sleep time for mammals in low danger categories to high danger categories. We’ll still have to start with fig, ax = plt.subplots()
, but our next step will be ax.boxplot()
, instead of ax.hist()
. We can input our Danger_1
and Danger_5
DataFrames in to compare, and use the labels=
to distinguish between them. Again, we’ll add labels to our axis and a title. Use plt.show()
to view!
# Task 4 - Plotting
fig, ax = plt.subplots()
# Plot the side-by-side boxplots
fig, ax = plt.subplots()
ax.boxplot([Danger_1['Parasleep'] , Danger_5['Parasleep']], labels=['Least danger','Most danger'], vert=False)
# Label the figure
ax.set_xlabel('Hours of dreaming sleep')
ax.set_ylabel('Danger category')
ax.set_title('Dreaming sleep for mammals that experience low and high levels of danger')
plt.show()
In task five, we’ll sort our Sleep
DataFrame by Gest, and create a new DataFrame from this named Sleep_sorted
.
# Task 5
# Sort Sleep by Gest
Sleep_sorted = Sleep.sort_values('Gest')
# View the results
Sleep_sorted.head()
For task 6 we’re going to introduce a line plot! We’ll use the value of Gest
on the x-axis and Parasleep
on the y-axis for each mammal. In the plot
statement, we’ll specify ‘o’
for the marker, ‘dashdot’
for the style, and ‘b’
for the color. We’ll again set our labels and title, feel free to experiment with the different styles for the markers and colors on the plot!
# Task 6 - Plotting
fig, ax = plt.subplots()
ax.plot([Sleep_sorted['Parasleep'], Sleep_sorted['Gest']], marker='o', linestyle='dashdot', color='b')
ax.set_xlabel('Gestation Period')
ax.set_ylabel('Hours of Dreaming Sleep')
ax.set_title('Hours of Dreaming Sleep by Gestation Period')
plt.show()
In task 7, we’ll create a normalized DataFrame to show the proportion of mammals that are in each danger category. To do so, we’ll use .value_counts
, and set normalize=True
, and add .to_frame()
at the end of this statement to create a DataFrame. To view the numbers as a percent, we’ll just multiply the statement by 100.
# Task 7
Danger_prop = Sleep['Danger'].value_counts(normalize=True).to_frame()
Danger_pct = Sleep['Danger'].value_counts(normalize=True).to_frame()*100
# View the DataFrame
print(Danger_prop)print(Danger_pct)
For task 8 we’ll create a pie chart. We’ll use the Danger
variable from the Danger_pct
DataFrame, this time using ax.pie
. We’ll set the labels of the plot using Danger_pct.index
, and also assign some formatting information.
# Task 8 - Plotting
# Create the pie chart
fig, ax = plt.subplots()
ax.pie(Danger_pct['Danger'], labels=Danger_pct.index, autopct='%1.1f%%', startangle=90)
ax.set_title('Percent of mammals in each danger category')
plt.show()
For task 9 we’ll create a new feature in our Sleep
DataFrame called ‘Short life
’ to tell if the mammal lives less than 30 years, or 30 or more years. We’ll use a .loc[]
statement to do this. Then, we’ll turn this information into a DataFrame.
# Task 9
Sleep.loc[Sleep['Life']<30, 'Short life']= 'Less than 30 years'
Sleep.loc[Sleep['Life']>=30, 'Short life']= '30 or more years'
Life_counts = Sleep['Short life'].value_counts().to_frame()
# View the results
Life_counts.head()
For our last task, we’ll plot a bar chart to show the proportion of mammals that had a short life or long. We’ll set it up like our bar chart above, except since we only have two categories, which we want as our x-axis, we can use Life_counts.index
, instead of entering the columns separately for our x-axis like above. Since our labels on the x-axis for each bar are long, we can set the labels to be angled at 45 degrees using ax.set_xticklabels(rotation=45)
.
# Task 10 - Plotting
fig, ax = plt.subplots()
ax.bar(Life_counts.index, Life_counts['Short life'])
ax.set_xlabel('Mammal lifespan')
ax.set_xticklabels(Life_counts.index, rotation=45)
ax.set_ylabel('Frequency')
ax.set_title('Number of Mammals with Long and Short Life Expectancies')
plt.show()
There you have it! An introductory assignment on making different graphs to explain your data. Visualizations are the best way to report your findings to non-data science background colleagues. I recommend practicing with these and making some changes to see the different options. We’ll also be using seaborn
, another graphing package, in future assignments! We’ll be getting into hypothesis testing next week! See you then!