Exercise 4#

Due date: Please complete this exercise by the end of day on Wednesday 28.2.

Exercise 4 - Start your assignment

You can start working on your personal copy of the Exercise by:

Notice that if you are using GitHub Classroom for the first time, it might ask from you a permission to verify your GitHub identity. In such case, choose “Authorize GitHub Classroom”.

You can also take a look at the open course copy of Exercise 4 in the course GitHub repository (does not require logging in). Note that you should not try to make changes to this copy of the exercise, but rather only to the copy available via GitHub Classroom.

Cloud computing environment#

After you have your personal exercise in GitHub, start doing the programming using CSC Notebooks:

https://img.shields.io/badge/launch-CSC%20notebook-blue.svg

Using Git#

Note

We will use git and GitHub when working with the exercises. You can find instructions for using git and the Jupyter Lab git plugin in here.

Hints#

How to easily create an interactive visualization from pandas or geopandas?#

It is very easy to create an interactive visualization from any data presented as pandas DataFrame or geopandas GeoDataFrame, using either the built-in geopandas function .explore() or using a hvplot library. In this hint, we show both ways. To be able to use hvplot() functionality, we need to import the pandas extension (hvplot.pandas) that provides us extended capabilities that we can use with our DataFrames and GeoDataFrames. You can then for example do following:

import hvplot.pandas
import geopandas as gpd

# Fetch sample data
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head(2)
pop_est continent name iso_a3 gdp_md_est geometry
0 889953.0 Oceania Fiji FJI 5496 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 58005463.0 Africa Tanzania TZA 63177 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
# Plot all countries as an interactive map using the .explore()
world.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook
# Plot African countries on top of CartoDB Positron baselayer style with hvplot
africa = world.loc[world["continent"]=="Africa"].copy()

# Plot using hvplot
africa.hvplot(geo=True, width=500, height=500, tiles="CartoDark", color=None, line_color="green", alpha=0.4)

How to count values in pandas and visualize them interactively?#

import pandas as pd
import seaborn as sns

# Load some sample data
data = sns.load_dataset('flights')
data.head(3)
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
# Let's take a random sample from the data for demonstration purposes
data = data.sample(n=70)
data.shape
(70, 3)
# Add day info as the first day of the month
data["day"] = 1

# Convert month names to integers
data["month"] = pd.to_datetime(data["month"], format="%b").dt.month

# Generate datetime index from year, month and day
data["time"] = pd.to_datetime(data[["year", "month", "day"]])

# Convert the time to timestamp string with specific format (Year-Month-Day Hour:Minute:Second)
data["timestamp"] = data["time"].dt.strftime("%Y-%m-%d %H:%M:%S")

# Set the time as index
data = data.set_index("time")
data.head(2)
year month passengers day timestamp
time
1951-03-01 1951 3 178 1 1951-03-01 00:00:00
1958-01-01 1958 1 340 1 1958-01-01 00:00:00
# Count how many values there are per "month" column within a year key work: "A" 
# Check most typical ways to sample temporal data from here (e.g. how to do this on a minutely frequency): https://stackoverflow.com/a/19821311
data["month"].resample("A").count()
time
1949-12-31    4
1950-12-31    9
1951-12-31    7
1952-12-31    6
1953-12-31    8
1954-12-31    6
1955-12-31    5
1956-12-31    6
1957-12-31    5
1958-12-31    7
1959-12-31    4
1960-12-31    3
Freq: A-DEC, Name: month, dtype: int64

As we can see, there are now different number of months for each year because we picked randomly 70 months from our data.

We can plot the counts as an interactive bar graph by:

data["timestamp"].resample("A").count().hvplot()

How to plot an interactive histogram?#

PLotting an interactive histogram can be in a similar manner as above that we did with monthly counts.

# Load some sample data
data = sns.load_dataset('penguins')
data.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
# Plot a histogram showing the counts of "flipper_length_mm" attribute
data.hvplot.hist("flipper_length_mm")