Exercise 3¶

Due date: Please complete this exercise by the end of day on Thursday the 10th of February.

Exercise 3 - Start your assignment

You can start working on your personal copy of the Exercise by:

accepting the GitHub Classroom assignment.

Notice that if you are using GitHub Classroom for the first time, it might ask from you a permission to verify your GitHub identity. In such case, choose “Authorize GitHub Classroom”.

You can also take a look at the open course copy of Exercise 3 in the course GitHub repository (does not require logging in). Note that you should not try to make changes to this copy of the exercise, but rather only to the copy available via GitHub Classroom.

Cloud computing environment¶

After you have your personal exercise in GitHub, start doing the programming using CSC Notebooks:

Using Git¶

Note

We will use git and GitHub when working with the exercises. You can find instructions for using git and the Jupyter Lab git plugin in here.

Hints¶

How to easily create an interactive visualization from pandas or geopandas?¶

It is very easy to create an interactive visualization from any data presented as pandas DataFrame or geopandas GeoDataFrame, using hvplot library. To be able to do this, we just need to import the pandas extension (hvplot.pandas) that provides us extended capabilities that we can use with our DataFrames and GeoDataFrames. You can then for example do following:

import hvplot.pandas
import geopandas as gpd

# Fetch sample data
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head(2)

	pop_est	continent	name	iso_a3	gdp_md_est	geometry
0	920938	Oceania	Fiji	FJI	8374.0	MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1	53950935	Africa	Tanzania	TZA	150600.0	POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...

# Plot all countries as an interactive map
world.explore()

Make this Notebook Trusted to load map: File -> Trust Notebook

# Plot African countries on top of CartoDB Positron baselayer style
africa = world.loc[world["continent"]=="Africa"].copy()

# See all parameters from: https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html
africa.explore(width=500, height=500, tiles="CartoDB positron", color=None, style_kwds=dict( fillOpacity=0.4, color="green") )

Make this Notebook Trusted to load map: File -> Trust Notebook

How to count values in pandas and visualize them interactively?¶

import pandas as pd
import seaborn as sns

# Load some sample data
data = sns.load_dataset('flights')

data.head(3)

	year	month	passengers
0	1949	Jan	112
1	1949	Feb	118
2	1949	Mar	132

# Let's take a random sample from the data for demonstration purposes
data = data.sample(n=70)
data.shape

(70, 3)

# Add day info as the first day of the month
data["day"] = 1

# Convert month names to integers
data["month"] = pd.to_datetime(data["month"], format="%b").dt.month

# Generate datetime index from year, month and day
data["time"] = pd.to_datetime(data[["year", "month", "day"]])

# Convert the time to timestamp string with specific format (Year-Month-Day Hour:Minute:Second)
data["timestamp"] = data["time"].dt.strftime("%Y-%m-%d %H:%M:%S")

# Set the time as index
data = data.set_index("time")
data.head(2)

	year	month	passengers	day	timestamp
time
1951-03-01	1951	3	178	1	1951-03-01 00:00:00
1958-01-01	1958	1	340	1	1958-01-01 00:00:00

# Count how many values there are per "month" column within a year key work: "A" 
# Check most typical ways to sample temporal data from here (e.g. how to do this on a minutely frequency): https://stackoverflow.com/a/19821311
data["month"].resample("A").count()

time
1949-12-31    4
1950-12-31    9
1951-12-31    7
1952-12-31    6
1953-12-31    8
1954-12-31    6
1955-12-31    5
1956-12-31    6
1957-12-31    5
1958-12-31    7
1959-12-31    4
1960-12-31    3
Freq: A-DEC, Name: month, dtype: int64

As we can see, there are now different number of months for each year because we picked randomly 70 months from our data.

We can plot the counts as an interactive bar graph by:

data["timestamp"].resample("A").count().hvplot()

How to plot an interactive histogram?¶

PLotting an interactive histogram can be in a similar manner as above that we did with monthly counts.

# Load some sample data
data = sns.load_dataset('penguins')
data.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

# Plot a histogram showing the counts of "flipper_length_mm" attribute
data.hvplot.hist("flipper_length_mm")

More detailed instructions for parsing trajectory statistics¶

When looping over the groups in Problem 3:

Check if the maximum value of the column speed is zero. If it is, do not proceed further and continue to the next iteration. In case it is not:
Create a TrajectoryCollection from the group using vehicle_id as the identifier.
Add speed to the TrajectoryCollection (overwrite existing one).
If there are any trajectories in our TrajectoryCollection, select the first trajectory from the TrajectoryCollection (each collection should only have a single Trajectory because we have grouped the data ourselves based on “vehicle_id”)
Split the trajectory based on criteria that if there is a 5-minute time gap between observations, the trajectory should be splitted into multiple ones. This step is used to detect if there were multiple trips between same route and direction (which is likely because the same vehicle travels the same route many times during the day). You can use the mpd.ObservationGapSplitter() <https://movingpandas.readthedocs.io/en/latest/trajectorysplitter.html#movingpandas.ObservationGapSplitter>__ to split the observations.
Iterate over the splitted trajectories and:
Calculate the speed in kilometers per hour (into column speed_kmph) based on the speed information which is reported as meters per second.
Based on the speed_kmph select only rows that have speed over 1 kilomters per hour.
If there weren’t any observations left after you do the selection in the previous step, do not proceed but continue to the next iteration. Otherwise:
Create a LineString geometry from the trajectory (you can take advantage of the .to_linestring() method that is part of the Trajectory class.
Calculate the average speed of the trajectory based on the speed_kmph
Calculate the standard deviation of the speed of the trajectory (again based on the speed_kmph)
Parse other useful information from the trajectory (should be a single value): vehicle_id, route_id, direction_id, start_time (i.e. what was the first timestamp of the trajectory)
After you have parsed all the previous information, make a GeoDataFrame out of them and store them into results list.