Exercise 3¶
Due date: Please complete this exercise by the end of day on Thursday the 10th of February.
Exercise 3 - Start your assignment
You can start working on your personal copy of the Exercise by:
Notice that if you are using GitHub Classroom for the first time, it might ask from you a permission to verify your GitHub identity. In such case, choose “Authorize GitHub Classroom”.
You can also take a look at the open course copy of Exercise 3 in the course GitHub repository (does not require logging in). Note that you should not try to make changes to this copy of the exercise, but rather only to the copy available via GitHub Classroom.
Cloud computing environment¶
After you have your personal exercise in GitHub, start doing the programming using CSC Notebooks:
Using Git¶
Note
We will use git and GitHub when working with the exercises. You can find instructions for using git and the Jupyter Lab git plugin in here.
Hints¶
How to easily create an interactive visualization from pandas or geopandas?¶
It is very easy to create an interactive visualization from any data presented as pandas DataFrame or geopandas GeoDataFrame, using hvplot library. To be able to do this, we just need to import the pandas extension (hvplot.pandas) that provides us extended capabilities that we can use with our DataFrames and GeoDataFrames. You can then for example do following:
import hvplot.pandas
import geopandas as gpd
# Fetch sample data
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head(2)
| pop_est | continent | name | iso_a3 | gdp_md_est | geometry | |
|---|---|---|---|---|---|---|
| 0 | 920938 | Oceania | Fiji | FJI | 8374.0 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... |
| 1 | 53950935 | Africa | Tanzania | TZA | 150600.0 | POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... |
# Plot all countries as an interactive map
world.explore()
# Plot African countries on top of CartoDB Positron baselayer style
africa = world.loc[world["continent"]=="Africa"].copy()
# See all parameters from: https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html
africa.explore(width=500, height=500, tiles="CartoDB positron", color=None, style_kwds=dict( fillOpacity=0.4, color="green") )
How to count values in pandas and visualize them interactively?¶
import pandas as pd
import seaborn as sns
# Load some sample data
data = sns.load_dataset('flights')
data.head(3)
| year | month | passengers | |
|---|---|---|---|
| 0 | 1949 | Jan | 112 |
| 1 | 1949 | Feb | 118 |
| 2 | 1949 | Mar | 132 |
# Let's take a random sample from the data for demonstration purposes
data = data.sample(n=70)
data.shape
(70, 3)
# Add day info as the first day of the month
data["day"] = 1
# Convert month names to integers
data["month"] = pd.to_datetime(data["month"], format="%b").dt.month
# Generate datetime index from year, month and day
data["time"] = pd.to_datetime(data[["year", "month", "day"]])
# Convert the time to timestamp string with specific format (Year-Month-Day Hour:Minute:Second)
data["timestamp"] = data["time"].dt.strftime("%Y-%m-%d %H:%M:%S")
# Set the time as index
data = data.set_index("time")
data.head(2)
| year | month | passengers | day | timestamp | |
|---|---|---|---|---|---|
| time | |||||
| 1951-03-01 | 1951 | 3 | 178 | 1 | 1951-03-01 00:00:00 |
| 1958-01-01 | 1958 | 1 | 340 | 1 | 1958-01-01 00:00:00 |
# Count how many values there are per "month" column within a year key work: "A"
# Check most typical ways to sample temporal data from here (e.g. how to do this on a minutely frequency): https://stackoverflow.com/a/19821311
data["month"].resample("A").count()
time
1949-12-31 4
1950-12-31 9
1951-12-31 7
1952-12-31 6
1953-12-31 8
1954-12-31 6
1955-12-31 5
1956-12-31 6
1957-12-31 5
1958-12-31 7
1959-12-31 4
1960-12-31 3
Freq: A-DEC, Name: month, dtype: int64
As we can see, there are now different number of months for each year because we picked randomly 70 months from our data.
We can plot the counts as an interactive bar graph by:
data["timestamp"].resample("A").count().hvplot()
How to plot an interactive histogram?¶
PLotting an interactive histogram can be in a similar manner as above that we did with monthly counts.
# Load some sample data
data = sns.load_dataset('penguins')
data.head()
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
# Plot a histogram showing the counts of "flipper_length_mm" attribute
data.hvplot.hist("flipper_length_mm")
More detailed instructions for parsing trajectory statistics¶
When looping over the groups in Problem 3:
Check if the maximum value of the column
speedis zero. If it is, do not proceed further and continue to the next iteration. In case it is not:Create a TrajectoryCollection from the group using
vehicle_idas the identifier.Add speed to the TrajectoryCollection (overwrite existing one).
If there are any trajectories in our TrajectoryCollection, select the first trajectory from the TrajectoryCollection (each collection should only have a single Trajectory because we have grouped the data ourselves based on “vehicle_id”)
Split the trajectory based on criteria that if there is a 5-minute time gap between observations, the trajectory should be splitted into multiple ones. This step is used to detect if there were multiple trips between same route and direction (which is likely because the same vehicle travels the same route many times during the day). You can use the
mpd.ObservationGapSplitter() <https://movingpandas.readthedocs.io/en/latest/trajectorysplitter.html#movingpandas.ObservationGapSplitter>__ to split the observations.Iterate over the splitted trajectories and:
Calculate the speed in kilometers per hour (into column
speed_kmph) based on thespeedinformation which is reported asmeters per second.Based on the
speed_kmphselect only rows that have speed over 1 kilomters per hour.If there weren’t any observations left after you do the selection in the previous step, do not proceed but continue to the next iteration. Otherwise:
Create a LineString geometry from the trajectory (you can take advantage of the
.to_linestring()method that is part of theTrajectoryclass.Calculate the average speed of the trajectory based on the
speed_kmphCalculate the standard deviation of the speed of the trajectory (again based on the
speed_kmph)Parse other useful information from the trajectory (should be a single value): vehicle_id, route_id, direction_id, start_time (i.e. what was the first timestamp of the trajectory)
After you have parsed all the previous information, make a GeoDataFrame out of them and store them into
resultslist.