CNI Hackathon 2022

CNI Hackathon '22 Data Science Challenge AI Challenge Register FAQs

Data Science Challenge using BMTC Dataset

Background

Task

Dataset

Submission

Evaluation criteria

Leaderboard

Useful links

Background

We have a dataset containing information about the buses travelling in Bengaluru. We have obtained it from Bengaluru Metropolitan Transport Corporation (BMTC).
The region of interest is an approximately 40km by 40km square area. See the following figure:
The data was collected from around two thousand buses for one day, between 7:00am to 7:00pm.
The buses follow different routes within the city.
Each bus is identified with a unique ID. A bus carries a device which records the data: latitude, longitude, speed, and timestamp.

Task

Create a model to estimate the travel time, in minutes, between source-destination pairs using the provided dataset.

Dataset

We are providing the following three files in the dataset (download link ):

BMTC.parquet.gzip: It contains the GPS traces of around two thousand buses.
Input.csv: It contains geographical coordinates of various sources-destination pairs.
GroundTruth.csv: It contains the ground truth travel times between the source-destination pairs provided in Input.csv. It is provided to help participants assess their solutions.

Following is the detailed description of the contents of these files:

`BMTC.parquet.gzip`:

The file contains information in five columns, described as follows:

BusID: The (unique) ID associated with the device present in a bus.
Latitude: Latitude (geographical coordinate) of a bus, as recorded by the device.
Longitude: Longitude (geographical coordinate) of the bus, as recorded by the device.
Speed: Instantaneous speed of the bus in kmph.
Timestamp: Timestamp in IST format. The format of datetime is yyyy-mm-dd HH:MM:SS.

For better understanding, following is a snapshot from the dataset:

	BusID	Latitude	Longitude	Speed	Timestamp
0	150212121	13.06593	77.45269	20	2019-08-01 18:59:18
1	150212121	13.06627	77.45211	27	2019-08-01 18:59:28
2	150212121	13.06661	77.45152	24	2019-08-01 18:59:38
3	150212121	13.06697	77.45089	28	2019-08-01 18:59:48
4	150212121	13.06727	77.45035	26	2019-08-01 18:59:58
5	150218000	13.00571	77.68619	46	2019-08-01 07:22:33
6	150218000	13.00525	77.68542	35	2019-08-01 07:22:42
7	150218000	13.00504	77.68509	0	2019-08-01 07:22:51
8	150218000	13.00504	77.68509	0	2019-08-01 07:23:01
9	150218000	13.00498	77.68497	13	2019-08-01 07:23:11

Note: The devices may not record the data with same sampling intervals. The recordings may also be noisy.

`Input.csv`:

The file contains four columns, described as follows:

Source_Lat: The latitude of a source.
Source_Long: The longitude of a source.
Dest_Lat: The latitude of a destination.
Dest_Long: The longitude of a destination.

For better understanding, following is the format of a typical input file:

	Source_Lat	Source_Long	Dest_Lat	Dest_Long
0	13.067272	77.45035	13.00525	77.68542
1	13.005042	77.68509	13.06627	77.45211
2	13.065925	77.45269	13.00498	77.68497
3	13.005247	77.68542	13.06661	77.45152

`GroundTruth.csv`:

The file contains one column TT, i.e. the actual travel time between a source-destination pair. The value in the i-th row corresponds to the travel time between i-th source-destination pair in Input.csv.

For better understanding, following is the format of a typical ground truth file:

	TT
0	1.99
1	6.21
2	7.34
3	5.20

You can use the ground truth from the dataset to check if your code is working well.

Output (Estimated Travel Time)

Your output will be the estimated travel time (ETT), in minutes, between a given source-destination pair. For each source-destination pair, you should fill this value in the ETT column of a pandas dataframe, as illustrated below:

	Source_Lat	Source_Long	Dest_Lat	Dest_Long	ETT
0	13.067272	77.45035	13.00525	77.68542	2.34
1	13.005042	77.68509	13.06627	77.45211	5.51
2	13.065925	77.45269	13.00498	77.68497	3.72
3	13.005247	77.68542	13.06661	77.45152	5.13

Submission

Team size: At most two individuals.
Programming language: Python (3.5 and beyond)
Packages: The participants can use the commonly available Python packages like pandas, geopandas, numpy, scikit-learn, pyarrow, matplotlib, scipy, math, string, random, datetime, etc. In case the submissions use different packages, we reserve the right to not consider and evaluate them for the Hackathon.
Step1: Download the dataset (download link ). Create a folder data and copy the three files from the dataset into it.
Step 2: Create a GitHub repository as per the folder structure illustrated in the example below (see image):

where, for the purpose of illustration, srika_DS_456AB is the GitHub repository, and data contains the dataset files.
(We will provide the repository name to each participating individual or team.)

We require the participants to keep their respective repositories private for the duration of the hackathon.
Step 3: Build EstimatedTravelTime() within Predict.py:
Ensure that the following holds (refer to the above example for the folder structure):
- data folder contains the data file (BMTC.parquet.gzip), the input file (Input.csv), and the ground truth file (GroundTruth.csv). We are providing all these files. These files remain unchanged while a team works on the problem.
- The GitHub repository (srika_DS_456AB folder, in the above example) contains the following:
  1. Python code: Name the Python code file as Predict.py and build the function EstimatedTravelTime() that predicts the travel time between source-destination pairs.
    Use relative paths as per the Predict.py template given below:
  2. Instruction file: If there are any special instructions that are required for us to run your code, please add them in an Instructions.txt file.
Step 4 (Before the hackathon deadline gets over): Add cnihackathon22 as a collaborator to your GitHub repository, and grant the Read access.

Evaluation criteria

The teams should build models to estimate the travel time between a source and destination, and code should work beyond the provided inputs. The model should try to minimize the mean absolute difference between the actual and predicted values (\(L1\) error).
We will fetch the last GitHub commit done before the submision deadline and evaluate it.
We will have our own test input files against which the submissions will be evaluated.
The Jury: We will shortlist 5-6 best performing teams and ask them to present and explain their submissions before a jury.

Leaderboard

The leaderboard lists the valid submissions and the final results. Many congratulations to the winners!

Useful links

Dataset: download link

Contact us

For queries, Email us at: admin@cnihackathon.in

Contents