https://github.com/brunocampos01/fleet-management-corporation-data-engineering
Challenge to Senior Data Engineer
https://github.com/brunocampos01/fleet-management-corporation-data-engineering
data-engineering data-engineering-pipeline pyspark python
Last synced: 4 months ago
JSON representation
Challenge to Senior Data Engineer
- Host: GitHub
- URL: https://github.com/brunocampos01/fleet-management-corporation-data-engineering
- Owner: brunocampos01
- Created: 2023-02-25T02:45:12.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2025-06-28T01:19:14.000Z (4 months ago)
- Last Synced: 2025-06-28T02:26:17.280Z (4 months ago)
- Topics: data-engineering, data-engineering-pipeline, pyspark, python
- Language: Python
- Homepage:
- Size: 71.3 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Fleet management corporation
# Introduction
As a newly hired data engineer in a fleet management corporation,
you are asked to write an application to detect possible speeding
events in the stream of data coming from the customer's trucks.
Further, the application will be used by the data scientists team.
They are currently working on a classifier that predicts whether
a speeding event is going to happen in the nearest future or not.
Thereby, your application will be used to test the classifier.# Problem Statement
Your task will be to prepare raw data for further verification of
model/classifier performance.
## DataYou will be working with a tabular data of the following structure:
* `customer_id` - identifier of a customer,
* `vehicle_id` - identifier of a vehicle,
* `driver_id` - identifier of a driver,
* `location_x`, `location_y` - columns with x and y positions respectivelly on a 2D grid (in kilometers)
* `timespan` - datetime in epoch seconds,
* `speed_limit` - speed limit the vehicle should obey in a given moment (in km/h),
* `will_be_speeding` - an output from the classifier, whether a speeding event
will likely happen in the next `N` further records of the same ride.The same ride means a chronological sequence of records for a given
`(customer_id, vehicle_id, driver_id)` tuple.## Tasks
### Task 1: Detecting speeding events
Your task is to implement `detect_speeding_events` function in `app.detector` module.
The function accepts `logs` containing the data as described in the *Data* section
and should output another `DataFrame` with an additional `is_speeding` column.The value in this column should represent whether speeding has happened between
the current record and the previous one.The speeding is defined as traveling with the speed higher than the speed limit
defined in the current record.For the first record in the sequence both `False` and `None` are valid values.
### Task 2: Prepare data for the classifier validation
Your task is to implement a `predict_speeding_event` function in `app.detector` module.
The function accepts `logs_with_speeding` and `prediction_horizon` and should output
another `DataFrame` with an additional `actually_speeding` column. The `logs_with_speeding`
is a `DataFrame` containing speeding events detected in the previous task.The function should calculate the ride status over a prediction horizon, so the
value in new column represents whether a speeding event is going
to happen in the next `prediction_horizon` steps of the same ride.