Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sayed-ashfaq/delhivery-dataanalysis

In this project, I conducted basic analysis, feature engineering, normalization, and outlier handling, along with statistical and non-parametric testing to extract insights.
https://github.com/sayed-ashfaq/delhivery-dataanalysis

feature-engineering normalization outlier-detection pandas python scikit-learn statistcal-tests statistical-analysis

Last synced: 13 days ago
JSON representation

In this project, I conducted basic analysis, feature engineering, normalization, and outlier handling, along with statistical and non-parametric testing to extract insights.

Awesome Lists containing this project

README

        

# Delhivery Data Analysis

## About Delhivery
Delhivery is the largest and fastest-growing fully integrated logistics provider in India as of Fiscal 2021. The company aims to build the operating system for commerce through a blend of world-class infrastructure, high-quality logistics operations, and cutting-edge engineering and technology capabilities.

The data team at Delhivery leverages vast datasets to enhance business intelligence, drive operational efficiency, and maintain profitability, creating a significant competitive edge.

---

## Objective
The goal of this project is to process and analyze data generated by Delhivery's logistics operations to:
1. **Clean, sanitize, and manipulate raw data** to derive actionable insights.
2. **Create useful features** for the data science team to develop forecasting models.

---

## Dataset
The dataset consists of records from Delhivery's logistics and operational data pipeline.

### **Key Features**:
- **`data`**: Indicates if the record is training or testing data.
- **`trip_creation_time`**: Timestamp of trip creation.
- **`route_schedule_uuid`**: Unique identifier for a route schedule.
- **`route_type`**: Type of transportation (`FTL`, `Carting`).
- **FTL**: Full Truck Load shipments, faster delivery as there are no intermediate pickups/drop-offs.
- **Carting**: Delivery system using smaller vehicles (carts).
- **`trip_uuid`**: Unique identifier for a trip (a trip can involve multiple source and destination centers).
- **`source_center`**: ID of the trip's origin center.
- **`source_name`**: Name of the trip's origin center.
- **`destination_center`**: ID of the destination center.
- **`destination_name`**: Name of the destination center.
- **`od_start_time`**: Trip start time.
- **`od_end_time`**: Trip end time.
- **`start_scan_to_end_scan`**: Total time taken for delivery from source to destination.
- **`actual_distance_to_destination`**: Actual distance in kilometers between source and destination.
- **`actual_time`**: Cumulative time taken to complete the delivery.
- **`osrm_time`**: Time calculated by the Open-Source Routing Machine (OSRM) considering shortest paths and typical traffic conditions (cumulative).
- **`osrm_distance`**: Distance calculated by OSRM (cumulative).
- **`segment_actual_time`**: Time taken for a segment of the delivery.
- **`segment_osrm_time`**: OSRM-calculated time for a delivery segment.
- **`segment_osrm_distance`**: OSRM-calculated distance for a delivery segment.

### **Additional Fields**:
Some fields with currently unclear meanings, like `is_cutoff`, `cutoff_factor`, `cutoff_timestamp`, and `factor`, are included for completeness and may be explored further.

---

## Process Overview

### 1. **Feature Engineering**:
- Derived meaningful metrics such as:
- **`time_diff_hours`**: Time difference between `od_start_time` and `od_end_time`.
- Extracted components from timestamps (e.g., month, year, day of the week).
- Split and standardized source and destination names into city, place code, and state.

### 2. **Data Cleaning**:
- Handled missing values using appropriate imputation techniques.
- Addressed `outliers` with boxplots and the `IQR` method.

### 3. **Categorical Feature Handling**:
- Applied one-hot encoding to variables like `route_type` for better interpretability in downstream models.

### 4. **Normalization and Standardization**:
- Used MinMaxScaler and StandardScaler for numerical columns to align features to a uniform scale.

---

## Key Insights

1. **Route Type Insights**:
- FTL routes are faster and more efficient for long distances compared to Carting.

2. **Source and Destination Patterns**:
- High-frequency routes indicate key operational hubs that could benefit from resource optimization.

3. **Time Efficiency**:
- Delivery times vary significantly by route type, season, and traffic conditions.

4. **OSRM vs. Actual Metrics**:
- Discrepancies between OSRM-calculated and actual times/distances highlight areas for improving routing algorithms.

---

## Tools and Libraries
This project utilized the following tools:
- **Python**:
- `Pandas` for data manipulation.
- `Matplotlib` and `Seaborn` for visualization.
- `Sklearn` for preprocessing and scaling.
- **Jupyter Notebook**: For interactive analysis and documentation.

---

## Repository Structure
- **`data/`**: Contains the dataset used for analysis.
- **`notebooks/`**: Jupyter Notebooks documenting the analysis process.
- **`visualizations/`**: Saved plots and charts.
- **`README.md`**: Overview of the project (this file).

---

## Next Steps
Future directions for this project include:
1. Developing predictive models for delivery time and distance.
2. Investigating patterns in the unknown fields (`is_cutoff`, `cutoff_factor`, etc.).
3. Implementing clustering techniques to identify high-demand routes.

---

## Acknowledgments
- **Dataset Source**: Provided by Scaler for this analysis.
- **Python Libraries**: Thanks to the open-source Python community for providing versatile data analysis tools.

---

## License
This project is licensed for educational and non-commercial use only. If utilizing any part of this repository, please credit the author.