https://github.com/srikarkunta/de_coding_task

Python project to download, clean, and process order item CSV files. Includes data cleaning, logging, JSON stats, and monthly metrics generation.
https://github.com/srikarkunta/de_coding_task

csv datacleaning dataengineering pandas python

Last synced: 3 months ago
JSON representation

Python project to download, clean, and process order item CSV files. Includes data cleaning, logging, JSON stats, and monthly metrics generation.

Host: GitHub
URL: https://github.com/srikarkunta/de_coding_task
Owner: SRIKARKUNTA
Created: 2025-10-02T07:01:08.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-10-02T12:05:48.000Z (10 months ago)
Last Synced: 2025-10-02T14:11:57.132Z (10 months ago)
Topics: csv, datacleaning, dataengineering, pandas, python
Language: Python
Homepage:
Size: 9.77 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# CSV Processor

A Python-based project downloads a CSV file from a provided URL, saves it with a timestamp, cleans the data by removing duplicates and empty rows, logs discarded rows, and produces a JSON stats file and a monthly metrics CSV.

📌 Assumptions
- Input CSV has columns:
- `order_date` (in `YYYY-MM-DD` format)
- `item_price` (float)
- `item_promo_discount` (float)
- System has enough memory to load a ~1GB CSV into Pandas (16GB+ RAM recommended).
- No extra invalid row checks beyond:
- Empty rows
- Duplicates
- Invalid date formats (discarded)
- Therefore, `"total_invalid_rows_discarded"` in stats is always `0`.

## Setup

1. Install Poetry: `pip install poetry`
2. Install dependencies: `poetry install`

**##📂 Project Structure**
csv-processor/
├── src/
│ ├── main.py # CLI entrypoint
│ └── csv_processor/
│ ├── clean.py # Data cleaning logic
│ ├── download.py # Streaming CSV download
│ ├── metrics.py # Monthly aggregation calculations
│ └── stats.py # Processing statistics (JSON output)
│
├── tests/
│ ├── __init__.py
│ └── test_clean.py # Unit tests for cleaning module
│
├── pyproject.toml # Poetry dependencies & build config
├── README.md # Project documentation
└── .gitignore # Ignore large outputs and build artifacts

## How to Run

Run via CLI with the URL argument:

`poetry run python src/main.py --url https://storage.googleapis.com/nozzle-csv-exports/testing-data/order_items_data_2_.csv`

Outputs:
When you run the pipeline, the following files are generated in the working directory:

**order_items_.csv** → raw downloaded file

**discarded_rows.csv** → rows removed during cleaning

**processing_stats.json** → summary of processing (rows kept, discarded, invalid)

**monthly_metrics.csv** → monthly aggregated metrics

⚠️Note: These files are listed in .gitignore, so they won’t appear in the GitHub repo, but they will be created locally every time you run the program.

## How to Execute Tests

`poetry run pytest`

Tests cover basic cleaning functionality using a sample in-memory DataFrame.

## 👨‍💻 Author
- **Srikar Kunta**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/srikarkunta/de_coding_task

Awesome Lists containing this project

README