https://github.com/srikarkunta/de_coding_task
Python project to download, clean, and process order item CSV files. Includes data cleaning, logging, JSON stats, and monthly metrics generation.
https://github.com/srikarkunta/de_coding_task
csv datacleaning dataengineering pandas python
Last synced: 6 months ago
JSON representation
Python project to download, clean, and process order item CSV files. Includes data cleaning, logging, JSON stats, and monthly metrics generation.
- Host: GitHub
- URL: https://github.com/srikarkunta/de_coding_task
- Owner: SRIKARKUNTA
- Created: 2025-10-02T07:01:08.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-10-02T12:05:48.000Z (6 months ago)
- Last Synced: 2025-10-02T14:11:57.132Z (6 months ago)
- Topics: csv, datacleaning, dataengineering, pandas, python
- Language: Python
- Homepage:
- Size: 9.77 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CSV Processor
A Python-based project downloads a CSV file from a provided URL, saves it with a timestamp, cleans the data by removing duplicates and empty rows, logs discarded rows, and produces a JSON stats file and a monthly metrics CSV.
π Assumptions
- Input CSV has columns:
- `order_date` (in `YYYY-MM-DD` format)
- `item_price` (float)
- `item_promo_discount` (float)
- System has enough memory to load a ~1GB CSV into Pandas (16GB+ RAM recommended).
- No extra invalid row checks beyond:
- Empty rows
- Duplicates
- Invalid date formats (discarded)
- Therefore, `"total_invalid_rows_discarded"` in stats is always `0`.
## Setup
1. Install Poetry: `pip install poetry`
2. Install dependencies: `poetry install`
**##π Project Structure**
csv-processor/
βββ src/
β βββ main.py # CLI entrypoint
β βββ csv_processor/
β βββ clean.py # Data cleaning logic
β βββ download.py # Streaming CSV download
β βββ metrics.py # Monthly aggregation calculations
β βββ stats.py # Processing statistics (JSON output)
β
βββ tests/
β βββ __init__.py
β βββ test_clean.py # Unit tests for cleaning module
β
βββ pyproject.toml # Poetry dependencies & build config
βββ README.md # Project documentation
βββ .gitignore # Ignore large outputs and build artifacts
## How to Run
Run via CLI with the URL argument:
`poetry run python src/main.py --url https://storage.googleapis.com/nozzle-csv-exports/testing-data/order_items_data_2_.csv`
Outputs:
When you run the pipeline, the following files are generated in the working directory:
**order_items_.csv** β raw downloaded file
**discarded_rows.csv** β rows removed during cleaning
**processing_stats.json** β summary of processing (rows kept, discarded, invalid)
**monthly_metrics.csv** β monthly aggregated metrics
β οΈNote: These files are listed in .gitignore, so they wonβt appear in the GitHub repo, but they will be created locally every time you run the program.
## How to Execute Tests
`poetry run pytest`
Tests cover basic cleaning functionality using a sample in-memory DataFrame.
## π¨βπ» Author
- **Srikar Kunta**