An open API service indexing awesome lists of open source software.

https://github.com/srikarkunta/de_coding_task

Python project to download, clean, and process order item CSV files. Includes data cleaning, logging, JSON stats, and monthly metrics generation.
https://github.com/srikarkunta/de_coding_task

csv datacleaning dataengineering pandas python

Last synced: 6 months ago
JSON representation

Python project to download, clean, and process order item CSV files. Includes data cleaning, logging, JSON stats, and monthly metrics generation.

Awesome Lists containing this project

README

          

# CSV Processor

A Python-based project downloads a CSV file from a provided URL, saves it with a timestamp, cleans the data by removing duplicates and empty rows, logs discarded rows, and produces a JSON stats file and a monthly metrics CSV.

πŸ“Œ Assumptions
- Input CSV has columns:
- `order_date` (in `YYYY-MM-DD` format)
- `item_price` (float)
- `item_promo_discount` (float)
- System has enough memory to load a ~1GB CSV into Pandas (16GB+ RAM recommended).
- No extra invalid row checks beyond:
- Empty rows
- Duplicates
- Invalid date formats (discarded)
- Therefore, `"total_invalid_rows_discarded"` in stats is always `0`.

## Setup

1. Install Poetry: `pip install poetry`
2. Install dependencies: `poetry install`

**##πŸ“‚ Project Structure**
csv-processor/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ main.py # CLI entrypoint
β”‚ └── csv_processor/
β”‚ β”œβ”€β”€ clean.py # Data cleaning logic
β”‚ β”œβ”€β”€ download.py # Streaming CSV download
β”‚ β”œβ”€β”€ metrics.py # Monthly aggregation calculations
β”‚ └── stats.py # Processing statistics (JSON output)
β”‚
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ __init__.py
β”‚ └── test_clean.py # Unit tests for cleaning module
β”‚
β”œβ”€β”€ pyproject.toml # Poetry dependencies & build config
β”œβ”€β”€ README.md # Project documentation
└── .gitignore # Ignore large outputs and build artifacts

## How to Run

Run via CLI with the URL argument:

`poetry run python src/main.py --url https://storage.googleapis.com/nozzle-csv-exports/testing-data/order_items_data_2_.csv`

Outputs:
When you run the pipeline, the following files are generated in the working directory:

**order_items_.csv** β†’ raw downloaded file

**discarded_rows.csv** β†’ rows removed during cleaning

**processing_stats.json** β†’ summary of processing (rows kept, discarded, invalid)

**monthly_metrics.csv** β†’ monthly aggregated metrics

⚠️Note: These files are listed in .gitignore, so they won’t appear in the GitHub repo, but they will be created locally every time you run the program.

## How to Execute Tests

`poetry run pytest`

Tests cover basic cleaning functionality using a sample in-memory DataFrame.

## πŸ‘¨β€πŸ’» Author
- **Srikar Kunta**