{"id":26360938,"url":"https://github.com/kurulko/etl-project","last_synced_at":"2026-05-16T23:09:48.900Z","repository":{"id":282531202,"uuid":"948881808","full_name":"Kurulko/ETL-Project","owner":"Kurulko","description":"A simple ETL project in CLI that inserts data from a CSV into a single, flat table","archived":false,"fork":false,"pushed_at":"2025-03-15T07:55:47.000Z","size":14,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-15T08:26:36.504Z","etag":null,"topics":["csv","entity-framework-core","ms-sql-server","netcore"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Kurulko.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-15T06:59:00.000Z","updated_at":"2025-03-15T07:55:50.000Z","dependencies_parsed_at":"2025-03-15T08:37:40.413Z","dependency_job_id":null,"html_url":"https://github.com/Kurulko/ETL-Project","commit_stats":null,"previous_names":["kurulko/etl-project"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kurulko%2FETL-Project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kurulko%2FETL-Project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kurulko%2FETL-Project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kurulko%2FETL-Project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Kurulko","download_url":"https://codeload.github.com/Kurulko/ETL-Project/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243902320,"owners_count":20366262,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","entity-framework-core","ms-sql-server","netcore"],"created_at":"2025-03-16T17:18:55.485Z","updated_at":"2026-05-16T23:09:43.862Z","avatar_url":"https://github.com/Kurulko.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ETL Project\n\nA simple ETL project in CLI that inserts data from a CSV into a single, flat table\n\n--- \n\n## Table of Contents\n\n- [Test Assessment](#test-assessment)\n- [Installation](#installation)\n- [Questions and Answers](#questions-and-answers)\n  \n---\n\n## Test Assessment\n\nThe goal of this task is to implement a simple ETL project in CLI that inserts data from a CSV into a single, flat table. \n\n### Objectives\n\n1. Import the data from the CSV into an MS SQL table. We only want to store the following columns:\n    - `tpep_pickup_datetime`\n    - `tpep_dropoff_datetime`\n    - `passenger_count`\n    - `trip_distance`\n    - `store_and_fwd_flag`\n    - `PULocationID`\n    - `DOLocationID`\n    - `fare_amount`\n    - `tip_amount`\n2. Set up a SQL Server database (local or cloud-based, as per your convenience).\n3. Design a table schema that will hold the processed data; make sure you are using the proper data types.\n4. Users of the table will perform the following queries; ensure your schema is optimized for them:\n    - Find out which `PULocationId` (Pick-up location ID) has the highest tip_amount on average.\n    - Find the top 100 longest fares in terms of `trip_distance`.\n    - Find the top 100 longest fares in terms of time spent traveling.\n    - Search, where part of the conditions is `PULocationId`.\n5. Implement efficient bulk insertion of the processed records into the database.\n6. Identify and remove any duplicate records from the dataset based on a combination of `pickup_datetime`, `dropoff_datetime`, and `passenger_count`. Write all removed duplicates into a `duplicates.csv` file.\n7. For the `store_and_fwd_flag` column, convert any 'N' values to 'No' and any 'Y' values to 'Yes'.\n8. Ensure that all text-based fields are free from leading or trailing whitespace.\n9. Assume your program will be used on much larger data files. Describe in a few sentences what you would change if you knew it would be used for a 10GB CSV input file.\n10. (nice to have) The input data is in the EST timezone. Convert it to UTC when inserting into the DB.\n\n### Requirements\n\n- Use C# as the primary programming language.\n- Efficiency of data insertion into SQL Server.\n- Assume the data comes from a potentially unsafe source.\n\n## Installation\n\n### Step 1: **Clone the repository**\n   ```\n   git clone https://github.com/Kurulko/ETL-Project.git\n   cd ETL-Project\n  ```\n\n### Step 2. **Install the necessary dependencies**\n```\ndotnet restore\n```\n\n### Step 3. **Set up the database (SQL Server)**\nOpen SQL Server Management Studio (SSMS) or any SQL management tool.\n\n#### a. Create the database:\n```\nCREATE DATABASE ETL_db;\nGO\n```\n\n#### b.  Apply the necessary SQL schema:\n```\nCREATE TABLE Vendors (\n\tVendorID INT IDENTITY PRIMARY KEY,\n  tpep_pickup_datetime DATETIME NOT NULL,\n  tpep_dropoff_datetime DATETIME NOT NULL,\n  passenger_count INT NULL,\n  trip_distance FLOAT NOT NULL,\n  store_and_fwd_flag VARCHAR(3) NULL,\n  fare_amount DECIMAL(5,2) NOT NULL,\n  tip_amount DECIMAL(5,2) NOT NULL,\n\tPULocationID INT NOT NULL,\n  DOLocationID INT NOT NULL,\n\n\tCONSTRAINT CK_Vendors_tpep_dropoff_datetime CHECK (tpep_dropoff_datetime \u003e tpep_pickup_datetime),\n  CONSTRAINT CK_Vendors_passenger_count CHECK (passenger_count IS NULL OR passenger_count \u003e= 0),\n  CONSTRAINT CK_Vendors_trip_distance CHECK (trip_distance \u003e= 0),\n  CONSTRAINT CK_Vendors_store_and_fwd_flag CHECK (store_and_fwd_flag IS NULL OR store_and_fwd_flag IN ('Yes', 'No')),\n  CONSTRAINT CK_Vendors_fare_amount CHECK (fare_amount \u003e= 0),\n  CONSTRAINT CK_Vendors_tip_amount CHECK (tip_amount \u003e= 0),\n);\nGO\n```\n\n### Step 4. **Configure appsettings.json**\n\n#### 1. Create the appsettings.json file in the project's root folder.\n#### 2. Add your sensitive settings (e.g., connection strings, CSV settings) to appsettings.json:\n```\n{\n  \"ConnectionStrings\": {\n    \"DefaultConnection\": \"Server=(localdb)\\\\mssqllocaldb; Database=ETL_db; Trusted_Connection=True;\"\n  },\n  \"CsvSettings\": {\n    \"DataFilePath\": \"path-to-sample-data-file\\\\sample-cab-data.csv\",\n    \"DuplicatesFilePath\": \"path-to-duplicates-data-file\\\\duplicates.csv\"\n  }\n}\n```\n\n### Step 5. **Run the project**\nIn the project's root folder\n```\ndotnet run\n```\n\n### Step 6. **Execute SQL tasks**\n#### 1. Open SQL Server Management Studio (SSMS) or any SQL management tool.\n#### 2. Execute SQL tasks.\n\n##### Finding out which `PULocationId` (Pick-up location ID) has the highest tip_amount on average\n```\nWITH AvgTipAmounts AS (\n  SELECT PULocationID, AVG(tip_amount) as avg_tip_amount \n\tFROM Vendors\n\tGROUP BY PULocationID\n)\nSELECT PULocationID, avg_tip_amount as highest_avg_tip_amount FROM AvgTipAmounts\nWHERE avg_tip_amount = (\n\tSELECT MAX(avg_tip_amount) FROM AvgTipAmounts\n);\n```\n\n##### Finding the top 100 longest fares in terms of `trip_distance`\n```\nSELECT TOP 100 trip_distance FROM Vendors\nORDER BY trip_distance DESC;\n```\n\n##### Finding the top 100 longest fares in terms of time spent traveling\n```\nSELECT TOP 100 tpep_pickup_datetime, tpep_dropoff_datetime, DATEDIFF(SECOND, tpep_pickup_datetime, tpep_dropoff_datetime) AS time_spent_traveling_in_seconds \nFROM Vendors\nORDER BY time_spent_traveling_in_seconds DESC;\n```\n\n##### Searching, where part of the conditions is `PULocationId`\n```\nSELECT * FROM Vendors\nWHERE PULocationId = 193;\n```\n\n## Questions and Answers\n\n**№1**\n\n**Question**: \nAssume your program will be used on much larger data files. Describe in a few sentences what you would change if you knew it would be used for a 10GB CSV input file.\n\n**Answer**:\nI would use the `SqlBulkCopy` class from the `Microsoft.Data.SqlClient` namespace to use bulk insert operations, I would also implement streaming processing (e.g. reading the file in smaller portions) instead of loading the entire CSV file into memory, and last but not least, I would optimize the database schema by adding relevant indexes.\n\n---\n\n**№2**\n\n**Question**: \nNumber of rows in your table after running the program\n\n**Answer**: \n\nNumber of rows after running: **29889** (in db)\n\nNumber of duplicate rows: **111** (in duplicates.csv file)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkurulko%2Fetl-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkurulko%2Fetl-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkurulko%2Fetl-project/lists"}