{"id":18410960,"url":"https://github.com/slatawa/csv_parquet","last_synced_at":"2025-08-10T10:33:23.063Z","repository":{"id":156213218,"uuid":"372756335","full_name":"slatawa/csv_parquet","owner":"slatawa","description":"Project showing integration of upstream file into your data lake. we look at handling high volume customized data formats and converting them into parquet. ","archived":false,"fork":false,"pushed_at":"2021-07-25T06:51:50.000Z","size":394,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-12T23:20:04.869Z","etag":null,"topics":["parquet-files","pyspark","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/slatawa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-01T08:27:28.000Z","updated_at":"2021-07-25T06:51:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"50f1eaeb-bec8-43a5-b446-df4fdd63c4c2","html_url":"https://github.com/slatawa/csv_parquet","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/slatawa/csv_parquet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slatawa%2Fcsv_parquet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slatawa%2Fcsv_parquet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slatawa%2Fcsv_parquet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slatawa%2Fcsv_parquet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/slatawa","download_url":"https://codeload.github.com/slatawa/csv_parquet/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slatawa%2Fcsv_parquet/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269711621,"owners_count":24463198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-10T02:00:08.965Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["parquet-files","pyspark","python3"],"created_at":"2024-11-06T03:34:38.074Z","updated_at":"2025-08-10T10:33:22.999Z","avatar_url":"https://github.com/slatawa.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CSV To Parquet\r\n\r\nYou are tasked with integrating a new cargo container datasource from a provider. \r\nThis dataset contains rows of information about different shipments that are occurring, with details about the goods being transferred to and from a specific country.\r\n\r\nTHe project has a few rows of sample data (under ./incoming_data) for a specific country, alongside a file \r\nwith the schema (schema.png). \r\n\t\r\nAim:\r\n\t\r\n\t* Working with the data in the current format is very difficult. The goal of the exercise is to write out the data to a parquet file format that can be easily read by Apache Spark. This file would need to include column headers.\r\n\t\r\nThis must be done in the context of the following:\r\n\r\n\t* The size of a single file is very large (\u003e50GB). You can imagine that you have access to a computing cluster.\r\n\t* The format of this file is unconventional and helper methods for reading it in libraries such as Python, Pandas and Spark may not work.\r\n\t* There is a large amount of historical data to process (\u003e3TB), so making this process as efficient as possible is extremely important.\r\n\t* There will be heterogenous datasets for different countries, each with their own schema - so your solution should allow room for this.\r\n\r\n# Prelude\r\n\r\nOn initial analysis this task seems best fit for a pyspark job which should be able to read the data and do a quick conversion to parquet. \r\nIssues with using Pyspark:\r\n1)\tThe custom file format makes it difficult to use out of box pyspark.sql DataFrames  (df.read.csv) -\u003e this function does accept custom record delimiter and column delimiter. But this can not be used as csv supports record delimiter of length 1 but custom delimiter is -  ‘#@#@#’ \r\n\r\n2)\tCould use the RDD structure where we can apply lambda to break the file based on our requirements, but this is not going to be the most performant efficient way. Sample code to achieve this via RDD\r\n\r\n```\r\nrdd1= (sc.sparkContext\r\n       .textFile(\"./sample_data.txt\",10)\r\n       .flatMap(lambda line: line.split(\"#@#@#\"))\r\n       .map(lambda x: x.split(\"~\"))\r\n       .filter(lambda x: len(x) \u003e 1)\r\n      )\r\n```\r\n\r\n\t\r\n# Final Design\r\n\r\nThe requirement will be best catered by using a hybrid solution of Python and Pyspark.\r\nBelow are the two steps of the flow\r\n\r\n## Step 1 - Preprocess the file \r\n\r\nIn this step we pick up the raw file from the ‘incoming data’ folder and process it to adhere to a \r\ngeneric csv file (columns are ‘,’ separated and records are ‘\\n’ separated). \r\nAlso we can use multiprocessing to increase performance. Along with this we run the preprocess script on all \r\nnodes in the cluster, each node can pick up the files pending for processing by using metadata \r\ninformation stored in config.txt . This file is kept up to date with information on the state of \r\neach file which has to be migrated. The final csv file is then place under ‘processed’ folder.\r\n\r\n## Step 2 – Convert to Parquet\r\nThis python script polls on the master node waiting for files to be added to ‘processed’  \r\nfolder (picks up files starting with string con*) once it detects a new file picks up the file and \r\nuses Pyspark to convert the csv to parquet. \r\nThe final result is put into the ‘parquet’ folder , the csv file post migration is renamed to\r\n‘bkp_filename’ so that it gets ignored from the next run. \r\n\r\n## Flow Chart\r\n\r\n![img_3.png](./images/img_3.png)\r\n \r\n\r\n\r\nAssumptions:\r\n\tHave hardcoded local folder values for testing on local we can replace these with hdfs or \r\ns3 based on file system being used. Based on that the file-locking mechanism might have to be changed.\r\n\r\n\r\nPerformance Testing Baseline:\r\nTested the solution with 25gb input file and end to end run time is approx. ~~ 6 mins. Taking a linear approach approx. time for 50gb file ~ 12 mins.\r\n\r\nAWS Serverless Solution\r\nBelow flow chart shows a possible serverless solution for this requirement by using Glue ETL\r\n\r\n ![img_4.png](./images/img_4.png)\r\n\r\nOther Possible Solutions \r\n\r\n\tCould consider using Hive or Athena to solve this issue via using create table, this is constricted as the schema of the files is not clear\r\n\r\n\tUse Hadoop streaming to run step 1 on the cluster on multiple nodes instead of the current approach ,this would require to put a screening logic to segregate the incoming files based on the country to different folders.\r\n\r\n\r\n\r\n \r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslatawa%2Fcsv_parquet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fslatawa%2Fcsv_parquet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslatawa%2Fcsv_parquet/lists"}