{"id":18594683,"url":"https://github.com/ernitingarg/very-large-file-processing-python","last_synced_at":"2025-05-16T12:12:56.856Z","repository":{"id":189173456,"uuid":"669782772","full_name":"ernitingarg/very-large-file-processing-python","owner":"ernitingarg","description":"Python solution which uses min-heap data structure and thread parallalism to process very large file","archived":false,"fork":false,"pushed_at":"2023-08-07T06:26:05.000Z","size":7,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-17T22:46:29.335Z","etag":null,"topics":["data-structures","large-files","min-heap","multiprocessing","python3","space-complexity","threading","time-complexity","unit-testing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ernitingarg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-07-23T12:14:53.000Z","updated_at":"2023-10-22T11:15:06.000Z","dependencies_parsed_at":"2023-08-18T16:38:54.547Z","dependency_job_id":null,"html_url":"https://github.com/ernitingarg/very-large-file-processing-python","commit_stats":null,"previous_names":["ernitingarg09/file_data_processor","ernitingarg/very-large-file-processing-python"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ernitingarg%2Fvery-large-file-processing-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ernitingarg%2Fvery-large-file-processing-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ernitingarg%2Fvery-large-file-processing-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ernitingarg%2Fvery-large-file-processing-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ernitingarg","download_url":"https://codeload.github.com/ernitingarg/very-large-file-processing-python/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254527099,"owners_count":22085919,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-structures","large-files","min-heap","multiprocessing","python3","space-complexity","threading","time-complexity","unit-testing"],"created_at":"2024-11-07T01:16:39.234Z","updated_at":"2025-05-16T12:12:56.494Z","avatar_url":"https://github.com/ernitingarg.png","language":"Python","readme":"# File Data Processor\n\nThis solution is designed to find the unique IDs associated with the X-largest values in the rightmost column of a file with the specified format. The program reads the input data from either a file or standard input (stdin) and processes it to produce the desired output. The program also includes error handling to handle various scenarios.\n\n## Input data format\n\nThe input data should be in the following fixed format:\n\n```\n\u003cunique record identifier\u003e\u003cwhite_space\u003e\u003cnumeric value\u003e\n```\n\nFor example:\n\n```\n1426828011 9\n1426828028 350\n1426828037 25\n1426828056 231\n1426828058 109\n1426828066 111\n```\n\n## Output format\n\nThe program prints a list of the unique IDs associated with the X-largest values in the rightmost column, where X is specified as an input parameter. For given X=3, the above input should produce below output.\nNote: The order of the output list may vary and does not follow any particular order.\n\n```\n1426828028\n1426828066\n1426828056\n```\n\n## Algorithm\n\nThe solution uses a min-heap data structure to efficiently find the X-largest values in the rightmost column while processing the input data. The `Record` class implements the comparison methods required by the min-heap to compare records based on their values. The `RecordUtils` class provides static method(s) to find the unique IDs associated with the X-largest values.\n\n- Initialize an empty min-heap to store the X-largest records.\n- Process each line of input data.\n- For each line, extract the unique record identifier and numeric value.\n- Create a new `Record` object with the unique identifier and numeric value.\n- If the min-heap has not reached its capacity (X), push the current record to the heap.\n- If the min-heap is full, push the current recird into the heap and simultaneously pop the smallest element (root) of the heap.\n- Repeat above steps until all input data is processed.\n- Extract the unique IDs from the X-largest records in the min-heap and return them as the result.\n\n## Core and Thread Parallelism\n\nThe solution leverages both core and thread parallelism to optimize the processing of input data.\n\n- The number of CPU cores available on the system is determined using multiprocessing.cpu_count().\n- The input data is split into smaller chunks to distribute the workload among threads.\n- Each chunk of data is processed concurrently by separate threads, which significantly reduces the processing time for large input datasets.\n- The results from all chunks are then merged into a single list. The merged list contains all the records from different chunks.\n- Finally, the X-largest values are extracted from the merged list\n\n## Time Complexity\n\nFor a given min-heap of size X, lets assume the total number of records in the input data is N.\n\n- Reading and parsing each line of input data: O(N)\n- Heap insertion and extraction (should be equal to height of the heap): O(log X) for each record\n- Overall time complexity: `O(N log X)`\n\n## Space Complexity\n\nFor a given min-heap of size X,\n\n- Min-heap to store X-largest records: O(X)\n- Additional variables for processing: O(1)\n- Overall space complexity: `O(X)`\n\n## Usage\n\n- Open the Command Prompt (CMD) or PowerShell.\n- Navigate to this current directoty\n- To read the input data from the standard input (stdin)\n\n```\npython main.py\n```\n\n- To read the data from a file\n\n```\npython main.py data.txt\n```\n\n- After running the script, the program will prompt you to enter the value of X (the number of largest values to find). Enter a positive integer value for X.\n\n## Unit tests\n\nPlease run below command to execute unit tests\n\n```\npython -m unittest test_record_utils.py\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fernitingarg%2Fvery-large-file-processing-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fernitingarg%2Fvery-large-file-processing-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fernitingarg%2Fvery-large-file-processing-python/lists"}