https://github.com/ernitingarg/very-large-file-processing-python

Python solution which uses min-heap data structure and thread parallalism to process very large file
https://github.com/ernitingarg/very-large-file-processing-python

data-structures large-files min-heap multiprocessing python3 space-complexity threading time-complexity unit-testing

Last synced: 7 months ago
JSON representation

Python solution which uses min-heap data structure and thread parallalism to process very large file

Host: GitHub
URL: https://github.com/ernitingarg/very-large-file-processing-python
Owner: ernitingarg
Created: 2023-07-23T12:14:53.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2023-08-07T06:26:05.000Z (over 2 years ago)
Last Synced: 2025-02-17T22:46:29.335Z (10 months ago)
Topics: data-structures, large-files, min-heap, multiprocessing, python3, space-complexity, threading, time-complexity, unit-testing
Language: Python
Homepage:
Size: 6.84 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # File Data Processor

This solution is designed to find the unique IDs associated with the X-largest values in the rightmost column of a file with the specified format. The program reads the input data from either a file or standard input (stdin) and processes it to produce the desired output. The program also includes error handling to handle various scenarios.

## Input data format

The input data should be in the following fixed format:

```

```

For example:

```

1426828011 9

1426828028 350

1426828037 25

1426828056 231

1426828058 109

1426828066 111

```

## Output format

The program prints a list of the unique IDs associated with the X-largest values in the rightmost column, where X is specified as an input parameter. For given X=3, the above input should produce below output.

Note: The order of the output list may vary and does not follow any particular order.

```

1426828028

1426828066

1426828056

```

## Algorithm

The solution uses a min-heap data structure to efficiently find the X-largest values in the rightmost column while processing the input data. The `Record` class implements the comparison methods required by the min-heap to compare records based on their values. The `RecordUtils` class provides static method(s) to find the unique IDs associated with the X-largest values.

- Initialize an empty min-heap to store the X-largest records.

- Process each line of input data.

- For each line, extract the unique record identifier and numeric value.

- Create a new `Record` object with the unique identifier and numeric value.

- If the min-heap has not reached its capacity (X), push the current record to the heap.

- If the min-heap is full, push the current recird into the heap and simultaneously pop the smallest element (root) of the heap.

- Repeat above steps until all input data is processed.

- Extract the unique IDs from the X-largest records in the min-heap and return them as the result.

## Core and Thread Parallelism

The solution leverages both core and thread parallelism to optimize the processing of input data.

- The number of CPU cores available on the system is determined using multiprocessing.cpu_count().

- The input data is split into smaller chunks to distribute the workload among threads.

- Each chunk of data is processed concurrently by separate threads, which significantly reduces the processing time for large input datasets.

- The results from all chunks are then merged into a single list. The merged list contains all the records from different chunks.

- Finally, the X-largest values are extracted from the merged list

## Time Complexity

For a given min-heap of size X, lets assume the total number of records in the input data is N.

- Reading and parsing each line of input data: O(N)

- Heap insertion and extraction (should be equal to height of the heap): O(log X) for each record

- Overall time complexity: `O(N log X)`

## Space Complexity

For a given min-heap of size X,

- Min-heap to store X-largest records: O(X)

- Additional variables for processing: O(1)

- Overall space complexity: `O(X)`

## Usage

- Open the Command Prompt (CMD) or PowerShell.

- Navigate to this current directoty

- To read the input data from the standard input (stdin)

```

python main.py

```

- To read the data from a file

```

python main.py data.txt

```

- After running the script, the program will prompt you to enter the value of X (the number of largest values to find). Enter a positive integer value for X.

## Unit tests

Please run below command to execute unit tests

```

python -m unittest test_record_utils.py

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ernitingarg/very-large-file-processing-python

Awesome Lists containing this project

README