https://github.com/pushshift/parallel-ndjson-reader
Parallel NDJSON Reader for Python
https://github.com/pushshift/parallel-ndjson-reader
json multiprocessing ndjson newline parallel parallel-processing python
Last synced: 5 months ago
JSON representation
Parallel NDJSON Reader for Python
- Host: GitHub
- URL: https://github.com/pushshift/parallel-ndjson-reader
- Owner: pushshift
- Created: 2018-06-15T07:20:29.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-12-04T08:22:35.000Z (almost 6 years ago)
- Last Synced: 2025-04-07T07:51:19.194Z (6 months ago)
- Topics: json, multiprocessing, ndjson, newline, parallel, parallel-processing, python
- Language: Python
- Size: 1000 Bytes
- Stars: 16
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Parallel NDJSON Reader
### Purpose
This script can read and process newline delimited data extremely quickly. For NDJSON files, my 12 core Xeon was able to decode (json.loads) 90,000 Twitter objects per second. This script is basically limited by the amount of CPUs you have and how fast your I/O subsystem is.### Features
- Ability to select number of cores used by setting the value of the n_chunks variable.
- If the file is too small to split into N pieces, the script will scale to the maximum number of chunks possible. This script is not meant for small files since there is a little bit of startup time involved. This is meant to tear through big data (gigabytes / terabytes / petabytes).jason@pushshift.io
https://pushshift.io/donations
### End