{"id":18779148,"url":"https://github.com/trisongz/pylines","last_synced_at":"2025-09-01T10:03:06.408Z","repository":{"id":44876328,"uuid":"307857896","full_name":"trisongz/pylines","owner":"trisongz","description":"Simplifying parsing of large jsonline files in NLP Workflows","archived":false,"fork":false,"pushed_at":"2022-01-21T08:01:17.000Z","size":250,"stargazers_count":12,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-05-14T16:16:28.812Z","etag":null,"topics":["json","jsonlines","jsonlines-data","nlp-datasets"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/pylines/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/trisongz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-27T23:56:36.000Z","updated_at":"2024-05-28T12:44:26.000Z","dependencies_parsed_at":"2022-09-06T01:11:36.117Z","dependency_job_id":null,"html_url":"https://github.com/trisongz/pylines","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/trisongz/pylines","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trisongz%2Fpylines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trisongz%2Fpylines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trisongz%2Fpylines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trisongz%2Fpylines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/trisongz","download_url":"https://codeload.github.com/trisongz/pylines/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trisongz%2Fpylines/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260109456,"owners_count":22960023,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["json","jsonlines","jsonlines-data","nlp-datasets"],"created_at":"2024-11-07T20:18:59.026Z","updated_at":"2025-07-04T13:05:04.005Z","avatar_url":"https://github.com/trisongz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pylines\nSimplifying parsing of large jsonline files in NLP Workflows (but you can use it for anything)\n\n## Highlights\n- Memory Efficient Processing of Large (GB+) Jsonline files\n- High performance through use of iterators, multiprocessing, and simdjson\n- Deterministic open_function, allowing r/w to cloud storage objects (s3,gs)\n- Removes need to handle dumps, newlines, flush when writing to file\n- Minimal dependencies (only simdjson)\n- Allows passing of functions directly to iterators w/ multiprocessing, writing to file or returning results\n\n## Quickstart\n\n#### Quick Installation + Various Flavors\n```shell\n\n# only installs pysimdjson as the requirement\npip install --upgrade pylines \n\n # Installs tensorflow, smart_open[all], google-cloud-storage\npip install --upgrade pylines[cloud]\n\n# Installs above + transformers, torch\npip install --upgrade pylines[all] \n\n```\n\n#### Source Installation (probably a bad idea.)\n```shell\npip install --upgrade git+https://github.com/trisongz/pylines\n```\n\n\n#### Why Pylines\n\nAfter using numerous data formats, I came to appreciate jsonlines due to its transparency (being able to look at the data without it being in a binary format), as well as its ability to handle text-based serialization when compared to other data formats that often break. Pylines is used extensively in my private projects and is being released as a standalone project since I almost always use it when dealing with data.\n\nPylines is designed to be simplistic and deterministic, dealing with things such as:\n\n- open function: will determine open based on the filename prefix, including from gs, s3, https, and hfds\n- r/w/a: will determine write mode based on detecting a file present, unless overwrite is called\n- broken lines: by default, will skip any unparsable lines rather than returning an error, which can be a pain in the middle of a large data processing batch, and then having to figure out how to resume\n- flexibility in adding multiple files: deals with strings, lists, and/or globs\n- flushing the file: will automatically flush file buffers before closing to prevent unwritten files when writing to storage objects\n\nPylines is backed by [pysimdjson](https://github.com/TkTech/pysimdjson), a python binding around [simdjson](https://github.com/lemire/simdjson), allowing high performance read/write ops, as well as being highly memory efficient when dealing with extremely large files (30GB+) through extensive use of generators.\n\nSome additional current features include:\n\n- Multi-threaded Tokenization from input files\n- Tensorflow Records Writer with Tokenization\n- Object Storage r/w support\n- Useful iterators/writers that take external functions, and switching from one to another easily.\n- A LazyLoadFile class (wip) that allows for sampling of the dataset without loading everything into memory\n- functions that allow retrieving a line by index or with a key=value match\n\n\n#### Example of Usage\n```python3\n'''\nBase Default Variables for Pylines\n\ninput_fns: None, if no input file is set, can be later set or added. by calling , adding to existing files or  to override existing list of files, and set to the new files.\noutput_fn: None, if no output file is set, can be later set or added. - if no output file is set, can be later set by calling \nskip_broken: True, will skip unparsable lines by default\noverwrite_output: False, will control overwriting of the output file\nuse_lazy: False, doesn't do anything right now\nuse_mp: True, doesn't explicitly do anything right now\nuse_idx: False, doesn't do anything right now\ntotal_lines: 0, used to set the total lines expected, useful if you already know how many items are in your file and prevent counting of items during initialization.\n\n'''\nfrom pylines import Pylines\ninput_fn = 'myinputfile.json' # or ['inputfile1.json', 'inputfile2.json']\n\n# Quick Iterator through all items\nfor item in Pylines(input_fn):\n    # process items here\n    print(item)\n\n# Can be initialized as an object\npylines = Pylines(input_fn)\n# returns total count of all lines in all input files\nprint(len(pylines))\n\n# Can set output file and then iterate through input files, and writing to output file\noutput_file = 'myoutputfile.json' # only takes a string right now\npylines.set_writefile(output_file)\n\nwrite = pylines.write\nfor item in pylines.as_iterator():\n    # process items here\n    write(item)\n\n# alternatively, you can pass a function and run an iterator with mp, and have it write to the output file\npylines.run_function(processor_function)\n\n# or get the results from the processor to do something else\nfor result in pylines.as_function(processor_function):\n    # do something with results\n```\n\n\n#### Built in Generators and Writers\nPylines has several built in iterators that follow the same pattern, making it simple to switch between a generator and writing to a file\n\n```python3\n\nfrom pylines import Pylines\n\ninput_fn = 'myinputfile.json'\noutput_fn = 'tokenized_file.json'\n\n# as_* indicates a generator, yielding results\n# run_* indicates a writer, no returns\n\npylines = Pylines(input_file, output_file)\n\n# Assuming defined functions\n\nfor result in pylines.as_tokenizer(tokenizer_function):\n    yield result\n\n# Then to write it all to file after verifying the results\n\npylines.run_tokenizer(tokenizer_function)\n\n# Other functions\n\n# Maps each line in the file to a custom defined processor function. expects a result\npylines.as_processor(custom_processor_function) -\u003e Generator\npylines.run_processor(custom_processor_function) -\u003e Writer\n\n# Takes a filter function for each key, with 'bypass' being keys that are ignored.\n# As such, since the function can also return a different value, the output from the filter function will update the value\n# Filter function must either return None or a value.\n# Example:\n# filter_funcs = {'text': text_filter_fuc, 'target': text_processor_func, 'idx': filter_idx_func, 'bypass': ['key_1', 'key_2']}\n\npylines.as_filter(custom_processor_function) -\u003e Generator\npylines.run_filter(custom_processor_function) -\u003e Writer\n\n# Encoder only supports tensorflow records at this time.\n\n# Options \n# output_dir, dataset_features=None, tokenizer_fn=None, serialization='tf',\n# input_fns=None, start_idx=1, split_key='split', split='train', \n# write_string='{}_shard_{}.tfrecords', shard_size=50000, overwrite=False, \n# use_tempdir=False, use_mp=True\n\n# Requires an output_dir, dataset_features, and tokenizer_fn if not passed previously.\n# For gs backed object storage, you can write directly to gs. If it's s3, it will create a temporary dir, and upon completion, run a background\n# Thread to copy/move the file to s3.\n# You can explicitly set use_tempdir=True to use local_storage during write, and it will write to local first.\n\n# Write String requires two placeholders, which is formatted using 'split' and the current shard count.\n# If files are detected in the dir, the shard count will automatically be the next shard number, preventing overwrites, unless explicitly called, or start_idx!=1.\n\n# if multiple splits exist in the dataset, you can set split_key to the key that your split is defined as, and split to the value.\n# It will run a filter to find all the items that match split_key=split before encoding.\n# If no split_key exists in the dataset, set split_key=None which will process the entire dataset without filtering\n\n# Expected Dataset Features Format:\n\ndataset_features = {\n        'x': {\n            'names': [\"input_ids\", \"attention_mask\"]\n        },\n        'y': {\n            'names': [\"target_ids\", \"target_attention_mask\", \"shifted_target_input_ids\"]\n        }\n    }\n\npylines.as_encoder(dataset_features=dataset_features, tokenizer_fn=tokenize_fn, split_key=None, use_mp=1) -\u003e Generator\n\n# Only output_dir is required for run_encoder\npylines.run_encoder(dataset_features=dataset_features, tokenizer_fn=tokenize_fn, output_dir=output_dir, split_key=None, use_mp=1) -\u003e Writer\n\n```\n\n\n#### Example of Tokenization\n```python3\nfrom transformers import T5Tokenizer\nfrom pylines import Pylines\nimport numpy as np\n\ntokenizer = T5Tokenizer.from_pretrained('t5-base')\ninput_fn = 'myinputfile.json'\noutput_fn = 'tokenized_file.json'\n\ndef shift_to_right(input_ids, decoder_start_token_id):\n    shifted_input_ids = np.zeros(input_ids.shape, dtype=input_ids.dtype)\n    shifted_input_ids[..., 1:] = input_ids[..., :-1]\n    shifted_input_ids[..., 0] = decoder_start_token_id\n    return shifted_input_ids\n\ndef tokenize_fn(example):\n    encoder_tokens = tokenizer(example['source_text'], truncation='longest_first', max_length=416, padding='max_length', add_special_tokens=True)\n    decoder_tokens = tokenizer(example['target_text'], truncation='longest_first', max_length=96, padding='max_length', add_special_tokens=True)\n    target_input_ids = np.copy(decoder_tokens['input_ids'])\n\n    shifted_target_input_ids = shift_to_right(target_input_ids, tokenizer.pad_token_id)\n    target_input_ids[target_input_ids == tokenizer.pad_token_id] = -100\n\n    return {\n        'input_ids': encoder_tokens['input_ids'],\n        'attention_mask': encoder_tokens['attention_mask'],\n        'target_ids': target_input_ids.tolist(),\n        'target_attention_mask': decoder_tokens['attention_mask'],\n        'shifted_target_input_ids': shifted_target_input_ids.tolist()\n    }\n\n# Initialize Pylines with input_fn and output_fn\npylines = Pylines(input_fn, output_fn, overwrite_output=False)\n\n# Pass the above tokenization function through, which will be serialized and used to process every line.\n# By default, use_mp=True will use all cores.\npylines.run_tokenizer(tokenize_fn, use_mp=True)\n# or use_mp=int will use that many processes in mp.Pool\npylines.run_tokenizer(tokenize_fn, use_mp=4)\n\n# or use as a generator\nfor ex in pylines.as_tokenizer(tokenize_fn, use_mp=True):\n    # do something with ex\n\n# Or serialize to tfrecords by defining your dataset features\n\ndataset_features = {\n        'x': {\n            'names': [\"input_ids\", \"attention_mask\"]\n        },\n        'y': {\n            'names': [\"target_ids\", \"target_attention_mask\", \"shifted_target_input_ids\"]\n        }\n    }\n\n# will yield serialized examples after passing through tokenizer_fn if its provided.\nfor x, ex in enumerate(pylines.as_encoder(dataset_features=dataset_features, tokenizer_fn=tokenize_fn, use_mp=1)):\n    if x == 5:\n        print(ex)\n\n# define a output_dir and it will write tfrecords\noutput_dir = '/my/dataset/dir'\npylines.run_encoder(dataset_features=dataset_features, tokenizer_fn=tokenize_fn, output_dir=output_dir, split_key=None, use_mp=1)\n\n\n```\n#### Automatically Reading/Writing to Object Storage\n\nBy Default, upon init, Pylines will look for common environment variables including:\n- GOOGLE_APPLICATION_CREDENTIALS\n- AWS_SHARED_CREDENTIALS_FILE\n- AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY\n\nIf those are set, then the associated client will be initialized\n- google.cloud.storage - GCS\n- boto3 - S3\n\nThese will be passed to smart_open for S3, and GCS (if tensorflow is not installed).\n\nIf these are not automatically detected during first initialization, they can also be explicitly set\n```python3\n\nimport os\nfrom pylines import auth_cloud, Pylines\n\n# Can be set at the env level \nos.environ[\"AWS_ACCESS_KEY_ID\"] = ''\nos.environ[\"AWS_SECRET_ACCESS_KEY\"] = ''\nos.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"] = '/path/to/adc.json'\nos.environ[\"AWS_SHARED_CREDENTIALS_FILE\"] = '/path/to/file'\n\nauth_cloud()\n\n# or passed explictly\nmy_gcs_cred = '/path/to/adc.json'\nmy_s3_cred = '/path/to/file'\n\nauth_cloud(gcs=my_gcs_cred, s3=my_s3_cred)\n\n# or passed as a dict, only for s3\n\ns3 = {\n    'aws_access_key_id': 'key',\n    'aws_secret_access_key': 'key'\n}\nauth_cloud(gcs=my_gcs_cred, s3=s3)\n```\n\n\n## Quick Function Cheatsheet\n```python3\n'''\n\n# I/O\nPylines.add_files(input_files): str or list. Used to add new files to the input feed\nPylines.set_input_files(input_files): str or list. Used to override existing files.\n\nPylines.set_writefile(output_file, overwrite=False): str, bool. Used to set the output file. Will append by default if file exists.\n\nPylines.clear_input_files(): No args. Clears all input files. Use set_input_files or add_files to add new input files.\n\n# Json Functions - Majority of default json functions are wrappers around simdjson\nPylines.parse(x): Will read from binary and return as bytes. Use .loads() instead for a python dict. Useful for direct serialization, ie from json to pickle.\nPylines.loads(x): Will read from binary or any json string, and return as deserialized json.\n\nPylines.dumps(x): Serializes json input\nPylines.dump(filename, x): Serializes json input (non-jsonlines), and will take an open_fn or a raw file string.\n\nPylines.index(idx, filename=None): -\u003e dict\nReturns a line from a specific index. If filename is set, will use that specific file instead of existing files added. If no filename, will return the same index from all files added.\n\nPylines.write(x): Serializes and writes json to the output_fn and appends with a newline. Do not call with json.dumps() as this is already handled.\n\n# Other Functions\nPylines.find(key, value, results=['first', 'all'], filename=None): -\u003e dict [generator]\nYields results when key=value from file, assuming all lines have key within the line. If used with filename, will search that filename instead of added files, else will search through all files added. If results='first', will only return the first line that matches. If results='all', will return all lines in all file(s) that match.\n\nPylines.as_iterator(): -\u003e dict [generator]\nTakes no args. Iterates over all files added and yields deserialized json lines.\n\nPylines.merge(input_fns=None, output_fn=None):\nAdds input_fns to existing files, and sets output_fn if not none. Merges them all into the output fn.\n\nPylines.linecount(filename=None): -\u003e dict in {filename: num_lines} for each file\nIf filename, will find the total number of lines in the file. Else will iterate through input_files and returns dict\n\nPylines.stats -\u003e dict [property] in {filename: {'read': int, 'missed': int}} for each filename that have been processed. Will reset upon each call of iterators.\n\n# Example usage below.\nPylines.run_tokenizer(tokenizer_fn, use_mp=[True, or int for num_processes]): -\u003e tokenized examples to output_fn.\nIterates through all files and tokenizes with provided tokenizer_fn. use_mp = False will not use multiprocessing. Otherwise will use all available cores by default.\n\nPylines.run_processor(iter_fn, use_mp=[True, or int for num_processes]): -\u003e  results from function(line) to output_fn.\nIterates through all files and tokenizes with provided iter_fn. use_mp = False will not use multiprocessing. Otherwise will use all available cores by default.\n'''\n\n```\n\n\nRoadmap:\n\n- Different format of serialization outputs (pickle, torch) // tfrecords done\n- Creation of Dataset objects for Tensorflow and Pytorch, to easily plug into training pipelines, with or without caching\n- Support for sharding of files rather than a single big output file\n- Conditional merging of input files // done\n- Support for mapping of jsonlines into functions\n- Allow for caching through redis backend\n- Creating an index file that maps keys \u003c-\u003e int to save precious bytes when writing large files\n  ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrisongz%2Fpylines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftrisongz%2Fpylines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrisongz%2Fpylines/lists"}