{"id":37074792,"url":"https://github.com/signaln/parallelio","last_synced_at":"2026-01-14T08:48:02.954Z","repository":{"id":57450716,"uuid":"102192064","full_name":"SignalN/parallelio","owner":"SignalN","description":"For reading from and writing to parallel data files in Python","archived":false,"fork":false,"pushed_at":"2017-09-07T15:41:44.000Z","size":11,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-25T08:46:36.520Z","etag":null,"topics":["machine-learning","natural-language-processing","pre-processing","preprocessing","text","text-data"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SignalN.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-09-02T11:30:34.000Z","updated_at":"2022-05-01T03:24:47.000Z","dependencies_parsed_at":"2022-09-26T17:31:34.089Z","dependency_job_id":null,"html_url":"https://github.com/SignalN/parallelio","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/SignalN/parallelio","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SignalN%2Fparallelio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SignalN%2Fparallelio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SignalN%2Fparallelio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SignalN%2Fparallelio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SignalN","download_url":"https://codeload.github.com/SignalN/parallelio/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SignalN%2Fparallelio/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414693,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","natural-language-processing","pre-processing","preprocessing","text","text-data"],"created_at":"2026-01-14T08:48:02.305Z","updated_at":"2026-01-14T08:48:02.944Z","avatar_url":"https://github.com/SignalN.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Parallel I/O\n\n**Parallel I/O** is a library for easily reading from and writing to parallel data files in Python.\n\n***What are parallel data files?***\n\nParallel data files are two or more files that have the same number of lines, like columns in a spreadsheet.  Their rows correspond to each other.\n\nWith Parallel I/O, data from the same row across multiple files can be read as input to functions, and the output of the functions can be written to new files.\n\nIt is especially intended for text data at scale, for which formats like CSV and TSV are not ideal.\n\n```\npip install parallelio\n```\n\n```\nfrom parallelio.parallelio import pread, papply, pwrite\n\na_b = pread(\"a.txt\", \"b.txt\")\nc = papply(your_magic_fn, a_b)\npwrite(c, \"c.txt\")\n```\n\n`pread`, `pwrite` and `papply` do not change the number of lines, but `pinsert` and `pfilter` do.\n\n### pread\n`pread` reads in a variable number of files, which must have the same number of lines.\n```\na_b = pread(\"a.txt\", \"b.txt\")\n```\nIt returns an iterator over tuples of corresponding lines.\n\n### papply\n`papply` applies a function to the items in the iterator.\n```\nc = papply(magic_fn, a_b)\n```\n`fn` should expect an argument for each item in the iterator's tuples, for example `lambda a, b: a + ' ' + b\n`, where `a` is a line in a.txt and be is the corresponding line in b.txt.  It can also take arbitrary keyword arguments.  It should return a single value.\n\n### pwrite\n`pwrite` writes lines to a file.\n```\npwrite(c, \"c.txt\")\n```\nIt expects an iterator of values, and writes out one value per line.  It returns only the path to the newly written file.\n\n### pinsert\n`pinsert` turns one line into multiple lines.\n```\nc = pinsert(insert_fn, c)\n```\n`fn` should have an argument for each item in the iterator's tuples.  It can also take arbitrary keyword arguments.  It should return a tuple of values.  The tuple can be empty, and if it is empty or it does not contain the original value then it is equivalent to filtering out the line.\n\n`pinsert` returns a new iterator.\n\n### pfilter\n`pfilter` is a way to remove certain lines.\n```\nc = pfilter(fn, c)\n```\n`fn` should have an argument for item in the iterator's tuples.  It can also take arbitrary keyword arguments.  Similar to built-in `filter`, only those items in the iterator for which `fn` returns something that evaluates to `True` are preserved.\n\n`pfilter` returns a new iterator.\n\n### pio\n`pio` is simply all operations in one - `pread`, `pinsert`, `papply`, `pfilter` and `pwrite`.\n\n```\nc_txt = pio(fn, \"a.txt\", \"b.txt\", insert_fn=fx, filter_fn=fy, path=\"c.txt\")\n```\n\nIf `path` is an extension, it will add it to the common prefix.  For example, if the input files are `\"data/fifa/matches.location.txt\"` and `\"data/fifa/matches.date.txt\"`, and path is `\".weather.txt\"`, the output will written to\n`\"data/fifa/matches.weather.txt\"`.\n\n## Keyword arguments\n\n`pinsert`, `papply`, `pfilter` and `pio` support keyword arguments that will be passed on to the functions `fn`.\n\n## Example\n\na.txt:\n```\nAleppo\nBellinzona\nChicago\nDetroit\n```\n\nb.txt:\n```\nAlla\nBoban\nCharles\nDino\n```\n\nYour code:\n```\n\ndef your_magic_fn(a, b):\n  return a + ' ' + b\n\na_b = pread(\"a.txt\", \"b.txt\")\nc = papply(your_magic_fn, a_b)\npwrite(c, \"c.txt\")\n```\n\nOnce it runs, c.txt will be written with:\n```\nAleppo Alla\nBellinzona Boban\nChicago Charles\nDetroit Dino\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsignaln%2Fparallelio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsignaln%2Fparallelio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsignaln%2Fparallelio/lists"}