https://github.com/signaln/parallelio
For reading from and writing to parallel data files in Python
https://github.com/signaln/parallelio
machine-learning natural-language-processing pre-processing preprocessing text text-data
Last synced: 19 days ago
JSON representation
For reading from and writing to parallel data files in Python
- Host: GitHub
- URL: https://github.com/signaln/parallelio
- Owner: SignalN
- License: mit
- Created: 2017-09-02T11:30:34.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-09-07T15:41:44.000Z (over 8 years ago)
- Last Synced: 2025-08-25T08:46:36.520Z (5 months ago)
- Topics: machine-learning, natural-language-processing, pre-processing, preprocessing, text, text-data
- Language: Python
- Size: 10.7 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Parallel I/O
**Parallel I/O** is a library for easily reading from and writing to parallel data files in Python.
***What are parallel data files?***
Parallel data files are two or more files that have the same number of lines, like columns in a spreadsheet. Their rows correspond to each other.
With Parallel I/O, data from the same row across multiple files can be read as input to functions, and the output of the functions can be written to new files.
It is especially intended for text data at scale, for which formats like CSV and TSV are not ideal.
```
pip install parallelio
```
```
from parallelio.parallelio import pread, papply, pwrite
a_b = pread("a.txt", "b.txt")
c = papply(your_magic_fn, a_b)
pwrite(c, "c.txt")
```
`pread`, `pwrite` and `papply` do not change the number of lines, but `pinsert` and `pfilter` do.
### pread
`pread` reads in a variable number of files, which must have the same number of lines.
```
a_b = pread("a.txt", "b.txt")
```
It returns an iterator over tuples of corresponding lines.
### papply
`papply` applies a function to the items in the iterator.
```
c = papply(magic_fn, a_b)
```
`fn` should expect an argument for each item in the iterator's tuples, for example `lambda a, b: a + ' ' + b
`, where `a` is a line in a.txt and be is the corresponding line in b.txt. It can also take arbitrary keyword arguments. It should return a single value.
### pwrite
`pwrite` writes lines to a file.
```
pwrite(c, "c.txt")
```
It expects an iterator of values, and writes out one value per line. It returns only the path to the newly written file.
### pinsert
`pinsert` turns one line into multiple lines.
```
c = pinsert(insert_fn, c)
```
`fn` should have an argument for each item in the iterator's tuples. It can also take arbitrary keyword arguments. It should return a tuple of values. The tuple can be empty, and if it is empty or it does not contain the original value then it is equivalent to filtering out the line.
`pinsert` returns a new iterator.
### pfilter
`pfilter` is a way to remove certain lines.
```
c = pfilter(fn, c)
```
`fn` should have an argument for item in the iterator's tuples. It can also take arbitrary keyword arguments. Similar to built-in `filter`, only those items in the iterator for which `fn` returns something that evaluates to `True` are preserved.
`pfilter` returns a new iterator.
### pio
`pio` is simply all operations in one - `pread`, `pinsert`, `papply`, `pfilter` and `pwrite`.
```
c_txt = pio(fn, "a.txt", "b.txt", insert_fn=fx, filter_fn=fy, path="c.txt")
```
If `path` is an extension, it will add it to the common prefix. For example, if the input files are `"data/fifa/matches.location.txt"` and `"data/fifa/matches.date.txt"`, and path is `".weather.txt"`, the output will written to
`"data/fifa/matches.weather.txt"`.
## Keyword arguments
`pinsert`, `papply`, `pfilter` and `pio` support keyword arguments that will be passed on to the functions `fn`.
## Example
a.txt:
```
Aleppo
Bellinzona
Chicago
Detroit
```
b.txt:
```
Alla
Boban
Charles
Dino
```
Your code:
```
def your_magic_fn(a, b):
return a + ' ' + b
a_b = pread("a.txt", "b.txt")
c = papply(your_magic_fn, a_b)
pwrite(c, "c.txt")
```
Once it runs, c.txt will be written with:
```
Aleppo Alla
Bellinzona Boban
Chicago Charles
Detroit Dino
```