https://github.com/rtmigo/linecompress_py
Stores text strings in a set of .txt.gz files. Minimalist solution for storing logs
https://github.com/rtmigo/linecompress_py
logs python
Last synced: 11 days ago
JSON representation
Stores text strings in a set of .txt.gz files. Minimalist solution for storing logs
- Host: GitHub
- URL: https://github.com/rtmigo/linecompress_py
- Owner: rtmigo
- License: mit
- Created: 2022-01-19T19:17:19.000Z (almost 4 years ago)
- Default Branch: staging
- Last Pushed: 2022-02-13T20:13:24.000Z (almost 4 years ago)
- Last Synced: 2025-01-21T15:32:20.496Z (12 months ago)
- Topics: logs, python
- Language: Python
- Homepage:
- Size: 73.2 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README


# [linecompress_py](https://github.com/rtmigo/linecompress_py#readme)
Library for storing text lines in compressed files.
It uses **.gz** compression format, so the data can be decompressed by any
compression utility with .gz support.
# Install
```
pip3 install git+https://github.com/rtmigo/linecompress_py#egg=linecompress
```
# Use
`LinesDir` saves data to multiple files, making sure the files don't get too
big.
The files are filled sequentially: first we write only to `000.txt.gz`, then
only to `001.txt.gz`, and so on.
```python3
from pathlib import Path
from linecompress import LinesDir
lines_dir = LinesDir(Path('/parent/dir'),
max_file_size=1000000)
lines_dir.append('Line one')
lines_dir.append('Line two')
# reading from oldest to newest
for line in lines_dir:
print(line)
# reading from newest to oldest
for line in reversed(lines_dir):
print(line)
```
# Directory structure
```
000/000/000.txt.gz
000/000/001.txt.gz
000/000/002.txt.gz
...
000/000/999.txt.gz
000/001/000.txt.gz
...
000/001/233.txt.gz
000/001/234.txt
```
The last file usually contains raw text, not yet compressed.
# Limitations
The default maximum file size is 1 million bytes (decimal megabyte).
This is the size of text data *before* compression.
The directory will hold up to a billion of these files. Thus, the maximum total
storage size is one decimal petabyte.
By changing the value of the `subdirs` argument, we change the maximum number of
files: an increase in `subdirs` by one means an increase in the number of
files by a thousand times.
With the default file size 1MB we get the following limits:
| subdirs | file path | max sum size |
|-------------|----------------------|--------------|
| `subdirs=0` | `000.gz` | gigabyte |
| `subdirs=1` | `000/000.gz` | terabyte |
| `subdirs=2` | `000/000/000.gz` | petabyte |
| `subdirs=3` | `000/000/000/000.gz` | exabyte |
These are the data sizes before compression. The actual size of the files on
the disk will most likely be smaller.
Adjusting the limits:
```python3
from pathlib import Path
from linecompress import LinesDir
gb = LinesDir(Path('/max/gigabyte'),
subdirs=1)
# subdirs=2 is the default value
pb = LinesDir(Path('/max/petabyte'))
eb = LinesDir(Path('/max/exabyte'),
subdirs=3)
```
The file size can also be adjusted.
```python3
from pathlib import Path
from linecompress import LinesDir
# the default buffer size is 1 MB
pb = LinesDir(Path('/max/petabyte'))
# set file size to 5 MB
pb5 = LinesDir(Path('/max/5_petabytes'),
buffer_size=5000000)
```
* With larger files, we get better compression and less load on the file system.
* With smaller files, we're much more efficient at iterating through lines in
reverse order.
# See also
* [linecompress_kt](https://github.com/rtmigo/linecompress_kt) – Kotlin/JVM
library