{"id":18221751,"url":"https://github.com/karlicoss/bleanser","last_synced_at":"2025-04-03T02:30:45.820Z","repository":{"id":38377901,"uuid":"154458524","full_name":"karlicoss/bleanser","owner":"karlicoss","description":"Tool for cleaning old and redundant backups","archived":false,"fork":false,"pushed_at":"2025-01-19T03:07:17.000Z","size":443,"stargazers_count":13,"open_issues_count":2,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-02T20:05:08.127Z","etag":null,"topics":["backup","dataliberation"],"latest_commit_sha":null,"homepage":"https://beepb00p.xyz/exobrain/projects/bleanser.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/karlicoss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-24T07:35:22.000Z","updated_at":"2025-02-01T18:55:54.000Z","dependencies_parsed_at":"2023-10-17T06:07:38.098Z","dependency_job_id":"abd886fb-f6f0-4f39-b561-de8cd68dd9cb","html_url":"https://github.com/karlicoss/bleanser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karlicoss%2Fbleanser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karlicoss%2Fbleanser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karlicoss%2Fbleanser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/karlicoss%2Fbleanser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/karlicoss","download_url":"https://codeload.github.com/karlicoss/bleanser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246925192,"owners_count":20855852,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backup","dataliberation"],"created_at":"2024-11-03T22:04:12.754Z","updated_at":"2025-04-03T02:30:45.794Z","avatar_url":"https://github.com/karlicoss.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"bleanser or 'backup cleanser' is a tool for cleaning old and redundant backups\n\n## Installing\n\nTo install, run: `pip install bleanser`.\n\nThere are also extra install options. You can use none or multiple depending on your needs, i.e. `pip install bleanser[flavor1,flavor2]`:\n\n- `bleanser[extra]` : some recommended but optional extras\n- `bleanser[json]` : dependencies for JSON based modules\n- `bleanser[xml]` : dependencies for XML based modules\n- `bleanser[HPI]` : dependencies for [HPI](https://github.com/karlicoss/HPI) based modules\n\nSee `optional-dependencies` section in [pyproject.toml](pyproject.toml) if you're curious what libraries these extras pull.\n\n## What bleanser does\n\nIn this context, backup typically means something like a GDPR export, an XML or JSON file which includes your data from some website/API, or a sqlite database from an application\n\n\u003chttps://beepb00p.xyz/exobrain/projects/bleanser.html\u003e\n\nThis is used to find 'redundant backups'. As an example, say you save your data to a JSON file by making API requests to some API service once a day. If your export of the data you exported today is a [superset](https://en.wikipedia.org/wiki/Subset) of the export yesterday, you know you can safely delete the old file and still have a complete backup of your data. This helps:\n\n- save on disk space\n- save of data access time; how long it takes to parse all your input files (see [data access layer](https://beepb00p.xyz/exports.html#dal))\n\nThis works for both [full](https://beepb00p.xyz/exports.html#full) (you're able to get all your data from a service) and [incremental](https://beepb00p.xyz/exports.html#incremental) exports.\n\nThis is especially relevant for incremental data exports, as they're harder to reason about. So, this handles the complex bits of diffing adjacent backups.\n\nAs an example of an incremental export, imagine the service you were using only gave you access to the latest 3 items in your history (a real example of this is the [github activity feed](https://github.com/karlicoss/ghexport))\n\n| Day 1 | Day 2 | Day 3 |\n| ----- | ----- | ----- |\n| A     | B     | C     |\n| B     | C     | D     |\n| C     | D     | E     |\n\nTo parse this in your [data access layer](https://beepb00p.xyz/exports.html#dal), you could imagine something like this:\n\n```python\nevents = set()\nfor file in inputs:\n    for line in file:\n        events.add(line)\n# events is now {'A', 'B', 'C', 'D', 'E'}\n```\n\nYou might notice that if you removed 'Day 2', you'd still have an accurate backup, and we'd still have all 5 items, but its not obvious you can remove it since none of these are supersets of each other.\n\n`bleanser` is meant to solve this problem in a data agnostic way, so any export can be converted to a normalised representation, and those can be compared against each other to find redundant data\n\nSidenote: in particular this is describing how `--multiway` finds redundant files, see [`options.md`](./doc/options.md) for more info\n\n## How it works\n\nThis has `Normaliser`s for different data sources (see [modules](src/bleanser/modules)), and generally follows a pattern like this:\n\n```python\nfrom contextlib import contextmanager\nfrom pathlib import Path\nfrom typing import Iterator\n\nfrom bleanser.core.processor import BaseNormaliser, unique_file_in_tempdir\n\nclass Normaliser(BaseNormaliser):\n\n    @contextmanager\n    def normalise(self, *, path: Path) -\u003e Iterator[Path]:\n        # if the input file was compressed, the \"path\" you receive here will be decompressed\n\n        # a temporary file we write 'normalised' data to, that can be easily diffed/compared\n        normalised = unique_file_in_tempdir(input_filepath=path, dir=self.tmp_dir)\n\n        # some custom code here per-module that writes to 'normalised'\n\n        yield normalised\n\n\n# this script should be run as a module like\n# python3 -m bleanser.modules.smscalls --glob ...\nif __name__ == \"__main__\":\n    Normaliser.main()\n```\n\nThis is **always** acting on the data loaded into memory/temporary files, it is not modifying the files itself. Once it determines an input file can be pruned, it will warn you by default, and you can specify `--move` or `--remove` with the CLI (see below) to remove it.\n\nThere are particular normalisers for different filetypes, e.g. [`json`](./src/bleanser/core/modules/json.py), [`xml`](./src/bleanser/core/modules/xml.py), [`sqlite`](./src/bleanser/core/modules/sqlite.py) which might work if your data is especially basic, but typically this requires subclassing one of those and writing some custom code to 'cleanup' the data, so it can be properly compared/diffed.\n\n### normalise\n\nThere are two ways you can think about `normalise` (creating a 'cleaned'/normalised representation of an input file) -- by specifying an 'upper' or 'lower' bound:\n\n- upper: specify which data you want to drop, dumping everything else to `normalised`\n- lower: specify which keys/data you want to keep, e.g. only returning a few keys which uniquely identify events in the data\n\nAs an example say you had a JSON export:\n\n```json\n[\n  { \"id\": 5, \"images\": [{}], \"href\": \"...\" },\n  { \"id\": 6, \"images\": [{}], \"href\": \"...\" },\n  { \"id\": 7, \"images\": [{}], \"href\": \"...\" }\n]\n```\n\nWhen comparing this, you could possibly:\n\n1. Just write the `id` to the file. This is somewhat risky as you don't know if the `href` will always remain the same, so you may be losing data\n2. Write the `id` and the `href`, by specifying those two keys you're interested in\n3. Write the `id` and the `href`, by deleting the `images` key (this is different from 2!)\n\nThere is a trade-off to be made here. For especially noisy exports with lots of metadata that might change over time that you're not interested in, number 3 means every couple months you might have to check and add new keys to delete (as an example see [spotify](./src/bleanser/modules/spotify.py)). This could be seen as a positive as well, as it means when the schema for the API/data changes underneath you, you may notice it quicker\n\nWith option 2, you are more likely to remove redundant data files if additional metadata fields are added, and if you only really care about the `id` and `href` and you don't think the export format will change often, this is fine.\n\nOption 3. is generally the safest, but most verbose/tedious, it makes sure you're not removing files that may possibly contain new fields you want to preserve/parse.\n\nIdeally you meet somewhere in the middle, it depends a lot on the specific export data you're comparing.\n\nAs it can be a bit difficult to follow, generally this is doing something like:\n\n- Decompress file if its a known compressed format into a `cleaned` file (`unpacked` in [`BaseNormaliser`](./src/bleanser/core/processor.py)), see [`kompress`](https://github.com/karlicoss/kompress/) for supported compression formats\n- Creating a temporary file to write data to (`unique_file_in_tempdir` in [`BaseNormaliser`](./src/bleanser/core/processor.py))\n- Parse the `cleaned` file into python objects (`JsonNormaliser`, `XmlNormaliser`, or something custom)\n- Let the user `cleanup` the data to remove noisy keys/data (specific modules, e.g. [spotify](./src/bleanser/modules/spotify.py))\n- Diff those against each other to find and/or remove files which dont contribute new data (module agnostic, run in `main`)\n\n### Subclassing\n\nFor example, the JSON normaliser calls a `cleanup` function before it starts processing the data. If you wanted to remove the `images` key like shown above, you could do so like:\n\n```python\nfrom bleanser.core.modules.json import JsonNormaliser, delkeys, Json\n\n\nclass Normaliser(JsonNormaliser):\n    # here, j is a dict, each file that this gets passed from the CLI call\n    # below is pre-processed by the cleanup function\n    def cleanup(self, j: Json) -\u003e Json:\n        delkeys(j, keys={\n            'images',\n        })\n\n        return j\n\n\nif __name__ == '__main__':\n    Normaliser.main()\n```\n\nFor common formats, the helper classes handle all the tedious bits like loading/parsing data, managing the temporary files. The `Normaliser.main` calls the CLI, which looks like this:\n\n```\n $ python3 -m bleanser.core.modules.json prune --help\nUsage: python -m bleanser.core.modules.json prune [OPTIONS] PATH\n\nOptions:\n  --glob                 Treat the path as glob (in the glob.glob sense)\n  --sort-by [size|name]  how to sort input files  [default: name]\n  --dry                  Do not prune the input files, just print what would happen after pruning.\n  --remove               Prune the input files by REMOVING them (be careful!)\n  --move PATH            Prune the input files by MOVING them to the specified path. A bit safer than --remove mode.\n  --yes                  Do not prompt before pruning files (useful for cron etc)\n  --threads INTEGER      Number of threads (processes) to use. Without the flag won't use any, with the flag will try\n                         using all available, can also take a specific value. Passed down to PoolExecutor.\n  --from INTEGER\n  --to INTEGER\n  --multiway             force \"multiway\" cleanup\n  --prune-dominated\n  --help                 Show this message and exit.\n```\n\nYou'd provide input paths/globs to this file, and possibly `--remove` or `--move /tmp/removed` to remove/move files\n\nIf you're not able to subclass one of the those, you might be able to subclass [extract](./src/bleanser/core/modules/extract.py), which lets you just yield any sort of string-afiable data, which is then used to diff/compare the input files. For example, if you only wanted to return the `id` and `href` in the JSON example above, you could just return a tuple:\n\n```python\nimport json\nfrom pathlib import Path\nfrom typing import Iterator, Any\n\nfrom bleanser.core.modules.extract import ExtractObjectsNormaliser\n\n\nclass Normaliser(ExtractObjectsNormaliser):\n    def extract_objects(self, path: Path) -\u003e Iterator[Any]:\n        data = json.loads(path.read_text())\n        for blob in data:\n            yield (blob[\"id\"], blob[\"href\"])\n\n\nif __name__ == \"__main__\":\n    Normaliser.main()\n```\n\nOtherwise if you have some complex data source you need to handle yourself, you can override `do_normalise` and `unpacked` (how the data gets uncompressed/pre-processed) methods yourself, see handling the [discord zip files](https://github.com/purarue/bleanser/blob/master/src/bleanser_pura/modules/discord.py) as an example.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkarlicoss%2Fbleanser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkarlicoss%2Fbleanser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkarlicoss%2Fbleanser/lists"}