{"id":20456554,"url":"https://github.com/krassowski/data-vault","last_synced_at":"2025-04-13T04:05:42.052Z","repository":{"id":36469936,"uuid":"226589892","full_name":"krassowski/data-vault","owner":"krassowski","description":"IPython magic for simple, organized, compressed and encrypted: storage \u0026 transfer of files between notebooks.","archived":false,"fork":false,"pushed_at":"2022-12-08T15:48:22.000Z","size":83,"stargazers_count":12,"open_issues_count":11,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-13T04:05:20.815Z","etag":null,"topics":["data-storage","disk-cache","ipython-magic","jupyter-notebook","persistent-storage"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/krassowski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-07T23:35:35.000Z","updated_at":"2024-12-26T18:52:46.000Z","dependencies_parsed_at":"2023-01-17T01:47:18.793Z","dependency_job_id":null,"html_url":"https://github.com/krassowski/data-vault","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krassowski%2Fdata-vault","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krassowski%2Fdata-vault/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krassowski%2Fdata-vault/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krassowski%2Fdata-vault/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/krassowski","download_url":"https://codeload.github.com/krassowski/data-vault/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248661707,"owners_count":21141450,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-storage","disk-cache","ipython-magic","jupyter-notebook","persistent-storage"],"created_at":"2024-11-15T11:23:04.007Z","updated_at":"2025-04-13T04:05:42.022Z","avatar_url":"https://github.com/krassowski.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IPython data-vault\n[![tests](https://github.com/krassowski/data-vault/workflows/tests/badge.svg)](https://github.com/krassowski/data-vault/actions/workflows/tests.yml)\n![CodeQL](https://github.com/krassowski/data-vault/workflows/CodeQL/badge.svg)\n[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](http://choosealicense.com/licenses/mit/)\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/krassowski/data-vault/master?filepath=Example.ipynb)\n[![DOI](https://zenodo.org/badge/226589892.svg)](https://zenodo.org/badge/latestdoi/226589892)\n\nIPython magic for simple, organized, compressed and encrypted storage \u0026 transfer of files between notebooks.\n\n## Background and demo\n\n### Right tool for a simple job\n\nThe `%vault` magic provides a reproducible caching mechanism for variables exchange between notebooks.\nThe cache is compressed, persistent and safe.\n\nDifferently to the builtin `%store` magic, the variables are stored in plain sight,\nin a zipped archive, so that they can be easily accessed for manual inspection,\nor for the use by other tools.\n\n### Demonstration by usage:\n\nLet's open the vault (it will be created if not here yet):\n\n```python\n%open_vault -p data/storage.zip\n```\n\nGenerate some dummy dataset:\n```python\nfrom pandas import DataFrame\nfrom random import choice, randint\ncities = ['London', 'Delhi', 'Tokyo', 'Lagos', 'Warsaw', 'Chongqing']\nsalaries = DataFrame([\n    {'salary': randint(0, 100), 'city': choice(cities)}\n    for i in range(10000)\n])\n```\n\n#### Store variable in a module\n\nAnd store it in the vault:\n\n```python\n%vault store salaries in datasets\n```\n\n\u003e Stored salaries (None → 40CA7812) at Sunday, 08. Dec 2019 11:58\n\nA short description is printed out (including a CRC32 hashsum and a timestamp) by default, but can be disabled by passing `--timestamp False` to `%open_vault` magic.\nEven more information enhancing the reproducibility is [stored in the cell metadata](#metadata-for-storage-operations).\n\n#### Import variable from a module\n\nWe can now load the stored DataFrame in another (or the same) notebook:\n\n```python\n%vault import salaries from datasets\n```\n\n\u003e Imported salaries (40CA7812) at Sunday, 08. Dec 2019 12:02\n\nThanks to (optional) [memory optimizations](#memory-optimizations) we saved some RAM (87% as compared to unoptimized `pd.read_csv()` result).\nTo track how many MB were saved use `--report_memory_gain` setting which will display memory optimization results below imports, for example:\n\n\u003e Reduced memory usage by 87.28%, from 0.79 MB to 0.10 MB.\n\n#### Import variable as something else\n\nIf we already have the salaries variable, we can use `as`, just like in the Python import system.\n```python\n%vault import salaries from datasets as salaries_dataset\n```\n\n#### Store or import with a custom function\n\n```python\nfrom pandas import read_csv\nto_csv = lambda df: df.to_csv()\n%vault store salaries in datasets with to_csv as salaries_csv\n%vault import salaries_csv from datasets with read_csv\n```\n\n#### Import an arbitrary file\n\n```python\nfrom pandas import read_excel\n%vault import 'cars.xlsx' as cars_dataset with read_excel\n```\n\nMore examples are available in the [Examples.ipynb](https://github.com/krassowski/data-vault/blob/master/Example.ipynb) notebook, which can be [run interactively in the browser](https://mybinder.org/v2/gh/krassowski/data-vault/master?filepath=Example.ipynb).\n\n### Goals\n\nSyntax:\n- easy to understand in plain language (avoid abbreviations when possible),\n- while intuitive for Python developers,\n- ...but sufficiently different so that it would not be mistaken with Python constructs\n   - for example, we could have `%from x import y`, but this looks very like normal Python;\n     having `%vault from x import y` makes it sufficiently easy to distinguish\n- star imports are better avoided, thus not supported\n- as imports may be confusing if there is more than one\n\nReproducibility:\n- promote good reproducible and traceable organization of files:\n   - promote storage in plain text files and the use of DataFrame\n      \u003e pickling is often an easy solution, but it can cause hurtful problems in prototyping phase (which is what notebooks are often used for): if you pickle you objects, then change the class definition and attempt to load your data again you are likely to fail severly; this is why the plain text files are the default option in this package (but pickling is supported too!).\n   - print out a short hashsum and human-readable datetime (always in UTC),\n   - while providing even more details in cell metadata\n- allow to trace instances of the code being modified post execution\n\nSecurity:\n\n* think of it as a tool to minimize the damage in case of accidental `git add` of data files (even if those should have been elsewhere and `.gitignore`d in the first place),\n* or, as an additional layer of security for already anonymized data,\n* but this tool is **not** aimed at facilitating the storage of highly sensitive data\n* you have to set a password, or explicitly set `--secure False` to get rid of a security warning\n\n## Features overview\n\n### Metadata for storage operations\n\nEach operation will print out the timestamp and the CRC32 short checksum of the files involved.\nThe timestamp of the operation is reported in the UTC timezone in a human-readable format.\n\nThis can be disabled by setting `-t False` or `--timestamp False`, however for the sake of reproducibility\nit is encouraged to keep this information visible in the notebook.\n\nMore precise information including the SHA256 cheksum (with a lower probability of collisions),\nand a full timestamp (to detect potential race condition errors in file write operations) are\nembedded in the metadata of the cell. You can disable this by setting --metadata False.\n\nThe exact command line is also stored in the metadata, so that if you accidentally modify the code cell\nwithout re-running the code, the change can be tracked down.\n\n### Storage\n\nIn order to enforce interoperability plain text files are used for pandas DataFrame and Series objects.\nOther variables are stores as pickle objects. The location of the storage archive on the disk defaults\nto `storage.zip` in the current directory, and can changed using `%open_vault` magic:\n\n```python\n%open_vault -p custom_storage.zip\n```\n\n#### Encryption\n\n\u003e **The encryption is not intended as a high security mechanism,\nbut only as an additional layer of protection for already anonymized data.**\n\nThe password to encrypt the storage archive is retrieved from the environmental variable,\nusing a name provided in `encryption_variable` during the setup.\n\n```python\n%open_vault -e ENV_STORAGE_KEY\n```\n\n### Memory optimizations\n\nPandas DataFrames are by-default memory optimized by conversion of string variables to (ordered) categorical\ncolumns (pandas equivalent of R's factors/levels). Each string column will be tested for the memory improvement\nand the optimization will be only applied if it does reduce the memory usage.\n\n\n### Why ZIP and not HDF?\n\nThe storage archive is conceptually similar to Hierarchical Data Format (e.g. HDF5) object - it contains:\n  - a hierarchy of files, and\n  - a metadata files\n\nI believe that HDF may be the future, but this future is not here yet - numerous issues with the packages handling\nthe HDF files, as well as low performance and compression rate prompted me to stay with a simple zip format now.\n\nZIP is a popular file format with known features and limitations - files can be password encrypted, while the file\nlist is always accessible. This is okay given that the code of the project is assumed to be public, and only the\nfiles in the storage area are assumed to be of encrypted, increasing the security in case of unauthorized access.\n\nAs the limitations of the ZIP encryption are assumed to be a common knowledge, I hope that managing expectations\nof the level of security offered by this package will be easier.\n\n## Installation and requirements\n\nPre-requirements:\n- Python 3.6+\n- 7zip (16.02+) (see [below](#installing-7-zip) for Ubuntu and Mac commands)\n\n### Installation:\n\n```bash\npip3 install data_vault\n```\n\n### Installing 7-zip\n\nYou can use p7zip packages from the default repositories:\n\n#### Ubuntu\n\n```bash\nsudo apt-get install -y p7zip-full\n```\n\n#### Mac\n```bash\nbrew install p7zip\n```\n\n#### Windows\n\n\n\u003cs\u003eInstallers for Windows can be downloaded from the [7-zip website](https://www.7-zip.org/download.html).\u003c/s\u003e\n\nWindows is not supported as it has known issues.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrassowski%2Fdata-vault","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkrassowski%2Fdata-vault","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrassowski%2Fdata-vault/lists"}