{"id":19382593,"url":"https://github.com/neuro-ml/tarn","last_synced_at":"2025-04-23T20:32:23.675Z","repository":{"id":39798592,"uuid":"470340743","full_name":"neuro-ml/tarn","owner":"neuro-ml","description":"An insanely customizable framework for key-value storage 💾","archived":false,"fork":false,"pushed_at":"2025-03-29T15:53:20.000Z","size":351,"stargazers_count":2,"open_issues_count":2,"forks_count":0,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-29T16:30:57.218Z","etag":null,"topics":["cache","datalake","memoization","persistent","python","storage"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/neuro-ml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-15T21:43:05.000Z","updated_at":"2025-02-12T07:15:52.000Z","dependencies_parsed_at":"2023-02-19T16:01:10.821Z","dependency_job_id":"b8154537-bfa5-4f93-86d0-7d70f91d014c","html_url":"https://github.com/neuro-ml/tarn","commit_stats":{"total_commits":60,"total_committers":1,"mean_commits":60.0,"dds":0.0,"last_synced_commit":"6b13449b461bafa144428d76854fe47c73aef6f8"},"previous_names":[],"tags_count":41,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuro-ml%2Ftarn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuro-ml%2Ftarn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuro-ml%2Ftarn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuro-ml%2Ftarn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/neuro-ml","download_url":"https://codeload.github.com/neuro-ml/tarn/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250509827,"owners_count":21442507,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cache","datalake","memoization","persistent","python","storage"],"created_at":"2024-11-10T09:22:21.882Z","updated_at":"2025-04-23T20:32:22.205Z","avatar_url":"https://github.com/neuro-ml.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![codecov](https://codecov.io/gh/neuro-ml/tarn/branch/master/graph/badge.svg)](https://codecov.io/gh/neuro-ml/tarn)\n[![pypi](https://img.shields.io/pypi/v/tarn?logo=pypi\u0026label=PyPi)](https://pypi.org/project/tarn/)\n![License](https://img.shields.io/github/license/neuro-ml/tarn)\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/tarn)](https://pypi.org/project/tarn/)\n\nA generic framework for key-value storage\n\n# Install\n\n```shell\npip install tarn\n```\n\n# Recipes\n\n## A simple datalake\n\nLet's start small and create a simple disk-based datalake. It will store various files, and the keys will be their\n[sha256](https://en.wikipedia.org/wiki/SHA-2) digest:\n\n```python\nfrom tarn import HashKeyStorage\n\nstorage = HashKeyStorage('/path/to/some/folder')\n# here `key` is the sha256 digest\nkey = storage.write('/path/to/some/file.png')\n# now we can use the key to read the file at a later time\nwith storage.read(key) as value:\n    # this will output something like Path('/path/to/some/folder/a0/ff9ae8987..')\n    print(value.resolve())\n\n# you can also store values directly from memory\n# - either byte strings\nkey = storage.write(b'my-bytes')\n# - or file-like objects\n#  in this example we stream data from an url directly to the datalake\nimport requests\n\nkey = storage.write(requests.get('https://example.com').raw)\n```\n\n## Smart cache to disk\n\nA really cool feature of `tarn` is [memoization](https://en.wikipedia.org/wiki/Memoization) with automatic invalidation:\n\n```python\nfrom tarn import smart_cache\n\n\n@smart_cache('/path/to/storage')\ndef my_expensive_function(x):\n    y = x ** 2\n    return my_other_function(x, y)\n\n\ndef my_other_function(x, y):\n    ...\n    z = x * y\n    return x + y + z\n```\n\nNow the calls to `my_expensive_function` will be automatically cached to disk.\n\nBut that's not all! Let's assume that `my_expensive_function` and `my_other_function` are often prone to change,\nand we would like to invalidate the cache when they do. Just annotate these function with a decorator:\n\n```python\nfrom tarn import smart_cache, mark_unstable\n\n\n@smart_cache('/path/to/storage')\n@mark_unstable\ndef my_expensive_function(x):\n    ...\n\n\n@mark_unstable\ndef my_other_function(x, y):\n    ...\n```\n\nNow any change to these functions, will cause the cache to invalidate itself!\n\n## Other storage locations\n\nWe support multiple storage locations out of the box.\n\nDidn't find the location you were looking for? Create an [issue](https://github.com/neuro-ml/tarn/issues).\n\n### S3\n\n```python\nfrom tarn import HashKeyStorage, S3\n\nstorage = HashKeyStorage(S3('my-storage-url', 'my-bucket'))\n```\n\n### Redis\n\nIf your files are small, and you want a fast in-memory storage [Redis](https://redis.io/) is a great option\n\n```python\nfrom tarn import HashKeyStorage, RedisLocation\n\nstorage = HashKeyStorage(RedisLocation('localhost'))\n```\n\n### SFTP\n\n```python\nfrom tarn import HashKeyStorage, SFTP\n\nstorage = HashKeyStorage(SFTP('myserver', '/path/to/root/folder'))\n```\n\n### SCP\n\n```python\nfrom tarn import HashKeyStorage, SCP\n\nstorage = HashKeyStorage(SCP('myserver', '/path/to/root/folder'))\n```\n\n### Nginx\n\nNginx has an [autoindex](https://nginx.org/en/docs/http/ngx_http_autoindex_module.html#autoindex_format) option, that\nallows to serve files and list directory contents. This is useful when you want to access files over http/https:\n\n```python\nfrom tarn import HashKeyStorage, Nginx\n\nstorage = HashKeyStorage(Nginx('https://example.com/storage'))\n```\n\n## Advanced\n\nHere we'll show more specific (but useful!) use-cases\n\n### Fanout\n\nYou might have several HDDs, and you may want to keep your datalake on both without creating a RAID array:\n\n```python\nfrom tarn import HashKeyStorage, Fanout\n\nstorage = HashKeyStorage(Fanout(\n    '/mount/hdd1/lake',\n    '/mount/hdd2/lake',\n))\n```\n\nNow both disks are used, and we'll start writing to `/mount/hdd2/lake` after `/mount/hdd1/lake` becomes full.\n\nYou can even use other types of locations:\n\n```python\nfrom tarn import HashKeyStorage, Fanout, S3\n\nstorage = HashKeyStorage(Fanout(S3('server1', 'bucket1'), S3('server2', 'bucket2')))\n```\n\nOr mix and match them as you please:\n\n```python\nfrom tarn import HashKeyStorage, Fanout, S3\n\n# write to s3, then start writing to HDD1 after s3 becomes full\nstorage = HashKeyStorage(Fanout(S3('server2', 'bucket2'), '/mount/hdd1/lake'))\n```\n\n### Lazy migration\n\nLet's say you want to seamlessly replicate an old storage to a new location, but copy only the needed files first:\n\n```python\nfrom tarn import HashKeyStorage, Levels\n\nstorage = HashKeyStorage(Levels(\n    '/mount/new-hdd/lake',\n    '/mount/old-hdd/lake',\n))\n```\n\nThis will create something like a [cache hierarchy](https://en.wikipedia.org/wiki/Cache_hierarchy) with copy-on-read\nbehaviour. Each time we read a key, if we don't find it in `/mount/new-hdd/lake`, we read it from `/mount/old-hdd/lake`\nand save a copy to `/mount/new-hdd/lake`.\n\n### Cache levels\n\nThe same [cache hierarchy](https://en.wikipedia.org/wiki/Cache_hierarchy) logic can be used if you have a combination of\nHDDs and SSD which will seriously speed up the reading:\n\n```python\nfrom tarn import HashKeyStorage, Levels, Level\n\nstorage = HashKeyStorage(Levels(\n    Level('/mount/fast-ssd/lake', write=False),\n    Level('/mount/slow-hdd/lake', write=False),\n    '/mount/slower-nfs/lake',\n))\n```\n\nThe setup above is similar to the one we use in our lab:\n\n- we have a slow but _huge_ NFS-mounted storage\n- a faster but smaller HDD\n- and a super fast but even smaller SSD\n\nNow, we only write to the NFS storage, but the data gets lazily replicated to the local HDD and SSD to speed up the\nreads.\n\n### Caching small files to Redis\n\nWe can take this approach even further and use ultra fast in-memory storages, such as Redis:\n\n```python\nfrom tarn import HashKeyStorage, Levels, Small, RedisLocation\n\nstorage = HashKeyStorage(Levels(\n    # max file size = 100KiB\n    Small(RedisLocation('my-host'), max_size=100 * 1024),\n    '/mount/hdd/lake',\n))\n```\n\nHere we use `Small` - a wrapper that only allows small (\u003c=100KiB in this case) files to be written to it.\nIn our experiments we observed a 10x speedup for reading small files.\n\n## Composability\n\nBecause all the locations implement the same interface, you can start creating more complex storage logic specifically\ntailored to your needs. You can make setups as crazy as you want!\n\n```python\nfrom tarn import HashKeyStorage, Levels, Fanout, RedisLocation, Small, S3, SFTP\n\nstorage = HashKeyStorage(Levels(\n    Small(RedisLocation('my-host'), max_size=10 * 1024 ** 2),\n    '/mount/fast-ssd/lake',\n\n    Fanout(\n        '/mount/hdd1/lake',\n        '/mount/hdd2/lake',\n        '/mount/hdd3/lake',\n\n        # nested locations are not a problem!\n        Levels(\n            # apparently we want mirrored locations here\n            '/mount/hdd3/lake',\n            '/mount/old-hdd/lake',\n        ),\n    ),\n\n    '/mount/slower-nfs/lake',\n\n    S3('my-s3-host', 'my-bucket'),\n\n    # pull missing files over sftp when needed\n    SFTP('remove-host', '/path/to/remote/folder'),\n))\n```\n\n# Acknowledgements\n\nSome parts of our cache invalidation machinery were heavily inspired by\nthe [cloudpickle](https://github.com/cloudpipe/cloudpickle) project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneuro-ml%2Ftarn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneuro-ml%2Ftarn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneuro-ml%2Ftarn/lists"}