{"id":21587760,"url":"https://github.com/heikomuller/histore","last_synced_at":"2025-04-10T20:22:34.081Z","repository":{"id":52464531,"uuid":"73864747","full_name":"heikomuller/histore","owner":"heikomuller","description":"Library for maintaining snapshots of evolving tabular data sets","archived":false,"fork":false,"pushed_at":"2021-04-28T11:17:52.000Z","size":7127,"stargazers_count":2,"open_issues_count":4,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-24T18:02:35.189Z","etag":null,"topics":["data","version-control"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/heikomuller.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-11-15T23:31:15.000Z","updated_at":"2021-08-08T20:46:54.000Z","dependencies_parsed_at":"2022-09-01T13:20:26.884Z","dependency_job_id":null,"html_url":"https://github.com/heikomuller/histore","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heikomuller%2Fhistore","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heikomuller%2Fhistore/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heikomuller%2Fhistore/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heikomuller%2Fhistore/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/heikomuller","download_url":"https://codeload.github.com/heikomuller/histore/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247909136,"owners_count":21016475,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","version-control"],"created_at":"2024-11-24T15:17:39.171Z","updated_at":"2025-04-10T20:22:34.063Z","avatar_url":"https://github.com/heikomuller.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":".. image:: https://img.shields.io/pypi/pyversions/histore.svg\n    :target: https://pypi.org/pypi/histore\n\n.. image:: https://badge.fury.io/py/histore.svg\n    :target: https://badge.fury.io/py/histore\n\n.. image:: https://github.com/heikomuller/histore/workflows/build/badge.svg\n    :target: https://github.com/heikomuller/histore/actions?query=workflow%3A%22build%22\n\n.. image:: https://codecov.io/gh/heikomuller/histore/branch/master/graph/badge.svg\n    :target: https://codecov.io/gh/heikomuller/histore\n\n.. image:: https://img.shields.io/badge/License-BSD-green.svg\n    :target: https://github.com/heikomuller/histore/blob/master/LICENSE\n\n.. figure:: https://raw.githubusercontent.com/heikomuller/histore/master/docs/graphics/logo.png\n   :align: center\n   :alt: History Store\n\n\n\n**History Store** (HISTORE) is a Pyhton package for maintaining snapshots of evolving data sets. This package provides an implementation of the core functionality that was implemented in the `XML Archiver (XArch) \u003chttp://xarch.sourceforge.net/\u003e`_. The package is a lightweight implementation that is intended for maintaining data set snapshots that are represented as pandas data frames.\n\n**HISTORE** is based on a nested merge approach that efficiently stores multiple dataset snapshots in a compact archive `[Buneman, Khanna, Tajima, Tan. 2004] \u003chttps://dl.acm.org/citation.cfm?id=974752\u003e`_. The library allows one to create new archives, to merge new data set snapshots into an existing archive, and to retrieve data set snapshots from the archive.\n\n\nInstallation\n============\n\nInstall ``histore`` from the  `Python Package Index (PyPI) \u003chttps://pypi.org/\u003e`_ using ``pip`` with:\n\n.. code-block:: bash\n\n  pip install histore\n\n\nExamples\n========\n\n**HISTORE** maintains data set versions (snapshots) in an archive. A separate archive is created for each data set. The package currently provides two different types of archive: a volatile archive that maintains all data set snapshots in main-memory and a persistent archive that writes data set snapshots to disk.\n\n\nExample using Volatile Archive\n------------------------------\n\nStart by creating a new archive. For each archive, a optional primary key (list of column names) can be specified. If a primary key is given, the values in the key attributes are used as row keys when data set snapshots are merged into the archive. If no primary key is specified the row index of the data frame is used to match rows during the merge phase.\n\nFor archives that have a primary key, the initial dataset snapshot (or at least the dataset schema) needs to be given when creating the archive.\n\n.. code-block:: python\n\n   # Create a new archive that merges snapshots\n   # based on a primary key attribute\n\n   import histore as hs\n   import pandas as pd\n\n  # First version\n   df = pd.DataFrame(\n       data=[['Alice', 32], ['Bob', 45], ['Claire', 27], ['Dave', 23]],\n       columns=['Name', 'Age']\n   )\n   archive = hs.Archive(doc=df, primary_key='Name', descriptor=hs.Descriptor('First snapshot'))\n\n\nAdd the first two data set versions to the archive:\n\n.. code-block:: python\n\n   # Second version: Change age for Alice and Bob\n   df = pd.DataFrame(\n       data=[['Alice', 33], ['Bob', 44], ['Claire', 27], ['Dave', 23]],\n       columns=['Name', 'Age']\n   )\n   archive.commit(df, descriptor=hs.Descriptor('Alice is 33 and Bob 44'))\n\n\nList information about all snapshots in the archive. This also shows how to use the checkout method to retrieve a particular data set version:\n\n.. code-block:: python\n\n   # Print all data frame versions\n   for s in archive.snapshots():\n       df = archive.checkout(s.version)\n       print('({}) {}\\n'.format(s.version, s.description))\n       print(df)\n       print()\n\nThe result should look like this:\n\n.. code-block:: console\n\n   (0) First snapshot\n\n        Name  Age\n   0   Alice   32\n   1     Bob   45\n   2  Claire   27\n   3    Dave   23\n\n   (1) Alice is 33 and Bob 44\n\n        Name  Age\n   0   Alice   33\n   1     Bob   44\n   2  Claire   27\n   3    Dave   23\n\n\nExample using Persistent Archive\n--------------------------------\n\nTo create persistent archive that maintains all data on disk use the ``PersistentArchive`` class:\n\n.. code-block:: python\n\n   archive = hs.PersistentArchive(basedir='path/to/archive/dir', create=True, doc=df, primary_key=['Name'])\n\nThe persistent archive maintains the data set snapshots in two files that are created in the directory that is given as the ``basedir`` argument.\n\nFor more examples see the notebooks in the `examples folder \u003chttps://github.com/heikomuller/histore/tree/pandas/examples\u003e`_.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheikomuller%2Fhistore","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fheikomuller%2Fhistore","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheikomuller%2Fhistore/lists"}