https://github.com/purarue/git_doc_history
copy/track file history in git, with python bindings to traverse and extract history/files/lines at some date
https://github.com/purarue/git_doc_history
data git
Last synced: about 1 month ago
JSON representation
copy/track file history in git, with python bindings to traverse and extract history/files/lines at some date
- Host: GitHub
- URL: https://github.com/purarue/git_doc_history
- Owner: purarue
- License: mit
- Created: 2022-03-20T00:49:20.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-10-24T23:19:02.000Z (over 1 year ago)
- Last Synced: 2025-02-12T22:51:15.569Z (over 1 year ago)
- Topics: data, git
- Language: Python
- Homepage: https://pypi.org/project/git-doc-history/
- Size: 42 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# git_doc_history
Copy and track files in `git`, and a library to traverse the history
I use this to track my [`todo.txt`](https://github.com/todotxt/todo.txt-cli) files, changes to configuration files, any shell histories which don't support timestamps (see all of my config files [here](https://github.com/purarue/dotfiles/tree/master/.config/git_doc_history))
This copies the files to a different directory, so it doesn't interfere with the application/configuration
By copying those files to a separate directory, I can always roll back to previous file, or see what the file was like a couple days/months ago.
For shell histories/files which are unique lines of text (e.g., my `todo.txt` file) this also lets me estimate timestamps for when new lines were added to the history/text files, using the `iter_commit_snapshots` and `parse_snapshot_diffs` below, which emits added/removed events for individual lines with estimated times
This was mostly created for [HPI](https://github.com/purarue/HPI), so I don't have to rewrite the code to extract lines for git history over and over
This is a general purpose solution for tracking file history in `git` -- so its not extremely opinionated. In some cases it can be seen as a stop-gap solution, to have some file versioning in case you ever want to roll back. It may work particularly well for basic files with a couple dozen lines (e.g. I use it for RSS feeds, `todo.txt`, bookmarks, and a couple history files)
## Installation
Requires `python3.10+`
To install with pip, run:
```
pip install git_doc_history
```
## Usage
The main script to backup data is the bash script [`bin/git_doc_history`](bin/git_doc_history), which gets installed into your `~/.local/bin/` directory.
If uses a config file (parsed with [`python-dotenv`](https://github.com/theskumar/python-dotenv) -- so you can use bash-like syntax to grab environment variables) like:
```
SOURCE_DIR=~/.todo # copy from
BACKUP_DIR=~/data/todo_git_history # copy to
# multiple lines means multiple files
COPY_FILES="todo.txt
done.txt"
```
You can either provide the full path to that config file, or place the file in `~/.config/git_doc_history`
For example, after placing it at `~/.config/git_doc_history/todo` -- to copy/commit any changes, run:
```bash
$ git_doc_history todo
```
```
Generated configuration:
SOURCE_DIR: /home/username/data/todo
BACKUP_DIR: /home/username/data/todo_git_history
COPY_FILES: todo.txt
done.txt
'/home/username/data/todo/todo.txt' -> '/home/username/data/todo_git_history/todo.txt'
'/home/username/data/todo/done.txt' -> '/home/username/data/todo_git_history/done.txt'
'/home/username/data/todo/.gitignore' -> '/home/username/data/todo_git_history/.gitignore'
[master f927490] update
1 file changed, 1 insertion(+)
create mode 100644 .gitignore
```
That uses `python3 -m git_doc_history shell todo` to parse the configuration file, like:
```bash
eval "$(python3 -m git_doc_history shell todo)"
```
The python library comes with a small CLI interface to extract a file from some time ago:
```
$ python3 -m git_doc_history extract-file-at --at 2020-09-20 -c todo todo.txt -
setup command of completion
```
The `BACKUP_DIR` is of course just a regular git directory -- you can `reset --hard` to some point in the past to get rid of recent commits, `rebase`/`squash` to merge commits or do whatever you please
### Library Usage
Most things will be done with `git_doc_history.DocHistory`
This doesn't assume the filetype is readable text (you may be storing images/binary doc files in the git repository), so the default is to return the data as `bytes` -- you can `.decode("utf-8")` to convert that to readable text
To traverse the entire history:
```python
from git_doc_history import DocHistory
from git_doc_history.config import parse_config, resolve_config
# parse the config from the env file
doc = DocHistory.from_dict(parse_config(resolve_config("todo")))
# iterate through the history for the todo.txt file
for snapshot in doc.iter_commit_snapshots("todo.txt"):
print(str(snapshot.commit_sha))
print(str(snapshot.dt))
print(snapshot.data.decode("utf-8"))
```
#### Parsing Diffs
Iterates through the git history in chronological order, keeping track
of when data was added or removed. By default, this parses the `file`
given by splitting it into lines. If lines are added/removed, this returns an
event which specifies when in the history, and what was added/removed
Alternatively, can pass a `parse_func`, which is a function which
accepts the `DocHistorySnapshot`, and returns a list of hashable items
to store as state
For an example of parsing diffs, see [`examples/todotxt_diff.py`](examples/todotxt_diff.py):
Example output looks something like:
```
added 2022-03-08 12:14:45 (C) 2022-03-08 create shebang script +programming
removed 2022-03-08 13:14:58 (C) 2022-03-08 create shebang script +programming
added 2022-03-08 22:23:39 save formhistory.sqlite in browserexport
removed 2022-03-08 23:23:45 save formhistory.sqlite in browserexport
added 2022-03-09 02:49:58 (C) create a python fzf wrapper because apparently I can't find a good one
added 2022-03-10 16:24:24 (B) 2022-03-10 create plaintext playlist parser module +music
removed 2022-03-11 01:30:49 (B) 2022-03-10 create plaintext playlist parser module +music
added 2022-03-11 10:37:06 (C) 2022-03-11 sync tmux from home directory +programming
added 2022-03-12 03:44:24 install undotree +vim +programming
removed 2022-03-12 04:44:51 (C) 2022-03-11 sync tmux from home directory +programming
removed 2022-03-12 10:51:20 install undotree +vim +programming
```
In this case, 'removed' would mean I either changed the text on the line, or (more likely) I completed it
### Tests
```bash
git clone 'https://github.com/purarue/git_doc_history'
cd ./git_doc_history
pip install '.[testing]'
pytest
flake8 ./git_doc_history
mypy ./git_doc_history
```