https://github.com/xandkar/arkheia
Experiments with document archival and analysis (WIP)
https://github.com/xandkar/arkheia
Last synced: 7 months ago
JSON representation
Experiments with document archival and analysis (WIP)
- Host: GitHub
- URL: https://github.com/xandkar/arkheia
- Owner: xandkar
- Created: 2012-11-20T18:17:31.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2013-12-28T18:38:07.000Z (almost 12 years ago)
- Last Synced: 2024-10-19T03:06:25.887Z (about 1 year ago)
- Language: OCaml
- Homepage:
- Size: 340 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Αρχεια
======
Experiments with document archival and analysis.
Design ideas
------------
### Import
On import, a document is classified and parsed into meta data tags by a parser
for the determined type (such as: email message, research paper, photo, etc).
### Analysis
From meta data analysis, new entities can be discovered, such as people, and
links established between entities (a relationship graph). For example: I have
an email from, and a few photos of, John Smith, and he is also an author of
several papers (on certain topics, that he is now linked to) and a recipient of
some other emails from me directly or by way of CC (which implies either direct
or proxy relationship), he is also an author of emails to a set of mailing
lists and he usually replies to certain topics, etc. The rabbit hole is nearly
infinite.
### Storage
Component interfaces should be abstracted in such a way that initial
implementation can use SQLite for everything, and later move to an
eventually-consistent k/v store, like Riak.
Meta data makes the most sense as a relational table. Can we represent it as
k/v pairs-only (without relying on something like LevelDB 2i), and still
provide sane, complex querying?
How to represent/store a relationship graph?
doc-id = hash of document
3 collections:
#### Object storage
| Key | Value |
|--------|----------|
| doc-id | document |
#### Meta data
| Document ID | Tag | Value |
|-------------|----------|-----------|
| doc-id | tag-name | tag-value |
#### Index
Suffix tree. But how to store?
Abstract in such a way that initial implementation can use SQLite and later can
be swapped with an implementation of on-disk suffix tree.