https://github.com/internetarchive/trough

Trough: Big data, small databases.
https://github.com/internetarchive/trough

database python python3 sqlite

Last synced: 5 months ago
JSON representation

Trough: Big data, small databases.

Host: GitHub
URL: https://github.com/internetarchive/trough
Owner: internetarchive
License: bsd-2-clause
Created: 2016-12-16T01:38:52.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2024-07-25T13:47:49.000Z (over 1 year ago)
Last Synced: 2025-07-12T01:33:01.423Z (5 months ago)
Topics: database, python, python3, sqlite
Language: Python
Homepage:
Size: 740 KB
Stars: 42
Watchers: 14
Forks: 7
Open Issues: 9
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

README

          .. image:: https://travis-ci.org/internetarchive/trough.svg?branch=master

    :target: https://travis-ci.org/internetarchive/trough

=======

Trough

=======

Big data, small databases.

==========================

Big data is really just lots and lots of little data. 

If you split a large dataset into lots of small SQL databases sharded on a well-chosen key, 

they can work in concert to create a database system that can query very large datasets.

Worst-case Performance is *important*

=====================================

A key insight when working with large datasets is that with monolithic big data tools' performance 

is largely tied to having a full dataset completely loaded and working in a 

production-quality cluster.

Trough is designed to have very predictable performance characteristics: simply determine your sharding key,

determine your largest shard, load it into a sqlite database locally, and you already know your worst-case

performance scenario.

Designed to leverage storage, not RAM

=====================================

Rather than having huge CPU and memory requirements to deliver performant queries over large datasets,

Trough relies on flat sqlite files, which are easily distributed to a cluster and queried against.

Reliable parts, reliable whole

==============================

Each piece of technology in the stack was carefully selected and load tested to ensure that your data stays

reliably up and reliably queryable. The code is small enough for one programmer to audit.

Ease of installation

====================

One of the worst parts of setting up a big data system generally is getting setting sensible defaults and

deploying it to staging and production environments. Trough has been designed to require as little 

configuration as possible.

An example ansible deployment specification has been removed from the trough

repo but can be found at https://github.com/internetarchive/trough/tree/cc32d3771a7/ansible.

It is designed for a cluster Ubuntu 16.04 Xenial nodes.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/internetarchive/trough

Awesome Lists containing this project

README