Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bretthoerner/timak

Timelines (activity streams) backed by Riak
https://github.com/bretthoerner/timak

Last synced: 16 days ago
JSON representation

Timelines (activity streams) backed by Riak

Awesome Lists containing this project

README

        

=====
timak
=====

timak is a Python library for storing timelines (activity streams) in Riak. It is very alpha and rough around the edges.

It is loosely based on my understanding of Yammer's `Streamie `_.

Example
-------

Timelines are unique sets of objects (unique by the ID you provide) ordered by a datetime (that you also provide). They are bounded, so items fall off the end when a (user defined) capacity is reached.

>>> from datetime import datetime
>>> import riak
>>> from timak.timelines import Timeline

>>> conn = riak.RiakClient()

>>> tl = Timeline(connection=conn, max_items=3)

>>> # t1.add("key", "unique_id", "score")
>>> tl.add("brett:tweets", 1, datetime(2011, 1, 1))
[1]
>>> tl.add("brett:tweets", 2, datetime(2011, 1, 2))
[2, 1]
>>> tl.add("brett:tweets", 3, datetime(2011, 1, 3))
[3, 2, 1]
>>> tl.add("brett:tweets", 4, datetime(2011, 1, 4))
[4, 3, 2]
>>> tl.delete("brett:tweets", 2, datetime(2011, 1, 2))
[4, 3]

If you provide a ``datetime.datetime`` value to score Timak will automatically convert to a sortable score value.

As you can see the default order is descending by the date you provide, and the object IDs are returned by default. You can also provide an ``obj_data`` argument (must be JSON serializable) which will be returned instead.

>>> tl.add("brett:tweets", 5, datetime(2011, 1, 5), obj_data={'body': 'Hello world, this is my first tweet'})
[{'body': 'Hello world, this is my first tweet'}, 4, 3]

Why?
----

I needed *highly available*, *linearly scalable* timelines where readers and writers *don't block* one another. Because Riak is a Dynamo based system, multiple writers can update a single value and I can merge the conflicts on a later read. I can also add a machine to the cluster for more throughput, and since it's simply fetching denormalized timelines by key it should be incredibly performant.

So what? I could write this in...
---------------------------------

PostgreSQL or MySQL
```````````````````

This would be a very simple table in a RDBMS. It could even be boundless (though without some PLSQL hackery large ``OFFSETS`` are very expensive). You'd be hitting large indexes instead of fetching values directly by key. The biggest problem is it all has to fit on a single system, unless you manually shard the data (and re-shard if you ever grew out of that size). Plus you'd have to deal with availability using read slaves and failover.

MongoDB
```````

The only possible difference I see from the RDBMSs above is that you could use Mongo's "auto-sharding." If that's your thing, and you trust it, then I wish you the best of luck. You may want to `read this `_.

Redis
`````

You can fake timelines in Redis using a list or sorted set. Like RDBMS you have to handle all of the sharding yourself, re-shard on growth, and use slaves and failover for availability. In addition to these, and even more critical for my use case: all of your timelines would have to fit in RAM. If you have this problem and that kind of money please send me some.

Cassandra
`````````

Probably another great fit. You could even store much longer timelines, though I'm not sure what the cost is of doing a ``SELECT`` with ``OFFSET`` equivalent on the columns in a Cassandra row.

TODO
----

1. Add better API with cursors (last seen ``obj_date``?) for pagination.
2. Built-in Django support for update on ``post_save`` and ``post_delete``.
3. Compress values.