An open API service indexing awesome lists of open source software.

https://github.com/amccool/amccool.github.io

the IO pages
https://github.com/amccool/amccool.github.io

Last synced: 9 months ago
JSON representation

the IO pages

Awesome Lists containing this project

README

          

The Log (https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying), Jay Kreps

## projects mentioned

### Academic papers, systems, talks, and blogs:

The End

If you made it this far you know most of what I know about logs.

Here are a few interesting references you may want to check out.

Everyone seems to uses different terms for the same things so it is a bit of a puzzle to connect the database literature to the distributed systems stuff to the various enterprise software camps to the open source world. Nonetheless, here are a few pointers in the general direction.

Academic papers, systems, talks, and blogs:

  • A good overview of state machine and primary-backup replication


  • PacificA is a generic framework for implementing log-based distributed storage systems at Microsoft.


  • Spanner—Not everyone loves logical time for their logs. Google's new database tries to use physical time and models the uncertainty of clock drift directly by treating the timestamp as a range.


  • Datanomic: Deconstructing the database is a great presentation by Rich Hickey, the creator of Clojure, on his startup's database product.


  • A Survey of Rollback-Recovery Protocols in Message-Passing Systems. I found this to be a very helpful introduction to fault-tolerance and the practical application of logs to recovery outside databases.


  • Reactive Manifesto—I'm actually not quite sure what is meant by reactive programming, but I think it means the same thing as "event driven". This link doesn't have much info, but this class by Martin Odersky (of Scala fame) looks facinating.

  • Paxos!
    • Original paper is here. Leslie Lamport has an interesting history of how the algorithm was created in the 1980s but not published until 1998 because the reviewers didn't like the Greek parable in the paper and he didn't want to change it.

    • Even once the original paper was published it wasn't well understood. Lamport tries again and this time even includes a few of the "uninteresting details" of how to put it to use using these new-fangled automatic computers. It is still not widely understood.


    • Fred Schneider and Butler Lampson each give more detailed overview of applying Paxos in real systems.

    • A few Google engineers summarize their experience implementing Paxos in Chubby.

    • I actually found all the Paxos papers pretty painful to understand but dutifully struggled through. But you don't need to because this video by John Ousterhout (of log-structured filesystem fame!) will make it all very simple. Somehow these consensus algorithms are much better presented by drawing them as the communication rounds unfold, rather than in a static presentation in a paper. Ironically, this video was created in an attempt to show that Paxos was hard to understand.


    • Using Paxos to Build a Scalable Consistent Data Store: This is a cool paper on using a log to build a data store, by Jun, one of the co-authors is also one of the earliest engineers on Kafka.


  • Paxos has competitors! Actually each of these map a lot more closely to the implementation of a log and are probably more suitable for practical implementation:

    • Viewstamped Replication by Barbara Liskov is an early algorithm to directly model log replication.


    • Zab is the algorithm used by Zookeeper.


    • RAFT is an attempt at a more understandable consensus algorithm. The video presentation, also by John Ousterhout, is great too.


  • You can see the role of the log in action in different real distributed databases.

    • PNUTS is a system which attempts to apply to log-centric design of traditional distributed databases at large scale.


    • HBase and Bigtable both give another example of logs in modern databases.

    • LinkedIn's own distributed database Espresso, like PNUTs, uses a log for replication, but takes a slightly different approach using the underlying table itself as the source of the log.


  • If you find yourself comparison shopping for a replication algorithm, this paper may help you out.


  • Replication: Theory and Practice is a great book that collects a bunch of summary papers on replication in distributed systems. Many of the chapters are online (e.g. 1, 4, 5, 6, 7, 8).

  • Stream processing. This is a bit too broad to summarize, but here are a few things I liked.

Enterprise software has all the same problems but with different names, a smaller scale, and XML. Ha ha, just kidding. Kind of.

  • Event Sourcing—As far as I can tell this is basically the enterprise software engineer's way of saying "state machine replication". It's interesting that the same idea would be invented again in such a different context. Event sourcing seems to focus on smaller, in-memory use cases. This approach to application development seems to combine the "stream processing" that occurs on the log of events with the application. Since this becomes pretty non-trivial when the processing is large enough to require data partitioning for scale I focus on stream processing as a separate infrastructure primitive.


  • Change Data Capture—There is a small industry around getting data out of databases, and this is the most log-friendly style of data extraction.


  • Enterprise Application Integration seems to be about solving the data integration problem when what you have is a collection of off-the-shelf enterprise software like CRM or supply-chain management software.


  • Complex Event Processing (CEP): Fairly certain nobody knows what this means or how it actually differs from stream processing. The difference seems to be that the focus is on unordered streams and on event filtering and detection rather than aggregation, but this, in my opinion is a distinction without a difference. I think any system that is good at one should be good at another.


  • Enterprise Service Bus—I think the enterprise service bus concept is very similar to some of the ideas I have described around data integration. This idea seems to have been moderately successful in enterprise software communities and is mostly unknown among web folks or the distributed data infrastructure crowd.

Interesting open source stuff:

  • Kafka Is the "log as a service" project that is the basis for much of this post.


  • Bookeeper and Hedwig comprise another open source "log as a service". They seem to be more targeted at data system internals then at event data.


  • Databus is a system that provides a log-like overlay for database tables.


  • Akka is an actor framework for Scala. It has an add on, eventsourced, that provides persistence and journaling.


  • Samza is a stream processing framework we are working on at LinkedIn. It uses a lot of the ideas in this article as well as integrating with Kafka as the underlying log.


  • Storm is popular stream processing framework that integrates well with Kafka.


  • Spark Streaming is a stream processing framework that is part of Spark.


  • Summingbird is a layer on top of Storm or Hadoop that provides a convenient computing abstraction.