{"id":13632506,"url":"https://github.com/TimelyDataflow/differential-dataflow","last_synced_at":"2025-04-18T02:33:25.965Z","repository":{"id":31532044,"uuid":"35096584","full_name":"TimelyDataflow/differential-dataflow","owner":"TimelyDataflow","description":"An implementation of differential dataflow using timely dataflow on Rust.","archived":false,"fork":false,"pushed_at":"2025-04-06T18:45:24.000Z","size":23508,"stargazers_count":2676,"open_issues_count":109,"forks_count":187,"subscribers_count":46,"default_branch":"master","last_synced_at":"2025-04-13T16:11:24.178Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TimelyDataflow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-05-05T12:00:16.000Z","updated_at":"2025-04-13T02:03:03.000Z","dependencies_parsed_at":"2023-02-17T06:45:40.274Z","dependency_job_id":"722e6933-3daa-40cb-9eba-84560548abd4","html_url":"https://github.com/TimelyDataflow/differential-dataflow","commit_stats":{"total_commits":1067,"total_committers":40,"mean_commits":26.675,"dds":0.1030927835051546,"last_synced_commit":"438804d98d1888416ef20288570b804bdba8bea9"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TimelyDataflow%2Fdifferential-dataflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TimelyDataflow%2Fdifferential-dataflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TimelyDataflow%2Fdifferential-dataflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TimelyDataflow%2Fdifferential-dataflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TimelyDataflow","download_url":"https://codeload.github.com/TimelyDataflow/differential-dataflow/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248741199,"owners_count":21154255,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T22:03:05.062Z","updated_at":"2025-04-18T02:33:25.416Z","avatar_url":"https://github.com/TimelyDataflow.png","language":"Rust","funding_links":[],"categories":["Rust","\u003ca name=\"Rust\"\u003e\u003c/a\u003eRust","Projects"],"sub_categories":["Production-Ready Systems"],"readme":"# Differential Dataflow\nAn implementation of [differential dataflow](https://github.com/timelydataflow/differential-dataflow/blob/master/differentialdataflow.pdf) over [timely dataflow](https://github.com/timelydataflow/timely-dataflow) on [Rust](http://www.rust-lang.org).\n\n## Background\n\nDifferential dataflow is a data-parallel programming framework designed to efficiently process large volumes of data and to quickly respond to arbitrary changes in input collections. You can read more in the [differential dataflow mdbook](https://timelydataflow.github.io/differential-dataflow/) and in the [differential dataflow documentation](https://docs.rs/differential-dataflow/).\n\nDifferential dataflow programs are written as functional transformations of collections of data, using familiar operators like `map`, `filter`, `join`, and `reduce`. Differential dataflow also includes more exotic operators such as `iterate`, which repeatedly applies a differential dataflow fragment to a collection. The programs are compiled down to [timely dataflow](https://github.com/timelydataflow/timely-dataflow) computations.\n\nFor example, here is a differential dataflow fragment to compute the out-degree distribution of a directed graph (for each degree, the number of nodes with that many outgoing edges):\n\n```rust\nlet out_degr_dist =\nedges.map(|(src, _dst)| src)    // extract source\n     .count()                   // count occurrences of source\n     .map(|(_src, deg)| deg)    // extract degree\n     .count();                  // count occurrences of degree\n```\n\nAlternately, here is a fragment that computes the set of nodes reachable from a set `roots` of starting nodes:\n\n```rust\nlet reachable =\nroots.iterate(|reach|\n    edges.enter(\u0026reach.scope())\n         .semijoin(reach)\n         .map(|(src, dst)| dst)\n         .concat(reach)\n         .distinct()\n)\n```\n\nOnce written, a differential dataflow responds to arbitrary changes to its initially empty input collections, reporting the corresponding changes to each of its output collections. Differential dataflow can react quickly because it only acts where changes in collections occur, and does no work elsewhere.\n\nIn the examples above, we can add to and remove from `edges`, dynamically altering the graph, and get immediate feedback on how the results change: if the degree distribution shifts we'll see the changes, and if nodes are now (or no longer) reachable we'll hear about that too. We could also add to and remove from `roots`, more fundamentally altering the reachability query itself.\n\nBe sure to check out the [differential dataflow documentation](https://docs.rs/differential-dataflow), which is continually improving.\n\n## An example: counting degrees in a graph.\n\nLet's check out that out-degree distribution computation, to get a sense for how differential dataflow actually works. This example is [examples/hello.rs](https://github.com/TimelyDataflow/differential-dataflow/blob/master/examples/hello.rs) in this repository, if you'd like to follow along.\n\nA graph is a collection of pairs `(Node, Node)`, and one standard analysis is to determine the number of times each `Node` occurs in the first position, its \"degree\". The number of nodes with each degree is a helpful graph statistic.\n\nTo determine the out-degree distribution, we create a new timely dataflow scope in which we describe our computation and how we plan to interact with it.\n\n```rust\n// create a degree counting differential dataflow\nlet (mut input, probe) = worker.dataflow(|scope| {\n\n    // create edge input, count a few ways.\n    let (input, edges) = scope.new_collection();\n\n    let out_degr_distr =\n    edges.map(|(src, _dst)| src)    // extract source\n         .count()                   // count occurrences of source\n         .map(|(_src, deg)| deg)    // extract degree\n         .count();                  // count occurrences of degree\n\n    // show us something about the collection, notice when done.\n    let probe =\n    out_degr_distr\n        .inspect(|x| println!(\"observed: {:?}\", x))\n        .probe();\n\n    (input, probe)\n});\n```\n\nThe `input` and `probe` we return are how we get data into the dataflow, and how we notice when some amount of computation is complete. These are timely dataflow idioms, and we won't get in to them in more detail here (check out [the timely dataflow repository](https://github.com/timelydataflow/timely-dataflow)).\n\nIf we feed this computation with some random graph data, say fifty random edges among ten nodes, we get output like\n\n    Echidnatron% cargo run --release --example hello -- 10 50 1 inspect\n        Finished release [optimized + debuginfo] target(s) in 0.05s\n        Running `target/release/examples/hello 10 50 1 inspect`\n    observed: ((3, 1), 0, 1)\n    observed: ((4, 2), 0, 1)\n    observed: ((5, 4), 0, 1)\n    observed: ((6, 2), 0, 1)\n    observed: ((7, 1), 0, 1)\n    round 0 finished after 772.464µs (loading)\n\nThis shows us the records that passed the `inspect` operator, revealing the contents of the collection: there are five distinct degrees, three through seven. The records have the form `((degree, count), time, delta)` where the `time` field says this is the first round of data, and the `delta` field tells us that each record is coming into existence. If the corresponding record were departing the collection, it would be a negative number.\n\nLet's update the input by removing one edge and adding a new random edge:\n\n    observed: ((2, 1), 1, 1)\n    observed: ((3, 1), 1, -1)\n    observed: ((7, 1), 1, -1)\n    observed: ((8, 1), 1, 1)\n    round 1 finished after 149.701µs\n\nWe see here some changes! Those degree three and seven nodes have been replaced by degree two and eight nodes; looks like one node lost an edge and gave it to the other!\n\nHow about a few more changes?\n\n    round 2 finished after 127.444µs\n    round 3 finished after 100.628µs\n    round 4 finished after 130.609µs\n    observed: ((5, 3), 5, 1)\n    observed: ((5, 4), 5, -1)\n    observed: ((6, 2), 5, -1)\n    observed: ((6, 3), 5, 1)\n    observed: ((7, 1), 5, 1)\n    observed: ((8, 1), 5, -1)\n    round 5 finished after 161.82µs\n\nWell a few weird things happen here. First, rounds 2, 3, and 4 don't print anything. Seriously? It turns out that the random changes we made didn't affect any of the degree counts, we moved edges between nodes, preserving degrees. It can happen.\n\nThe second weird thing is that in round 5, with only two edge changes we have six changes in the output! It turns out we can have up to eight. The degree eight gets turned back into a seven, and a five gets turned into a six. But: going from five to six *changes* the count for each, and each change requires two record differences. Eight and seven were more concise because their counts were only one, meaning just arrival and departure of records rather than changes.\n\n### Scaling up\n\nThe appealing thing about differential dataflow is that it only does work where changes occur, so even if there is a lot of data, if not much changes it can still go quite fast. Let's scale our 10 nodes and 50 edges up by a factor of one million:\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 1 inspect\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 1 inspect`\n    observed: ((1, 336908), 0, 1)\n    observed: ((2, 843854), 0, 1)\n    observed: ((3, 1404462), 0, 1)\n    observed: ((4, 1751921), 0, 1)\n    observed: ((5, 1757099), 0, 1)\n    observed: ((6, 1459805), 0, 1)\n    observed: ((7, 1042894), 0, 1)\n    observed: ((8, 653178), 0, 1)\n    observed: ((9, 363983), 0, 1)\n    observed: ((10, 181423), 0, 1)\n    observed: ((11, 82478), 0, 1)\n    observed: ((12, 34407), 0, 1)\n    observed: ((13, 13216), 0, 1)\n    observed: ((14, 4842), 0, 1)\n    observed: ((15, 1561), 0, 1)\n    observed: ((16, 483), 0, 1)\n    observed: ((17, 143), 0, 1)\n    observed: ((18, 38), 0, 1)\n    observed: ((19, 8), 0, 1)\n    observed: ((20, 3), 0, 1)\n    observed: ((22, 1), 0, 1)\n    round 0 finished after 15.470465014s (loading)\n\nThere are a lot more distinct degrees here. I sorted them because it was too painful to look at the unsorted data. You would normally get to see the output unsorted, because they are just changes to values in a collection.\n\nLet's perform a single change again.\n\n    observed: ((5, 1757098), 1, 1)\n    observed: ((5, 1757099), 1, -1)\n    observed: ((6, 1459805), 1, -1)\n    observed: ((6, 1459807), 1, 1)\n    observed: ((7, 1042893), 1, 1)\n    observed: ((7, 1042894), 1, -1)\n    round 1 finished after 228.451µs\n\nAlthough the initial computation took about fifteen seconds, we get our changes in about 230 microseconds; that's about one hundred thousand times faster than re-running the computation. That's pretty nice. Actually, it is small enough that the time to print things to the screen is a bit expensive, so let's stop doing that.\n\nNow we can just watch as changes roll past and look at the times.\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 1 no_inspect\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 1 no_inspect`\n    round 0 finished after 15.586969662s (loading)\n    round 1 finished after 1.070239ms\n    round 2 finished after 2.303187ms\n    round 3 finished after 208.45µs\n    round 4 finished after 163.224µs\n    round 5 finished after 118.792µs\n    ...\n\nNice. This is some hundreds of microseconds per update, which means maybe ten thousand updates per second. It's not a horrible number for my laptop, but it isn't the right answer yet.\n\n### Scaling .. \"along\"?\n\nDifferential dataflow is designed for throughput in addition to latency. We can increase the number of rounds of updates it works on concurrently, which can increase its effective throughput. This does not change the output of the computation, except that we see larger batches of output changes at once.\n\nNotice that those times above are a few hundred microseconds for each single update. If we work on ten rounds of updates at once, we get times that look like this:\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 10 no_inspect\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 10 no_inspect`\n    round 0 finished after 15.556475008s (loading)\n    round 10 finished after 421.219µs\n    round 20 finished after 1.56369ms\n    round 30 finished after 338.54µs\n    round 40 finished after 351.843µs\n    round 50 finished after 339.608µs\n    ...\n\nThis is appealing in that rounds of ten aren't much more expensive than single updates, and we finish the first ten rounds in much less time than it takes to perform the first ten updates one at a time. Every round after that is just bonus time.\n\nAs we turn up the batching, performance improves. Here we work on one hundred rounds of updates at once:\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 100 no_inspect\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 100 no_inspect`\n    round 0 finished after 15.528724145s (loading)\n    round 100 finished after 2.567577ms\n    round 200 finished after 1.861168ms\n    round 300 finished after 1.753794ms\n    round 400 finished after 1.528285ms\n    round 500 finished after 1.416605ms\n    ...\n\nWe are still improving, and continue to do so as we increase the batch sizes. When processing 100,000 updates at a time we take about half a second for each batch. This is less \"interactive\" but a higher throughput.\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 100000 no_inspect\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 100000 no_inspect`\n    round 0 finished after 15.65053789s (loading)\n    round 100000 finished after 505.210924ms\n    round 200000 finished after 524.069497ms\n    round 300000 finished after 470.77752ms\n    round 400000 finished after 621.325393ms\n    round 500000 finished after 472.791742ms\n    ...\n\nThis averages to about five microseconds on average; a fair bit faster than the hundred microseconds for individual updates! And now that I think about it each update was actually two changes, wasn't it. Good for you, differential dataflow!\n\n### Scaling out\n\nDifferential dataflow is built on top of [timely dataflow](https://github.com/timelydataflow/timely-dataflow), a distributed data-parallel runtime. Timely dataflow scales out to multiple independent workers, increasing the capacity of the system (at the cost of some coordination that cuts into latency).\n\nIf we bring two workers to bear, our 10 million node, 50 million edge computation drops down from fifteen seconds to just over eight seconds.\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 1 no_inspect -w2\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 1 no_inspect -w2`\n    round 0 finished after 8.065386177s (loading)\n    round 1 finished after 275.373µs\n    round 2 finished after 759.632µs\n    round 3 finished after 171.671µs\n    round 4 finished after 745.078µs\n    round 5 finished after 213.146µs\n    ...\n\nThat is a so-so reduction. You might notice that the times *increased* for the subsequent rounds. It turns out that multiple workers just get in each other's way when there isn't much work to do.\n\nFortunately, as we work on more and more rounds of updates at the same time, the benefit of multiple workers increases. Here are the numbers for ten rounds at a time:\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 10 no_inspect -w2\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 10 no_inspect -w2`\n    round 0 finished after 8.083000954s (loading)\n    round 10 finished after 1.901946ms\n    round 20 finished after 3.092976ms\n    round 30 finished after 889.63µs\n    round 40 finished after 409.001µs\n    round 50 finished after 320.248µs\n    ...\n\nOne hundred rounds at a time:\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 100 no_inspect -w2\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 100 no_inspect -w2`\n    round 0 finished after 8.121800831s (loading)\n    round 100 finished after 2.52821ms\n    round 200 finished after 3.119036ms\n    round 300 finished after 1.63147ms\n    round 400 finished after 1.008668ms\n    round 500 finished after 941.426µs\n    ...\n\nOne hundred thousand rounds at a time:\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 100000 no_inspect -w2\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 100000 no_inspect -w2`\n    round 0 finished after 8.200755198s (loading)\n    round 100000 finished after 275.262419ms\n    round 200000 finished after 279.291957ms\n    round 300000 finished after 259.137138ms\n    round 400000 finished after 340.624124ms\n    round 500000 finished after 259.870938ms\n    ...\n\nThese last numbers were about half a second with one worker, and are decently improved with the second worker.\n\n### Going even faster\n\nThere are several performance optimizations in differential dataflow designed to make the underlying operators as close to what you would expect to write, when possible. Additionally, by building on timely dataflow, you can drop in your own implementations a la carte where you know best.\n\nFor example, we also know in this case that the underlying collections go through a *sequence* of changes, meaning their timestamps are totally ordered. In this case we can use a much simpler implementation, `count_total`. The reduces the update times substantially, for each batch size:\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 10 no_inspect -w2\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 10 no_inspect -w2`\n    round 0 finished after 5.985084002s (loading)\n    round 10 finished after 1.802729ms\n    round 20 finished after 2.202838ms\n    round 30 finished after 192.902µs\n    round 40 finished after 198.342µs\n    round 50 finished after 187.725µs\n    ...\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 100 no_inspect -w2\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 100 no_inspect -w2`\n    round 0 finished after 5.588270073s (loading)\n    round 100 finished after 3.114716ms\n    round 200 finished after 2.657691ms\n    round 300 finished after 890.972µs\n    round 400 finished after 448.537µs\n    round 500 finished after 384.565µs\n    ...\n\n    Echidnatron% cargo run --release --example hello -- 10000000 50000000 100000 no_inspect -w2\n        Finished release [optimized + debuginfo] target(s) in 0.04s\n        Running `target/release/examples/hello 10000000 50000000 100000 no_inspect -w2`\n    round 0 finished after 6.486550581s (loading)\n    round 100000 finished after 89.096615ms\n    round 200000 finished after 79.469464ms\n    round 300000 finished after 72.568018ms\n    round 400000 finished after 93.456272ms\n    round 500000 finished after 73.954886ms\n    ...\n\nThese times have now dropped quite a bit from where we started; we now absorb over one million rounds of updates per second, and produce correct (not just consistent) answers even while distributed across multiple workers.\n\n## A second example: k-core computation\n\nThe k-core of a graph is the largest subset of its edges so that all vertices with any incident edges have degree at least k. One way to find the k-core is to repeatedly delete all edges incident on vertices with degree less than k. Those edges going away might lower the degrees of other vertices, so we need to *iteratively* throwing away edges on vertices with degree less than k until we stop. Maybe we throw away all the edges, maybe we stop with some left over.\n\nHere is a direct implementation, in which we repeatedly take determine the set of active nodes (those with at least\n`k` edges point to or from them), and restrict the set `edges` to those with both `src` and `dst` present in `active`.\n\n```rust\nlet k = 5;\n\n// iteratively thin edges.\nedges.iterate(|inner| {\n\n    // determine the active vertices        /-- this is a lie --\\\n    let active = inner.flat_map(|(src,dst)| [src,dst].into_iter())\n                      .map(|node| (node, ()))\n                      .group(|_node, s, t| if s[0].1 \u003e k { t.push(((), 1)); })\n                      .map(|(node,_)| node);\n\n    // keep edges between active vertices\n    edges.enter(\u0026inner.scope())\n         .semijoin(active)\n         .map(|(src,dst)| (dst,src))\n         .semijoin(active)\n         .map(|(dst,src)| (src,dst))\n});\n```\n\nTo be totally clear, the syntax with `into_iter()` doesn't work, because Rust, and instead there is a more horrible syntax needed to get a non-heap allocated iterator over two elements. But, it works, and\n\n    Running `target/release/examples/degrees 10000000 50000000 1 5 kcore1`\n    Loading finished after 72204416910\n\nWell that is a thing. Who knows if 72 seconds is any good? (*ed:* it is worse than the numbers in the previous version of this readme).\n\nThe amazing thing, though is what happens next:\n\n    worker 0, round 1 finished after Duration { secs: 0, nanos: 567171 }\n    worker 0, round 2 finished after Duration { secs: 0, nanos: 449687 }\n    worker 0, round 3 finished after Duration { secs: 0, nanos: 467143 }\n    worker 0, round 4 finished after Duration { secs: 0, nanos: 480019 }\n    worker 0, round 5 finished after Duration { secs: 0, nanos: 404831 }\n\nWe are taking about half a millisecond to *update* the k-core computation. Each edge addition and deletion could cause other edges to drop out of or more confusingly *return* to the k-core, and differential dataflow is correctly updating all of that for you. And it is doing it in sub-millisecond timescales.\n\nIf we crank the batching up by one thousand, we improve the throughput a fair bit:\n\n    Running `target/release/examples/degrees 10000000 50000000 1000 5 kcore1`\n    Loading finished after Duration { secs: 73, nanos: 507094824 }\n    worker 0, round 1000 finished after Duration { secs: 0, nanos: 55649900 }\n    worker 0, round 2000 finished after Duration { secs: 0, nanos: 51793416 }\n    worker 0, round 3000 finished after Duration { secs: 0, nanos: 57733231 }\n    worker 0, round 4000 finished after Duration { secs: 0, nanos: 50438934 }\n    worker 0, round 5000 finished after Duration { secs: 0, nanos: 55020469 }\n\nEach batch is doing one thousand rounds of updates in just over 50 milliseconds, averaging out to about 50 microseconds for each update, and corresponding to roughly 20,000 distinct updates per second.\n\nI think this is all great, both that it works at all and that it even seems to work pretty well.\n\n## Roadmap\n\nThe [issue tracker](https://github.com/timelydataflow/differential-dataflow/issues) has several open issues relating to current performance defects or missing features. If you are interested in contributing, that would be great! If you have other questions, don't hesitate to get in touch.\n\n## Acknowledgements\n\nIn addition to contributions to this repository, differential dataflow is based on work at the now defunct Microsoft Research lab in Silicon Valley, and continued at the Systems Group of ETH Zürich. Numerous collaborators at each institution (among others) have contributed both ideas and implementations.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTimelyDataflow%2Fdifferential-dataflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTimelyDataflow%2Fdifferential-dataflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTimelyDataflow%2Fdifferential-dataflow/lists"}