{"id":19610580,"url":"https://github.com/superlinear-ai/graphchain","last_synced_at":"2025-04-27T22:32:53.160Z","repository":{"id":33168817,"uuid":"124039460","full_name":"superlinear-ai/graphchain","owner":"superlinear-ai","description":"⚡️ An efficient cache for the execution of dask graphs.","archived":false,"fork":false,"pushed_at":"2023-11-01T05:56:43.000Z","size":301,"stargazers_count":71,"open_issues_count":10,"forks_count":14,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-05T03:41:23.655Z","etag":null,"topics":["cache","dask","s3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/superlinear-ai.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-06T07:35:59.000Z","updated_at":"2024-06-19T04:59:57.000Z","dependencies_parsed_at":"2024-09-12T07:28:26.295Z","dependency_job_id":"f560e7c3-e99f-435a-929c-80d90ab602c7","html_url":"https://github.com/superlinear-ai/graphchain","commit_stats":null,"previous_names":["superlinear-ai/graphchain","radix-ai/graphchain"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinear-ai%2Fgraphchain","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinear-ai%2Fgraphchain/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinear-ai%2Fgraphchain/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/superlinear-ai%2Fgraphchain/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/superlinear-ai","download_url":"https://codeload.github.com/superlinear-ai/graphchain/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251219600,"owners_count":21554444,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cache","dask","s3"],"created_at":"2024-11-11T10:30:42.394Z","updated_at":"2025-04-27T22:32:48.144Z","avatar_url":"https://github.com/superlinear-ai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![License](https://img.shields.io/github/license/mashape/apistatus.svg)](https://choosealicense.com/licenses/mit/) [![PyPI](https://img.shields.io/pypi/v/graphchain.svg)](https://pypi.python.org/pypi/graphchain/)\n\n# Graphchain\n\n## What is graphchain?\n\nGraphchain is like [joblib.Memory](https://joblib.readthedocs.io/en/latest/memory.html) for dask graphs. [Dask graph computations](https://docs.dask.org/en/latest/spec.html) are cached to a local or remote location of your choice, specified by a [PyFilesystem FS URL](https://docs.pyfilesystem.org/en/latest/openers.html).\n\nWhen you change your dask graph (by changing a computation's implementation or its inputs), graphchain will take care to only recompute the minimum number of computations necessary to fetch the result. This allows you to iterate quickly over your graph without spending time on recomputing previously computed keys.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://imgs.xkcd.com/comics/is_it_worth_the_time_2x.png\" width=\"400\" /\u003e\u003cbr /\u003e\n    \u003cspan\u003eSource: \u003ca href=\"https://xkcd.com/1205/\"\u003exkcd.com/1205/\u003c/a\u003e\u003c/span\u003e\n\u003c/p\u003e\n\nThe main difference between graphchain and joblib.Memory is that in graphchain a computation's materialised inputs are _not_ serialised and hashed (which can be very expensive when the inputs are large objects such as pandas DataFrames). Instead, a chain of hashes (hence the name graphchain) of the computation object and its dependencies (which are also computation objects) is used to identify the cache file.\n\nAdditionally, the result of a computation is only cached if it is estimated that loading that computation from cache will save time compared to simply computing the computation. The decision on whether to cache depends on the characteristics of the cache location, which are different when caching to the local filesystem compared to caching to S3 for example.\n\n## Usage by example\n\n### Basic usage\n\nInstall graphchain with pip to get started:\n\n```sh\npip install graphchain\n```\n\nTo demonstrate how graphchain can save you time, let's first create a simple dask graph that (1) creates a few pandas DataFrames, (2) runs a relatively heavy operation on these DataFrames, and (3) summarises the results.\n\n```python\nimport dask\nimport graphchain\nimport pandas as pd\n\ndef create_dataframe(num_rows, num_cols):\n    print(\"Creating DataFrame...\")\n    return pd.DataFrame(data=[range(num_cols)]*num_rows)\n\ndef expensive_computation(df, num_quantiles):\n    print(\"Running expensive computation on DataFrame...\")\n    return df.quantile(q=[i / num_quantiles for i in range(num_quantiles)])\n\ndef summarize_dataframes(*dfs):\n    print(\"Summing DataFrames...\")\n    return sum(df.sum().sum() for df in dfs)\n\ndsk = {\n    \"df_a\": (create_dataframe, 10_000, 1000),\n    \"df_b\": (create_dataframe, 10_000, 1000),\n    \"df_c\": (expensive_computation, \"df_a\", 2048),\n    \"df_d\": (expensive_computation, \"df_b\", 2048),\n    \"result\": (summarize_dataframes, \"df_c\", \"df_d\")\n}\n```\n\nUsing `dask.get` to fetch the `\"result\"` key takes about 6 seconds:\n\n```python\n\u003e\u003e\u003e %time dask.get(dsk, \"result\")\n\nCreating DataFrame...\nRunning expensive computation on DataFrame...\nCreating DataFrame...\nRunning expensive computation on DataFrame...\nSumming DataFrames...\n\nCPU times: user 7.39 s, sys: 686 ms, total: 8.08 s\nWall time: 6.19 s\n```\n\nOn the other hand, using `graphchain.get` for the first time to fetch `'result'` takes only 4 seconds:\n\n```python\n\u003e\u003e\u003e %time graphchain.get(dsk, \"result\")\n\nCreating DataFrame...\nRunning expensive computation on DataFrame...\nSumming DataFrames...\n\nCPU times: user 4.7 s, sys: 519 ms, total: 5.22 s\nWall time: 4.04 s\n```\n\nThe reason `graphchain.get` is faster than `dask.get` is because it can load `df_b` and `df_d` from cache after `df_a` and `df_c` have been computed and cached. Note that graphchain will only cache the result of a computation if loading that computation from cache is estimated to be faster than simply running the computation.\n\nRunning `graphchain.get` a second time to fetch `\"result\"` will be almost instant since this time the result itself is also available from cache:\n\n```python\n\u003e\u003e\u003e %time graphchain.get(dsk, \"result\")\n\nCPU times: user 4.79 ms, sys: 1.79 ms, total: 6.58 ms\nWall time: 5.34 ms\n```\n\nNow let's say we want to change how the result is summarised from a sum to an average:\n\n```python\ndef summarize_dataframes(*dfs):\n    print(\"Averaging DataFrames...\")\n    return sum(df.mean().mean() for df in dfs) / len(dfs)\n```\n\nIf we then ask graphchain to fetch `\"result\"`, it will detect that only `summarize_dataframes` has changed and therefore only recompute this function with inputs loaded from cache:\n\n```python\n\u003e\u003e\u003e %time graphchain.get(dsk, \"result\")\n\nAveraging DataFrames...\n\nCPU times: user 123 ms, sys: 37.2 ms, total: 160 ms\nWall time: 86.6 ms\n```\n\n### Storing the graphchain cache remotely\n\nGraphchain's cache is by default `./__graphchain_cache__`, but you can ask graphchain to use a cache at any [PyFilesystem FS URL](https://docs.pyfilesystem.org/en/latest/openers.html) such as `s3://mybucket/__graphchain_cache__`:\n\n```python\ngraphchain.get(dsk, \"result\", location=\"s3://mybucket/__graphchain_cache__\")\n```\n\n### Excluding keys from being cached\n\nIn some cases you may not want a key to be cached. To avoid writing certain keys to the graphchain cache, you can use the `skip_keys` argument:\n\n```python\ngraphchain.get(dsk, \"result\", skip_keys=[\"result\"])\n```\n\n### Using graphchain with dask.delayed\n\nAlternatively, you can use graphchain together with dask.delayed for easier dask graph creation:\n\n```python\nimport dask\nimport pandas as pd\n\n@dask.delayed\ndef create_dataframe(num_rows, num_cols):\n    print(\"Creating DataFrame...\")\n    return pd.DataFrame(data=[range(num_cols)]*num_rows)\n\n@dask.delayed\ndef expensive_computation(df, num_quantiles):\n    print(\"Running expensive computation on DataFrame...\")\n    return df.quantile(q=[i / num_quantiles for i in range(num_quantiles)])\n\n@dask.delayed\ndef summarize_dataframes(*dfs):\n    print(\"Summing DataFrames...\")\n    return sum(df.sum().sum() for df in dfs)\n\ndf_a = create_dataframe(num_rows=10_000, num_cols=1000)\ndf_b = create_dataframe(num_rows=10_000, num_cols=1000)\ndf_c = expensive_computation(df_a, num_quantiles=2048)\ndf_d = expensive_computation(df_b, num_quantiles=2048)\nresult = summarize_dataframes(df_c, df_d)\n```\n\nAfter which you can compute `result` by setting the `delayed_optimize` method to `graphchain.optimize`:\n\n```python\nimport graphchain\nfrom functools import partial\n\noptimize_s3 = partial(graphchain.optimize, location=\"s3://mybucket/__graphchain_cache__/\")\n\nwith dask.config.set(scheduler=\"sync\", delayed_optimize=optimize_s3):\n    print(result.compute())\n```\n\n### Using a custom a serializer/deserializer\n\nBy default graphchain will cache dask computations with [joblib.dump](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html) and LZ4 compression. However, you may also supply a custom `serialize` and `deserialize` function that writes and reads computations to and from a [PyFilesystem filesystem](https://docs.pyfilesystem.org/en/latest/introduction.html), respectively. For example, the following snippet shows how to serialize dask DataFrames with [dask.dataframe.to_parquet](https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html), while other objects are serialized with joblib:\n\n```python\nimport dask.dataframe\nimport graphchain\nimport fs.osfs\nimport joblib\nimport os\nfrom functools import partial\nfrom typing import Any\n\ndef custom_serialize(obj: Any, fs: fs.osfs.OSFS, key: str) -\u003e None:\n    \"\"\"Serialize dask DataFrames with to_parquet, and other objects with joblib.dump.\"\"\"\n    if isinstance(obj, dask.dataframe.DataFrame):\n        obj.to_parquet(os.path.join(fs.root_path, \"parquet\", key))\n    else:\n        with fs.open(f\"{key}.joblib\", \"wb\") as fid:\n            joblib.dump(obj, fid)\n\ndef custom_deserialize(fs: fs.osfs.OSFS, key: str) -\u003e Any:\n    \"\"\"Deserialize dask DataFrames with read_parquet, and other objects with joblib.load.\"\"\"\n    if fs.exists(f\"{key}.joblib\"):\n        with fs.open(f\"{key}.joblib\", \"rb\") as fid:\n            return joblib.load(fid)\n    else:\n        return dask.dataframe.read_parquet(os.path.join(fs.root_path, \"parquet\", key))\n\noptimize_parquet = partial(\n    graphchain.optimize,\n    location=\"./__graphchain_cache__/custom/\",\n    serialize=custom_serialize,\n    deserialize=custom_deserialize\n)\n\nwith dask.config.set(scheduler=\"sync\", delayed_optimize=optimize_parquet):\n    print(result.compute())\n```\n\n## Contributing\n\n\u003cdetails\u003e\n\u003csummary\u003eSetup: once per device\u003c/summary\u003e\n\n1. [Generate an SSH key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent#generating-a-new-ssh-key) and [add the SSH key to your GitHub account](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account).\n1. Configure SSH to automatically load your SSH keys:\n    ```sh\n    cat \u003c\u003c EOF \u003e\u003e ~/.ssh/config\n    Host *\n      AddKeysToAgent yes\n      IgnoreUnknown UseKeychain\n      UseKeychain yes\n    EOF\n    ```\n1. [Install Docker Desktop](https://www.docker.com/get-started).\n    - Enable _Use Docker Compose V2_ in Docker Desktop's preferences window.\n    - _Linux only_:\n        - [Configure Docker and Docker Compose to use the BuildKit build system](https://docs.docker.com/develop/develop-images/build_enhancements/#to-enable-buildkit-builds). On macOS and Windows, BuildKit is enabled by default in Docker Desktop.\n        - Export your user's user id and group id so that [files created in the Dev Container are owned by your user](https://github.com/moby/moby/issues/3206):\n            ```sh\n            cat \u003c\u003c EOF \u003e\u003e ~/.bashrc\n            export UID=$(id --user)\n            export GID=$(id --group)\n            EOF\n            ```\n1. [Install VS Code](https://code.visualstudio.com/) and [VS Code's Remote-Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers). Alternatively, install [PyCharm](https://www.jetbrains.com/pycharm/download/).\n    - _Optional:_ Install a [Nerd Font](https://www.nerdfonts.com/font-downloads) such as [FiraCode Nerd Font](https://github.com/ryanoasis/nerd-fonts/tree/master/patched-fonts/FiraCode) with `brew tap homebrew/cask-fonts \u0026\u0026 brew install --cask font-fira-code-nerd-font` and [configure VS Code](https://github.com/tonsky/FiraCode/wiki/VS-Code-Instructions) or [configure PyCharm](https://github.com/tonsky/FiraCode/wiki/Intellij-products-instructions) to use `'FiraCode Nerd Font'`.\n\n\u003c/details\u003e\n\n\u003cdetails open\u003e\n\u003csummary\u003eSetup: once per project\u003c/summary\u003e\n\n1. Clone this repository.\n2. Start a [Dev Container](https://code.visualstudio.com/docs/remote/containers) in your preferred development environment:\n    - _VS Code_: open the cloned repository and run \u003ckbd\u003eCtrl/⌘\u003c/kbd\u003e + \u003ckbd\u003e⇧\u003c/kbd\u003e + \u003ckbd\u003eP\u003c/kbd\u003e → _Remote-Containers: Reopen in Container_.\n    - _PyCharm_: open the cloned repository and [configure Docker Compose as a remote interpreter](https://www.jetbrains.com/help/pycharm/using-docker-compose-as-a-remote-interpreter.html#docker-compose-remote).\n    - _Terminal_: open the cloned repository and run `docker compose run --rm dev` to start an interactive Dev Container.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eDeveloping\u003c/summary\u003e\n\n- This project follows the [Conventional Commits](https://www.conventionalcommits.org/) standard to automate [Semantic Versioning](https://semver.org/) and [Keep A Changelog](https://keepachangelog.com/) with [Commitizen](https://github.com/commitizen-tools/commitizen).\n- Run `poe` from within the development environment to print a list of [Poe the Poet](https://github.com/nat-n/poethepoet) tasks available to run on this project.\n- Run `poetry add {package}` from within the development environment to install a run time dependency and add it to `pyproject.toml` and `poetry.lock`.\n- Run `poetry remove {package}` from within the development environment to uninstall a run time dependency and remove it from `pyproject.toml` and `poetry.lock`.\n- Run `poetry update` from within the development environment to upgrade all dependencies to the latest versions allowed by `pyproject.toml`.\n- Run `cz bump` to bump the package's version, update the `CHANGELOG.md`, and create a git tag.\n\n\u003c/details\u003e\n\n## Developed by Radix\n\n[Radix](https://radix.ai) is a Belgium-based Machine Learning company.\n\nOur vision is to make technology work for and with us. We believe that if technology is used in a creative way, jobs become more fulfilling, people become the best version of themselves, and companies grow.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuperlinear-ai%2Fgraphchain","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsuperlinear-ai%2Fgraphchain","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuperlinear-ai%2Fgraphchain/lists"}