{"id":13647781,"url":"https://github.com/airbnb/chronon","last_synced_at":"2025-09-04T17:13:36.738Z","repository":{"id":225203495,"uuid":"300087031","full_name":"airbnb/chronon","owner":"airbnb","description":"Chronon is a data platform for serving for AI/ML applications.","archived":false,"fork":false,"pushed_at":"2025-04-17T22:14:10.000Z","size":12664,"stargazers_count":791,"open_issues_count":65,"forks_count":66,"subscribers_count":36,"default_branch":"main","last_synced_at":"2025-04-18T04:57:37.816Z","etag":null,"topics":["ai","ml"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/airbnb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":"GOVERNANCE.md","roadmap":null,"authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-09-30T23:20:42.000Z","updated_at":"2025-04-17T22:14:13.000Z","dependencies_parsed_at":"2024-03-04T21:26:18.679Z","dependency_job_id":"5c631349-fe91-44cb-b788-848adc7e863b","html_url":"https://github.com/airbnb/chronon","commit_stats":null,"previous_names":["airbnb/chronon"],"tags_count":249,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airbnb%2Fchronon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airbnb%2Fchronon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airbnb%2Fchronon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airbnb%2Fchronon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/airbnb","download_url":"https://codeload.github.com/airbnb/chronon/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250163794,"owners_count":21385317,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ml"],"created_at":"2024-08-02T01:03:46.293Z","updated_at":"2025-04-22T02:32:53.742Z","avatar_url":"https://github.com/airbnb.png","language":"Scala","funding_links":[],"categories":["Scala","人工智能"],"sub_categories":[],"readme":"# Chronon: A Data Platform for AI/ML\n\nChronon is a platform that abstracts away the complexity of data computation and serving for AI/ML applications. Users define features as transformation of raw data, then Chronon can perform batch and streaming computation, scalable backfills, low-latency serving, guaranteed correctness and consistency, as well as a host of observability and monitoring tools.\n\nIt allows you to utilize all of the data within your organization, from batch tables, event streams or services to power your AI/ML projects, without needing to worry about all the complex orchestration that this would usually entail.\n\nMore information about Chronon can be found at [chronon.ai](https://chronon.ai/).\n\n![High Level](https://chronon.ai/_images/intro.png)\n\n\n## Platform Features\n\n### Online Serving\n\nChronon offers an API for realtime fetching which returns up-to-date values for your features. It supports:\n\n- Managed pipelines for batch and realtime feature computation and updates to the serving backend\n- Low latency serving of computed features \n- Scalable for high fanout feature sets\n\n### Backfills \n\nML practitioners often need historical views of feature values for model training and evaluation. Chronon's backfills are:\n\n- Scalable for large time windows\n- Resilient to highly skewed data\n- Point-in-time accurate such that consistency with online serving is guaranteed\n\n### Observability, monitoring and data quality\n\nChronon offers visibility into:\n\n- Data freshness - ensure that online values are being updated in realtime\n- Online/Offline consistency - ensure that backfill data for model training and evaluation is consistent with what is being observed in online serving \n\n### Complex transformations and windowed aggregations\n\nChronon supports a range of aggregation types. For a full list see the documentation [here](https://chronon.ai/Aggregations.html).\n\nThese aggregations can all be configured to be computed over arbitrary window sizes.\n\n# Quickstart\n\nThis section walks you through the steps to create a training dataset with Chronon, using a fabricated underlying raw dataset.\n\nIncludes:\n- Example implementation of the main API components for defining features - `GroupBy` and `Join`.\n- The workflow for authoring these entities.\n- The workflow for backfilling training data.\n- The workflows for uploading and serving this data.\n- The workflow for measuring consistency between backfilled training data and online inference data.\n\nDoes not include:\n- A deep dive on the various concepts and terminologies in Chronon. For that, please see the [Introductory](https://chronon.ai/authoring_features/GroupBy.html) documentation.\n- Running streaming jobs.\n\n## Requirements\n\n- Docker\n\n## Setup\n\nTo get started with the Chronon, all you need to do is download the [docker-compose.yml](https://github.com/airbnb/chronon/blob/main/docker-compose.yml) file and run it locally:\n\n```bash\ncurl -o docker-compose.yml https://chronon.ai/docker-compose.yml\ndocker-compose up\n```\n\nOnce you see some data printed with a `only showing top 20 rows` notice, you're ready to proceed with the tutorial.\n\n## Introduction\n\nIn this example, let's assume that we're a large online retailer, and we've detected a fraud vector based on users making purchases and later returning items. We want to train a model that will be called when the **checkout** flow commences and predicts whether this transaction is likely to result in a fraudulent return.\n\n## Raw data sources\n\nFabricated raw data is included in the [data](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/data) directory. It includes four tables:\n\n1. Users - includes basic information about users such as account created date; modeled as a batch data source that updates daily\n2. Purchases - a log of all purchases by users; modeled as a log table with a streaming (i.e. Kafka) event-bus counterpart\n3. Returns - a log of all returns made by users; modeled as a log table with a streaming (i.e. Kafka) event-bus counterpart\n4. Checkouts - a log of all checkout events; **this is the event that drives our model predictions**\n\n### Start a shell session in the Docker container\n\nIn a new terminal window, run:\n\n```shell\ndocker-compose exec main bash\n```\n\nThis will open a shell within the chronon docker container.\n\n## Chronon Development\n\nNow that the setup steps are complete, we can start creating and testing various Chronon objects to define transformation and aggregations, and generate data.\n\n### Step 1 - Define some features\n\nLet's start with three feature sets, built on top of our raw input sources.\n\n**Note: These python definitions are already in your `chronon` image. There's nothing for you to run until [Step 3 - Backfilling Data](#step-3---backfilling-data) when you'll run computation for these definitions.**\n\n**Feature set 1: Purchases data features**\n\nWe can aggregate the purchases log data to the user level, to give us a view into this user's previous activity on our platform. Specifically, we can compute `SUM`s `COUNT`s and `AVERAGE`s of their previous purchase amounts over various windows.\n\nBecause this feature is built upon a source that includes both a table and a topic, its features can be computed in both batch and streaming.\n\n```python\nsource = Source(\n    events=EventSource(\n        table=\"data.purchases\", # This points to the log table with historical purchase events\n        topic=None, # Streaming is not currently part of quickstart, but this would be where you define the topic for realtime events\n        query=Query(\n            selects=select(\"user_id\",\"purchase_price\"), # Select the fields we care about\n            time_column=\"ts\") # The event time\n    ))\n\nwindow_sizes = [Window(length=day, timeUnit=TimeUnit.DAYS) for day in [3, 14, 30]] # Define some window sizes to use below\n\nv1 = GroupBy(\n    sources=[source],\n    keys=[\"user_id\"], # We are aggregating by user\n    aggregations=[Aggregation(\n            input_column=\"purchase_price\",\n            operation=Operation.SUM,\n            windows=window_sizes\n        ), # The sum of purchases prices in various windows\n        Aggregation(\n            input_column=\"purchase_price\",\n            operation=Operation.COUNT,\n            windows=window_sizes\n        ), # The count of purchases in various windows\n        Aggregation(\n            input_column=\"purchase_price\",\n            operation=Operation.AVERAGE,\n            windows=window_sizes\n        ) # The average purchases by user in various windows\n    ],\n)\n```\n\nSee the whole code file here: [purchases GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/purchases.py). This is also in your docker image. We'll be running computation for it and the other GroupBys in [Step 3 - Backfilling Data](#step-3---backfilling-data). \n\n**Feature set 2: Returns data features**\n\nWe perform a similar set of aggregations on returns data in the [returns GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/returns.py). The code is not included here because it looks similar to the above example.\n\n**Feature set 3: User data features**\n\nTurning User data into features is a littler simpler, primarily because there are no aggregations to include. In this case, the primary key of the source data is the same as the primary key of the feature, so we're simply extracting column values rather than performing aggregations over rows:\n\n```python\nsource = Source(\n    entities=EntitySource(\n        snapshotTable=\"data.users\", # This points to a table that contains daily snapshots of the entire product catalog\n        query=Query(\n            selects=select(\"user_id\",\"account_created_ds\",\"email_verified\"), # Select the fields we care about\n        )\n    ))\n\nv1 = GroupBy(\n    sources=[source],\n    keys=[\"user_id\"], # Primary key is the same as the primary key for the source table\n    aggregations=None # In this case, there are no aggregations or windows to define\n) \n```\n\nTaken from the [users GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/users.py).\n\n\n### Step 2 - Join the features together\n\nNext, we need the features that we previously defined backfilled in a single table for model training. This can be achieved using the `Join` API.\n\nFor our use case, it's very important that features are computed as of the correct timestamp. Because our model runs when the checkout flow begins, we'll want to be sure to use the corresponding timestamp in our backfill, such that features values for model training logically match what the model will see in online inference.\n\n`Join` is the API that drives feature backfills for training data. It primarilly performs the following functions:\n\n1. Combines many features together into a wide view (hence the name `Join`).\n2. Defines the primary keys and timestamps for which feature backfills should be performed. Chronon can then guarantee that feature values are correct as of this timestamp.\n3. Performs scalable backfills.\n\nHere is what our join looks like:\n\n```python\nsource = Source(\n    events=EventSource(\n        table=\"data.checkouts\", \n        query=Query(\n            selects=select(\"user_id\"), # The primary key used to join various GroupBys together\n            time_column=\"ts\",\n            ) # The event time used to compute feature values as-of\n    ))\n\nv1 = Join(  \n    left=source,\n    right_parts=[JoinPart(group_by=group_by) for group_by in [purchases_v1, refunds_v1, users]] # Include the three GroupBys\n)\n```\n\nTaken from the [training_set Join](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/joins/quickstart/training_set.py). \n\nThe `left` side of the join is what defines the timestamps and primary keys for the backfill (notice that it is built on top of the `checkout` event, as dictated by our use case).\n\nNote that this `Join` combines the above three `GroupBy`s into one data definition. In the next step, we'll run the command to execute computation for this whole pipeline.\n\n### Step 3 - Backfilling Data\n\nOnce the join is defined, we compile it using this command:\n\n```shell\ncompile.py --conf=joins/quickstart/training_set.py\n```\n\nThis converts it into a thrift definition that we can submit to spark with the following command:\n\n\n```shell\nrun.py --conf production/joins/quickstart/training_set.v1\n```\n\nThe output of the backfill would contain the user_id and ts columns from the left source, as well as the 11 feature columns from the three GroupBys that we created.\n\nFeature values would be computed for each user_id and ts on the left side, with guaranteed temporal accuracy. So, for example, if one of the rows on the left was for `user_id = 123` and `ts = 2023-10-01 10:11:23.195`, then the `purchase_price_avg_30d` feature would be computed for that user with a precise 30 day window ending on that timestamp.\n\nYou can now query the backfilled data using the spark sql shell:\n\n```shell\nspark-sql\n```\n\nAnd then: \n\n```sql\nspark-sql\u003e SELECT user_id, quickstart_returns_v1_refund_amt_sum_30d, quickstart_purchases_v1_purchase_price_sum_14d, quickstart_users_v1_email_verified from default.quickstart_training_set_v1 limit 100;\n```\n\nNote that this only selects a few columns. You can also run a `select * from default.quickstart_training_set_v1 limit 100` to see all columns, however, note that the table is quite wide and the results might not be very readable on your screen.\n\nTo exit the sql shell you can run:\n\n```shell\nspark-sql\u003e quit;\n```\n\n## Online Flows\n\nNow that we've created a join and backfilled data, the next step would be to train a model. That is not part of this tutorial, but assuming it was complete, the next step after that would be to productionize the model online. To do this, we need to be able to fetch feature vectors for model inference. That's what this next section covers.\n\n### Uploading data\n\nIn order to serve online flows, we first need the data uploaded to the online KV store. This is different than the backfill that we ran in the previous step in two ways:\n\n1. The data is not a historic backfill, but rather the most up-to-date feature values for each primary key.\n2. The datastore is a transactional KV store suitable for point lookups. We use MongoDB in the docker image, however you are free to integrate with a database of your choice.\n\n\nUpload the purchases GroupBy:\n\n```shell\nrun.py --mode upload --conf production/group_bys/quickstart/purchases.v1 --ds  2023-12-01\n\nspark-submit --class ai.chronon.quickstart.online.Spark2MongoLoader --master local[*] /srv/onlineImpl/target/scala-2.12/mongo-online-impl-assembly-0.1.0-SNAPSHOT.jar default.quickstart_purchases_v1_upload mongodb://admin:admin@mongodb:27017/?authSource=admin\n```\n\nUpload the returns GroupBy:\n\n```shell\nrun.py --mode upload --conf production/group_bys/quickstart/returns.v1 --ds  2023-12-01\n\nspark-submit --class ai.chronon.quickstart.online.Spark2MongoLoader --master local[*] /srv/onlineImpl/target/scala-2.12/mongo-online-impl-assembly-0.1.0-SNAPSHOT.jar default.quickstart_returns_v1_upload mongodb://admin:admin@mongodb:27017/?authSource=admin\n```\n\n### Upload Join Metadata\n\nIf we want to use the `FetchJoin` api rather than `FetchGroupby`, then we also need to upload the join metadata:\n\n```bash\nrun.py --mode metadata-upload --conf production/joins/quickstart/training_set.v2\n```\n\nThis makes it so that the online fetcher knows how to take a request for this join and break it up into individual GroupBy requests, returning the unified vector, similar to how the Join backfill produces the wide view table with all features.\n\n### Fetching Data\n\nWith the above entities defined, you can now easily fetch feature vectors with a simple API call.\n\nFetching a join:\n\n```bash\nrun.py --mode fetch --type join --name quickstart/training_set.v2 -k '{\"user_id\":\"5\"}'\n```\n\nYou can also fetch a single GroupBy (this would not require the Join metadata upload step performed earlier):\n\n```bash\nrun.py --mode fetch --type group-by --name quickstart/purchases.v1 -k '{\"user_id\":\"5\"}'\n```\n\nFor production, the Java client is usually embedded directly into services.\n\n```Java\nMap\u003cString, String\u003e keyMap = new HashMap\u003c\u003e();\nkeyMap.put(\"user_id\", \"123\");\nFetcher.fetch_join(new Request(\"quickstart/training_set_v1\", keyMap))\n```\nsample response \n```\n\u003e '{\"purchase_price_avg_3d\":14.3241, \"purchase_price_avg_14d\":11.89352, ...}'\n```\n\n**Note: This java code is not runnable in the docker env, it is just an illustrative example.**\n\n## Log fetches and measure online/offline consistency\n\nAs discussed in the introductory sections of this [README](https://github.com/airbnb/chronon?tab=readme-ov-file#platform-features), one of Chronon's core guarantees is online/offline consistency. This means that the data that you use to train your model (offline) matches the data that the model sees for production inference (online).\n\nA key element of this is temporal accuracy. This can be phrased as: **when backfilling features, the value that is produced for any given `timestamp` provided by the left side of the join should be the same as what would have been returned online if that feature was fetched at that particular `timestamp`**.\n\nChronon not only guarantees this temporal accuracy, but also offers a way to measure it.\n\nThe measurement pipeline starts with the logs of the online fetch requests. These logs include the primary keys and timestamp of the request, along with the fetched feature values. Chronon then passes the keys and timestamps to a Join backfill as the left side, asking the compute engine to backfill the feature values. It then compares the backfilled values to actual fetched values to measure consistency.\n\nStep 1: log fetches\n\nFirst, make sure you've ran a few fetch requests. Run:\n\n`run.py --mode fetch --type join --name quickstart/training_set.v2 -k '{\"user_id\":\"5\"}'` \n\nA few times to generate some fetches.\n\nWith that complete, you can run this to create a usable log table (these commands produce a logging hive table with the correct schema):\n\n```bash\nspark-submit --class ai.chronon.quickstart.online.MongoLoggingDumper --master local[*] /srv/onlineImpl/target/scala-2.12/mongo-online-impl-assembly-0.1.0-SNAPSHOT.jar default.chronon_log_table mongodb://admin:admin@mongodb:27017/?authSource=admin\ncompile.py --conf group_bys/quickstart/schema.py\nrun.py --mode backfill --conf production/group_bys/quickstart/schema.v1\nrun.py --mode log-flattener --conf production/joins/quickstart/training_set.v2 --log-table default.chronon_log_table --schema-table default.quickstart_schema_v1\n```\n\nThis creates a `default.quickstart_training_set_v2_logged` table that contains the results of each of the fetch requests that you previously made, along with the timestamp at which you made them and the `user` that you requested.\n\n**Note:** Once you run the above command, it will create and \"close\" the log partitions, meaning that if you make additional fetches on the same day (UTC time) it will not append. If you want to go back and generate more requests for online/offline consistency, you can drop the table (run `DROP TABLE default.quickstart_training_set_v2_logged` in a `spark-sql` shell) before rerunning the above command. \n\nNow you can compute consistency metrics with this command:\n\n```bash\nrun.py --mode consistency-metrics-compute --conf production/joins/quickstart/training_set.v2\n```\n\nThis job will take the primary key(s) and timestamps from the log table (`default.quickstart_training_set_v2_logged` in this case), and uses those to create and run a join backfill. It then compares the backfilled results to the actual logged values that were fetched online\n\nIt produces two output tables:\n\n1. `default.quickstart_training_set_v2_consistency`: A human readable table that you can query to see the results of the consistency checks.\n   1. You can enter a sql shell by running `spark-sql` from your docker bash sesion, then query the table.\n   2.  Note that it has many columns (multiple metrics per feature), so you might want to run a `DESC default.quickstart_training_set_v2_consistency` first, then select a few columns that you care about to query.\n2. `default.quickstart_training_set_v2_consistency_upload`: A list of KV bytes that is uploaded to the online KV store, that can be used to power online data quality monitoring flows. Not meant to be human readable.\n\n\n## Conclusion\n\nUsing chronon for your feature engineering work simplifies and improves your ML Workflow in a number of ways:\n\n1. You can define features in one place, and use those definitions both for training data backfills and for online serving.\n2. Backfills are automatically point-in-time correct, which avoids label leakage and inconsistencies between training data and online inference.\n3. Orchestration for batch and streaming pipelines to keep features up to date is made simple.\n4. Chronon exposes easy endpoints for feature fetching.\n5. Consistency is guaranteed and measurable.\n\nFor a more detailed view into the benefits of using Chronon, see [Benefits of Chronon documentation](https://github.com/airbnb/chronon/tree/main?tab=readme-ov-file#benefits-of-chronon-over-other-approaches).\n\n\n# Benefits of Chronon over other approaches\n\nChronon offers the most value to AI/ML practitioners who are trying to build \"online\" models that are serving requests in real-time as opposed to batch workflows.\n\nWithout Chronon, engineers working on these projects need to figure out how to get data to their models for training/eval as well as production inference. As the complexity of data going into these models increases (multiple sources, complex transformation such as windowed aggregations, etc), so does the infrastructure challenge of supporting this data plumbing.\n\nGenerally, we observed ML practitioners taking one of two approaches:\n\n## The log-and-wait approach\n\nWith this approach, users start with the data that is available in the online serving environment from which the model inference will run. Log relevant features to the data warehouse. Once enough data has accumulated, train the model on the logs, and serve with the same data.\n\nPros:\n- Features used to train the model are guaranteed to be available at serving time\n- The model can access service call features \n- The model can access data from the the request context\n\n\nCons:\n- It might take a long to accumulate enough data to train the model\n- Performing windowed aggregations is not always possible (running large range queries against production databases doesn't scale, same for event streams)\n- Cannot utilize the wealth of data already in the data warehouse\n- Maintaining data transformation logic in the application layer is messy\n\n## The replicate offline-online approach\n\nWith this approach, users train the model with data from the data warehouse, then figure out ways to replicate those features in the online environment.\n\nPros:\n- You can use a broad set of data for training\n- The data warehouse is well suited for large aggregations and other computationally intensive transformation\n\nCons:\n- Often very error prone, resulting in inconsistent data between training and serving\n- Requires maintaining a lot of complicated infrastructure to even get started with this approach, \n- Serving features with realtime updates gets even more complicated, especially with large windowed aggregations\n- Unlikely to scale well to many models\n\n**The Chronon approach** \n\nWith Chronon you can use any data available in your organization, including everything in the data warehouse, any streaming source, service calls, etc, with guaranteed consistency between online and offline environments. It abstracts away the infrastructure complexity of orchestrating and maintining this data plumbing, so that users can simply define features in a simple API, and trust Chronon to handle the rest.\n\n# Contributing\n\nWe welcome contributions to the Chronon project! Please read [CONTRIBUTING](CONTRIBUTING.md) for details.\n\n# Support\n\nUse the GitHub issue tracker for reporting bugs or feature requests.\nJoin our [community Slack workspace](https://join.slack.com/t/chrononworkspace/shared_invite/zt-2r621b6hw-pm552u71Y257Vtpt4RTiyg) for discussions, tips, and support.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairbnb%2Fchronon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fairbnb%2Fchronon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairbnb%2Fchronon/lists"}