{"id":14973844,"url":"https://github.com/suryadev99/stream_processing_website_click_data","last_synced_at":"2026-03-10T13:05:15.132Z","repository":{"id":228574014,"uuid":"774363337","full_name":"suryadev99/stream_processing_website_click_data","owner":"suryadev99","description":"Stream Processing of website click data using Kafka and monitored and visualised using Prometheus and Grafana","archived":false,"fork":false,"pushed_at":"2024-07-11T17:52:46.000Z","size":520,"stargazers_count":0,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-10T04:56:38.866Z","etag":null,"topics":["clickdata","data","dataengineering","docker","flink-kafka","flink-metrics","flink-stream-processing","git","grafana","kafka","kafka-streams","kafka-topic","prometheus","psql","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/suryadev99.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-19T12:23:27.000Z","updated_at":"2024-07-11T17:52:50.000Z","dependencies_parsed_at":"2024-06-02T19:56:34.683Z","dependency_job_id":"372d8074-7c14-48e1-9e64-81d7ef591977","html_url":"https://github.com/suryadev99/stream_processing_website_click_data","commit_stats":{"total_commits":13,"total_committers":2,"mean_commits":6.5,"dds":"0.23076923076923073","last_synced_commit":"1e18f79d95c0c0042e60a9d05ad18464c98bfd8c"},"previous_names":["suryadev99/streaming_processing_website_click_data"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/suryadev99/stream_processing_website_click_data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/suryadev99%2Fstream_processing_website_click_data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/suryadev99%2Fstream_processing_website_click_data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/suryadev99%2Fstream_processing_website_click_data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/suryadev99%2Fstream_processing_website_click_data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/suryadev99","download_url":"https://codeload.github.com/suryadev99/stream_processing_website_click_data/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/suryadev99%2Fstream_processing_website_click_data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30334412,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-10T12:41:07.687Z","status":"ssl_error","status_checked_at":"2026-03-10T12:41:06.728Z","response_time":106,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clickdata","data","dataengineering","docker","flink-kafka","flink-metrics","flink-stream-processing","git","grafana","kafka","kafka-streams","kafka-topic","prometheus","psql","python"],"created_at":"2024-09-24T13:49:33.780Z","updated_at":"2026-03-10T13:05:15.103Z","avatar_url":"https://github.com/suryadev99.png","language":"Python","readme":"# Stream Processing of Website Click Data\n\n\n## Project\n\nConsider we run an e-commerce website. An everyday use case with e-commerce is to identify, for every product purchased, the click that led to this purchase. Attribution is the joining of checkout(purchase) of a product to a click. There are multiple types of **[attribution](https://www.shopify.com/blog/marketing-attribution#3)**; we will focus on `First Click Attribution`. \n\nOur objectives are:\n 1. Enrich checkout data with the user name. The user data is in a transactional database.\n 2. Identify which click leads to a checkout (aka attribution). For every product checkout, we consider **the earliest click a user made on that product in the previous hour to be the click that led to a checkout**.\n 3. Log the checkouts and their corresponding attributed clicks (if any) into a table.\n\n## Prerequisites\n\nTo run the code, you'll need the following:\n\n1. [git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)\n2. [Docker](https://docs.docker.com/engine/install/) with at least 4GB of RAM and [Docker Compose](https://docs.docker.com/compose/install/) v1.27.0 or later\n3. [psql](https://blog.timescale.com/tutorials/how-to-install-psql-on-mac-ubuntu-debian-windows/)\n\nIf you are using windows please setup WSL and a local Ubuntu Virtual machine following **[the instructions here](https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview)**. Install the above prerequisites on your ubuntu terminal, if you have trouble installing docker follow **[the steps here](https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04#step-1-installing-docker)**.\n\n## Architecture\n\nOur streaming pipeline architecture is as follows (from left to right):\n\n1. **`Application`**: Website generates clicks and checkout event data.\n2. **`Queue`**: The clicks and checkout data are sent to their corresponding Kafka topics.\n3. **`Stream processing`**: \n   1. Flink reads data from the Kafka topics.\n   2. The click data is stored in our cluster state. Note that we only store click information for the last hour, and we only store one click per user-product combination. \n   3. The checkout data is enriched with user information by querying the user table in Postgres.\n   4. The checkout data is left joined with the click data( in the cluster state) to see if the checkout can be attributed to a click.\n   5. The enriched and attributed checkout data is logged into a Postgres sink table.\n4. **`Monitoring \u0026 Alerting`**: Apache Flink metrics are pulled by Prometheus and visualized using Graphana.\n\n![Architecture](./assets/images/arch.png)\n\n## Code design\n\nWe use Apache Table API to \n\n1. Define Source systems: **[clicks, checkouts and users] generates fake click and checkout data.\n2. Define how to process the data (enrich and attribute): **[Enriching with user data and attributing checkouts ]\n3. Define Sink system: **[sink]\n\nWe store the SQL DDL and DML in the folders `source`, `process`, and `sink` corresponding to the above steps. We use *\n\n## Run streaming job\n\nClone and run the streaming job (via terminal) as shown below:\n\n```bash\ngit clone the repo\ncd streaming_click_data\nmake run # restart all containers, \u0026 start streaming job\n```\n\n1. **Apache Flink UI**: Open [http://localhost:8081/](http://localhost:8081/) or run `make ui` and click on `Jobs -\u003e Running Jobs -\u003e checkout-attribution-job` to see our running job. \n2. **Graphana**: Visualize system metrics with Graphana, use the `make open` command or go to [http://localhost:3000](http://localhost:3000) via your browser (username: `admin`, password:`flink`).\n\n*## Check output\n\nOnce we start the job, it will run asynchronously. We can check the Flink UI ([http://localhost:8081/](http://localhost:8081/) or `make ui`) and clicking on `Jobs -\u003e Running Jobs -\u003e checkout-attribution-job` to see our running job.\n\n![Flink UI](assets/images/flink_ui_dag.png)\n\nWe can check the output of our job, by looking at the attributed checkouts. \n\nOpen a postgres terminal as shown below.\n\n```bash\npgcli -h localhost -p 5432 -U postgres -d postgres \n# password: postgres\n```\n\nUse the below query to check that the output updates every few seconds.\n\n```sql\nSELECT checkout_id, click_id, checkout_time, click_time, user_name FROM commerce.attributed_checkouts order by checkout_time desc limit 5;\n```\n\n## Tear down \n\nUse `make down` to spin down the containers.\n\n## Contributing\n\nContributions are welcome. If you would like to contribute you can help by opening a github issue or putting up a PR.\n\n## References\n\n1. [Apache Flink docs](https://nightlies.apache.org/flink/flink-docs-release-1.17/)\n2. [Flink Prometheus example project](https://github.com/mbode/flink-prometheus-example)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuryadev99%2Fstream_processing_website_click_data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsuryadev99%2Fstream_processing_website_click_data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuryadev99%2Fstream_processing_website_click_data/lists"}