{"id":25578687,"url":"https://github.com/as2811-project/monolith-pipeline","last_synced_at":"2025-07-07T08:33:39.268Z","repository":{"id":263707124,"uuid":"890930311","full_name":"as2811-project/Monolith-Pipeline","owner":"as2811-project","description":"A scale replica of the pipeline used for training TikTok's monolith recommender system","archived":false,"fork":false,"pushed_at":"2024-11-25T13:19:37.000Z","size":15,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-25T15:48:14.226Z","etag":null,"topics":["etl-pipeline","monolith","recommender-system","tiktok"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/as2811-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-19T12:37:55.000Z","updated_at":"2024-12-24T02:00:43.000Z","dependencies_parsed_at":"2024-11-20T02:24:26.946Z","dependency_job_id":"029e3eae-88e1-4d9b-a300-486efef0e808","html_url":"https://github.com/as2811-project/Monolith-Pipeline","commit_stats":null,"previous_names":["as2811-project/monolith-pipeline"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/as2811-project/Monolith-Pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/as2811-project%2FMonolith-Pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/as2811-project%2FMonolith-Pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/as2811-project%2FMonolith-Pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/as2811-project%2FMonolith-Pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/as2811-project","download_url":"https://codeload.github.com/as2811-project/Monolith-Pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/as2811-project%2FMonolith-Pipeline/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264043191,"owners_count":23548518,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["etl-pipeline","monolith","recommender-system","tiktok"],"created_at":"2025-02-21T03:16:52.900Z","updated_at":"2025-07-07T08:33:39.255Z","avatar_url":"https://github.com/as2811-project.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"### Scale Replica of TikTok's Monolith Recommender System's Pipeline\n\nI recently started reading TikTok's Monolith research paper and thought I'd replicate the ETL pipeline they built for the system. In this repo, you will find Terraform files to create an EC2 instance\nand install the necessary dependencies (i.e., Kafka, Java) on launch. But you can ignore these as I've shelved IaC and AWS as a whole temporarily as I highly doubt I'll stay within the free tier limits.\n\nYou will find a notebook with some code to periodically send a row of data from the 'ratings.dat' file to the Kafka\ntopic. A separate Spark job runs to join the ratings data with the movies metadata (movies.dat) which is preloaded into memory.\n\nTo try this out:\n* Clone the repo\n* Create a `.env` file in the `spark` directory and add your Redshift credentials\n* Run `docker compose up`\n* Once everything's up and running, run the `payload.ipynb` notebook\n* If there are no errors, you will be able to see `MicroBatchExecution` logs on your terminal\n\nData Source:\nhttps://grouplens.org/datasets/movielens/10m/ (ensure this folder is in the project root)\n\nMonolith:\nhttps://arxiv.org/pdf/2209.07663\n\nI might figure out a way to deploy this on AWS whilst remaining within the free tier limits later.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fas2811-project%2Fmonolith-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fas2811-project%2Fmonolith-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fas2811-project%2Fmonolith-pipeline/lists"}