{"id":28378565,"url":"https://github.com/mage-ai/machine_learning","last_synced_at":"2025-07-14T20:33:06.700Z","repository":{"id":233171528,"uuid":"786117203","full_name":"mage-ai/machine_learning","owner":"mage-ai","description":"The definitive end-to-end machine learning (ML lifecycle) guide and tutorial for data engineers.","archived":false,"fork":false,"pushed_at":"2024-11-14T07:20:14.000Z","size":1323,"stargazers_count":21,"open_issues_count":0,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-25T13:48:13.319Z","etag":null,"topics":["artificial-intelligence","data-engineering","machine-learning"],"latest_commit_sha":null,"homepage":"https://www.mage.ai/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mage-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-13T13:38:12.000Z","updated_at":"2025-06-24T09:27:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"fecd22d2-17af-4ff8-8e4f-2576753566cb","html_url":"https://github.com/mage-ai/machine_learning","commit_stats":null,"previous_names":["mage-ai/machine_learning"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mage-ai/machine_learning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mage-ai%2Fmachine_learning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mage-ai%2Fmachine_learning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mage-ai%2Fmachine_learning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mage-ai%2Fmachine_learning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mage-ai","download_url":"https://codeload.github.com/mage-ai/machine_learning/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mage-ai%2Fmachine_learning/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265344832,"owners_count":23750566,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","data-engineering","machine-learning"],"created_at":"2025-05-30T02:06:44.461Z","updated_at":"2025-07-14T20:33:06.683Z","avatar_url":"https://github.com/mage-ai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# [The definitive end-to-end machine learning (ML lifecycle) guide and tutorial for data engineers](https://www.notion.so/mageai/The-definitive-end-to-end-machine-learning-ML-lifecycle-guide-and-tutorial-for-data-engineers-ea24db5e562044c29d7227a67e70fd56?pvs=4)\n\n\u003cimg src=\"https://github.com/mage-ai/assets/blob/main/machine-learning/mage-ml-guide.png?raw=true\" /\u003e\n\n## TLDR\n\n1. Define problem\n1. Prepare data\n1. Train and evaluate\n1. Deploy and integrate\n1. Observe\n1. Experiment\n1. Retrain\n\n\u003cimg src=\"https://github.com/mage-ai/assets/blob/main/machine-learning/ml.jpg?raw=true\" /\u003e\n\n---\n\n## Setup\n\n1. Clone the repository: `git clone https://github.com/mage-ai/machine_learning.git`.\n    1. Stay in the same directory that you executed this command in; don’t change directory.\n\n1. Run Docker:\n    ```bash\n    docker run -it -p 6789:6789 -v $(pwd):/home/src mageai/mageai /app/run_app.sh mage start machine_learning\n    ```\n\n    If you don’t use MacOS or Linux, check out other examples in Mage’s [quick start guide](https://docs.mage.ai/getting-started/setup).\n\n1. Open a browser and go to [http://localhost:6789](http://localhost:6789).\n\n---\n\n## 🕵️‍♀️ Define problem\n\nClearly state the business problem you're trying to solve with machine learning and your hypothesis for how it can be solved.\n\n1. Open pipeline [`define_problem`](http://localhost:6789/pipelines/define_problem/edit).\n\n1. Define the problem and your hypothesis.\n\n\u003cvideo src=\"https://github.com/mage-ai/assets/assets/1066980/23d45e9e-cd03-4598-973d-590008788eb6\"\u003e\u003c/video\u003e\n\n---\n\n## 💾 Prepare data\n\nCollect data from various sources, generate additional training data if needed, and\nperform feature engineering to transform the raw data into a set of useful input features.\n\n1. The pipeline [`core_data_users_v0`](http://localhost:6789/pipelines/core_data_users_v0/edit)\n   contains 3 tables that are joined together.\n\n1. Pipeline [`prepare_data`](http://localhost:6789/pipelines/prepare_data/edit) is used in multiple\n   other pipeline to perform data preparation on input datasets.\n\n    For example, the [`ml_training`](http://localhost:6789/pipelines/ml_training/edit)\n    pipeline that’s responsible for training an ML model will first run the above 2 pipelines to\n    build the training set that’s used to train and test the model.\n\n### Collecting and combining core user data\n\n\u003cvideo src=\"https://github.com/mage-ai/assets/assets/1066980/06334154-96c1-48ae-9045-615175184ffa\"\u003e\u003c/video\u003e\n\n### Feature engineering\n\n\u003cvideo src=\"https://github.com/mage-ai/assets/assets/1066980/5c8749aa-630e-4622-b7a9-35273feda140\"\u003e\u003c/video\u003e\n\n---\n\n## 🦾 Train and evaluate\n\nUse the training data to teach the machine learning model to make accurate predictions.\nEvaluate the trained model's performance on a test set.\n\n\n1. The [`ml_training`](http://localhost:6789/pipelines/ml_training/edit) pipeline takes in a\n    training set and trains an XGBoost classifier to predict in what scenarios a user would unsubscribe\n    from a marketing email.\n\n1. This pipeline will also evaluate the model’s performance on a test data set.\n    It’ll provide visualizations and explain which features are important using SHAP values.\n\n1. Finally, this pipeline will serialize the model and its weights to disk to be used during\n    the inference phase.\n\n\u003cvideo src=\"https://github.com/mage-ai/assets/assets/1066980/5a4d86f8-3f0b-41c2-9127-99620bd5fe0e\"\u003e\u003c/video\u003e\n\n---\n\n## 🤖 Deploy and integrate\n\nDeploy the trained model to a production environment to generate predictions on new data,\neither in real-time via an API or in batch pipelines.\nIntegrate the model's predictions with other business applications.\n\n1. Once the model is done training and has been packaged for deployment, before we can use it to\n    make predictions, we’ll need to setup our feature store that’ll serve user features on-demand\n    when making a prediction.\n\n1. Use the [`ml_feature_fetching`](http://localhost:6789/pipelines/ml_feature_fetching/edit)\n    pipeline to prepare the features for each user ahead of time before progressing to the inference\n    phase.\n\n1. The [`ml_inference_offline`](http://localhost:6789/pipelines/ml_inference_offline/edit)\n    pipeline is responsible for making batch predictions offline on the entire set of users.\n\n1. The [`ml_inference_online`](http://localhost:6789/pipelines/ml_inference_online/edit)\n    pipeline serves real-time model predictions and can be interacted with via an API request.\n    Use the [`ML playground`](http://localhost:6789/pipelines/ml_playground/edit)\n    to interact with this model and make online predictions.\n\n### Feature store and fetching\n\n\u003cvideo src=\"https://github.com/mage-ai/assets/assets/1066980/7deeee51-0fcb-44bf-8192-ae48b2bac0c7\"\u003e\u003c/video\u003e\n\n### Batch offline predictions\n\n\u003cvideo src=\"https://github.com/mage-ai/assets/assets/1066980/0ce55744-8058-4b79-8699-d09c16f6aa0e\"\u003e\u003c/video\u003e\n\n### Real-time online predictions\n\n1. The pipeline used for online inference is called\n    [`ml_inference_online`](http://localhost:6789/pipelines/ml_inference_online/edit).\n\n1. Before interacting with the online predictions pipeline, you must first create an API trigger for\n    [`ml_inference_online`](http://localhost:6789/pipelines/ml_inference_online/edit) pipeline.\n    You can follow the [general instructions](https://docs.mage.ai/orchestration/triggers/trigger-pipeline-api)\n    to create an API trigger.\n\n1. The video below is for the pipeline named\n    [`ml_playground`](http://localhost:6789/pipelines/ml_playground/edit), which contains\n    [no-code UI interactions](https://docs.mage.ai/interactions/overview) to make it easy to\n    play around with the online predictions.\n\n\u003cvideo src=\"https://github.com/mage-ai/assets/assets/1066980/f101bda8-603b-47bb-ae88-7c825dbdba08\"\u003e\u003c/video\u003e\n\n---\n\n## 🔭 Observe\n\nMonitor the deployed model's prediction performance, latency, and system health in the production environment.\n\n*Example coming soon.*\n\n\u003cimg src=\"https://github.com/mage-ai/assets/blob/main/machine-learning/observe.png?raw=true\" /\u003e\n\n---\n\n## 🧪 Experiment\n\nConduct controlled experiments like A/B tests to measure the impact of the model's predictions on\nbusiness metrics. Compare the new model's performance to a control model or previous model versions.\n\n*Example coming soon.*\n\n\u003cimg src=\"https://github.com/mage-ai/assets/blob/main/machine-learning/experiment.png?raw=true\" /\u003e\n\n---\n\n## 🏋️ Retrain\n\nContinuously gather new training data and retrain the model periodically to maintain and\nimprove prediction performance.\n\n1. Every 2 hours, the retraining pipeline named\n    [`ml_retraining_model`](http://localhost:6789/pipelines/ml_retraining_model/edit) will run.\n\n1. The retraining pipeline triggers the [`ml_training`](http://localhost:6789/pipelines/ml_training/edit)\n    pipeline if the following contrived condition is met:\n\n    The number of partitions created for the `core_data.users_v0` data product is divisible by 4.\n\n\u003cvideo src=\"https://github.com/mage-ai/assets/assets/1066980/885eec0f-71b2-4485-87b1-e0931ec16537\"\u003e\u003c/video\u003e\n\n---\n\n## Conclusion\n\n\u003cimg\n    src=\"https://github.com/mage-ai/assets/blob/main/machine-learning/ml%20tools.jpg?raw=true\"\n/\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmage-ai%2Fmachine_learning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmage-ai%2Fmachine_learning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmage-ai%2Fmachine_learning/lists"}