{"id":19676809,"url":"https://github.com/victordibia/cml_churn","last_synced_at":"2025-07-23T04:33:45.834Z","repository":{"id":77574362,"uuid":"286799224","full_name":"victordibia/cml_churn","owner":"victordibia","description":null,"archived":false,"fork":false,"pushed_at":"2020-08-11T17:11:37.000Z","size":1156,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-27T05:52:35.076Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/victordibia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-11T16:51:39.000Z","updated_at":"2020-08-11T17:11:39.000Z","dependencies_parsed_at":"2023-04-14T23:25:00.408Z","dependency_job_id":null,"html_url":"https://github.com/victordibia/cml_churn","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/victordibia/cml_churn","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/victordibia%2Fcml_churn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/victordibia%2Fcml_churn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/victordibia%2Fcml_churn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/victordibia%2Fcml_churn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/victordibia","download_url":"https://codeload.github.com/victordibia/cml_churn/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/victordibia%2Fcml_churn/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266618800,"owners_count":23957273,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-23T02:00:09.312Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T17:30:17.663Z","updated_at":"2025-07-23T04:33:45.806Z","avatar_url":"https://github.com/victordibia.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Churn Prediction Prototype\nThis project is a Cloudera Machine Learning \n([CML](https://www.cloudera.com/products/machine-learning.html)) **Applied Machine Learning \nProject Prototype**. It has all the code and data needed to deploy an end-to-end machine \nlearning project in a running CML instance.\n\n## Project Overview\nThis project builds the telco churn with model interpretability project discussed in more \ndetail [this blog post](https://blog.cloudera.com/visual-model-interpretability-for-telco-churn-in-cloudera-data-science-workbench/). \nThe initial idea and code comes from the FFL Interpretability report which is now freely \navailable and you can read the full report [here](https://ff06-2020.fastforwardlabs.com/)\n\n![table_view](images/table_view.png)\n\nThe goal is to build a classifier model using Logistic Regression to predict the churn \nprobability for a group of customers from a telecoms company. On top that, the model \ncan then be interpreted using [LIME](https://github.com/marcotcr/lime). Both the Logistic \nRegression and LIME models are then deployed using CML's real-time model deployment \ncapability and finally a basic flask based web application is deployed that will let \nyou interact with the real-time model to see which factors in the data have the most \ninfluence on the churn probability.\n\nBy following the notebooks in this project, you will understand how to perform similar \nclassification tasks on CML as well as how to use the platform's major features to your \nadvantage. These features include **streamlined model experimentation**, \n**point-and-click model deployment**, and **ML app hosting**.\n\nWe will focus our attention on working within CML, using all it has to offer, while\nglossing over the details that are simply standard data science.\nWe trust that you are familiar with typical data science workflows\nand do not need detailed explanations of the code.\nNotes that are *specific to CML* will be emphasized in **block quotes**.\n\n### Initialize the Project\nThere are a couple of steps needed at the start to configure the Project and Workspace \nsettings so each step will run sucessfully. You **must** run the project bootstrap \nbefore running other steps. If you just want to launch the model interpretability \napplication without going through each step manually, then you can also deploy the \ncomplete project. \n\n***Project bootstrap***\n\nOpen the file `0_bootstrap.py` in a normal workbench python3 session. You only need a \n1 vCPU / 2 GiB instance. Once the session is loaded, click **Run \u003e Run All Lines**. \nThis will file will create an Environment Variable for the project called **STORAGE**, \nwhich is the root of default file storage location for the Hive Metastore in the \nDataLake (e.g. `s3a://my-default-bucket`). It will also upload the data used in the \nproject to `$STORAGE/datalake/data/churn/`. The original file comes as part of this \ngit repo in the `raw` folder.\n  \n***Deploy the Complete Project***\n\nIf you just wish build the project artifacts without going through each step manually, \nrun the `8_build_projet.py` file in a python3 session. Again a 1 vCPU / 2 GiB instance \nwill be suffient. This script will: \n* run the bootstrap\n* then create the Hive Table and import the data\n* deploy the model\n* update the application files to use this new model\n* deploy the application\n* run the model drift simulation\nOnce the script has completed you will see the new model and application are now available \nin the project.\n\n## Project Build\nIf you want go through each of the steps manually to build and understand how the project \nworks, follow the steps below. There is a lot more detail and explanation/comments in each \nof the files/notebooks so its worth looking into those. Follow the steps below and you \nwill end up with a running application.\n\n### 0 Bootstrap\nJust to reiterate that you have run the bootstrap for this project before anything else. \nSo make sure you run step 0 first. \n\nOpen the file `0_bootstrap.py` in a normal workbench python3 session. You only need a \n1 CPU / 2 GB instance. Then **Run \u003e Run All Lines**\n\n### 1 Ingest Data\nThis script will read in the data csv from the file uploaded to the object store (s3/adls) setup \nduring the bootstrap and create a managed table in Hive. This is all done using Spark.\n\nOpen `1_data_ingest.py` in a Workbench session: python3, 1 CPU, 2 GB. Run the file.\n\n### 2 Explore Data\nThis is a Jupyter Notebook that does some basic data exploration and visualistaion. It \nis to show how this would be part of the data science workflow.\n\n![data](https://raw.githubusercontent.com/fletchjeff/cml_churn_demo_mlops/master/images/data.png)\n\nOpen a Jupyter Notebook session (rather than a work bench): python3, 1 CPU, 2 GB and \nopen the `2_data_exploration.ipynb` file. \n\nAt the top of the page click **Cells \u003e Run All**.\n\n### 3 Model Building\nThis is also a Jupyter Notebook to show the process of selecting and building the model \nto predict churn. It also shows more details on how the LIME model is created and a bit \nmore on what LIME is actually doing.\n\nOpen a Jupyter Notebook session (rather than a work bench): python3, 1 CPU, 2 GB and \nopen the `\t3_model_building.ipynb` file. \n\nAt the top of the page click **Cells \u003e Run All**.\n\n### 4 Model Training\nA model pre-trained is saved with the repo has been and placed in the `models` directory. \nIf you want to retrain the model, open the `4_train_models.py` file in a workbench  session: \npython3 1 vCPU, 2 GiB and run the file. The newly model will be saved in the models directory \nnamed `telco_linear`. \n\nThere are 2 other ways of running the model training process\n\n***1. Jobs***\n\nThe **[Jobs](https://docs.cloudera.com/machine-learning/cloud/jobs-pipelines/topics/ml-creating-a-job.html)**\nfeature allows for adhoc, recurring and depend jobs to run specific scripts. To run this model \ntraining process as a job, create a new job by going to the Project window and clicking _Jobs \u003e\nNew Job_ and entering the following settings:\n* **Name** : Train Mdoel\n* **Script** : 4_train_models.py\n* **Arguments** : _Leave blank_\n* **Kernel** : Python 3\n* **Schedule** : Manual\n* **Engine Profile** : 1 vCPU / 2 GiB\nThe rest can be left as is. Once the job has been created, click **Run** to start a manual \nrun for that job.\n\n***2. Experiments***\n\nThe other option is running an **[Experiment](https://docs.cloudera.com/machine-learning/cloud/experiments/topics/ml-running-an-experiment.html)**. Experiments run immediately and are used for testing different parameters in a model training process. In this instance it would be use for hyperparameter optimisation. To run an experiment, from the Project window click Experiments \u003e Run Experiment with the following settings.\n* **Script** : 4_train_models.py\n* **Arguments** : 5 lbfgs 100 _(these the cv, solver and max_iter parameters to be passed to \nLogisticRegressionCV() function)\n* **Kernel** : Python 3\n* **Engine Profile** : 1 vCPU / 2 GiB\n\nClick **Start Run** and the expriment will be sheduled to build and run. Once the Run is \ncompleted you can view the outputs that are tracked with the experiment using the \n`cdsw.track_metrics` function. It's worth reading through the code to get a sense of what \nall is going on.\n\n\n### 5 Serve Model\nThe **[Models](https://docs.cloudera.com/machine-learning/cloud/models/topics/ml-creating-and-deploying-a-model.html)** \nis used top deploy a machine learning model into production for real-time prediction. To \ndeploy the model trailed in the previous step, from  to the Project page, click **Models \u003e New\nModel** and create a new model with the following details:\n\n* **Name**: Explainer\n* **Description**: Explain customer churn prediction\n* **File**: 5_model_serve_explainer.py\n* **Function**: explain\n* **Input**: \n```\n{\n\t\"StreamingTV\": \"No\",\n\t\"MonthlyCharges\": 70.35,\n\t\"PhoneService\": \"No\",\n\t\"PaperlessBilling\": \"No\",\n\t\"Partner\": \"No\",\n\t\"OnlineBackup\": \"No\",\n\t\"gender\": \"Female\",\n\t\"Contract\": \"Month-to-month\",\n\t\"TotalCharges\": 1397.475,\n\t\"StreamingMovies\": \"No\",\n\t\"DeviceProtection\": \"No\",\n\t\"PaymentMethod\": \"Bank transfer (automatic)\",\n\t\"tenure\": 29,\n\t\"Dependents\": \"No\",\n\t\"OnlineSecurity\": \"No\",\n\t\"MultipleLines\": \"No\",\n\t\"InternetService\": \"DSL\",\n\t\"SeniorCitizen\": \"No\",\n\t\"TechSupport\": \"No\"\n}\n```\n* **Kernel**: Python 3\n* **Engine Profile**: 1vCPU / 2 GiB Memory\n\nLeave the rest unchanged. Click **Deploy Model** and the model will go through the build \nprocess and deploy a REST endpoint. Once the model is deployed, you can test it is working \nfrom the model Model Overview page.\n\n_**Note: This is important**_\n\nOnce the model is deployed, you must disable the additional model authentication feature. In the model settings page, untick **Enable Authentication**.\n\n![disable_auth](images/disable_auth.png)\n\n### 6 Deploy Application\nThe next step is to deploy the Flask application. The **[Applications](https://docs.cloudera.com/machine-learning/cloud/applications/topics/ml-applications.html)** feature is still quite new for CML. For this project it is used to deploy a web based application that interacts with the underlying model created in the previous step.\n\n_**Note: This next step is important**_\n\n_In the deployed model from step 5, go to **Model \u003e Settings** and make a note (i.e. copy) the \n\"Access Key\". It will look something like this (ie. mukd9sit7tacnfq2phhn3whc4unq1f38)_\n\n_From the Project level click on \"Open Workbench\" (note you don't actually have to Launch a \nsession) in order to edit a file. Select the flask/single_view.html file and paste the Access \nKey in at line 19._\n\n`        const accessKey = \"mp3ebluylxh4yn5h9xurh1r0430y76ca\";`\n\n_Save the file (if it has not auto saved already) and go back to the Project._\n\nFrom the Go to the **Applications** section and select \"New Application\" with the following:\n* **Name**: Churn Analysis App\n* **Subdomain**: churn-app _(note: this needs to be unique, so if you've done this before, \npick a more random subdomain name)_\n* **Script**: 6_application.py\n* **Kernel**: Python 3\n* **Engine Profile**: 1vCPU / 2 GiB Memory\n\n\nAfter the Application deploys, click on the blue-arrow next to the name. The initial view is a \ntable of randomly selected from the dataset. This shows a global view of which features are \nmost important for the predictor model. The reds show incresed importance for preditcting a \ncusomter that will churn and the blues for for customers that will not.\n\n![table_view](images/table_view.png)\n\nClicking on any single row will show a \"local\" interpreted model for that particular data point \ninstance. Here you can see how adjusting any one of the features will change the instance's \nchurn prediction.\n\n\n![single_view_1](images/single_view_1.png)\n\nChanging the InternetService to DSL lowers the probablity of churn. *Note: this does not mean \nthat changing the Internet Service to DSL cause the probability to go down, this is just what \nthe model would predict for a customer with those data points*\n\n\n![single_view_2](images/single_view_2.png)\n\n### 7 Model Operations\nThe final step is the model operations which consists of [Model Metrics](https://docs.cloudera.com/machine-learning/cloud/model-metrics/topics/ml-enabling-model-metrics.html)\nand [Model Governance](https://docs.cloudera.com/machine-learning/cloud/model-governance/topics/ml-enabling-model-governance.html)\n\n**Model Governance** is setup in the `0_bootstrap.py` script, which writes out the lineage.yml file at\nthe start of the project. For the **Model Metrics** open a workbench session (1 vCPU / 2 GiB) and open the\n`7a_ml_ops_simulation.py` file. You need to set the `model_id` number from the model created in step 5 on line\n113. The model number is on the model's main page.\n\n![model_id](images/model_id.png)\n\n`model_id = \"95\"`\n\nFrom there, run the file. This goes through a process of simulating an model that drifts over \nover 1000 calls to the model. The file contains comments with details of how this is done.\n\nIn the next step you can interact and display the model metrics. Open a workbench \nsession (1 vCPU / 2 GiB) and open and run the `7b_ml_ops_visual.py` file. Again you \nneed to set the `model_id` number from the model created in step 5 on line 53. \nThe model number is on the model's main page.\n\n![model_accuracy](images/model_accuracy.png)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvictordibia%2Fcml_churn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvictordibia%2Fcml_churn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvictordibia%2Fcml_churn/lists"}