{"id":27364834,"url":"https://github.com/razorcd/ml-project","last_synced_at":"2025-04-13T05:36:10.936Z","repository":{"id":46482091,"uuid":"415130871","full_name":"razorcd/ml-project","owner":"razorcd","description":"Machine Learning Ops project","archived":false,"fork":false,"pushed_at":"2022-08-29T16:39:25.000Z","size":1001149,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-06T11:35:20.609Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/razorcd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-10-08T21:25:10.000Z","updated_at":"2024-03-19T20:45:07.000Z","dependencies_parsed_at":"2023-01-16T18:45:24.568Z","dependency_job_id":null,"html_url":"https://github.com/razorcd/ml-project","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razorcd%2Fml-project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razorcd%2Fml-project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razorcd%2Fml-project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razorcd%2Fml-project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/razorcd","download_url":"https://codeload.github.com/razorcd/ml-project/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248670509,"owners_count":21142897,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-13T05:36:10.513Z","updated_at":"2025-04-13T05:36:10.928Z","avatar_url":"https://github.com/razorcd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ML project\n\nOnce you move to Berlin, the hardest thing you face is renting an apartment. The competition is very high and prices vary a lot.\nI have built this ML application to predict the baseRent of an apartment based on features like living space, rooms, area, etc.\nThis way you can check what you can afford or if you already found an apartment you can check what would be the correct price to pay.\n\nThe dataset is based on data from 2018 - 2019. I suggest to add 10% more to the final prediction to reflect 2021 rental prices.\n\n## Data Set\n\nhttps://www.kaggle.com/corrieaar/apartment-rental-offers-in-germany\n\nDataset contains information from entire germany. \nI have selected only data from Berlin. This gives around 10000 records.\n\n## Development System\n  - OS: x64 Linux Ubuntu\n\n# Project progress\n\nJupiter notebook has progress comments on each step.\n\n1. PrepareData: \n    - source: [data_analysis.ipynb](data_analysis.ipynb)\n    - selected only Berlin data\n    - checked and removed invalid data\n    - select features\n    - split data 60/20/20\n    \n2. Trained a Linear Regression model:\n    - source: [capstoneProject_linearRegression.ipynb](capstoneProject_linearRegresion.ipynb)\n    - notice this notebook is linked to the data_analysis notebook\n    - tried different columns and countries to find most accurate combination.\n    - Columns selected for training: \n        - `x = [cellar\tbaseRent\tlivingSpace\tnoRooms\theating\tneighborhoods]`,\n        - `y = 'baseRent'` (numerical)\n    - found `MAE = 257.0` and `Model max deviation for 50: 15.041 percent`\n3. Trained a xgboost model:\n    - source: [capstoneProject__xgboost.ipynb](capstoneProject__xgboost.ipynb)\n    - source2: [target_encoding/capstoneProject__xgboost_target_encoding.ipynb](target_encoding/capstoneProject__xgboost_target_encoding.ipynb)\n    - notice this notebook is linked to the data_analysis notebook\n    - tried different xgboost properties: max_depth, eta\n    - found best xgboost arguments with smallest depth:  `max_depth: 20, eta: 0.6`\n    - Columns selected for training: \n        - `x = [newlyConst\tbalcony\thasKitchen\tcellar\tlivingSpace\tlift\tnoRooms\tgarden\theating\tneighborhoods]`,\n        - `y = 'baseRent'` (numerical)\n    - found `MAE = 219` and `Model max deviation for 50: 27.391 percent`\n4. Trained a neural network Keras model:\n    - source: [capstoneProject_keras.ipynb](capstoneProject_keras.ipynb)\n    - notice this notebook is linked to the data_analysis notebook\n    - converted categorical values to numerical values using LoadEncoding\n    - converted booleans to ints\n    - used different Dense layers with various units\n    - tried different Keras parameters: learning_rate, batch_size, epochs, optimizer\n    - found best Keras parameters:  `learning_rate: 0.01, batch_size: 50, epochs: 40`\n    - Columns selected for training: \n        - `x = [newlyConst\tbalcony\thasKitchen\tcellar\tlivingSpace\tlift\tnoRooms\tgarden\theating\tneighborhood]`,\n        - `y = 'baseRent'` (numerical)\n    - found `MAE = 265`\n5. Built model_training script using xgboost because it was most accurate.\n    - source: [server/train_model.py](server/train_model.py)\n6. Created web server to serve model using an API\n    - server app source: [server/](server/)\n    - server python file source: [server/serve.py](server/serve.py)\n    - note the web server can do batch predictions to improve performance. Request payload accepts an array of data and will return an array of predictions in same order.\n    - web server will catch some exceptions to return user friendly error messages and correct Status Code.\n    - server can be started using vanilla Python or Unicorn.\n    - see below how to start it and how to call it\n7. Created Docker image with the web server\n    - source: [server/Dockerfile](server/Dockerfile)\n    - docker image is serving the API on port 9000\n    - see below how to build and run the docker image\n8. Deployed to DigitalOcean using the docker image.\n    - see below how to call ML-project1 running in cloud\n    - this was deployed manually due to lack of time to do proper CI\n\n# Steps to run the application.\n\n##\nUnzip `immo_data.csv.zip`, file is too big for Github.\n\n## Build final model\nFollow commands:\n\n```bash\n(base) ➜  project1 git:(main) ✗ cd server \n\n(base) ➜  server git:(main) ✗ pipenv install\nInstalling dependencies from Pipfile.lock (0baaa4)...\n  🎃   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0/0 — 00:00:00\n\n(base) ➜  server git:(main) ✗ python train_model.py \nData file loaded. records count: 10406\ndoing validation with eta=0.6, max_depth=20\nmae: 239.929\nmae: 244.092\nmae: 244.963\nmae: 246.842\nmae: 241.170\nmae: 233.049\nmae: 247.643\nmae: 237.237\nmae: 229.175\nmae: 240.148\nvalidation mean mae=240.000, +-5.459\ntraining the final model. records count= 8324\ntest mae=245.7104721089857\nthe model is saved to model_xg_0.6_20.bin\n\n(base) ➜  server git:(main) ✗ ls\nmodel_xg_0.6_20.bin  Pipfile  Pipfile.lock  train_model.py\n```\n\n## Start python server using Python\n```bash\n(base) ➜  server git:(main) ✗ python serve.py\n * Serving Flask app \"Berlin rent\" (lazy loading)\n * Environment: production\n   WARNING: This is a development server. Do not use it in a production deployment.\n   Use a production WSGI server instead.\n * Debug mode: on\n * Running on http://0.0.0.0:9696/ (Press CTRL+C to quit)\n * Restarting with inotify reloader\n * Debugger is active!\n * Debugger PIN: 201-502-766\n```\n\n### Start python server using Unicorn\n```bash\n(base) ➜  server git:(main) ✗ gunicorn --bind 0.0.0.0:9696 serve:app               \n[2021-12-12 18:24:27 +0100] [7726] [INFO] Starting gunicorn 20.1.0\n[2021-12-12 18:24:27 +0100] [7726] [INFO] Listening at: http://0.0.0.0:9696 (7726)\n[2021-12-12 18:24:27 +0100] [7726] [INFO] Using worker: sync\n[2021-12-12 18:24:27 +0100] [7728] [INFO] Booting worker with pid: 7728\n```\n\n## Request prediction using server API:\n```bash\n(base) ➜  ~ curl -v -X POST \\\n  http://localhost:9696/predict \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n        \"data\":[\n                {\n                    \"neighbourhood\": \"friedrichshain\",\n                    \"heating\": \"normal\",\n                    \"newlyConst\": false,\n                    \"balcony\": true,\n                    \"hasKitchen\": true,\n                    \"cellar\": true,\n                    \"livingSpace\": 62,\n                    \"lift\": false,\n                    \"noRooms\": 2,\n                    \"garden\": false\n                },\n                {\n                    \"neighbourhood\": \"Steglitz\",\n                    \"heating\": \"normal\",\n                    \"newlyConst\": false,\n                    \"balcony\": false,\n                    \"hasKitchen\": true,\n                    \"cellar\": true,\n                    \"livingSpace\":75,\n                    \"lift\": false,\n                    \"noRooms\": 3,\n                    \"garden\": true\n                }\n        ]\n}'\nNote: Unnecessary use of -X or --request, POST is already inferred.\n*   Trying 127.0.0.1:9696...\n* Connected to localhost (127.0.0.1) port 9696 (#0)\n\u003e POST /predict HTTP/1.1\n\u003e Host: localhost:9696\n\u003e User-Agent: curl/7.71.1\n\u003e Accept: */*\n\u003e Content-Type: application/json\n\u003e Content-Length: 530\n\u003e \n* upload completely sent off: 530 out of 530 bytes\n* Mark bundle as not supporting multiuse\n\u003c HTTP/1.1 200 OK\n\u003c Server: gunicorn\n\u003c Date: Sun, 12 Dec 2021 17:25:50 GMT\n\u003c Connection: close\n\u003c Content-Type: application/json\n\u003c Content-Length: 52\n\u003c \n{\n    \"prediction\":[\n        {\n            \"baseRent\": 907\n        },\n        {\n            \"baseRent\":1040\n        }\n    ]\n}\n```\n\n## Build docker image\n```bash\n(base) ➜  project1 git:(main) ✗ cd server \n\n(base) ➜  server git:(main) ✗ pipenv install\nPipfile.lock not found, creating...\nLocking [dev-packages] dependencies...\nLocking [packages] dependencies...\nBuilding requirements...\nResolving dependencies...\n✔ Success! \nUpdated Pipfile.lock (73e6b9)!\nInstalling dependencies from Pipfile.lock (73e6b9)...\n  🎃   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 3/3 — 00:00:00\n\n(base) ➜  server git:(main) docker build -t capstone_project:v0.1 . \nSending build context to Docker daemon  1.674MB\nStep 1/8 : FROM python:3.8.12-slim\n ---\u003e 32a5625aad35\nStep 2/8 : RUN pip install pipenv\n ---\u003e Using cache\n ---\u003e 262147d37546\nStep 3/8 : WORKDIR /app\n ---\u003e Using cache\n ---\u003e 153736e2bb7e\nStep 4/8 : COPY [\"Pipfile\", \"Pipfile.lock\", \"./\"]\n ---\u003e 9789f4690cef\nStep 5/8 : RUN pipenv install --system --deploy\n ---\u003e Running in c97d59df307c\nInstalling dependencies from Pipfile.lock (73e6b9)...\nRemoving intermediate container c97d59df307c\n ---\u003e a0da06768ded\nStep 6/8 : COPY [\"serve.py\", \"model_xg_0.6_20.bin\", \"./\"]\n ---\u003e 776a41c6989a\nStep 7/8 : EXPOSE 9696\n ---\u003e Running in 9ee1152f0d87\nRemoving intermediate container 9ee1152f0d87\n ---\u003e cbbecfa9784d\nStep 8/8 : ENTRYPOINT [\"gunicorn\", \"--bind=0.0.0.0:9696\", \"serve:app\"]\n ---\u003e Running in 7e5fdaa73af3\nRemoving intermediate container 7e5fdaa73af3\n ---\u003e 3b7063c50e92\nSuccessfully built 3b7063c50e92\nSuccessfully tagged capstone_project:v0.1\n```\n\n## Run docker image\n```bash\n(base) ➜  server git:(main) ✗ docker run -ti --rm -p 9000:9696 capstone_project:v0.1\n[2021-12-12 17:36:42 +0000] [1] [INFO] Starting gunicorn 20.1.0\n[2021-12-12 17:36:42 +0000] [1] [INFO] Listening at: http://0.0.0.0:9696 (1)\n[2021-12-12 17:36:42 +0000] [1] [INFO] Using worker: sync\n[2021-12-12 17:36:42 +0000] [7] [INFO] Booting worker with pid: 7\n```\n! notice Docker server API is exposed on port 9000\n\n## Call API on Dockerized server:\n```bash\n(base) ➜  ~ curl -v -X POST \\\n  http://localhost:9000/predict \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n        \"data\":[\n                {\n                    \"neighbourhood\": \"friedrichshain\",\n                    \"heating\": \"normal\",\n                    \"newlyConst\": false,\n                    \"balcony\": true,\n                    \"hasKitchen\": true,\n                    \"cellar\": true,\n                    \"livingSpace\": 62,\n                    \"lift\": false,\n                    \"noRooms\": 2,\n                    \"garden\": false\n                },\n                {\n                    \"neighbourhood\": \"Steglitz\",\n                    \"heating\": \"normal\",\n                    \"newlyConst\": false,\n                    \"balcony\": false,\n                    \"hasKitchen\": true,\n                    \"cellar\": true,\n                    \"livingSpace\":75,\n                    \"lift\": false,\n                    \"noRooms\": 3,\n                    \"garden\": true\n                }\n        ]\n}'\nNote: Unnecessary use of -X or --request, POST is already inferred.\n*   Trying 127.0.0.1:9000...\n* Connected to localhost (127.0.0.1) port 9000 (#0)\n\u003e POST /predict HTTP/1.1\n\u003e Host: localhost:9000\n\u003e User-Agent: curl/7.71.1\n\u003e Accept: */*\n\u003e Content-Type: application/json\n\u003e Content-Length: 530\n\u003e \n* upload completely sent off: 530 out of 530 bytes\n* Mark bundle as not supporting multiuse\n\u003c HTTP/1.1 200 OK\n\u003c Server: gunicorn\n\u003c Date: Sun, 12 Dec 2021 17:37:45 GMT\n\u003c Connection: close\n\u003c Content-Type: application/json\n\u003c Content-Length: 52\n\u003c \n{\n    \"prediction\":[\n        {\n            \"baseRent\": 907\n        },\n        {\n            \"baseRent\":1040\n        }\n    ]\n}\n```\n\n## Run docker image from my docker hub repository:\n- public Docker image: https://hub.docker.com/repository/docker/razorcd/capstone_project/general\n```bash\ndocker run -ti --rm -p 80:9696 razorcd/capstone_project:v0.1\n```\n\n\n## Access ML project deployed in DigitaOcean Cloud\n```bash\n(base) ➜  ~ curl -v -X POST http://206.189.61.226/predict \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n        \"data\":[\n                {\n                    \"neighbourhood\": \"friedrichshain\",\n                    \"heating\": \"normal\",\n                    \"newlyConst\": false,\n                    \"balcony\": true,\n                    \"hasKitchen\": true,\n                    \"cellar\": true,\n                    \"livingSpace\": 62,\n                    \"lift\": false,\n                    \"noRooms\": 2,\n                    \"garden\": false\n                },\n                {\n                    \"neighbourhood\": \"Steglitz\",\n                    \"heating\": \"normal\",\n                    \"newlyConst\": false,\n                    \"balcony\": false,\n                    \"hasKitchen\": true,\n                    \"cellar\": true,\n                    \"livingSpace\":75,\n                    \"lift\": false,\n                    \"noRooms\": 3,\n                    \"garden\": true\n                }\n        ]\n}'\nNote: Unnecessary use of -X or --request, POST is already inferred.\n*   Trying 206.189.61.226:80...\n* Connected to 206.189.61.226 (206.189.61.226) port 80 (#0)\n\u003e POST /predict HTTP/1.1\n\u003e Host: 206.189.61.226\n\u003e User-Agent: curl/7.71.1\n\u003e Accept: */*\n\u003e Content-Type: application/json\n\u003e Content-Length: 884\n\u003e \n* upload completely sent off: 884 out of 884 bytes\n* Mark bundle as not supporting multiuse\n\u003c HTTP/1.1 200 OK\n\u003c Server: gunicorn\n\u003c Date: Sun, 12 Dec 2021 18:24:56 GMT\n\u003c Connection: close\n\u003c Content-Type: application/json\n\u003c Content-Length: 52\n\u003c \n{\"prediction\":[{\"baseRent\":907},{\"baseRent\":1040}]}\n```\n\n\n## TODO checklist:\n\n - [x] find Dataset\n - [x] cleanup data\n - [x] perform EDA (exploratory data analysis)\n - [x] prepare data for model training\n - [x] train with linear logistic regression\n - [x] train with xgboost\n - [x] train with Keras/Tensorflow (optional)\n - [x] create server and dockerize\n - [x] deploy to cloud (optional)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frazorcd%2Fml-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frazorcd%2Fml-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frazorcd%2Fml-project/lists"}