{"id":18339986,"url":"https://github.com/mxagar/disaster_response_pipeline","last_synced_at":"2026-04-19T14:34:33.550Z","repository":{"id":131493653,"uuid":"611252082","full_name":"mxagar/disaster_response_pipeline","owner":"mxagar","description":"This repository contains a Machine Learning (ML) pipeline which predicts the response to messages in disaster situations. An ETL pipeline is also developed and everything is deployed with a web app based in Flask.","archived":false,"fork":false,"pushed_at":"2023-03-13T15:16:40.000Z","size":6943,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-23T01:49:32.007Z","etag":null,"topics":["classification","etl-pipeline","flask","machine-learning","machine-learning-pipeline","pipeline","sqlite"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mxagar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-08T12:51:10.000Z","updated_at":"2023-03-08T16:10:39.000Z","dependencies_parsed_at":"2023-05-14T17:30:40.065Z","dependency_job_id":null,"html_url":"https://github.com/mxagar/disaster_response_pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mxagar/disaster_response_pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxagar%2Fdisaster_response_pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxagar%2Fdisaster_response_pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxagar%2Fdisaster_response_pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxagar%2Fdisaster_response_pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mxagar","download_url":"https://codeload.github.com/mxagar/disaster_response_pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mxagar%2Fdisaster_response_pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32009974,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"online","status_checked_at":"2026-04-19T02:00:07.110Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","etl-pipeline","flask","machine-learning","machine-learning-pipeline","pipeline","sqlite"],"created_at":"2024-11-05T20:20:30.063Z","updated_at":"2026-04-19T14:34:33.519Z","avatar_url":"https://github.com/mxagar.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Disaster Response Pipeline\n\nThis repository contains a Machine Learning (ML) pipeline which predicts message categories in disaster situations. It is precisely during disaster situations, that the response organizations have the least capacity to evaluate and react properly to each message that arrives to them (via direct contact, social media, etc.). In this project, NLP is applied and a classification model is trained so that the category of each message can be predicted automatically; then, the messages can be directed to the appropriate relief agencies. In total 36 message categories are predicted, which are related to the type of possible emergency, e.g., `earthquake`, `fire`, `missing_people`, etc.\n\nAll in all, the following methods/techniques are implemented and documented:\n\n- [x] An ETL pipeline (Extract, Transform, Load).\n- [x] A Machine Learning pipeline which applies NLP to messages and predicts message categories.\n- [x] Testing (Pytest) and linting.\n- [x] Error checks, data validation with Pydantic and exception handling.\n- [x] Logging.\n- [x] Continuous Integration with Github Actions.\n- [x] Python packaging.\n- [x] Containerization (Docker).\n- [x] Flask web app, deployed locally.\n\nThe Flask web app enables interaction, i.e., the user inputs a text message and the trained classifier predicts candidate categories:\n\n\u003cp style=\"text-align:center\"\u003e\n  \u003cimg src=\"./assets/snapshot_prediction.jpg\" alt=\"A snapshot of the disaster response app.\" width=1000px\u003e\n\u003c/p\u003e\n\nI took the [`starter`](starter) code for this project from the [Udacity Data Scientist Nanodegree](https://www.udacity.com/course/data-scientist-nanodegree--nd025) and modified it to the present form, which deviates significantly from the original version.\n\n## Table of Contents\n\n- [Disaster Response Pipeline](#disaster-response-pipeline)\n  - [Table of Contents](#table-of-contents)\n  - [How to Use This Project](#how-to-use-this-project)\n    - [Installing Dependencies for Custom Environments](#installing-dependencies-for-custom-environments)\n  - [Dataset](#dataset)\n  - [Notes on the Implementation](#notes-on-the-implementation)\n    - [The `disaster_response` Package](#the-disaster_response-package)\n      - [ETL Pipeline](#etl-pipeline)\n      - [Machine Learning Training Pipeline](#machine-learning-training-pipeline)\n    - [Flask Web App](#flask-web-app)\n    - [Tests](#tests)\n    - [Continuous Integration with Github Actions](#continuous-integration-with-github-actions)\n    - [Docker Container](#docker-container)\n  - [Next Steps, Improvements](#next-steps-improvements)\n  - [References and Links](#references-and-links)\n  - [Authorship](#authorship)\n\n## How to Use This Project\n\nThe directory of the project consists of the following files:\n\n```\n.\n├── Instructions.md                             # Original challenge/project instructions\n├── README.md                                   # This file\n├── app                                         # Web app\n│   ├── run.py                                  # Implementation of the Flask app\n│   └── templates                               # Web app HTML/CSS templates\n│       ├── go.html\n│       └── master.html\n├── assets/                                     # Images, etc.\n├── data                                        # Datasets\n│   ├── DisasterResponse.db                     # Generated database\n│   ├── categories.csv                          # Catagories dataset\n│   └── messages.csv                            # Messages dataset\n├── disaster_response                           # Package\n│   ├── __init__.py\n│   ├── file_manager.py                         # General structures, loading/persistence manager\n│   ├── process_data.py                         # ETL pipeline\n│   └── train_classifier.py                     # ML pipeline and training\n├── main.py                                     # Script which runs both pipelines: ETL and ML (training)\n├── models                                      # Inference and evaluation artifacts\n│   ├── classifier.pkl                          # Trained pipeline (not committed)\n│   └── evaluation_report.txt                   # Evaluation metrics: F1, etc.\n├── disaster_response_pipeline.log              # Logs\n├── notebooks                                   # Research notebooks\n│   ├── ETL_Pipeline_Preparation.ipynb\n│   └── ML_Pipeline_Preparation.ipynb\n├── config.yaml                                 # Configuration file\n├── conda.yaml                                  # Conda environment\n├── requirements.txt                            # Dependencies for pip\n├── Dockerfile                                  # Docker image definition\n├── docker-compose.yaml                         # Docker compose YAML\n├── run.sh                                      # Execution script for Docker\n├── setup.py                                    # Package setup\n├── starter/                                    # Original starter material\n└── tests                                       # Tests\n    ├── __init__.py\n    ├── conftest.py                             # Pytest configuration, fixtures, etc.\n    └── test_library.py                         # disaster_response package tests\n```\n\nTo run the pipelines and the web app, first the dependencies need to be installed, as explained in the [next section](#installing-dependencies-for-custom-environments). Then, we can execute the following commands:\n\n```bash\n# This runs the the ETL pipeline, which creates the DisasterResponse.db database\n# It also runs the ML pipeline, which trains the models and outputs classifier.pkl\n# WARNING: The training might take some hours, because hyperparameter search\n# with cross-validation is performed.\npython main.py\n\n# Spin up the web app\n# Wait 10 seconds and open http://localhost:3000\n# We see some visualizations there; if we enter a message,\n# we should get the predicted categories.\npython app/run.py\n```\n\nNotes: \n\n- [`main.py`](./../main.py) uses [`config.yaml`](`./../config.yaml`); that configuration file defines all necessary parameters for both pipelines (ETL and ML training). However, some parameters can be overridden via CLi arguments \u0026mdash; try `python main.py --help` for more information.\n- :warning: The training might take some hours, because hyperparameter search with cross-validation is performed.\n- The outputs from executing both pipelines are the following:\n  - `DisasterResponse.db`: cleaned and merged SQLite database, product of the ETL process.\n  - `classifier.pkl`: trained classifier, used by the web app.\n  - `evaluation_report.txt`: evaluation metrics of the trained classifier.\n\n### Installing Dependencies for Custom Environments\n\nYou can create an environment with [conda](https://docs.conda.io/en/latest/) and install the dependencies with the following recipe:\n\n```bash\n# Create environment with YAML, incl. packages\nconda env create -f conda.yaml\nconda activate dis-res\npip install . # install the disaster_response package\n\n# Alternatively, if you prefer, create your own environment\n# and install the dependencies with pip\nconda create --name dis-res pip\nconda activate dis-res\npip install -r requirements.txt\npip install . # install the disaster_response package\n```\n\nNote that both [`conda.yaml`](./conda.yaml) and [`requirements.txt`](./requirements.txt) contain the same packages; however, [`requirements.txt`](./requirements.txt) has the specific package versions I have used with Python `3.9.16`.\n\n## Dataset\n\nThe dataset is contained in the folder [`data`](data), and it consists of the following files:\n\n- `messages.csv`: a CSV of shape `(26248, 4)`, which contains the help messages (in original and translated form) as well as information on the source.\n- `categories.csv`: a CSV of shape `(26248, 2)` which matches each message id from `messages.csv` with 36 categories, related to the type of disaster message. All categories are in text from in one column. All those target categories are listed in [`config.yaml`](config.yaml).\n\nThe [`notebooks`](notebooks) provide a good first exposure to the contents of the datasets. After running the [ETL pipeline](#etl-pipeline), the SQLite database `DisasterResponse.db` is created, which contains a clean merge of the aforementioned files.\n\n## Notes on the Implementation\n\nIn the following subsections, information on different aspects of the implementation is provided.\n\n### The `disaster_response` Package\n\nThe Machine Learning (ML) functionalities are implemented in this package, which can be used as shown in [`main.py`](./main.py). The package consists of the following files:\n\n- [`file_manager.py`](./distaster_response/file_manager.py): loading, validation and persistence manager.\n- [`process_data.py`](./distaster_response/process_data.py): ETL pipeline.\n- [`train_classifier.py`](./distaster_response/train_classifier.py): ML/training pipeline.\n\nHaving a file loading/validation/persistence manager makes the other modules more clear, abstracts the access to 3rd party modules and improves maintainability.\n\n#### ETL Pipeline\n\nThe ETL (Extract, Transform, Load) pipeline implemented in [`process_data.py`](./distaster_response/process_data.py) performs the following tasks:\n\n- Load the source CSV datasets from [`data`](data).\n- Clean and merge the datasets:\n  - Transform categories into booleans.\n  - Check that category values are correct.\n  - Drop duplicates and NaNs.\n- Save the processed dataset into a SQLite database to `DisasterResponse.db` (or the filename defined in `config.yaml`).\n\nWe can interact using SQL with the SQLite database `DisasterResponse.db` produced by the ETL pipeline via CLI if we install [`sqlite3`](https://www.tutorialspoint.com/sqlite/sqlite_installation.htm):\n\n```bash\ncd data\n# Enter SQLite terminal\nsqlite3\n# Open a DB\n.open DisasterResponse.db\n# Show tables\n.tables # Message\n# Get table info/columns \u0026 types\nPRAGMA table_info(Message);\n# Get first 5 entries\nSELECT * FROM Message LIMIT 5;\n# ...\n# Exit SQLite CLI terminal\n.quit\n```\n\nFor more information on how to interact with relational/SQL databases using python visit my [sql_guide](https://github.com/mxagar/sql_guide).\n\n#### Machine Learning Training Pipeline\n\nThe ML training pipeline implemented in [`train_classifier.py`](./distaster_response/train_classifier.py) loads the dataset from `DisasterResponse.db` and fits a `RandomForestClassifier` using `GridSearchCV`. Since we have multiple targets, the random forest is wrapped with a `MultiOutputClassifier`; as stated in the [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html), \n\n\u003e The strategy [of a `MultiOutputClassifier`] consists in fitting one classifier per target.\n\nThus, the training might extend some hours, specially because we perform cross-validation and hyperparameter tuning. The final output is composed by two files placed in the folder [`models`](models):\n\n- `classifier.pkl`: trained classifier, serialized as a pickle.\n- `evaluation_report.txt`: evaluation metrics of the trained classifier; for each target a classification report is provided (with F1 metrics).\n\nThe `classifier.pkl` contains not only the model, but also the data processing pipeline that transforms the features and the targets to train the model. More details on that can be found in the associated function `build_model()` from [`train_classifier.py`](./distaster_response/train_classifier.py).\n\n:construction: Notes:\n\n- The focus of the project doesn't lie at this stage on optimizing the model, but instead, on creating an MVP of the app; future revisions should improve the model performance.\n- The message category distribution (i.e., the target counts) is very imbalanced, as shown in the next figure (and future work should address that, too):\n\n\u003cp style=\"text-align:center\"\u003e\n  \u003cimg src=\"./assets/targets_distribution.png\" alt=\"Message category distribution (targets).\" width=400\u003e\n\u003c/p\u003e\n\n\n### Flask Web App\n\nThe [Flask](https://flask.palletsprojects.com/en/2.2.x/) web app is implemented in [`app/run.py`](./app/run.py). It consists of two routes that render one page each:\n\n- The index/default page visualizes 3 plots of the data with [Plotly](https://plotly.com/); it also offers an input box for the user to insert a message to be classified.\n- The classification page appears when the user hits the \"classify\" button and the model predicts the categories.\n\nMore information on how to create Flask dashboards with Plotly: [data_science_udacity](https://github.com/mxagar/data_science_udacity/blob/main/02_SoftwareEngineering/DSND_SWEngineering.md#6-web-development).\n\n### Tests\n\nOnce we have [Pytest](https://docs.pytest.org/en/7.2.x/) installed, we can run the tests as follows:\n\n```bash\npytest tests\n```\n\nThe [`tests`](./tests) folder contains these two files:\n\n- [`tests/conftest.py`](./tests/conftest.py): configuration and fixtures definition.\n- [`tests/test_library.py`](./tests/test_library.py): tests of functions defined in the `disaster_response` package.\n\n:construction: Currently, very few and shallow tests are implemented; even though the loading/persistence module [`file_manager.py`](./disaster_response/file_manager.py) validates many objects with [pydantic](https://docs.pydantic.dev/) and error-detection checks, the tests should be extended.\n\n### Continuous Integration with Github Actions\n\nI have implemented Continuous Integration (CI) using Github Actions. The workflow file [`python-app.yml`](.github/workflows/python-app.yml) performs the following tasks every time we push changes to the `main` branch:\n\n- Requirements are installed.\n- `flake8` is run to lint the code; note that [`.flake8`](.flake8) contains the files/folders to be ignored.\n- Tests are run as explained above: `pytest tests`.\n\n### Docker Container\n\nContainerization is a common step before deploying/shipping an application. Thanks to the simple [`Dockerfile`](./Dockerfile) in the repository, we can create an image of the web app and run it as a container as follows:\n\n```bash\n# Build the Dockerfile to create the image\n# docker build -t \u003cimage_name[:version]\u003e \u003cpath/to/Dockerfile\u003e\ndocker build -t disaster_response_app:latest .\n \n# Check the image is there: watch the size (e.g., ~1GB)\ndocker image ls\n\n# Run the container locally from a built image\n# Recall to: forward ports (-p) and pass PORT env variable (-e), because run.sh expects it!\n# Optional: \n# -d to detach/get the shell back,\n# --name if we want to choose conatiner name (else, one randomly chosen)\n# --rm: automatically remove container after finishing (irrelevant in our case, but...)\ndocker run -d --rm -p 3000:3000 -e PORT=3000 --name disaster_response_app disaster_response_app:latest\n\n# Check the API locally: open the browser\n#   WAIT 30 seconds...\n#   http://localhost:3000\n#   Use the web app\n \n# Check the running containers: check the name/id of our container,\n# e.g., census_model_app\ndocker container ls\ndocker ps\n\n# Get a terminal into the container: in general, BAD practice\n# docker exec -it \u003cid|name\u003e sh\ndocker exec -it disaster_response_app sh\n# (we get inside)\ncd /opt/disaster_response_pipeline\nls\ncat disaster_response_pipeline.log\nexit\n\n# Stop container and remove it (erase all files in it, etc.)\n# docker stop \u003cid/name\u003e\n# docker rm \u003cid/name\u003e\ndocker stop disaster_response_app\ndocker rm disaster_response_app\n```\n\nAlternatively, I have written a [`docker-compose.yaml`](./docker-compose.yaml) YAML which spins up the one-container service with the required parameters:\n\n```bash\n# Run contaner(s), detached; local docker-compose.yaml is used\ndocker-compose up -d\n\n# Check containers, logs\ndocker-compose ps\ndocker-compose logs\n\n# Stop containers\ndocker-compose down\n```\n\nNote: in order to keep image size in line, [`.dockerignore`](.dockerignore) lists all files that can be avoided, similarly as `.gitignore`.\n\n## Next Steps, Improvements\n\n- [x] Add logging.\n- [x] Lint with `flake8` and `pylint`.\n- [ ] Deploy it, e.g., to Heroku or AWS; another example project in which I have deployed the app that way: [census_model_deployment_fastapi](https://github.com/mxagar/census_model_deployment_fastapi).\n- [ ] Extend tests; currently, the test package contains very few tests that serve as blueprint for further implementations.\n- [ ] Add type hints to `process_data.py` and `train_classifier.py`; currently type hints and `pydantic` are used only in `file_manager.py` to clearly define loading and persistence functionalities and to validate the objects they handle.\n- [ ] Optimize properly the machine learning model, improving its performance: \n  - [ ] Try alternative models.\n  - [ ] Perform a through hyperparameter tuning (e.g., with [Optuna](https://optuna.org)).\n- [ ] Address the imbalanced nature of the dataset.\n- [ ] Add more visualizations to the web app.\n- [ ] Based on the detected categories, suggest organizations to connect to.\n- [ ] Improve the front-end design.\n\n## References and Links\n\n- My personal notes on the [Udacity MLOps](https://www.udacity.com/course/machine-learning-dev-ops-engineer-nanodegree--nd0821) nanodegree: [mlops_udacity](https://github.com/mxagar/mlops_udacity).\n- My personal notes on the [Udacity Data Science](https://www.udacity.com/course/data-scientist-nanodegree--nd025) nanodegree: [data_science_udacity](https://github.com/mxagar/data_science_udacity)\n- Notes on how to transform research code into production-level packages: [customer_churn_production](https://github.com/mxagar/customer_churn_production).\n- My summary of data processing and modeling techniques: [eda_fe_summary](https://github.com/mxagar/eda_fe_summary).\n\n## Authorship\n\nMikel Sagardia, 2022.  \nNo guarantees.\n\nIf you find this repository useful, you're free to use it, but please link back to the original source.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmxagar%2Fdisaster_response_pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmxagar%2Fdisaster_response_pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmxagar%2Fdisaster_response_pipeline/lists"}