{"id":23922610,"url":"https://github.com/datenhahn/disaster-response-pipeline-project","last_synced_at":"2025-10-13T11:05:01.123Z","repository":{"id":170556894,"uuid":"646706131","full_name":"datenhahn/disaster-response-pipeline-project","owner":"datenhahn","description":"The goal of this project is to build a Natural Language Processing (NLP) model to categorize messages sent during disasters.","archived":false,"fork":false,"pushed_at":"2023-05-31T20:48:46.000Z","size":4713,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-13T11:04:21.244Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datenhahn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-05-29T06:45:20.000Z","updated_at":"2023-10-06T17:00:31.000Z","dependencies_parsed_at":"2024-01-04T16:44:04.052Z","dependency_job_id":null,"html_url":"https://github.com/datenhahn/disaster-response-pipeline-project","commit_stats":null,"previous_names":["ecodia/disaster-response-pipeline-project","datenhahn/disaster-response-pipeline-project"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/datenhahn/disaster-response-pipeline-project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datenhahn%2Fdisaster-response-pipeline-project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datenhahn%2Fdisaster-response-pipeline-project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datenhahn%2Fdisaster-response-pipeline-project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datenhahn%2Fdisaster-response-pipeline-project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datenhahn","download_url":"https://codeload.github.com/datenhahn/disaster-response-pipeline-project/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datenhahn%2Fdisaster-response-pipeline-project/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279014750,"owners_count":26085593,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-05T17:15:24.531Z","updated_at":"2025-10-13T11:05:01.105Z","avatar_url":"https://github.com/datenhahn.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Project: Disaster Response Pipeline\n\n![Flooded houses as image for disaster](.assets/disaster.png)\n\nThe goal of this project is to build a Natural Language Processing (NLP) model to categorize messages sent during disasters. The model is trained on a dataset provided by Appen (formally Figure 8) containing real messages that were sent during disaster events. The model is then used to classify new messages.\n\nDemo (not permanently available):\nhttp://disaster-response-project.ecodia.de/\n\nThe project consists of three parts:\n\n* ETL Pipeline: Loads the messages and categories datasets, merges the two datasets, cleans the data, and stores it in a SQLite database.\n* ML Pipeline: Loads data from the SQLite database, splits the dataset into training and test sets, builds a text processing and machine learning pipeline and exports the final model as a pickle file.\n* Flask Web App: Loads the SQLite database and the trained model and provides a web interface to classify new messages.\n\n**Quickstart (Running the Web-App)**\n\nThe webapp requires at least 6GB of RAM to run.\n\nTo run the project with the pretrained model and prepared database execute:\n\n```\npython -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\ncurl -L https://github.com/ecodia/disaster-response-pipeline-project/releases/download/model/classifier.pkl -o models/classifier.pkl\nPYTHONPATH=$PWD python app/run.py\n```\n\nGo to http://0.0.0.0:3001/ to access the webapp.\n\n**Training the Model**\n\nThe Quickstart section above uses a pretrained model and a prepared database.\n\nIf you want to train the model yourself, the following steps are required, which are explained in more detail below.\n\n* Setup a virtual environment and install the required libraries.\n* Prepare the message database (ETL Pipeline).\n* Train the model (ML Pipeline).\n\n## Screenshots\n\nThe start page of the webapp.\n\n![Screenshot of the webapp](.assets/screenshot-index-page.png)\n\nThe result page of a query.\n\n![Screenshot of a query](.assets/screenshot-query.png)\n\n## Technical Requirements\n\nTested with the following versions, but might also work with other versions.\n\n* Python 3.10.6\n\nLibraries:\n\n* pandas==2.0.1\n* sqlalchemy==2.0.15\n* jupyter==1.0.0\n* nltk==3.8.1\n* scikit-learn==1.2.2\n* plotly==5.14.1\n* flask==2.3.2\n* pytest==7.3.1\n\n### Hardware Requirements\n\nThe webapp requires at least 6GB of RAM to run.\n\n## About the Dataset\n\nThe dataset contains real messages that were sent during disaster events. The messages were collected by Appen (formally Figure Eight) and were pre-labelled by them into 36 categories. The dataset contains the original message as well as the translated message.\n\nThe dataset contains 26,248 messages in total from different sources.\n\n* Direct Communication: 10766\n* News: 13054\n* Social Media: 2396\n\nThe following categories are available:\n\nFloods, Hospitals, Aid Related, Child Alone, Related, Water, Food, Weather Related, Fire, Shops, Direct Report, Military, Aid Centers, Other Aid, Cold, Shelter, Storm, Other Weather, Offer, Money, Missing People, Tools, Search And Rescue, Earthquake, Death, Buildings, Other Infrastructure, Infrastructure Related, Refugees, Request, Medical Products, Security, Clothing, Medical Help, Transport, Electricity\n\n## Training the Model\n\n### Setup Virtual Environment\n\n```bash\npython -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\n```\n\n### Prepare Message Database\n\nMerge the disaster messages and the manually classified categories into one database.\n\nAfter this step the file `data/DisasterResponse.db` should have been created.\n\n```bash\ncd data\npython ./process_data.py disaster_messages.csv disaster_categories.csv DisasterResponse.db\n```\n\n### Train the Model\n\nThis section describes the training of a model with the parameters used in the web-app.\n\nAs well as the exploration of different parameter combinations\nto find a better model.\n\n_Make sure you have the virtual environment prepared and activated before._\n\n#### A) Train the model with the training pipeline\n\nThe train_classifier.py script trains the model with data\nfrom the sqlite database prepared in the previous step.\n\nIt takes two arguments: The path to the sqlit database and the path to the pickle file to store the model.\n\n```bash\ncd models\npython ./train_classifier.py ../data/DisasterResponse.db classifier.pkl\n```\n\n#### B) Explore different parameter combinations or try out new models\n\nThe `models` folder also contains a jupyter notebook `ML Pipeline Preparation.ipynb` that can be used to explore different parameter combinations or try out new models.\n\nDuring development the following parameters were tested with\nthe notebook in a grid search:\n\n```\nparameters = {\n    'vect__max_df': (0.8, 0.9, 1.0),\n    'clf__estimator__n_estimators': [50, 100, 200],\n}\n\nwith parallel_backend('multiprocessing'):\n    # Initialize GridSearchCV\n    cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs=cpus, verbose=2)\n\n    # Fit and tune model\n    cv.fit(X_train, Y_train)\n```\n\nIt turned out that the default parameters of the CountVectorizer and the RandomForestClassifier were the best.\n\n```\ncv.best_params_\n{'clf__estimator__n_estimators': 100, 'vect__max_df': 1.0}\n```\n\nFeel free to experiment with other vectorizers or classifiers.\n\n\n## License\n\nCopyright 2023 Ecodia GmbH \u0026 Co. KG\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatenhahn%2Fdisaster-response-pipeline-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatenhahn%2Fdisaster-response-pipeline-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatenhahn%2Fdisaster-response-pipeline-project/lists"}