{"id":13339521,"url":"https://github.com/kevinmfreire/wheres_waldo","last_synced_at":"2025-03-11T14:31:40.836Z","repository":{"id":37366997,"uuid":"504921145","full_name":"kevinmfreire/wheres_waldo","owner":"kevinmfreire","description":"This project was developed to identify the name, address and organization name within text.","archived":false,"fork":false,"pushed_at":"2023-01-04T03:49:02.000Z","size":247,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-24T02:30:14.359Z","etag":null,"topics":["data-structures","database","nlp-machine-learning","notebooks","pandas-dataframe","spacy-nlp","tensorflow2","text-classification","unit-testing","webscraping"],"latest_commit_sha":null,"homepage":"https://kevinmfreire-wheres-waldo-st-app-ikv0ya.streamlit.app/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kevinmfreire.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-18T18:27:44.000Z","updated_at":"2023-01-04T03:53:39.000Z","dependencies_parsed_at":"2023-02-01T19:15:27.624Z","dependency_job_id":null,"html_url":"https://github.com/kevinmfreire/wheres_waldo","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevinmfreire%2Fwheres_waldo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevinmfreire%2Fwheres_waldo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevinmfreire%2Fwheres_waldo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevinmfreire%2Fwheres_waldo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kevinmfreire","download_url":"https://codeload.github.com/kevinmfreire/wheres_waldo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243051896,"owners_count":20228288,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-structures","database","nlp-machine-learning","notebooks","pandas-dataframe","spacy-nlp","tensorflow2","text-classification","unit-testing","webscraping"],"created_at":"2024-07-29T19:20:21.515Z","updated_at":"2025-03-11T14:31:40.497Z","avatar_url":"https://github.com/kevinmfreire.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# wheres_waldo\n\n![app](/img/app.png)\n\n### Requirements for this project\n* Python 3.8.x\n* Linux OS\n\n## Table of content\n* [Overview](https://github.com/kevinmfreire/wheres_waldo#overview)\n* [Goals](https://github.com/kevinmfreire/wheres_waldo#goals)\n* [Practical Applications](https://github.com/kevinmfreire/wheres_waldo#practical-applications)\n* [Usage](https://github.com/kevinmfreire/wheres_waldo#usage)\n* [Conclusion](https://github.com/kevinmfreire/wheres_waldo#conclusion)\n* [Requirements](https://github.com/kevinmfreire/wheres_waldo#requirements)\n\n## Overview\nNatural Language Processing (NLP) Models are very popular worldwide as it can be used in many cases such as language translation, speech-to-text or vice versa, it can detect fraud,\nor even classify highly sensitive data. In other words, it can make our lives easier.\n\nThis project is for the purpose of extracting names, organizations and locations from a news article, more specifically NBC News.  It is done by using the spaCy pretrained model `en_core_web_sm`\nwhich is a light weight model for Name Entity Recongnition (NER).  It can extract much more than just the name, location and organization, it can also classify words as either being a date, or law, etc.\n\n ## Goals\n The following bullet points are the challanges we want to complete.\n\n * Design and Implement an NLP model using TensorFlow 2.0 to identify name, location, and organization within a text.\n * Write a python application that uses a web scraper to extract text from a news article (NBC news).\n * Use the created model to identify name, location and organization from the extracted text.\n * Store the results in a database.\n * Set up a unit test for the code.\n\n ## Practical Applications\n * Extract information on certain articles on the web to detect privacy misconduct.\n * Analyze multiple articles to find rising trends (e.g What company/person/location is mentioned most).\n * Quickly parse through resume to find name, location and companies that applicant was involved in.\n\n ## Usage\n * Clone repo:\n ```\ngit clone https://github.com/kevinmfreire/wheres_waldo.git\n ```\n* Set up virtual environment:\n```\nvirtualenv .virtualenv/wheres_waldo\n```\n* Activate virtual environment:\n```\nsource .virtualenv/wheres_waldo/bin/activate\n```\n* Install all requirements:\n```\npip install -r requirements.txt\n```\n* If you want to see how the web scrapping works go to `src/` directory and run:\n```\npython ws_nbc.py\n```\nIt will save the dataframe as a `.csv` file to `data/ws_data/` so you can take a look at the output.\n\n* If you would like to see how the model works go to the `.src` directory and run:\n```\npython ner_model.py\n```\nThe output of the model is saved under `data/model_output/` as a `.json` and `.csv` file.\n* To observe how the model works on a single article and would like to search NAME, ORGANIZATION, and LOCATION mentioned in the article then run:\n```\npython main.py\n```\nOnce you run `main.py` it will ask you to place an NBC news article.  Navigate to your NBC news article of interest and copy/paste the link on your CLI. \nIt will then ask you to input an SQL search query for the NER extraction of the article.  Following search query examples are:\n```\nSELECT * FROM article\nSELECT NAME FROM article\nSELECT ORGANIZATION FROM article\nSELECT LOCATION FROM article\n```\n* To observe how the model works on multiple articles and would like to search NAME, ORGANIZATION, and LOCATION mentioned in article then run:\n```\npython main.py --multi_article True\n```\nDefault value for number of articles is 5, if you want more then add argument `--num_articles` and place desired number.\n* To run a basic Unit Test run:\n```\npython -m unittest basic_test.py\n```\n\n## Conclusion\nThe model is a light weight model so it doesn't classify the text perfectly.  By observing the output of the model in `data/model_output/output.json` you can see that it made a few mistakes.  Nevertheless, it works pretty well.  \nThe model can definitely be improved.  In regards to the output, I've decided to have the link of the article and the output results of the model tied together.  The purpose of this is because if one would like to see what the article\ntalks about based on the outputs then they can easily access the article. Keep in mind that the web scrapping will be different everyday because NBC News updates its content so the results will be different for everyone.  Please feel free to place any contributions and if you have any issues feel free to reach out.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkevinmfreire%2Fwheres_waldo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkevinmfreire%2Fwheres_waldo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkevinmfreire%2Fwheres_waldo/lists"}