https://github.com/ejw-data/web-scraping-nasa
Uses Selenium, Flask, and MongoDB to create a news aggregator of NASA information.
https://github.com/ejw-data/web-scraping-nasa
beautifulsoup4 flask mongodb pandas python requests selenium splinter
Last synced: 14 days ago
JSON representation
Uses Selenium, Flask, and MongoDB to create a news aggregator of NASA information.
- Host: GitHub
- URL: https://github.com/ejw-data/web-scraping-nasa
- Owner: ejw-data
- Created: 2019-11-10T06:27:12.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2022-06-23T16:31:52.000Z (almost 4 years ago)
- Last Synced: 2025-01-22T06:47:15.517Z (over 1 year ago)
- Topics: beautifulsoup4, flask, mongodb, pandas, python, requests, selenium, splinter
- Language: Jupyter Notebook
- Homepage:
- Size: 10.7 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# web-scraping-nasa
Author: Erin James Wills, ejw.data@gmail.com

Photo by [NASA](https://unsplash.com/@nasa?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/space?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)
## Overview
This repo uses Requests, Splinter (with Selenium), and BeautifulSoup to collect data from several NASA websites. Information collected includes the latest news stories, images, and tables related to Mars. The webscraping for this project was tested in a Jupyter Notebook called `mission_to_mars.ipynb`. No other tools are needed to run this logic. See instructions below for more details.
To create a user friendly version, the scrape logic was added to a python script file (`scrape_mars.py`) which is called by a flask app (`app.py`) that serves a simple webpage with a button that triggers the scrape to run. All data collected from this scrape is stored in a MongoDB collection. The flask app can be run locally if you also have MongoDB installed. Instructions are provided below.
## Technologies
* Python
* MongoDB
* Splinter
* BeautifulSoup
* Requests
## Data Source
The dataset is generated by scrapting the following websites:
* [https://mars.nasa.gov/news/](https://mars.nasa.gov/news/)
* [https://www.jpl.nasa.gov](https://www.jpl.nasa.gov)
* [https://twitter.com/marswxreport?lang=en](https://twitter.com/marswxreport?lang=en)
* [https://space-facts.com/mars/](https://space-facts.com/mars/)
* [https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars](https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars)
**`Note:`** The above links could change and cause the project to break. This was not a project emphasizing long-term reliability but instead is a project testing out different technologies such as when to use Splinter verus BeautifulSoup.
## Setup and Installation
1. Environment needs the following:
* Python 3.6+
* pandas
* webdriver_manager.chrome
* splinter
* BeautifulSoup
* requests
* time
* flask
* flask_pymongo
1. Verify that your MongoDB server is running by starting MongoDB Compass and connecting into the service.
1. Activate your environment
1. Clone the repo to your local machine
1. Start Jupyter Notebook within the environment from the repo
1. To run and/or troubleshoot the scraping, run `mission_to_mars.ipynb`.
1. To run the the flask app:
* Open a new terminal with a present working directory being your repo folder
* Activate your environment
* In terminal run: `python app.py`
* You should be able to go to a browser and see the scrape website from `http://localhost:5000`.
* The scrape will run and update the MongoDB database and webpage after clicking the scrape button.
1. The MongoDB database is automatically created if it does not exist and will be named: `mars_app` and the collection will be named: `listings`.
