Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chaudharypraveen98/scraper
It is a python based script which is running on python based framework Flask and Fast Api. We are using the Ngrok to deploy our localhost to web.
https://github.com/chaudharypraveen98/scraper
cli fastapi flask pandas requestshtml scraping script
Last synced: 5 days ago
JSON representation
It is a python based script which is running on python based framework Flask and Fast Api. We are using the Ngrok to deploy our localhost to web.
- Host: GitHub
- URL: https://github.com/chaudharypraveen98/scraper
- Owner: chaudharypraveen98
- Created: 2020-06-13T17:56:50.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-07-09T16:52:10.000Z (over 3 years ago)
- Last Synced: 2023-03-06T20:48:52.476Z (almost 2 years ago)
- Topics: cli, fastapi, flask, pandas, requestshtml, scraping, script
- Language: Python
- Homepage:
- Size: 180 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Scraper
A scraper script which is running on python's framework Flask and Fast Api can scrape BoxOfficeMojo website. We are using the Ngrok to deploy our localhost to web.### Web preview
### What we are going to build?
1. We are creating a script to scrape the BoxOfficeMojo and export it to excel using the [pandas library](https://pandas.pydata.org/) by making Dataframe using the nested dictionary.2. Then, setting up the [Flask](https://flask.palletsprojects.com/en/2.0.x/) Server to run periodic scraping using the get request on a particular url endpoint after a certain interval of time.
3. Then, making a url endpoint to serve the data which we had scraped till now.
4. We are making our localhost server accessible from anywhere using the [Ngrok](https://ngrok.com/).
5. Lastly, We are making a cmdlet or powershell script to run from anywhere wether command prompt or window power shell.
## Step 1 -> Writing scraper and exporting data (Scraper.py)
### Writing scraper
We will use requestsHTML to scrape the data with the help of css selectors.But, What are selectors/locators?
A CSS Selector is a combination of an element selector and a value which identifies the web element within a web page.
The choice of locator depends largely on your Application Under Test
Id
An element’s id in XPATH is defined using: “[@id='example']” and in CSS using: “#” - ID's must be unique within the DOM.
Examples:
```
XPath: //div[@id='example']
CSS: #example
```Element Type
The previous example showed //div in the xpath. That is the element type, which could be input for a text box or button, img for an image, or "a" for a link.```
Xpath: //input or
Css: =input
```Direct Child
HTML pages are structured like XML, with children nested inside of parents. If you can locate, for example, the first link within a div, you can construct a string to reach it. A direct child in XPATH is defined by the use of a “/“, while on CSS, it’s defined using “>”.
Examples:
```
XPath: //div/a
CSS: div > a
```Child or Sub-Child
Writing nested divs can get tiring - and result in code that is brittle. Sometimes you expect the code to change, or want to skip layers. If an element could be inside another or one of its children, it’s defined in XPATH using “//” and in CSS just by a whitespace.
Examples:
```
XPath: //div//a
CSS: div a
```Class
For classes, things are pretty similar in XPATH: “[@class='example']” while in CSS it’s just “.”
Examples:
```
XPath: //div[@class='example']
CSS: .example
```Libraries Required :-
- Request-html
- Requests
- Pandas
So, How to install them ?
Run the following commands in python shell
```
pip install requests-html
pip install requests
pip install pandas
```Getting the html code using the Requests library
It will take the filename as the keyword argument if save data is true. By default it will not save the html code in a file.
We are making a get request then we check for the response status code. if status_code is in range(200,299) then its Ok.
Then the html code is feed into `pharse_and_extract` function
```
def url_to_text(url, filename="world.html", save=False):
r = requests.get(url)
if r.status_code == 200:
html_text = r.text
if save:
with open(filename, 'w') as f:
f.write(html_text)
return html_text
return None
```Html Parsing
The response code is parsed into html code using the requestsHTML HTML parser.
```
def pharse_and_extract(url, name=2020):
html_text = url_to_text(url)
if html_text is None:
return ""
r_html = HTML(html=html_text)
```Scraping
We will use the css selectors to locate the element and get the required data.
```
def pharse_and_extract(url, name=2020):
html_text = url_to_text(url)
if html_text is None:
return ""
r_html = HTML(html=html_text)# scraping starts from here
table_class = ".imdb-scroll-table"
r_table = r_html.find(table_class)
table_data = []if len(r_table) == 1:
table = r_table[0]
header_col = table.find("th")
header_names = [x.text for x in header_col]
rows = table.find("tr")
for row in rows[1:]:
cols = row.find("td")
row_data = []
for i, col in enumerate(cols):
row_data.append(col.text)
table_data.append(row_data)
```Exporting into CSV file
Once the data is scraped, then it is loaded into Pandas Dataframe and then exported into CSV file.
```
def pharse_and_extract(url, name=2020):
html_text = url_to_text(url)
if html_text is None:
return ""
r_html = HTML(html=html_text)
table_class = ".imdb-scroll-table"
r_table = r_html.find(table_class)
table_data = []if len(r_table) == 1:
table = r_table[0]
header_col = table.find("th")
header_names = [x.text for x in header_col]
rows = table.find("tr")
for row in rows[1:]:
cols = row.find("td")
row_data = []
for i, col in enumerate(cols):
row_data.append(col.text)
table_data.append(row_data)# exporting data starts from here
path = os.path.join(BASE_DIR, 'data')
os.makedirs(path, exist_ok=True)
filepath = os.path.join('data', f'{name}.csv')# loading into dataframe
df = pd.DataFrame(table_data, columns=header_names)
df.to_csv(filepath, index=False)
```## Step 2 -> Setting up Flask Server
We are making a url endpoint, so that we can set up periodic scraping through a post request.
```
from flask import Flask
from scraper import run as scraper_runner
from logger import log_saveapp = Flask(__name__)
@app.route("/box-office", methods=['POST'])
def box_office_view():
log_save()
scraper_runner()
return "done"
```
## Step 3 -> Serving the scraped data using FastAPI
we will load the data into dataframe from a csv file and then return a dictionary data to the forntend
```
from fastapi import FastAPI
from scraper import run as scraper_runner
from logger import log_save
import os
import pandas as pdapp = FastAPI()
# Getting the file location
current = (os.path.join(os.getcwd(), 'data'))
os.makedirs(current, exist_ok=True)
file = os.path.join(current, "box_office_cleaned.csv")# Serving the file on a get request
@app.get("/box-office-collect")
def abc():
df = pd.read_csv(file)
return df.to_dict("Rank")```
We can even setup periodic scraping in FastAPI too by simple command given below -:
```
@app.post("/box-office")
def box_office():
log_save()
scraper_runner()
return {"data": "success"}```
## Step 4 -> Deployment
For deployment, We are using the Ngrok to deploy our localhost to web.For More Info
Make a python file (trigger.py)
Once our flask server is exposed through ngrok then we can make a post request to make a periodic scraping. We can use tools for that purpose like :-
- UptimeRobot
- Freshworks
- Uptrends
```
import requestsurl = "http://9d44e5091b7d.ngrok.io"
endpoint = f'{url}/box-office'
r = requests.post(endpoint, json={})
print(r.text)
```## Step 5 -> Making a ps1 script
It will help us to run command from anywhere like cmd, powershell etcMake a file with ps1 extension
For Flask Server```
set FLASK_APP=flask_server.py
$env:FLASK_APP = "flask_server.py"
flask run --host=127.0.0.1 --port=8000
```For FastAPI Server
`uvicorn fast_server:app --port 8888`