https://github.com/jokerdii/web-scrapping-projects
https://github.com/jokerdii/web-scrapping-projects
mongodb scrapy selenium splash sqlite3
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/jokerdii/web-scrapping-projects
- Owner: JoKerDii
- Created: 2022-05-27T02:21:12.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2022-05-30T02:13:33.000Z (about 4 years ago)
- Last Synced: 2025-08-20T15:58:18.417Z (10 months ago)
- Topics: mongodb, scrapy, selenium, splash, sqlite3
- Language: Python
- Homepage:
- Size: 7.89 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Scrapping Notes and Logs
## Using Scrapy
`scrapy bench`: Run quick benchmark test.
`scrapy fetch `: Fetch a URL using the Scrapy downloader.
`scrapy genspider`: Generate new spider using pre-defined templates.
## CSS Selectors
[CSS playground](https://try.jsoup.org/)
Use `#` to query tags with id. Use `.` to query tags with class. Use `.A.B` to query double classes `class="A B"`.
Use square bracket to query tags with attributes, e.g. use `[data-identifier=7]` or `li[data-identifier=7]` to query `
Select tags with specific attributes: `a[href^='https']` for `Google`, `a[href$='fr']` for `Google France`.
Select nested tags: `div.intro p, span#location` for `
` and `` in ` ` immediately after ` ## XPath Selectors [XPath playground](https://scrapinghub.github.io/xpath-playground/). Select all ` ` within the ` Select the immediate parent tag ` `: `//p[@id='unique']/parent::div`.
`: `//p[@id='unique']/parent::node()`.
`(p is excluded): `//p[@id='unique']/ancestor::node()`.
`(p is included): `//p[@id='unique']/ancestor-or-self::node()`.
` (not parents): `//p[@id='unique']/preceding::h1`.
` (not parents): `//p[@id='unique']/preceding::node()`.
`: `//p[@id='unique']/preceding-sibling::node()`. Select the immediate child tag ` ` of ` ## Basic steps to web-scrapping Start a project ```
cd worldometers
In `countries.py`, change `start_urls = ['http://www.worldometers.info/']` to `start_urls = ['https://www.worldometers.info/world-population/population-by-country/']`. ```
Note that Scrapy cannot interpret JavaScript. Scrapy will return the raw HTML markup without JS, so we need to disable JS. 'Command + Shift + I' -> 'Command + Shift + P' -> disable JavaScript. XPath expressions & CSS Selectors to get the title. ```
```
XPath expressions & CSS Selectors to get all the countries. ```
```
Display the response ```
## Splash JS requires engine to be executed. Chrome has V8 engine. Firefox has Spider Monkey. Safari has Apple Web kit (same engine used by Splash). Microsoft Edge has Shakra. For scrapping those websites on which we really need JavaScript, we can use Splash or Selenium. To download Splash, we first download Docker and run ```
To start Splash at the first time, we run ```
Then, open 'http://0.0.0.0:8050' on browser to start Splash. To start the Splash next time, we can use docker desktop, go dashboard, and click on start button of the specific app. To render the target website: https://duckduckgo.com, on http://0.0.0.0:8050 do ```
We can use `select()` or `select_all()` to select elements. When searching results, sometimes we need to wait a little bit more seconds to render the webpage. We can click the button by either ```
```
The full code is ```
url = args.url
return {
To overwrite request headers (set user agent) we can do ```
```
```
## Selenium ```
## Store data in MongoDB ```
Create an account on MongoDB cloud. Create a new cluster. Config databased access and network access (0.0.0.0/0). Connect to the cluster. Connect -> connect to your application -> config the language -> copy the application code and paste in 'pipelines.py' -> replace with the actual password. Run `scrapy crawl best_moviews` and check the collection on MongoDB. The data are store in it. ## Store data in SQLite3 Note that sqlite3 is already included in python standard library so we don't need to install it. Modify `pipelines.py` and `settings.py` (change to SQLlitePipeline). Run `scrapy crawl best_moviews` then we get a `imdb.db` file. Install SQLits extension in vscode. Right clide and open the `imdb.db`.
Select nested tags: `div.intro > p` for all tages within `
Select a particular tag immediately after a tag: `div.intro + p` for a specific `
Select the specific number of tag: `li:nth-child(1)` to get the first `
Select specific `` with href starting with 'https': `//a[start-with(@href,"https")]`.
Select specific `` with href ending with 'fr': `//a[end-with(@href,"fr")]`.
Select specific `` with href containing 'google': `//a[contains(@href,"google")]`.
Select specific `` with text containing 'google': `//a[contains(text(),"France")]`. (note that this is case sensitive)
Select the first ``: `//ul[@id="items"]/li[1]`. Select the first and the fourth `
`: `//ul[@id="items"]/li[position() = 1 or position() = 4]`. If the fourth one is the last: `//ul[@id="items"]/li[position() = 1 or last() = 4]`.
Select the any immediate parent tag of `
Select all parent tags of `
Select all parent tags of `
Select tag `` that precedes `
Select all tags that precede `
Select all tags that are siblings of `
Select the any immediate child tag of `
Select all tags listed after `
Select all tags listed after `
Selected all children tags inside `
mkdir projects
cd projects
scrapy startproject worldometers
scrapy genspider countries https://www.worldometers.info/world-population/population-by-country/
```
scrapy shell # shows some available Scrapy objects.
fetch('https://www.worldometers.info/world-population/population-by-country/')
r = scrapy.Request(url = 'https://www.worldometers.info/world-population/population-by-country/')
fetch(r)
response.body
view(response)
```
title = response.xpath('//h1')
title
title = response.xpath('//h1/text()')
title
title.get()
```
title_css = response.css('h1::text')
title_css
title_css.get()
```
countries = response.xpath('//td/a/text()').getall()
countries
```
countries_css = response.css('td a ::text').getall()
countries_css
```
yield {
'title': title,
'country': country
}
```
docker pull scrapinghub/splash
```
docker run -it -p 8050:8050 scrapinghub/splash
```
function main(splash, args)
url = args.url
assert(splash:go(url))
assert(spalsh:wait(1))
return {
splash:png(),
splash:html()
}
end
```
btn = assert(splash:select("#search_button_homepage"))
btn:mouse_click()
```
or
input_box:send_keys("")
```
function main(splash, args)
assert(splash:go(url))
assert(splash:wait(1))
input_box = assert(splash:select("#search_form_input_homepage"))
input_box:focus()
input_box:send_text("my user agent")
assert(splash:wait(0.5))
--[[
btn = assert(splash:select("#search_button_homepage"))
btn:mouse_click()
--]]
input_box:send_keys("my user agent")
assert(splash:wait(5))
splash:png(),
splash:html()
}
end
```
splash:set_user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")
```
or
header = {
['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36"
}
splash:set_custom_headers(headers)
```
or
splash:on_request(function(request)
request:set_head('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36')
end)
```
pip install scrapy_selenium
```
pip install pymongo dnspython
```
Modify `pipelines.py`, and `settings.py` (change to MongodbPipeline).