{"id":23955578,"url":"https://github.com/jokerdii/web-scrapping-projects","last_synced_at":"2026-05-03T12:33:03.321Z","repository":{"id":164002802,"uuid":"496835056","full_name":"JoKerDii/web-scrapping-projects","owner":"JoKerDii","description":null,"archived":false,"fork":false,"pushed_at":"2022-05-30T02:13:33.000Z","size":8277,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-20T15:58:18.417Z","etag":null,"topics":["mongodb","scrapy","selenium","splash","sqlite3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JoKerDii.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-27T02:21:12.000Z","updated_at":"2022-05-29T01:53:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"20d25fd1-f029-4a16-a20f-682a3b0ae06a","html_url":"https://github.com/JoKerDii/web-scrapping-projects","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/JoKerDii/web-scrapping-projects","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoKerDii%2Fweb-scrapping-projects","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoKerDii%2Fweb-scrapping-projects/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoKerDii%2Fweb-scrapping-projects/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoKerDii%2Fweb-scrapping-projects/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JoKerDii","download_url":"https://codeload.github.com/JoKerDii/web-scrapping-projects/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoKerDii%2Fweb-scrapping-projects/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32569712,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T06:36:36.687Z","status":"ssl_error","status_checked_at":"2026-05-03T06:36:09.306Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["mongodb","scrapy","selenium","splash","sqlite3"],"created_at":"2025-01-06T15:36:08.146Z","updated_at":"2026-05-03T12:33:03.287Z","avatar_url":"https://github.com/JoKerDii.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Scrapping Notes and Logs\n\n## Using Scrapy\n\n`scrapy bench`: Run quick benchmark test.\n`scrapy fetch \u003cURL\u003e`: Fetch a URL using the Scrapy downloader.\n`scrapy genspider`: Generate new spider using pre-defined templates.\n\n## CSS Selectors\n\n[CSS playground](https://try.jsoup.org/)\n\nUse `#` to query tags with id. Use `.` to query tags with class. Use `.A.B` to query double classes `class=\"A B\"`.\nUse square bracket to query tags with attributes, e.g. use `[data-identifier=7]` or `li[data-identifier=7]` to query `\u003cli data-identifier=\"7\"\u003e` tag.\nSelect tags with specific attributes: `a[href^='https']` for `\u003ca href=\"https://www.google.com\"\u003eGoogle\u003c/a\u003e`, `a[href$='fr']` for `\u003ca href=\"https://www.google.fr\"\u003eGoogle France\u003c/a\u003e`.\nSelect nested tags: `div.intro p, span#location` for `\u003cp\u003e` and `\u003cspan\u003e` in `\u003cdiv class = \"intro\"\u003e`.\nSelect nested tags: `div.intro \u003e p` for all tages within `\u003cdiv class=\"intro\"\u003e`.\nSelect a particular tag immediately after a tag: `div.intro + p` for a specific `\u003cp\u003e` immediately after `\u003cdiv class=\"intro\"\u003e`.\nSelect the specific number of tag: `li:nth-child(1)` to get the first `\u003cli\u003e`, `li:nth-child(3)` to get the third `\u003cli\u003e`, `li:nth-child(odd)` to get the odd number of `\u003cli\u003e`.\n\n## XPath Selectors\n\n[XPath playground](https://scrapinghub.github.io/xpath-playground/).\n\n\nSelect all `\u003cp\u003e` within the `\u003cdiv class=\"intro\"\u003e` and `\u003cdiv class=\"outro\"\u003e`: `//div[@class=\"intro\" or @class=\"outro\"]/p`. Select text only: `//div[@class=\"intro\" or @class=\"outro\"]/p/text()`.\nSelect specific `\u003ca\u003e` with href starting with 'https': `//a[start-with(@href,\"https\")]`.\nSelect specific `\u003ca\u003e` with href ending with 'fr': `//a[end-with(@href,\"fr\")]`.\nSelect specific `\u003ca\u003e` with href containing 'google': `//a[contains(@href,\"google\")]`.\nSelect specific `\u003ca\u003e` with text containing 'google': `//a[contains(text(),\"France\")]`. (note that this is case sensitive)\nSelect the first `\u003cli\u003e` in `\u003cul\u003e`: `//ul[@id=\"items\"]/li[1]`. Select the first and the fourth `\u003cli\u003e` in `\u003cul\u003e`: `//ul[@id=\"items\"]/li[position() = 1 or position() = 4]`. If the fourth one is the last: `//ul[@id=\"items\"]/li[position() = 1 or last() = 4]`.\n\n\nSelect the immediate parent tag `\u003cdiv\u003e` of `\u003cp id=\"unique\"\u003e`: `//p[@id='unique']/parent::div`.\nSelect the any immediate parent tag of `\u003cp id=\"unique\"\u003e`: `//p[@id='unique']/parent::node()`.\nSelect all parent tags of `\u003cp id=\"unique\"\u003e`(p is excluded): `//p[@id='unique']/ancestor::node()`.\nSelect all parent tags of `\u003cp id=\"unique\"\u003e`(p is included): `//p[@id='unique']/ancestor-or-self::node()`.\nSelect tag `\u003ch1\u003e` that precedes `\u003cp id=\"unique\"\u003e` (not parents): `//p[@id='unique']/preceding::h1`.\nSelect all tags that precede `\u003cp id=\"unique\"\u003e` (not parents): `//p[@id='unique']/preceding::node()`.\nSelect all tags that are siblings of `\u003cp id=\"unique\"\u003e`: `//p[@id='unique']/preceding-sibling::node()`.\n\n\nSelect the immediate child tag `\u003cp\u003e` of `\u003cdiv class=\"intro\"\u003e`: `//div[@class='intro']/child::p`.\nSelect the any immediate child tag of `\u003cdiv class=\"intro\"\u003e`: `//div[@class='unique']/child::node()`.\nSelect all tags listed after `\u003cdiv class=\"intro\"\u003e`: `//div[@class='unique']/following::node()`.\nSelect all tags listed after `\u003cdiv class=\"intro\"\u003e` and share the same parent `\u003cbody\u003e`: `//div[@class='unique']/following-sibling::node()`.\nSelected all children tags inside `\u003cdiv class=\"intro\"\u003e`: `//div[@class='unique']/descendant::node()`.\n\n\n## Basic steps to web-scrapping\n\nStart a project\n\n```\nmkdir projects\ncd projects\nscrapy startproject worldometers\n\ncd worldometers\nscrapy genspider countries https://www.worldometers.info/world-population/population-by-country/\n```\n\nIn `countries.py`, change `start_urls = ['http://www.worldometers.info/']` to `start_urls = ['https://www.worldometers.info/world-population/population-by-country/']`.\n\n\n```\nscrapy shell # shows some available Scrapy objects.\nfetch('https://www.worldometers.info/world-population/population-by-country/')\nr = scrapy.Request(url = 'https://www.worldometers.info/world-population/population-by-country/')\nfetch(r)\nresponse.body\nview(response)\n```\n\nNote that Scrapy cannot interpret JavaScript. Scrapy will return the raw HTML markup without JS, so we need to disable JS. 'Command + Shift + I' -\u003e 'Command + Shift + P' -\u003e disable JavaScript.\n\nXPath expressions \u0026 CSS Selectors to get the title.\n\n```\ntitle = response.xpath('//h1')\ntitle\ntitle = response.xpath('//h1/text()')\ntitle\ntitle.get()\n```\n\n```\ntitle_css = response.css('h1::text')\ntitle_css\ntitle_css.get()\n```\n\nXPath expressions \u0026 CSS Selectors to get all the countries.\n\n```\ncountries = response.xpath('//td/a/text()').getall()\ncountries \n```\n\n```\ncountries_css = response.css('td a ::text').getall()\ncountries_css\n```\n\nDisplay the response\n\n```\nyield {\n    'title': title,\n    'country': country\n}\n```\n\n## Splash\n\nJS requires engine to be executed. Chrome has V8 engine. Firefox has Spider Monkey. Safari has Apple Web kit (same engine used by Splash). Microsoft Edge has Shakra. For scrapping those websites on which we really need JavaScript, we can use Splash or Selenium.\n\nTo download Splash, we first download Docker and run\n\n```\ndocker pull scrapinghub/splash\n```\n\nTo start Splash at the first time, we run\n\n```\ndocker run -it -p 8050:8050 scrapinghub/splash \n```\n\nThen, open 'http://0.0.0.0:8050' on browser to start Splash.\n\nTo start the Splash next time, we can use docker desktop, go dashboard, and click on start button of the specific app.\n\nTo render the target website: https://duckduckgo.com, on http://0.0.0.0:8050 do\n\n```\nfunction main(splash, args)\n  url = args.url\n  assert(splash:go(url))\n  assert(spalsh:wait(1))\n  return {\n      splash:png(),\n      splash:html()\n  }\nend\n```\n\nWe can use `select()` or `select_all()` to select elements. When searching results, sometimes we need to wait a little bit more seconds to render the webpage.\n\n\nWe can click the button by either\n\n```\nbtn = assert(splash:select(\"#search_button_homepage\"))\nbtn:mouse_click()\n```\nor\n\n```\ninput_box:send_keys(\"\u003cEnter\u003e\")\n```\n\nThe full code is \n\n```\nfunction main(splash, args)\n\n  url = args.url\n  assert(splash:go(url))\n  assert(splash:wait(1))\n  \n  input_box = assert(splash:select(\"#search_form_input_homepage\"))\n  input_box:focus()\n  input_box:send_text(\"my user agent\")\n  assert(splash:wait(0.5))\n  \n  --[[\n  btn = assert(splash:select(\"#search_button_homepage\"))\n  btn:mouse_click()\n  --]]\n  input_box:send_keys(\"my user agent\")\n  assert(splash:wait(5))\n\n  return {\n      splash:png(),\n      splash:html()\n  }\nend\n```\n\nTo overwrite request headers (set user agent) we can do\n\n```\nsplash:set_user_agent(\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36\")\n```\nor \n\n```\nheader = {\n    ['User-Agent'] = \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36\"\n}\nsplash:set_custom_headers(headers)\n```\nor \n\n```\nsplash:on_request(function(request)\n    request:set_head('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36')\nend)\n```\n\n## Selenium\n\n```\npip install scrapy_selenium\n```\n\n## Store data in MongoDB\n\n```\npip install pymongo dnspython\n```\nModify `pipelines.py`, and `settings.py` (change to MongodbPipeline).\n\nCreate an account on MongoDB cloud. Create a new cluster. Config databased access and network access (0.0.0.0/0). \n\nConnect to the cluster. Connect -\u003e connect to your application -\u003e config the language -\u003e copy the application code and paste in 'pipelines.py' -\u003e replace \u003cpassword\u003e with the actual password.\n\nRun `scrapy crawl best_moviews` and check the collection on MongoDB. The data are store in it.\n\n\n## Store data in SQLite3\n\nNote that sqlite3 is already included in python standard library so we don't need to install it.\n\nModify `pipelines.py` and `settings.py` (change to SQLlitePipeline).\n\nRun `scrapy crawl best_moviews` then we get a `imdb.db` file.\n\nInstall SQLits extension in vscode. Right clide and open the `imdb.db`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjokerdii%2Fweb-scrapping-projects","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjokerdii%2Fweb-scrapping-projects","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjokerdii%2Fweb-scrapping-projects/lists"}