{"id":13936316,"url":"https://github.com/nicodds/chesf","last_synced_at":"2025-07-19T21:32:21.333Z","repository":{"id":136764446,"uuid":"110899746","full_name":"nicodds/chesf","owner":"nicodds","description":"CHeSF is the Chrome Headless Scraping Framework, a very very alpha code to scrape javascript intensive web pages","archived":false,"fork":false,"pushed_at":"2018-01-26T19:41:50.000Z","size":118,"stargazers_count":20,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-11-27T04:30:53.924Z","etag":null,"topics":["chrome-headless","scraping","selenium","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nicodds.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-11-15T23:43:41.000Z","updated_at":"2023-09-23T00:23:10.000Z","dependencies_parsed_at":null,"dependency_job_id":"23f90804-3bb5-48c2-b1a6-e120e6ea8a95","html_url":"https://github.com/nicodds/chesf","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nicodds/chesf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicodds%2Fchesf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicodds%2Fchesf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicodds%2Fchesf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicodds%2Fchesf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nicodds","download_url":"https://codeload.github.com/nicodds/chesf/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicodds%2Fchesf/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266019657,"owners_count":23864916,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chrome-headless","scraping","selenium","webscraping"],"created_at":"2024-08-07T23:02:33.797Z","updated_at":"2025-07-19T21:32:16.282Z","avatar_url":"https://github.com/nicodds.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"## Introduction ##\nIn the era of Big Data, the web is an endless source of information.\nFor this reason, there are plenty of good tools/frameworks to\nperform _scraping_ of web pages.\n\nSo, I guess, in an ideal world there should be no need of a new\nweb scraping framework. Nevertheless, there are always subtle\ndifferences between theory and practice. The case of web scraping made\nno exceptions.\n\nReal world web pages are often full of javascript codes that alter the\nDOM as the user requests/navigates pages. Consequently, scraping\njavascript intensive web pages could be impossible.\n\nSuch considerations were the sparks that gave birth to *CHeSF*, the\nChrome Headless Scraping Framework. To make a long story short, CHeSF\nrelies on both\n[selenium-python](https://github.com/baijum/selenium-python) and\n[ChromeDriver](https://sites.google.com/a/chromium.org/chromedriver/)\nto perform scraping of webpages also when javascript makes it\nimpossible.\n\nI know that already exists some nice solutions to this problems, but in my\npoint of view CHeSF is simpler: you just create a class that inherits from\nit, define the parse method and launch it with a start url.\n\nThe framework is still very alpha. You should expect that things could\nchange rapidly. Currently, there is no documentation, nor packaging. There is\njust an example showing how you could use the framework to easily scrape\nTripAdvisor reviews. Personally, I used it to collect this\n[dataset](https://www.kaggle.com/nicodds/rome-b-and-bs), i.e. a collection\nof more than 220k TripAdvisor reviews.\n\n## Basic usage ##\n\nCHeSF borrows its working philosophy (in part) from\n[Scrapy](http://www.scrapy.org), i.e. making a scraping tool means\ncreating (at least) a python class.\n\n\n```python\nimport sys\nimport os\n\n# the path to the crhome driver executable\npath_to_chrome_driver_exe = 'path_to_chromedriver.exe'\n# currently, no packages exists for CHeSF, so use this hack until \n# I'll have some free time to implement packaging\npath_to_chesf = 'path_to_chesf_in_your_system'\n\nsys.path.insert(0, os.path.abspath(path_to_chesf))\nfrom chesf import CHeSF, MAX_ATTEMPTS\n\nclass TripAdvisorScraper(CHeSF):\n    def __init__(self):\n        super().__init__(path_to_chrome_driver_exe, debug=False)\n        \n\n    # this is the core of the Scraper, you must define it since by\n    # convention is the callback called with the first url passed,\n    # after, you can define other callbacks\n    def parse(self):\n        # the main pro of CHeSF is that you could use directly\n        # javascript to parse the page\n        script = \"\"\"\n\t       let urls = [];\n\t       let anchors = document.querySelectorAll(\"a.property_title.prominent\");\n        \n    \t   for (let a of anchors)\n                urls.push(a.href);\n\n    \t   return urls;\n        \"\"\"\n\n        # the array returned from the javascript is automagically\n        # transformed to a python list (this is selenium magic)\n        links = self.call_js(script)\n\n        for link in links:\n            print(link)\n\n        # you could use both xpath and css selectors (just change the\n        # method you use)\n        next_page = self.css('a.nav.next.taLnk.ui_button.primary', timeout=1)\n        \n        if len(next_page) \u003e 0:\n            # clicks are immediately executed\n            self.enqueue_click(next_page[0], self.parse)\n            \nstart_url = 'https://www.tripadvisor.com/Hotels-g187791-c2-Rome_Lazio-Hotels.html'\nscraper = TripAdvisorScraper()\n\ntry:\n    scraper.start(start_url)\nexcept:\n    scraper.quit()\n    raise\n\n```\n\n## Contacts ##\nIn case of questions and/or suggestions, write me a note using my GitHub contact email.\n\n## Mini FAQ ##\n\nQ. Hey man, it absolutely doesn't work! What's wrong?\nA. Please, check that your ChromeDriver is suitable for the Chrome version you are using.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicodds%2Fchesf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnicodds%2Fchesf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicodds%2Fchesf/lists"}