{"id":16161443,"url":"https://github.com/gamemann/How-To-Use-Selenium-And-BeautifulSoup","last_synced_at":"2025-10-24T03:32:10.134Z","repository":{"id":217226903,"uuid":"743330222","full_name":"gamemann/How-To-Use-Selenium-And-BeautifulSoup","owner":"gamemann","description":"A full lab and how-to guide on how to use Selenium paired with Beautiful Soup to parse and extract data from a website using Python.","archived":false,"fork":false,"pushed_at":"2024-01-15T05:25:17.000Z","size":233,"stargazers_count":14,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-04T04:55:04.049Z","etag":null,"topics":["beautifulsoup","beautifulsoup4","bs4","firefox","geckodriver","node","nodejs","python","react","selenium","selenium-python","selenium-webdriver","webscraper","webscraping"],"latest_commit_sha":null,"homepage":"https://deaconn.net/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gamemann.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2024-01-15T01:43:08.000Z","updated_at":"2024-12-30T22:29:00.000Z","dependencies_parsed_at":"2024-01-15T05:59:01.963Z","dependency_job_id":null,"html_url":"https://github.com/gamemann/How-To-Use-Selenium-And-BeautifulSoup","commit_stats":null,"previous_names":["gamemann/selenium-and-beautifulsoup-lab"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamemann%2FHow-To-Use-Selenium-And-BeautifulSoup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamemann%2FHow-To-Use-Selenium-And-BeautifulSoup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamemann%2FHow-To-Use-Selenium-And-BeautifulSoup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamemann%2FHow-To-Use-Selenium-And-BeautifulSoup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gamemann","download_url":"https://codeload.github.com/gamemann/How-To-Use-Selenium-And-BeautifulSoup/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237910078,"owners_count":19385830,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","beautifulsoup4","bs4","firefox","geckodriver","node","nodejs","python","react","selenium","selenium-python","selenium-webdriver","webscraper","webscraping"],"created_at":"2024-10-10T02:25:16.721Z","updated_at":"2025-10-24T03:32:09.712Z","avatar_url":"https://github.com/gamemann.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"This repository will show how to use [Selenium](https://www.selenium.dev/) paired with [Beautiful Soup (V4)](https://pypi.org/project/beautifulsoup4/) in Python (3+) to parse and extract data from websites. I've included example(s) of using JavaScript as well (e.g. button clicks to open menus and then extract more hidden data). I also plan on making blog articles under [Deaconn](https://deaconn.net/) using these examples in the future!\n\nThese tools are commonly used with web browser automation, web scraping, and development tests. Additionally, you can use the combination of these tools in other projects such as creating a follow bot (obviously using at your own risk)!\n\n## What Is Selenium \u0026 Beautiful Soup?\n**Selenium** is a powerful tool for controlling web browsers through programs and performing browser automation/tasks. A driver is included for most web browsers and a wide range of programming languages are supported!\n\n**Beautiful Soup** is a Python library for pulling data out of HTML and XML files. It parses anything you give it, and does the tree traversal stuff for you!\n\n## Requirements \u0026 Setup\nI've created and tested the programs made in this repository on a Debian 12 virtual machine I have running on one of my [home servers](https://github.com/gamemann/Home-Lab?tab=readme-ov-file#two-powerball). While I don't have specific instructions for setting up this repository on non-Debian/Ubuntu-based systems, there shouldn't be many changes you need to make to the instructions below. In fact, it may be easier since you may not have to worry about your OS's package manager handling the Python installation.\n\n### Debian/Ubuntu-Based Systems\nDebian/Ubuntu-based systems typically use the `apt` package manager to manage the server's Python installation and its libraries. This is fine in most cases, but sometimes there are packages that aren't included with `apt` and when using the `pip` or `pip3` commands to install the package, you'll receive an error like below.\n\n```bash\nerror: externally-managed-environment\n\n× This environment is externally managed\n╰─\u003e To install Python packages system-wide, try apt install\n    python3-xyz, where xyz is the package you are trying to\n    install.\n    \n    If you wish to install a non-Debian-packaged Python package,\n    create a virtual environment using python3 -m venv path/to/venv.\n    Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make\n    sure you have python3-full installed.\n    \n    If you wish to install a non-Debian packaged Python application,\n    it may be easiest to use pipx install xyz, which will manage a\n    virtual environment for you. Make sure you have pipx installed.\n    \n    See /usr/share/doc/python3.11/README.venv for more information.\n\nnote: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.\nhint: See PEP 668 for the detailed specification.\n```\n\nYou could pass the `--break-system-packages` flag to the `pip` or `pip3` commands, but as stated in the error, this risks breaking packages in your global Python installation. A solution to this is using virtual Python environments which is detailed below.\n\nIf you do want to use `apt` to manage the packages, you can install Selenium and BeautifulSoup4 using the command below.\n\n```bash\nsudo apt install -y python3-bs4 python3-selenium\n```\n\n### Virtual Python Environments\nI personally recommend creating a virtual Python environment so that you don't risk breaking your Python installation if you need to install a package that isn't included in the `apt` package manager. It is pretty easy to create a virtual environment as well. In our case, we can do so by using the command below.\n\n```bash\npython3 -m venv venv/\n```\n\nThis will create a `venv/` directory in your current working directory. Afterwards, you will want to source `venv/bin/activate` and then you will be able to use the `pip` or `pip3` commands to install the required packages.\n\n```bash\nsource venv/bin/activate\n```\n\nI've also included a `requirements.txt` file which allows you to easily install the required packages using the `pip` or `pip3` commands. You may use the command below.\n\n```bash\npip3 install -r requirements.txt\n```\n\n**Note** - The `requirements.txt` file includes `beautifulsoup4` (version `4.12.2`) and `selenium` (version `4.16.0`). There may be updates available to these packages, but these are the versions I've made this repository with.\n\n### Firefox \u0026 Geckodriver\nIn this repository, we use Selenium's Firefox driver paired with [geckodriver](https://github.com/mozilla/geckodriver). I'd recommend heading to the [releases page](https://github.com/mozilla/geckodriver/releases) and downloading the latest. Otherwise, you can use the version I've tested below.\n\n```bash\n# Download version '0.34.0' for Linux 64-bit.\nwget https://github.com/mozilla/geckodriver/releases/download/v0.34.0/geckodriver-v0.34.0-linux64.tar.gz\n\n# Uncompress and extract the file using the 'tar' command.\ntar -xzvf geckodriver-v0.34.0-linux64.tar.gz\n\n# Move to '/usr/bin' using sudo/root.\nsudo mv geckodriver /usr/bin\n```\n\nYou'll also want to download Firefox. You can do so using `apt` below.\n\n```\nsudo apt install -y firefox-esr\n```\n\n## Website Setup \u0026 Running\nThe website we've made to test the Python programs utilize [React](https://react.dev/) and [Node.js](https://nodejs.org/en). The website's source code is located in the [`site/`](./site) directory.\n\n### Requirements\nYou will need to install **Node.js** and **NPM** onto your system. You can read [this guide](https://nodejs.org/en/download/package-manager/) on how to install these packages using a package manager. You can use the following command to install Node.js and NPM using the `apt` package manager. However, I did want to note that the standard repositories included in the `apt` package manager are fairly old (stable), but they should work for the websites in this repository.\n\n```bash\nsudo apt install -y nodejs npm\n```\n\n### Installing Packages\nAfter installing Node.js and NPM, you can change your directory to our website using the `cd site/` command and run the following to install the needed packages via NPM.\n\n```bash\nnpm install\n```\n\nAfterwards, you can run the following command to start the web development server.\n\n```bash\nnpm start\n```\n\nBy default, the website should be listening at [http://localhost:3000](http://localhost:3000). However, if you want to change the bind IP or port, you can set the `HOST` and `PORT` environmental variables. Here's an example.\n\n```bash\nHOST=0.0.0.0 PORT=3001 npm start\n```\n\nIf you use a different host or port, please make sure to specify this in the Python program's command line. Read **Command Line Usage** for more information.\n\n## Command Line Usage\nEach Python program utilizes [`src/base/cmdline.py`](./src/base/cmdline.py) to parse the command line arguments. Arguments are listed below.\n\n* `-b --binary` - The path to the Geckodriver binary file (default =\u003e `/usr/bin/geckodriver`).\n* `-s --site` - The full URL of the website to parse and extract information from (default =\u003e `http://localhost:3000`).\n* `-u --ua` - The web browser's user agent to use when sending requests (default =\u003e `Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0`).\n\n## Programs\nAll Python programs are located in the [`src/`](./src) directory. You may execute them using the following command. Please make sure you have the website started in another terminal!\n\n```bash\npython3 src/\u003cprogram\u003e.py\n```\n\nHere's a list of programs we've made so far!\n\n### [`simple-image-collector.py`](./src/simple-image-collector.py)\nThis Python program parses our website and extracts all image sources inside of elements with the class name `image-row`.\n\nThe expected output is the following.\n\n```bash\n$ python3 src/simple-image-collector.py \nStarting simple-image-collector...\nParsing arguments...\nSetting up Selenium driver...\nParsing website 'http://localhost:3000'...\nFound the following image URLs.\n        - /images/testimage01.png\n        - /images/testimage02.png\n        - /images/testimage03.png\n        - /images/testimage04.png\nExiting...\n```\n\n### [`simple-card-collector.py`](./src/simple-card-collector.py)\nThis Python program parses our website and extracts the title and description of all elements with the class name `card-row`. The title is found inside of the `\u003ch2\u003e` tag while the description is found inside of the `\u003cp\u003e` tag inside the card row element.\n\nThe expected output is the following.\n\n```bash\n$ python3 src/simple-card-collector.py \nStarting simple-card-collector...\nParsing arguments...\nSetting up Selenium driver...\nParsing website 'http://localhost:3000'...\nFound the following cards.\n        Card #1\n                Title =\u003e Card Title #1\n                Description =\u003e This is the description of card #1!\n        Card #2\n                Title =\u003e Card Title #2\n                Description =\u003e This is the description of card #2!\n        Card #3\n                Title =\u003e Card Title #3\n                Description =\u003e This is the description of card #3!\nExiting...\n```\n\n### [`adv-clickdiv-collector.py`](./src/adv-clickdiv-collector.py)\nThis Python program parses our website, clicks all the dividers with the class name `clickDiv-row`, and then extracts the divider's title and hidden content. This is a more advanced example since it uses JavaScript to click buttons.\n\nThe expected output is the following.\n\n```bash\n$ python3 src/adv-clickdiv-collector.py\nStarting adv-clickdiv-collector...\nParsing arguments...\nSetting up Selenium driver...\nParsing website 'http://localhost:3000'...\nFound the following clickable dividers.\n        ClickDiv #1\n                Title =\u003e Clickable Div #1\n                Description =\u003e These are the hidden contents of clickable div #1!\n        ClickDiv #2\n                Title =\u003e Clickable Div #2\n                Description =\u003e These are the hidden contents of clickable div #2!\n        ClickDiv #3\n                Title =\u003e Clickable Div #3\n                Description =\u003e These are the hidden contents of clickable div #3!\n        ClickDiv #4\n                Title =\u003e Clickable Div #4\n                Description =\u003e These are the hidden contents of clickable div #4!\nExiting...\n```\n\n## Credits\n* [Christian Deacon](https://github.com/gamemann)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgamemann%2FHow-To-Use-Selenium-And-BeautifulSoup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgamemann%2FHow-To-Use-Selenium-And-BeautifulSoup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgamemann%2FHow-To-Use-Selenium-And-BeautifulSoup/lists"}