Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/teticio/lambda-selenium
Use AWS Lambda functions as a proxy pool to scrape web pages with Selenium.
https://github.com/teticio/lambda-selenium
lambda-functions proxy scraping selenium terraform
Last synced: 2 months ago
JSON representation
Use AWS Lambda functions as a proxy pool to scrape web pages with Selenium.
- Host: GitHub
- URL: https://github.com/teticio/lambda-selenium
- Owner: teticio
- License: bsd-3-clause
- Created: 2023-09-16T13:16:49.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-02T10:01:31.000Z (about 1 year ago)
- Last Synced: 2024-04-14T03:16:10.707Z (10 months ago)
- Topics: lambda-functions, proxy, scraping, selenium, terraform
- Language: Python
- Homepage:
- Size: 21.5 KB
- Stars: 7
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Lambda Selenium
(See also [lambda-scraper](https://github.com/teticio/lambda-scraper))
Use AWS Lambda functions as a proxy to scrape web pages with Selenium. This is a cost effective way to have access to a large pool of IP addresses. Run the following to create as many Lambda functions as you need (one for each IP address). The number of functions as well as the region can be specified in `variables.tf`. Each Lambda function changes IP address after approximately 6 minutes of inactivity. For example, you could create 360 Lambda functions which you cycle through one per second, while making as many requests as possible via each corresponding IP address. Note that, in practice, AWS will sometimes assign the same IP address to more than one Lambda function.
## Pre-requisites
You will need to have installed Terraform and Docker.
## Usage
```bash
git clone https://github.com/teticio/lambda-selenium.git
cd lambda-selenium
terraform init
terraform apply -auto-approve
# run "terraform apply -destroy -auto-approve" in the same directory to tear all this down again
```You can specify an `AWS_PROFILE` and `AWS_REGION` with
```bash
terraform apply -auto-approve -var 'region=AWS_REGION' -var 'profile=AWS_PROFILE'
```An example of how to use this from Python is provided in `test_selenium.py`. It runs the script in `example.py` to search for descriptions of dog breeds in Google.
```bash
AWS_DEFAULT_REGION=AWS_REGION python test_selenium.py
```There are also examples of running the Lambda functions in parallel and asynchronously, which greatly speed up the process.
```bash
# Multi-processing (uses multiple CPU cores)
AWS_DEFAULT_REGION=AWS_REGION python test_selenium_parallel.py
# Asynchronous
AWS_DEFAULT_REGION=AWS_REGION python test_selenium_async.py
```