Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/teticio/lambda-selenium

Use AWS Lambda functions as a proxy pool to scrape web pages with Selenium.
https://github.com/teticio/lambda-selenium

lambda-functions proxy scraping selenium terraform

Last synced: 2 months ago
JSON representation

Use AWS Lambda functions as a proxy pool to scrape web pages with Selenium.

Awesome Lists containing this project

README

        

# Lambda Selenium

(See also [lambda-scraper](https://github.com/teticio/lambda-scraper))

Use AWS Lambda functions as a proxy to scrape web pages with Selenium. This is a cost effective way to have access to a large pool of IP addresses. Run the following to create as many Lambda functions as you need (one for each IP address). The number of functions as well as the region can be specified in `variables.tf`. Each Lambda function changes IP address after approximately 6 minutes of inactivity. For example, you could create 360 Lambda functions which you cycle through one per second, while making as many requests as possible via each corresponding IP address. Note that, in practice, AWS will sometimes assign the same IP address to more than one Lambda function.

## Pre-requisites

You will need to have installed Terraform and Docker.

## Usage

```bash
git clone https://github.com/teticio/lambda-selenium.git
cd lambda-selenium
terraform init
terraform apply -auto-approve
# run "terraform apply -destroy -auto-approve" in the same directory to tear all this down again
```

You can specify an `AWS_PROFILE` and `AWS_REGION` with

```bash
terraform apply -auto-approve -var 'region=AWS_REGION' -var 'profile=AWS_PROFILE'
```

An example of how to use this from Python is provided in `test_selenium.py`. It runs the script in `example.py` to search for descriptions of dog breeds in Google.

```bash
AWS_DEFAULT_REGION=AWS_REGION python test_selenium.py
```

There are also examples of running the Lambda functions in parallel and asynchronously, which greatly speed up the process.

```bash
# Multi-processing (uses multiple CPU cores)
AWS_DEFAULT_REGION=AWS_REGION python test_selenium_parallel.py
# Asynchronous
AWS_DEFAULT_REGION=AWS_REGION python test_selenium_async.py
```