Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/teticio/lambda-scraper

Use AWS Lambda functions as a proxy pool to scrape web pages.
https://github.com/teticio/lambda-scraper

lambda-functions proxy scraping terraform

Last synced: 25 days ago
JSON representation

Use AWS Lambda functions as a proxy pool to scrape web pages.

Awesome Lists containing this project

README

        

# Lambda Scraper

(See also [lambda-selenium](https://github.com/teticio/lambda-selenium))

Use AWS Lambda functions as a HTTPS proxy. This is a cost effective way to have access to a large pool of IP addresses. Run the following to create as many Lambda functions as you need (one for each IP address). The number of functions as well as the region can be specified in `variables.tf`. Each Lambda function changes IP address after approximately 6 minutes of inactivity. For example, you could create 360 Lambda functions which you cycle through one per second, while making as many requests as possible via each corresponding IP address. Note that, in practice, AWS will sometimes assign the same IP address to more than one Lambda function.

I have re-written this using Node.js to take advantage of streaming Lambda function URLs, so that you can make (asynchronous) proxy requests by simply pre-pending the proxy URL. If you are looking for the original Python version, it is available in the [`old`](old) directory.

## Pre-requisites

You will need to have installed Terraform and Docker.

## Usage

```bash
git clone https://github.com/teticio/lambda-scraper.git
cd lambda-scraper
terraform init
terraform apply -auto-approve
# run "terraform apply -destroy -auto-approve" in the same directory to tear all this down again
```

You can specify the AWS region and profile as well as the number of proxies in a `terraform.tfvars` file:

```terraform
num_proxies = 10
region = "eu-west-2"
profile = "default"
```

The `proxy` Lambda function forwards the requests to a random `proxy-` Lambda function. To obtain its URL, run

```bash
echo $(terraform output -json | jq -r '.lambda_proxy_url.value')
```

Then you can make requests via the proxy by pre-pending the URL.

```bash
curl https://.lambda-url..on.aws/ipinfo.io/ip
# or
curl https://.lambda-url..on.aws/http://ipinfo.io/ip
```

If you make a number of cURL requests to this URL, you should see several different IP addresses. A script that does exactly this is provided in `test.sh`. You will notice that there is a cold start latency the first time each Lambda function is invoked.

## Headers

Certain headers (`host` and those starting with `x-amz` or `x-forwarded-`) are stripped out because they interfere with the mechanism AWS uses to invoke the endpoint via HTTP. If you need these headers to be set in your request, you can do so by preceding them with `lambda-scraper-` (e.g. `lambda-scraper-host: example.com`). A special header `lambda-scraper-raw-query-params` is used to ensure the query parameters are passed straight through without being altered by encoding and decoding. Similarly, some response headers (those starting with `x-amz`) are mapped to `lambda-scraper-` so that they can be returned without affecting the response itself.

## Authentication

Currently, the `proxy` Lambda function URL is configured to be publicly accessible, although the hash in the URL serves as a "key". The underlying `proxy-` Lambda function URLs can only be accessed directly by signing the request with the appropriate AWS credentials. If you prefer to cycle through the underlying proxy URLs explicitly and avoid going through two Lambda functions per request, examples of how to sign the request are provided in `proxy.js` and `test_with_iam.py`. The list of underlying proxy URLs created by Terraform can be found in `lambda/proxy-urls.json`.

```bash
pip install -r requirements.txt
python test_with_iam.py
```

If you decide to also enforce IAM authentication for the `proxy` Lambda function URL, it is a simple matter of changing the `authorization_type` to `AWS_IAM` in `main.tf`.

## Concurrency

The ability to call the Lambda functions asynchronously makes numerous parallel requests possible without resorting to multi-threading, while the proxy avoids being rate limited. In Python you can use the `aiohttp` library to make asynchronous HTTP requests as follows:

```python
import asyncio

import aiohttp

# Replace with your proxy URL
PROXY = "https://.lambda-url..on.aws/"

async def fetch(session, url):
async with session.get(url) as response:
return await response.text()

async def fetch_all(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url.replace("https://", PROXY)) for url in urls]
htmls = await asyncio.gather(*tasks)
return htmls

urls = [
"https://www.bbc.co.uk/news",
"https://www.bbc.co.uk/news/uk",
]
print(asyncio.run(fetch_all(urls)))
```

## "Serverless VPN" (well, almost)

It is possible to set up a proxy server that forwards all HTTP requests (but not websockets) to the Lambda proxy. To do this, first create a Certificate Authority with

```bash
openssl req -x509 -new -nodes -keyout testCA.key -sha256 -days 365 -out testCA.pem -subj '/CN=Mockttp Testing CA - DO NOT TRUST'
```

Then add and trust the `testCA.pem` certificate in a browser and set the proxy host to `localhost` and port to `8080`. Add a `.env` file with the following contents:

```bash
PROXY_HOST=.lambda-url..on.aws
```

install the Node.JS packages

```bash
cd proxy_server
npm install
cd -
```
and run the server with

```bash
node proxy_server/app.js
```

You should now be able to navigate to a webpage with your browser and all the HTTP requests will be proxied via the Lambda function. Note that some sensitive endpoints may not work (for example if they use a pre-signed URL). You can toggle the proxy on and off by pressing `X`.