Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fschaeffler/lambda-web-scrapper


https://github.com/fschaeffler/lambda-web-scrapper

Last synced: 25 days ago
JSON representation

Awesome Lists containing this project

README

        

# AWS Lambda Web Scrapper

This web scrapper runs on AWS with a headless Chromium browser and Puppeteer.

## Development Setup

- install Node.js 8.10 because this is the version being used by the Serverless service

- install yarn from https://yarnpkg.com

- install dependencies via `yarn install`

- install serverless globally via `yarn global add serverless`

## Run (Local)

- run the service via `yarn serve`

## Deploy (Production)

- deploy the Serverless service via `yarn deploy`

- note down the created API-key from the deployment output

## Usage (Production)

The actualy endpoint URL is displayed in the deployment output. It looks like `https://.execute-api.eu-central-1.amazonaws.com/prod/scrape`.

In order to ensure only authorized clients can call this API, an API-key is needed for the call. This key needs to get specified in the HTTP Request Header for `x-api-key`.

An exemplary call of the endpoint would look like the following one.

Content of `request.json`

```
{
"url": "https://www.sfit.services",
"xpathes": [
"//*/div/div/div/a",
"//*/div/div/div"
],
"uniqueResults": true
}
```

`curl -X POST -H "x-api-key: API-SECRET-KEY" -d @request.json https://GATEWAY-ID.execute-api.eu-central-1.amazonaws.com/prod/scrape`

### Request Options

The request options get handed in as JSON-data. This data need to adhere the following structure.

- `url`: [string] The URL which should get scrapped

- `xpathes`: [array of strings] A list of XPathes of which the content should get retrieved

- `uniqueResults` (optional): [boolean] When set to `true`, only unique results will get returned