Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/smallstepman/puppeteer-real-browser-dockerized

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/smallstepman/puppeteer-real-browser-dockerized
Owner: smallstepman
Created: 2024-10-30T14:53:08.000Z (2 months ago)
Default Branch: main
Last Pushed: 2024-11-26T23:06:22.000Z (about 1 month ago)
Last Synced: 2024-11-27T00:20:52.583Z (about 1 month ago)
Language: JavaScript
Size: 19.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Puppeteer Real Browser Scraper

This project allows you to scrape web pages using Puppeteer wrapped in a Docker container. It supports two modes: **Command-Line Mode** & **HTTP Server Mode**

## Installation 

Pull the Docker Image:

```bash

docker pull ghcr.io/smallstepman/puppeteer-real-browser-dockerized:latest

```

## Usage

### Command-Line Mode

Scrape a URL and output the HTML to the console:

Usage (console):

```bash

docker run ghcr.io/smallstepman/puppeteer-real-browser-dockerized:latest http://example.com

```

Usage (python):

```python

import subprocess

def scrape_url(url):

    try:

        result = subprocess.run(

            ['docker', 'run', 'ghcr.io/smallstepman/puppeteer-real-browser-dockerized:latest', url],

            stdout=subprocess.PIPE,

            stderr=subprocess.PIPE,

            text=True,

            check=True,

            timeout=5

        )

    except subprocess.CalledProcessError as e:

        print(e)  

    else: 

        html_output = result.stdout

        return html_output

print(scrape_url('http://example.com'))

```

### HTTP Server Mode

Run an HTTP server that accepts POST requests to scrape URLs:

```bash

docker run -p 3000:3000 ghcr.io/smallstepman/puppeteer-real-browser-dockerized:latest serve

```

Usage (console):

```bash

curl -X POST -H "Content-Type: application/json" -d '{"url":"http://example.com"}' http://localhost:3000/scrape

```

Usage (python):

```python

import requests

def scrape_url_via_api(url):

    response = requests.post('http://localhost:3000/scrape', json={'url': url})

    response.raise_for_status()

    return response.text

print(scrape_url_via_api('http://example.com'))

```