https://github.com/raznem/parsera

Lightweight library for scraping web-sites with LLMs
https://github.com/raznem/parsera

ai ai-scraping data-extraction llm opensource playwright python scraping webscraping

Last synced: 3 months ago
JSON representation

Lightweight library for scraping web-sites with LLMs

Host: GitHub
URL: https://github.com/raznem/parsera
Owner: raznem
License: gpl-2.0
Created: 2024-08-12T13:04:33.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-03-19T15:35:27.000Z (3 months ago)
Last Synced: 2025-04-03T22:05:42.497Z (3 months ago)
Topics: ai, ai-scraping, data-extraction, llm, opensource, playwright, python, scraping, webscraping
Language: Python
Homepage: https://docs.parsera.org
Size: 1.83 MB
Stars: 1,059
Watchers: 16
Forks: 63
Open Issues: 3
Metadata Files:
- Readme: README.md
- Contributing: docs/contributing.md
- License: LICENSE

Awesome Lists containing this project

alan_awesome_llm - Parsera - sites with LLMs. (数据 Data)
awesome-hacking-lists - raznem/parsera - Lightweight library for scraping web-sites with LLMs (Python)
StarryDivineSky - raznem/parsera
awesome-LLM-resources - Parsera - sites with LLMs. (数据 Data)
alan_awesome_llm - Parsera - sites with LLMs. (数据 Data)

README

        # 📦 Parsera

[![Discord](https://img.shields.io/badge/Discord-7289da?style=for-the-badge)](https://discord.gg/gYXwgQaT7p)

[![Downloads](https://img.shields.io/pepy/dt/parsera?style=for-the-badge)](https://pepy.tech/project/parsera)



Lightweight Python library for scraping websites with LLMs. 

You can test it on [Parsera website](https://parsera.org).

## Why Parsera?

Because it's simple and lightweight. With interface as simple as:

```python

scraper = Parsera()

result = scraper.run(url=url, elements=elements)

```

## Table of Contents

- [Installation](#Installation)

- [Documentation](#Documentation)

- [Basic usage](#Basic-usage)

- [Running with Jupyter Notebook](#Running-with-Jupyter-Notebook)

- [Running with CLI](#Running-with-CLI)

- [Running in Docker](#Running-in-Docker)

## Installation

```shell

pip install parsera

playwright install

```

## Documentation

Check out [documentation](https://docs.parsera.org) to learn more about other features, like running custom models and playwright scripts.

## Basic usage

First, set up `PARSERA_API_KEY` env variable (If you want to run custom LLM see [Custom Models](https://docs.parsera.org/features/custom-models/)).

You can do this from python with:

```python

import os

os.environ["PARSERA_API_KEY"] = "YOUR_PARSERA_API_KEY_HERE"

```

Next, you can run a basic version:

```python

from parsera import Parsera

url = "https://news.ycombinator.com/"

elements = {

    "Title": "News title",

    "Points": "Number of points",

    "Comments": "Number of comments",

}

scraper = Parsera()

result = scraper.run(url=url, elements=elements)

```

`result` variable will contain a json with a list of records:

```json

[

   {

      "Title":"Hacking the largest airline and hotel rewards platform (2023)",

      "Points":"104",

      "Comments":"24"

   },

    ...

]

```

There is also `arun` async method available:

```python

result = await scrapper.arun(url=url, elements=elements)

```

## Running with Jupyter Notebook:

Either place this code at the beginning of your notebook:

```python

import nest_asyncio

nest_asyncio.apply()

```

Or instead of calling `run` method use async `arun`.

## Running with CLI

Before you run `Parsera` as command line tool don't forget to put your `OPENAI_API_KEY` to env variables or `.env` file

### Usage

You can configure elements to parse using `JSON string` or `FILE`.

Optionally, you can provide `FILE` to write output and amount of `SCROLLS`, that you want to do on the page

```sh

python -m parsera.main URL {--scheme '{"title":"h1"}' | --file FILENAME} [--scrolls SCROLLS] [--output FILENAME]

```

## Running in Docker

In case of issues with your local environment you can run Parsera with Docker, [see documentation](https://docs.parsera.org/features/docker/).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raznem/parsera

Awesome Lists containing this project

README