https://github.com/trancethehuman/entities-extraction-web-scraper
A web scraper that uses OpenAI Functions for selective scraping.
https://github.com/trancethehuman/entities-extraction-web-scraper
Last synced: 5 months ago
JSON representation
A web scraper that uses OpenAI Functions for selective scraping.
- Host: GitHub
- URL: https://github.com/trancethehuman/entities-extraction-web-scraper
- Owner: trancethehuman
- Created: 2023-07-31T15:05:09.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-01T12:24:03.000Z (over 1 year ago)
- Last Synced: 2024-12-01T02:26:08.617Z (5 months ago)
- Language: Python
- Homepage:
- Size: 117 KB
- Stars: 294
- Watchers: 9
- Forks: 113
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- jimsghstars - trancethehuman/entities-extraction-web-scraper - A web scraper that uses OpenAI Functions for selective scraping. (Python)
README
# Scrape the Web with entities extraction using OpenAI Function
## What is this?
This codebase allows you to scrape any website and extract relevant data points easily using [OpenAI Functions](https://openai.com/blog/function-calling-and-other-api-updates) and [LangChain](https://python.langchain.com/docs/get_started/introduction).
Create a schema in `schemas.py`, pick a url, and use them with `scrape_with_playwright()` in `main.py` to start scraping.Tip: each website has the bulk of content either in `
`, `` or `` tags. For best performance, choose a combination of tags that work for you.
### Example
1. Define the schema of the website you want to scrape in `schemas.py` (Pydantic class or dictionary are both fine):
```python
class SchemaNewsWebsites(BaseModel):
news_headline: str
news_short_summary: str
```2. To start scraping, in `main.py`, run something like this:
```python
asyncio.run(scrape_with_playwright(
url="https://www.bbc.com",
tags=["span"],
schema_pydantic=SchemaNewsWebsites
))
```## Setup
### 1. Create a new Python virtual environment
`python -m venv virtual-env` or `python3 -m venv virtual-env` (Mac)
`py -m venv virtual-env` (Windows 11)
### 2. Activate virtual environment
`.\virtual-env\Scripts\activate` (Windows)
`source virtual-env/bin/activate` (Mac)
### 3. Install dependencies using Poetry
Run `poetry install --sync` or `poetry install`
### 4. Install playwright
```bash
playwright install
```### 5. Create a new `.env` file to store OpenAI's API key
```text
OPENAI_API_KEY=XXXXXX
```## Usage
### Run locally
```bash
python main.py
```## Additional Information
- Add onto this a FastAPI server to serve this as an API endpoint for ease of use.
- Use caution when scraping. Don't do anything I wouldn't do (illegal)
- P.S I've added this functionality to LangChain [in this PR](https://github.com/langchain-ai/langchain/pull/8732). You can read [the official docs here.](https://python.langchain.com/docs/use_cases/web_scraping#quickstart)