https://github.com/darkdk123/ai-web-scraper
AI Web Scraper to scrape simple webpages using an LLM.
https://github.com/darkdk123/ai-web-scraper
hugging langchain python selenium
Last synced: 5 months ago
JSON representation
AI Web Scraper to scrape simple webpages using an LLM.
- Host: GitHub
- URL: https://github.com/darkdk123/ai-web-scraper
- Owner: DarkDk123
- Created: 2024-09-07T03:10:59.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-09-10T12:06:29.000Z (10 months ago)
- Last Synced: 2024-12-28T17:14:32.779Z (7 months ago)
- Topics: hugging, langchain, python, selenium
- Language: Python
- Homepage:
- Size: 24.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# AI Web Scraper 🤖
An **AI Web Scraper** using LangChain, HuggingFace, selenium etc.
## Usage1. Install the required packages: `pip install -r requirements.txt`.
2. Set the environments variables as explained [below.](#environment-variables)
3. Run the Streamlit app: `streamlit run streamlit_main.py`.
4. Enter a URL and a description of what you want to parse from the website.
5. The app will scrape the website, extract the relevant text, and use the HuggingFace model to parse the text.## Example: Scraping Github profiles
* URL: `https://github.com/techwithtim`
* query: `Provide info about the Github profile`
## Environment Variables
The AI Web Scraper uses the following environment variables:
* `HUGGINGFACE_MODEL_ID`: The ID of the HuggingFace model to use for parsing the text.
* `UGGINGFACEHUB_API_TOKEN` : HuggingFace Hub API token.* `SBR_WEBDRIVER` (Optional for captcha support): The URL of the Bright Data Webdriver to use for solving captchas.
## Development
The AI Web Scraper is built using the following technologies:
* `streamlit`: The web app framework.
* `langchain_huggingface`: The library for using HuggingFace models in langchain.
* `langchain`: Main langchain library.
* `selenium`: The library for interacting with the browser.
* `bs4`: The library for parsing HTML.