Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/darkdk123/ai-web-scraper
An AI Web Scraper to scrape simple websites.
https://github.com/darkdk123/ai-web-scraper
Last synced: about 18 hours ago
JSON representation
An AI Web Scraper to scrape simple websites.
- Host: GitHub
- URL: https://github.com/darkdk123/ai-web-scraper
- Owner: DarkDk123
- Created: 2024-09-07T03:10:59.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-09-10T12:06:29.000Z (about 2 months ago)
- Last Synced: 2024-09-10T13:43:08.780Z (about 2 months ago)
- Language: Python
- Size: 24.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# AI Web Scraper 🤖
An **AI Web Scraper** using LangChain, HuggingFace, selenium etc.
## Usage1. Install the required packages: `pip install -r requirements.txt`.
2. Set the environments variables as explained [below.](#environment-variables)
3. Run the Streamlit app: `streamlit run streamlit_main.py`.
4. Enter a URL and a description of what you want to parse from the website.
5. The app will scrape the website, extract the relevant text, and use the HuggingFace model to parse the text.## Example: Scraping Github profiles
* URL: `https://github.com/techwithtim`
* query: `Provide info about the Github profile`![demo](./example.gif)
## Environment Variables
The AI Web Scraper uses the following environment variables:
* `HUGGINGFACE_MODEL_ID`: The ID of the HuggingFace model to use for parsing the text.
* `UGGINGFACEHUB_API_TOKEN` : HuggingFace Hub API token.* `SBR_WEBDRIVER` (Optional for captcha support): The URL of the Bright Data Webdriver to use for solving captchas.
## Development
The AI Web Scraper is built using the following technologies:
* `streamlit`: The web app framework.
* `langchain_huggingface`: The library for using HuggingFace models in langchain.
* `langchain`: Main langchain library.
* `selenium`: The library for interacting with the browser.
* `bs4`: The library for parsing HTML.