https://github.com/brianlesko/web-scraper

a web scraping app, paste a URL and download the text or links on the website
https://github.com/brianlesko/web-scraper

Last synced: about 2 months ago
JSON representation

a web scraping app, paste a URL and download the text or links on the website

Host: GitHub
URL: https://github.com/brianlesko/web-scraper
Owner: BrianLesko
License: mit
Created: 2023-11-17T22:54:46.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-06-23T19:53:06.000Z (about 1 year ago)
Last Synced: 2025-02-17T15:23:17.675Z (5 months ago)
Language: Python
Size: 6.8 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md
- License: LICENSE.md

Awesome Lists containing this project

README

# Web Scraping
This code implements a web scraper for creating text files and a list of links. Altough this is a simpler implementation, similar approaches are used to train AI models utilizing internet data - especially machine learning models like OpenAI's ChatGPT. This implementation is written in [Pure Python](). Created for Learning Purposes.

## Dependencies

This code uses the following libraries:
- `streamlit`: for building the user interface.
- `numpy`: for creating arrays.
- `pandas`: for creating dataframes.
- `bs4`: for picking the text out of a webpage's HTML code, a process known as parsing.
- `requests`: for retreiving the HTML of a webpage.

## Usage

Run the following commands in your terminal:
```
python3 -m venv my_env
source my_env/bin/activate # Mac OS or Linux
.\my_env\Scripts\activate # Windows
pip install --upgrade streamlit numpy pandas bs4 requests
streamlit run https://raw.githubusercontent.com/BrianLesko/text-similarity-search/main/app.py
```

This will start the Streamlit server, and you can access the chatbot by opening a web browser and navigating to `http://localhost:8501`.

## How it Works

The web scraper works as follows:
1. The user enters a URL in the input field.
2. Requests retrieves the relevant HTML based on the user's URL.
3. bs4 parses the HTML code that makes up the website into text and links.
4. The chatbot displays some information about the text it parsed.
5. The option to download the text or links appears.

## Repository Structure
```
doc-chat/
├── .streamlit/
│ └── config.toml # theme info for the UI
├── docs/
│ └── preview.png
├── app.py # the code and UI integrated together live here
├── customize_gui # for adding gui elements like the about sidebar
├── requirements.txt # the python packages needed to run locally
└── .gitignore # includes the local virtual environment named my_env
```

╭━━╮╭━━━┳━━┳━━━┳━╮╱╭╮ ╭╮╱╱╭━━━┳━━━┳╮╭━┳━━━╮
┃╭╮┃┃╭━╮┣┫┣┫╭━╮┃┃╰╮┃┃ ┃┃╱╱┃╭━━┫╭━╮┃┃┃╭┫╭━╮┃
┃╰╯╰┫╰━╯┃┃┃┃┃╱┃┃╭╮╰╯┃ ┃┃╱╱┃╰━━┫╰━━┫╰╯╯┃┃╱┃┃
┃╭━╮┃╭╮╭╯┃┃┃╰━╯┃┃╰╮┃┃ ┃┃╱╭┫╭━━┻━━╮┃╭╮┃┃┃╱┃┃
┃╰━╯┃┃┃╰┳┫┣┫╭━╮┃┃╱┃┃┃ ┃╰━╯┃╰━━┫╰━╯┃┃┃╰┫╰━╯┃
╰━━━┻╯╰━┻━━┻╯╱╰┻╯╱╰━╯ ╰━━━┻━━━┻━━━┻╯╰━┻━━━╯

follow all of these or i will kick you

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/brianlesko/web-scraper

Awesome Lists containing this project

README