Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/brianlesko/web-scraper
a web scraping app, paste a URL and download the text or links on the website
https://github.com/brianlesko/web-scraper
Last synced: about 20 hours ago
JSON representation
a web scraping app, paste a URL and download the text or links on the website
- Host: GitHub
- URL: https://github.com/brianlesko/web-scraper
- Owner: BrianLesko
- License: mit
- Created: 2023-11-17T22:54:46.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-06-23T19:53:06.000Z (5 months ago)
- Last Synced: 2024-06-23T20:50:53.854Z (5 months ago)
- Language: Python
- Size: 6.8 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Web Scraping
This code implements a web scraper for creating text files and a list of links. Altough this is a simpler implementation, similar approaches are used to train AI models utilizing internet data - especially machine learning models like OpenAI's ChatGPT. This implementation is written in [Pure Python](). Created for Learning Purposes.
## Dependencies
This code uses the following libraries:
- `streamlit`: for building the user interface.
- `numpy`: for creating arrays.
- `pandas`: for creating dataframes.
- `bs4`: for picking the text out of a webpage's HTML code, a process known as parsing.
- `requests`: for retreiving the HTML of a webpage.
## Usage
Run the following commands in your terminal:
```
python3 -m venv my_env
source my_env/bin/activate # Mac OS or Linux
.\my_env\Scripts\activate # Windows
pip install --upgrade streamlit numpy pandas bs4 requests
streamlit run https://raw.githubusercontent.com/BrianLesko/text-similarity-search/main/app.py
```This will start the Streamlit server, and you can access the chatbot by opening a web browser and navigating to `http://localhost:8501`.
## How it Works
The web scraper works as follows:
1. The user enters a URL in the input field.
2. Requests retrieves the relevant HTML based on the user's URL.
3. bs4 parses the HTML code that makes up the website into text and links.
4. The chatbot displays some information about the text it parsed.
5. The option to download the text or links appears.
## Repository Structure
```
doc-chat/
├── .streamlit/
│ └── config.toml # theme info for the UI
├── docs/
│ └── preview.png
├── app.py # the code and UI integrated together live here
├── customize_gui # for adding gui elements like the about sidebar
├── requirements.txt # the python packages needed to run locally
└── .gitignore # includes the local virtual environment named my_env
```
## Topics
```
Python | Streamlit | Git | Low Code UI
Chat interface | Web scraping | HTML Parsing
Self taught coding | Mechanical engineer | Robotics engineer
```
╭━━╮╭━━━┳━━┳━━━┳━╮╱╭╮ ╭╮╱╱╭━━━┳━━━┳╮╭━┳━━━╮
┃╭╮┃┃╭━╮┣┫┣┫╭━╮┃┃╰╮┃┃ ┃┃╱╱┃╭━━┫╭━╮┃┃┃╭┫╭━╮┃
┃╰╯╰┫╰━╯┃┃┃┃┃╱┃┃╭╮╰╯┃ ┃┃╱╱┃╰━━┫╰━━┫╰╯╯┃┃╱┃┃
┃╭━╮┃╭╮╭╯┃┃┃╰━╯┃┃╰╮┃┃ ┃┃╱╭┫╭━━┻━━╮┃╭╮┃┃┃╱┃┃
┃╰━╯┃┃┃╰┳┫┣┫╭━╮┃┃╱┃┃┃ ┃╰━╯┃╰━━┫╰━╯┃┃┃╰┫╰━╯┃
╰━━━┻╯╰━┻━━┻╯╱╰┻╯╱╰━╯ ╰━━━┻━━━┻━━━┻╯╰━┻━━━╯
follow all of these or i will kick you