Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/brianlesko/web-scraper

a web scraping app, paste a URL and download the text or links on the website
https://github.com/brianlesko/web-scraper

Last synced: about 20 hours ago
JSON representation

a web scraping app, paste a URL and download the text or links on the website

Awesome Lists containing this project

README

        

# Web Scraping
This code implements a web scraper for creating text files and a list of links. Altough this is a simpler implementation, similar approaches are used to train AI models utilizing internet data - especially machine learning models like OpenAI's ChatGPT. This implementation is written in [Pure Python](). Created for Learning Purposes.

 

 

## Dependencies

This code uses the following libraries:
- `streamlit`: for building the user interface.
- `numpy`: for creating arrays.
- `pandas`: for creating dataframes.
- `bs4`: for picking the text out of a webpage's HTML code, a process known as parsing.
- `requests`: for retreiving the HTML of a webpage.

 

## Usage

Run the following commands in your terminal:
```
python3 -m venv my_env
source my_env/bin/activate # Mac OS or Linux
.\my_env\Scripts\activate # Windows
pip install --upgrade streamlit numpy pandas bs4 requests
streamlit run https://raw.githubusercontent.com/BrianLesko/text-similarity-search/main/app.py
```

This will start the Streamlit server, and you can access the chatbot by opening a web browser and navigating to `http://localhost:8501`.

 

## How it Works

The web scraper works as follows:
1. The user enters a URL in the input field.
2. Requests retrieves the relevant HTML based on the user's URL.
3. bs4 parses the HTML code that makes up the website into text and links.
4. The chatbot displays some information about the text it parsed.
5. The option to download the text or links appears.

 

## Repository Structure
```
doc-chat/
├── .streamlit/
│ └── config.toml # theme info for the UI
├── docs/
│ └── preview.png
├── app.py # the code and UI integrated together live here
├── customize_gui # for adding gui elements like the about sidebar
├── requirements.txt # the python packages needed to run locally
└── .gitignore # includes the local virtual environment named my_env
```

 

## Topics
```
Python | Streamlit | Git | Low Code UI
Chat interface | Web scraping | HTML Parsing
Self taught coding | Mechanical engineer | Robotics engineer
```
 


 

╭━━╮╭━━━┳━━┳━━━┳━╮╱╭╮ ╭╮╱╱╭━━━┳━━━┳╮╭━┳━━━╮
┃╭╮┃┃╭━╮┣┫┣┫╭━╮┃┃╰╮┃┃ ┃┃╱╱┃╭━━┫╭━╮┃┃┃╭┫╭━╮┃
┃╰╯╰┫╰━╯┃┃┃┃┃╱┃┃╭╮╰╯┃ ┃┃╱╱┃╰━━┫╰━━┫╰╯╯┃┃╱┃┃
┃╭━╮┃╭╮╭╯┃┃┃╰━╯┃┃╰╮┃┃ ┃┃╱╭┫╭━━┻━━╮┃╭╮┃┃┃╱┃┃
┃╰━╯┃┃┃╰┳┫┣┫╭━╮┃┃╱┃┃┃ ┃╰━╯┃╰━━┫╰━╯┃┃┃╰┫╰━╯┃
╰━━━┻╯╰━┻━━┻╯╱╰┻╯╱╰━╯ ╰━━━┻━━━┻━━━┻╯╰━┻━━━╯

 

X Logo             GitHub             LinkedIn

follow all of these or i will kick you