https://github.com/mirkotrotta/streamlit_web_scraper
This Streamlit Web Scraper extracts text content from any website, whether it's static or JavaScript-heavy, and saves the data into neatly formatted Markdown files. Ideal for personal research, data collection, or sharing website content with others.
https://github.com/mirkotrotta/streamlit_web_scraper
beutifulsoup markdown python scraper selenium showcase slack-bot sqllite streamlit
Last synced: 3 months ago
JSON representation
This Streamlit Web Scraper extracts text content from any website, whether it's static or JavaScript-heavy, and saves the data into neatly formatted Markdown files. Ideal for personal research, data collection, or sharing website content with others.
- Host: GitHub
- URL: https://github.com/mirkotrotta/streamlit_web_scraper
- Owner: mirkotrotta
- Created: 2024-09-26T19:50:52.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-07T22:48:33.000Z (8 months ago)
- Last Synced: 2025-05-16T12:16:40.590Z (5 months ago)
- Topics: beutifulsoup, markdown, python, scraper, selenium, showcase, slack-bot, sqllite, streamlit
- Language: Python
- Homepage:
- Size: 71.3 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Streamlit Web Scraper
![]()
![]()
![]()
![]()
## 🌐 A Clean & Simple Web Scraper Built With Streamlit, BeautifulSoup, and Selenium
This **Streamlit Web Scraper** extracts text content from any website, whether it's static or JavaScript-heavy, and saves the data into neatly formatted Markdown files. Ideal for personal research, data collection, or sharing website content with others.
## ✨ Key Features
- **Scrape Dynamic and Static Websites**: Supports JavaScript-rendered content using **Selenium** and traditional HTML scraping using **BeautifulSoup**.
- **Single Markdown File Output**: Consolidates all scraped data into one clean and structured Markdown file, organized by page and section.
- **SQLite-Based Scrape History**: Logs all scraped sessions for future access, allowing you to view or download previous scrapes at any time.
- **Slack Integration (Coming Soon)**: In an upcoming release, scraped data can be sent directly to a Slack channel for easy sharing and collaboration.
- **Cross-Device & Version-Control Friendly**: Built with **Git** in mind, enabling seamless version control and multi-device collaboration.---
## 🚀 Getting Started
### Prerequisites
Make sure you have the following installed:
- [Python 3.7+](https://www.python.org/)
- [Git](https://git-scm.com/)
- [Virtualenv](https://virtualenv.pypa.io/) (optional but recommended)### Installation
1. **Clone the repository** to your local machine:
```bash
git clone https://github.com/mirkotrotta/streamlit_web_scraper.git
cd streamlit_web_scraper
```2. **Set up a virtual environment** (optional, but recommended):
```bash
python3 -m venv venv
source venv/bin/activate
```3. **Install the required dependencies**:
```bash
pip install -r requirements.txt
```4. **Set up environment variables** (if required, e.g., for future Slack integration):
- Create a `.env` file and add any necessary environment variables, such as `SLACK_BOT_TOKEN` (for future integration).5. **Run the Streamlit App**:
```bash
streamlit run app.py
```6. Open your browser and go to `http://localhost:8501` to use the app.
---
## 🛠 How to Use
### Scraping a Website
1. **Enter the URL** of the website you want to scrape.
2. **Select if the website is dynamic** (i.e., JavaScript-heavy) or static.
3. **Click "Scrape Website"**: The scraper will retrieve text-based content from the site, organizing it by page and section into a Markdown file.
4. **Download the Markdown file**: Once the scrape is complete, download the file directly through the interface or access previously saved scrapes.### Features in Progress
- **Slack Integration**: Soon, you'll be able to send scraped data to a specified Slack channel.
- **More Framework Support**: Future experiments include integrating with additional frameworks and APIs for advanced scraping scenarios.## 🔄 Roadmap
### Planned Features
- **Slack Integration**: Scrape data and automatically send it to a Slack channel for quick collaboration.
- **API Integration**: Adding support for scraping APIs and handling authentication where required.
- **Advanced Scraping Techniques**: Experimenting with frameworks like Playwright for even better dynamic content handling.---
## 🤝 Contributing
Contributions are welcome! If you have suggestions for improvements, feel free to:
1. Fork the repository
2. Create a new branch (`git checkout -b feature/YourFeature`)
3. Commit your changes (`git commit -m 'Add YourFeature'`)
4. Push to the branch (`git push origin feature/YourFeature`)
5. Open a Pull Request---
## 📝 License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
---
## 👤 Author
**Mirkotrotta**
- [GitHub](https://github.com/mirkotrotta)
- [Twitter](https://twitter.com/mirkotrotta)---
## 💬 Contact
For any inquiries, questions, or feedback, feel free to open an issue or contact me.
## Development Workflow
- All new features and fixes should be made in the `dev` branch.
- After testing, merge `dev` into `main` via a pull request.
- Use feature branches (e.g., `feature-docker`) for major updates.