{"id":28193136,"url":"https://github.com/mirkotrotta/streamlit_web_scraper","last_synced_at":"2025-07-06T05:32:42.229Z","repository":{"id":276387496,"uuid":"863724953","full_name":"mirkotrotta/streamlit_web_scraper","owner":"mirkotrotta","description":"This Streamlit Web Scraper extracts text content from any website, whether it's static or JavaScript-heavy, and saves the data into neatly formatted Markdown files. Ideal for personal research, data collection, or sharing website content with others.","archived":false,"fork":false,"pushed_at":"2025-02-07T22:48:33.000Z","size":73,"stargazers_count":3,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-16T12:16:40.590Z","etag":null,"topics":["beutifulsoup","markdown","python","scraper","selenium","showcase","slack-bot","sqllite","streamlit"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mirkotrotta.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-26T19:50:52.000Z","updated_at":"2025-04-16T14:58:47.000Z","dependencies_parsed_at":"2025-02-07T23:22:58.542Z","dependency_job_id":"217513b7-8825-4740-9205-88d54e9c1247","html_url":"https://github.com/mirkotrotta/streamlit_web_scraper","commit_stats":null,"previous_names":["mirkotrotta/streamlit_web_scraper"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mirkotrotta/streamlit_web_scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirkotrotta%2Fstreamlit_web_scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirkotrotta%2Fstreamlit_web_scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirkotrotta%2Fstreamlit_web_scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirkotrotta%2Fstreamlit_web_scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mirkotrotta","download_url":"https://codeload.github.com/mirkotrotta/streamlit_web_scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mirkotrotta%2Fstreamlit_web_scraper/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263853361,"owners_count":23520128,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beutifulsoup","markdown","python","scraper","selenium","showcase","slack-bot","sqllite","streamlit"],"created_at":"2025-05-16T12:16:44.470Z","updated_at":"2025-07-06T05:32:42.213Z","avatar_url":"https://github.com/mirkotrotta.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Streamlit Web Scraper\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Python-3.7%2B-blue\" alt=\"Python Version\"\u003e\n  \u003cimg src=\"https://img.shields.io/github/license/mirkotrotta/streamlit_web_scraper\" alt=\"License\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Streamlit-%F0%9F%93%88%20Web%20App-success\" alt=\"Streamlit Web App\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Coming%20Soon-Slack%20Integration-orange\" alt=\"Coming Soon\"\u003e\n\u003c/p\u003e\n\n## 🌐 A Clean \u0026 Simple Web Scraper Built With Streamlit, BeautifulSoup, and Selenium\n\nThis **Streamlit Web Scraper** extracts text content from any website, whether it's static or JavaScript-heavy, and saves the data into neatly formatted Markdown files. Ideal for personal research, data collection, or sharing website content with others.\n\n## ✨ Key Features\n\n- **Scrape Dynamic and Static Websites**: Supports JavaScript-rendered content using **Selenium** and traditional HTML scraping using **BeautifulSoup**.\n- **Single Markdown File Output**: Consolidates all scraped data into one clean and structured Markdown file, organized by page and section.\n- **SQLite-Based Scrape History**: Logs all scraped sessions for future access, allowing you to view or download previous scrapes at any time.\n- **Slack Integration (Coming Soon)**: In an upcoming release, scraped data can be sent directly to a Slack channel for easy sharing and collaboration.\n- **Cross-Device \u0026 Version-Control Friendly**: Built with **Git** in mind, enabling seamless version control and multi-device collaboration.\n\n---\n\n## 🚀 Getting Started\n\n### Prerequisites\n\nMake sure you have the following installed:\n\n- [Python 3.7+](https://www.python.org/)\n- [Git](https://git-scm.com/)\n- [Virtualenv](https://virtualenv.pypa.io/) (optional but recommended)\n\n### Installation\n\n1. **Clone the repository** to your local machine:\n   ```bash\n   git clone https://github.com/mirkotrotta/streamlit_web_scraper.git\n   cd streamlit_web_scraper\n   ```\n\n2. **Set up a virtual environment** (optional, but recommended):\n   ```bash\n   python3 -m venv venv\n   source venv/bin/activate\n   ```\n\n3. **Install the required dependencies**:\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n4. **Set up environment variables** (if required, e.g., for future Slack integration):\n   - Create a `.env` file and add any necessary environment variables, such as `SLACK_BOT_TOKEN` (for future integration).\n\n5. **Run the Streamlit App**:\n   ```bash\n   streamlit run app.py\n   ```\n\n6. Open your browser and go to `http://localhost:8501` to use the app.\n\n---\n\n## 🛠 How to Use\n\n### Scraping a Website\n1. **Enter the URL** of the website you want to scrape.\n2. **Select if the website is dynamic** (i.e., JavaScript-heavy) or static.\n3. **Click \"Scrape Website\"**: The scraper will retrieve text-based content from the site, organizing it by page and section into a Markdown file.\n4. **Download the Markdown file**: Once the scrape is complete, download the file directly through the interface or access previously saved scrapes.\n\n### Features in Progress\n- **Slack Integration**: Soon, you'll be able to send scraped data to a specified Slack channel.\n- **More Framework Support**: Future experiments include integrating with additional frameworks and APIs for advanced scraping scenarios.\n\n## 🔄 Roadmap\n\n### Planned Features\n- **Slack Integration**: Scrape data and automatically send it to a Slack channel for quick collaboration.\n- **API Integration**: Adding support for scraping APIs and handling authentication where required.\n- **Advanced Scraping Techniques**: Experimenting with frameworks like Playwright for even better dynamic content handling.\n\n---\n\n## 🤝 Contributing\n\nContributions are welcome! If you have suggestions for improvements, feel free to:\n\n1. Fork the repository\n2. Create a new branch (`git checkout -b feature/YourFeature`)\n3. Commit your changes (`git commit -m 'Add YourFeature'`)\n4. Push to the branch (`git push origin feature/YourFeature`)\n5. Open a Pull Request\n\n---\n\n## 📝 License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n---\n\n## 👤 Author\n\n**Mirkotrotta**  \n- [GitHub](https://github.com/mirkotrotta)  \n- [Twitter](https://twitter.com/mirkotrotta)\n\n---\n\n## 💬 Contact\n\nFor any inquiries, questions, or feedback, feel free to open an issue or contact me.\n\n## Development Workflow\n\n- All new features and fixes should be made in the `dev` branch.\n- After testing, merge `dev` into `main` via a pull request.\n- Use feature branches (e.g., `feature-docker`) for major updates.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmirkotrotta%2Fstreamlit_web_scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmirkotrotta%2Fstreamlit_web_scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmirkotrotta%2Fstreamlit_web_scraper/lists"}