{"id":24154864,"url":"https://github.com/manishpjha/web-scraper-analyzer","last_synced_at":"2026-04-12T19:57:06.715Z","repository":{"id":271552350,"uuid":"913806197","full_name":"ManishPJha/web-scraper-analyzer","owner":"ManishPJha","description":"This is a Python-based web scraping and data analysis application built with Streamlit. It allows users to scrape data from websites using sitemap URLs, export the scraped data to CSV or JSON, analyze the data, and reset the database.","archived":false,"fork":false,"pushed_at":"2025-01-08T12:21:56.000Z","size":5,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-08T13:39:48.690Z","etag":null,"topics":["fine-tuning","matplotlib","pandas","python","sqlite","streamlit","training-dataset","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ManishPJha.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-08T11:42:39.000Z","updated_at":"2025-01-08T12:21:59.000Z","dependencies_parsed_at":"2025-01-09T00:52:37.122Z","dependency_job_id":null,"html_url":"https://github.com/ManishPJha/web-scraper-analyzer","commit_stats":null,"previous_names":["manishpjha/web-scraper-analyzer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ManishPJha%2Fweb-scraper-analyzer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ManishPJha%2Fweb-scraper-analyzer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ManishPJha%2Fweb-scraper-analyzer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ManishPJha%2Fweb-scraper-analyzer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ManishPJha","download_url":"https://codeload.github.com/ManishPJha/web-scraper-analyzer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241435116,"owners_count":19962399,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fine-tuning","matplotlib","pandas","python","sqlite","streamlit","training-dataset","webscraping"],"created_at":"2025-01-12T12:26:17.885Z","updated_at":"2025-11-25T22:04:36.820Z","avatar_url":"https://github.com/ManishPJha.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **Web Scraper and Data Analyzer**\n\nThis is a Python-based web scraping and data analysis application built with **Streamlit**. It allows users to scrape data from websites using sitemap URLs, export the scraped data to CSV or JSON, analyze the data, and reset the database.\n\n---\n\n## **Features**\n\n1. **Scrape Data**:\n\n   - Scrape data from a sitemap URL.\n   - Save the scraped data to an SQLite database.\n\n2. **Export Data**:\n\n   - Export the scraped data to **CSV** or **JSON**.\n\n3. **Analyze Data**:\n\n   - Analyze the scraped data and visualize the number of pages per domain.\n\n4. **Reset Database**:\n   - Delete all scraped data from the database.\n\n---\n\n## **Installation**\n\n1. Clone the repository:\n\n   ```bash\n   git clone https://github.com/ManishPJha/web-scraper-analyzer.git\n   cd web-scraper-analyzer\n   ```\n\n2. Install the required dependencies:\n\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n3. Run the Streamlit app:\n   ```bash\n   streamlit run streamlit_app.py\n   ```\n\n---\n\n## **Usage**\n\n1. **Scrape Data**:\n\n   - Enter a sitemap URL and specify the number of pages to scrape.\n   - Click **Start Scraping** to begin the scraping process.\n\n2. **Export Data**:\n\n   - Choose the export format (**CSV** or **JSON**).\n   - Click **Export Data** to save the scraped data to a file.\n\n3. **Analyze Data**:\n\n   - View a bar chart showing the number of pages per domain.\n\n4. **Reset Database**:\n   - Click **Reset Database** to delete all scraped data.\n\n---\n\n## **File Structure**\n\n```\nweb-scraper-analyzer/\n├── database.py          # Database operations\n├── scraper.py           # Web scraping logic\n├── exporter.py          # Data export logic\n├── analyzer.py          # Data analysis logic\n├── streamlit_app.py     # Main Streamlit app\n├── requirements.txt     # List of dependencies\n├── README.md            # Project documentation\n```\n\n---\n\n## **Dependencies**\n\n- `streamlit`\n- `aiohttp`\n- `beautifulsoup4`\n- `fake-useragent`\n- `pandas`\n- `matplotlib`\n- `sqlite3`\n- `python-dotenv`\n\nInstall all dependencies using:\n\n```bash\npip install -r requirements.txt\n```\n\n---\n\n## **Contributing**\n\nContributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.\n\n---\n\n## **License**\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n---\n\n## **Author**\n\n- **Manish Jha**\n- GitHub: [ManishPJha](https://github.com/ManishPJha)\n- Email: mjha205@rku.ac.in\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishpjha%2Fweb-scraper-analyzer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanishpjha%2Fweb-scraper-analyzer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanishpjha%2Fweb-scraper-analyzer/lists"}