{"id":22160622,"url":"https://github.com/mandarwagh9/web-scraper","last_synced_at":"2025-10-12T16:31:19.755Z","repository":{"id":253939888,"uuid":"844993372","full_name":"mandarwagh9/web-scraper","owner":"mandarwagh9","description":"This project is a Flask-based web application designed to scrape various types of content from a specified URL.","archived":false,"fork":false,"pushed_at":"2024-08-20T11:40:15.000Z","size":32,"stargazers_count":11,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-02T04:08:58.136Z","etag":null,"topics":["flask","flask-application","pytho","webscraper","webscraping"],"latest_commit_sha":null,"homepage":"https://youtu.be/5jNEx0zlp20","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mandarwagh9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-20T11:24:07.000Z","updated_at":"2024-08-30T19:23:13.000Z","dependencies_parsed_at":"2024-08-20T13:30:52.294Z","dependency_job_id":"857e7881-8f02-4d7f-8e82-2fb86a0b27e1","html_url":"https://github.com/mandarwagh9/web-scraper","commit_stats":null,"previous_names":["mandarwagh9/web-scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mandarwagh9%2Fweb-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mandarwagh9%2Fweb-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mandarwagh9%2Fweb-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mandarwagh9%2Fweb-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mandarwagh9","download_url":"https://codeload.github.com/mandarwagh9/web-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":236249488,"owners_count":19118705,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flask","flask-application","pytho","webscraper","webscraping"],"created_at":"2024-12-02T04:09:04.542Z","updated_at":"2025-10-12T16:31:14.428Z","avatar_url":"https://github.com/mandarwagh9.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Scraper\r\n\r\n## Overview\r\n\r\nThis project is a versatile web scraping tool with two interfaces: a Flask-based web application and a terminal-based script. Both tools are designed to scrape various types of content from a specified URL and save them to designated folders.\r\n\r\n## Features\r\n\r\n- **Web Application**:\r\n  - Scrapes text from `\u003cp\u003e` and header tags (`\u003ch1\u003e`, `\u003ch2\u003e`, etc.)\r\n  - Extracts all hyperlinks from the page\r\n  - Downloads images, videos, audio files, and documents based on user selection\r\n  - Saves scraped content to specified folders\r\n  - Provides a user-friendly interface for scraping configuration\r\n\r\n- **Terminal Script**:\r\n  - Allows scraping of text, links, and media (images, videos, audio files) directly from the terminal\r\n  - Asks for the URL, types of content to scrape, and the locations to save the scraped content\r\n\r\n## Setup\r\n\r\n### Prerequisites\r\n\r\n- Python 3.x\r\n- Flask\r\n- Requests\r\n- BeautifulSoup4\r\n\r\n### Installation\r\n\r\n1. **Clone the repository:**\r\n\r\n    ```bash\r\n    git clone https://github.com/mandarwagh9/web-scraper.git\r\n\r\n    cd web-scraper\r\n\r\n    ```\r\n\r\n2. **Create and activate a virtual environment (optional but recommended):**\r\n\r\n    ```bash\r\n    python -m venv venv\r\n    source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`\r\n    ```\r\n\r\n3. **Install the required packages:**\r\n\r\n    ```bash\r\n    pip install -r requirements.txt\r\n    ```\r\n\r\n### Requirements\r\n\r\nCreate a `requirements.txt` file with the following content:\r\n\r\n```\r\nFlask==2.2.2\r\nrequests==2.28.2\r\nbeautifulsoup4==4.12.2\r\n```\r\n\r\n## Usage\r\n\r\n### Web Application\r\n\r\n1. **Run the Flask application:**\r\n\r\n    ```bash\r\n    python app.py\r\n    ```\r\n\r\n2. **Open your web browser and navigate to:**\r\n\r\n    ```\r\n    http://127.0.0.1:5000\r\n    ```\r\n\r\n3. **Use the form to specify the URL and choose the types of content you want to scrape.** You can select:\r\n   - Text\r\n   - Links\r\n   - Images\r\n   - Videos\r\n   - Audio\r\n   - Documents\r\n\r\n4. **Specify the folders where you want to save each type of content.** For optional fields, leave them blank if you don't want to save that type of content.\r\n\r\n5. **Click the \"Start Scraping\" button to begin the scraping process.** A popup will appear to notify you when the scraping is complete.\r\n\r\n### Terminal Script\r\n\r\n1. **Run the terminal script:**\r\n\r\n    ```bash\r\n    python terminal_scraper.py\r\n    ```\r\n\r\n2. **Follow the prompts in the terminal:**\r\n   - Enter the URL to scrape\r\n   - Choose the types of content you want to scrape by entering the corresponding numbers (e.g., text, links, media)\r\n   - Provide the folder names for each type of content (if applicable)\r\n\r\n3. **The script will process the URL and save the scraped content to the specified folders.** You will be notified once the scraping is complete.\r\n\r\n## Example\r\n\r\n![Example Screenshot](https://github.com/mandarwagh9/web-scraper/blob/main/webscraper.PNG?raw=true)\r\n\r\n## Contributing\r\n\r\nFeel free to submit issues or pull requests. Please make sure to follow the coding style and include relevant tests with your contributions.\r\n\r\n1. Fork the repository\r\n2. Create a new branch (`git checkout -b feature-branch`)\r\n3. Commit your changes (`git commit -am 'Add new feature'`)\r\n4. Push to the branch (`git push origin feature-branch`)\r\n5. Create a new Pull Request\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n---\r\n\r\nLet me know if there's anything else you'd like to add or adjust!\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmandarwagh9%2Fweb-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmandarwagh9%2Fweb-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmandarwagh9%2Fweb-scraper/lists"}