{"id":23903101,"url":"https://github.com/jmitander/jmscraper","last_synced_at":"2025-09-06T10:37:14.779Z","repository":{"id":269800800,"uuid":"908497417","full_name":"JMitander/JMScraper","owner":"JMitander","description":"Scrape web pages and effortlessly extract the data you need. Easy, robust, efficient, and intuitively user-friendly.","archived":false,"fork":false,"pushed_at":"2024-12-26T17:04:50.000Z","size":25,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-04T22:53:06.453Z","etag":null,"topics":["extract-data","extract-media","extract-metadata","extractor","scraping","scraping-web","scraping-websites","webscraper","webscraping","website-scraper","webtool"],"latest_commit_sha":null,"homepage":"https://github.com/JMitander/JMScraper","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JMitander.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-26T08:06:41.000Z","updated_at":"2024-12-26T09:57:21.000Z","dependencies_parsed_at":"2024-12-26T09:36:14.262Z","dependency_job_id":null,"html_url":"https://github.com/JMitander/JMScraper","commit_stats":null,"previous_names":["jmitander/jmscraper"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JMitander%2FJMScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JMitander%2FJMScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JMitander%2FJMScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JMitander%2FJMScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JMitander","download_url":"https://codeload.github.com/JMitander/JMScraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240304562,"owners_count":19780312,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extract-data","extract-media","extract-metadata","extractor","scraping","scraping-web","scraping-websites","webscraper","webscraping","website-scraper","webtool"],"created_at":"2025-01-04T22:52:23.478Z","updated_at":"2025-02-23T10:45:12.540Z","avatar_url":"https://github.com/JMitander.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# JMScraper\n\n![License](https://img.shields.io/badge/license-MIT-blue.svg)\n![Python](https://img.shields.io/badge/python-3.7%2B-blue.svg)\n![Version](https://img.shields.io/badge/version-1.0.0-brightgreen.svg)\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Features](#features)\n- [Prerequisites](#prerequisites)\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Interactive Mode (Default)](#interactive-mode-default)\n  - [Advanced Mode with Command-Line Arguments](#advanced-mode-with-command-line-arguments)\n- [Configuration](#configuration)\n- [Logging](#logging)\n- [Media Downloading](#media-downloading)\n- [Respectful Scraping](#respectful-scraping)\n- [Contributing](#contributing)\n- [License](#license)\n- [Contact](#contact)\n\n## Overview\n**JMScraper** is a powerful and ethical web scraping tool designed to efficiently extract data from multiple web pages. Whether you're a beginner needing basic metadata or an advanced user requiring comprehensive data extraction with media downloads, JMScraper caters to all your scraping needs with ease and reliability.\n\n## Features\n\n- **Interactive Sequential Prompts**: Guided prompts to configure scraping settings one at a time for a seamless user experience.\n- **Optional Command-Line Arguments**: Advanced users can bypass interactive prompts by providing command-line arguments for more control.\n- **Asynchronous Requests**: Utilizes `aiohttp` and `asyncio` to handle multiple requests concurrently, enhancing performance.\n- **Real-Time Progress Bar**: Dynamic loading bar displays scraping progress in real-time without cluttering the terminal.\n- **Robust Fallback Methods**: Ensures data retrieval even if the primary extraction method fails by implementing fallback strategies using regex.\n- **Comprehensive Logging**: Detailed logs are maintained both in the terminal and in a log file (`scraper.log`) using the `Rich` library.\n- **Flexible Output Formats**: Supports both JSON and CSV formats for saving scraped data.\n- **Media Downloading**: Automatically downloads and organizes media files (images, videos, documents) into structured directories.\n- **Respectful Scraping Practices**: Fetches and includes `robots.txt` content for each domain to adhere to website scraping policies.\n- **User-Friendly Interface**: Enhanced terminal outputs using `Rich` and interactive prompts with `InquirerPy` for an intuitive experience.\n\n## Prerequisites\n\n- **Python 3.7+**: Ensure you have Python installed. You can download it from [python.org](https://www.python.org/downloads/).\n\n## Installation\n\n1. **Clone the Repository**\n\n   ```bash\n   git clone https://github.com/jmitander/JMScraper.git\n   cd JMScraper\n   ```\n\n2. **Create a Virtual Environment (Optional but Recommended)**\n\n   ```bash\n   python3 -m venv venv\n   source venv/bin/activate  # On Windows: venv\\Scripts\\activate\n   ```\n\n3. **Install Dependencies**\n\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n   *If you don't have a `requirements.txt`, you can install the necessary packages directly:*\n\n   ```bash\n   pip install aiohttp beautifulsoup4 rich InquirerPy aiofiles\n   ```\n\n## Usage\n\n### Interactive Mode (Default)\n\nWhen you run the scraper without any command-line arguments, it launches an interactive terminal menu that guides you through the scraping process step-by-step.\n\n```bash\npython scraper.py\n```\n\n**Interactive Steps:**\n\n1. **Enter URL(s)**: Provide one or multiple URLs separated by commas.\n\n   ```\n   ? Enter the URL(s) you want to scrape (separated by commas): https://www.example.com, https://www.python.org\n   ```\n\n2. **Choose Scraping Method(s)**: Select from Metadata, Links, Images, or All Data.\n\n   ```\n   ? Choose scraping method(s): [Metadata, Links, Images, All Data]\n   [✔] Metadata\n   [✔] Links\n   ```\n\n3. **Configure Advanced Settings** (Optional):\n\n   ```\n   ? Do you want to edit advanced settings? Yes\n   ```\n\n   - **Proxy Server**: Enter a proxy URL or leave blank.\n   - **Concurrency**: Define the number of concurrent requests.\n   - **Delay**: Set the delay between requests in seconds.\n   - **Output File**: Specify the path for the output file.\n   - **Output Format**: Choose between JSON or CSV.\n   - **Extraction Method**: Select between BeautifulSoup and Regex.\n\n4. **Confirmation**: Review the configuration summary and confirm to proceed.\n\n   ```\n   Scraper Configuration\n   =====================\n   \n   ╭─────────────────────┬────────────────────────────╮\n   │ Parameter           │ Value                      │\n   ╞═════════════════════╪════════════════════════════╡\n   │ Input URLs          │ https://www.example.com,   │\n   │                     │ https://www.python.org      │\n   ├─────────────────────┼────────────────────────────┤\n   │ Output File         │ results.json               │\n   ├─────────────────────┼────────────────────────────┤\n   │ Output Format       │ json                       │\n   ├─────────────────────┼────────────────────────────┤\n   │ Scraping Mode       │ metadata, links            │\n   ├─────────────────────┼────────────────────────────┤\n   │ Extraction Method   │ beautifulsoup              │\n   ├─────────────────────┼────────────────────────────┤\n   │ Delay (s)           │ 1.0                        │\n   ├─────────────────────┼────────────────────────────┤\n   │ Concurrency         │ 5                          │\n   ├─────────────────────┼────────────────────────────┤\n   │ Proxy               │ None                       │\n   ╰─────────────────────┴────────────────────────────╯\n   \n   ? Proceed with the above configuration? Yes\n   ```\n\n### Advanced Mode with Command-Line Arguments\n\nAdvanced users can bypass the interactive menu by providing command-line arguments to customize scraping parameters directly.\n\n**Example Command:**\n\n```bash\npython scraper.py --input urls.txt --output results.csv --format csv --mode all --concurrency 10 --delay 0.5 --proxy http://proxy:port\n```\n\n**Arguments:**\n\n- `--input`, `-i`: Path to the input file containing URLs (one per line).\n- `--output`, `-o`: Path for the output file.\n- `--format`, `-f`: Output format (`json` or `csv`). Default is `json`.\n- `--delay`, `-d`: Delay between requests in seconds. Default is `1.0`.\n- `--proxy`, `-p`: Proxy server to use (e.g., `http://proxy:port`).\n- `--mode`, `-m`: Scraping mode (`metadata`, `links`, `images`, or `all`).\n- `--concurrency`, `-c`: Number of concurrent requests. Default is `5`.\n- `--alternative`, `-a`: Extraction method (`beautifulsoup` or `regex`).\n\n**Example Command with Partial Arguments:**\n\n```bash\npython scraper.py -i urls.txt -o results.csv -f csv -m links\n```\n\n## Configuration\n\n### Interactive Prompts\n\nThe interactive mode guides users through a series of prompts to configure:\n\n1. **URL(s)**: Enter one or multiple URLs separated by commas.\n2. **Scraping Method(s)**: Select one or more methods (Metadata, Links, Images, All Data).\n3. **Advanced Settings** (Optional):\n   - **Proxy Server**: Enter a proxy URL or leave blank.\n   - **Concurrency**: Number of concurrent requests.\n   - **Delay**: Delay between requests in seconds.\n   - **Output File**: Path for the output file.\n   - **Output Format**: Choose between JSON or CSV.\n   - **Extraction Method**: Select between BeautifulSoup and Regex.\n\n### Command-Line Arguments\n\nAdvanced users can specify configurations directly using command-line arguments for more control and automation.\n\n```bash\npython scraper.py --input urls.txt --output results.csv --format csv --mode all --concurrency 10 --delay 0.5 --proxy http://proxy:port\n```\n\n## Logging\n\nThe scraper maintains detailed logs both in the terminal and in a log file named `scraper.log` using the `Rich` library. This aids in monitoring the scraping process and debugging if necessary.\n\n## Media Downloading\n\nJMScraper automatically downloads and organizes media files (images, videos, documents) into structured directories:\n\n- **Images**: Saved in `media/images/`\n- **Videos**: Saved in `media/videos/`\n- **Documents**: Saved in `media/documents/`\n\nEach media file is saved with a unique, safe filename generated using a hash of the URL and appropriate file extensions based on content type.\n\n## Respectful Scraping\n\nJMScraper adheres to ethical scraping practices by:\n\n- **Fetching `robots.txt`**: Retrieves and includes the contents of `robots.txt` for each domain to respect website scraping policies.\n- **User-Agent Rotation**: Rotates through a list of common User-Agent strings to mimic different browsers and reduce the risk of being blocked.\n- **Rate Limiting**: Implements delay and concurrency controls to avoid overwhelming target servers.\n\n**Important**: Always ensure you have permission to scrape the websites you target and comply with their `robots.txt` and terms of service.\n\n## Contributing\n\nContributions are welcome! If you'd like to contribute to JMScraper, please follow these steps:\n\n1. **Fork the Repository**\n\n2. **Create a New Branch**\n\n   ```bash\n   git checkout -b feature/YourFeatureName\n   ```\n\n3. **Make Your Changes**\n\n4. **Commit Your Changes**\n\n   ```bash\n   git commit -m \"Add your message here\"\n   ```\n\n5. **Push to the Branch**\n\n   ```bash\n   git push origin feature/YourFeatureName\n   ```\n\n6. **Open a Pull Request**\n\nPlease ensure your contributions adhere to the existing code style and include appropriate documentation and tests where necessary.\n\n## License\n\nThis project is licensed under the [MIT License](LICENSE).\n\n## Contact\n\nFor any questions or support, please open an issue in the [GitHub repository](https://github.com/jmitander/JMScraper/issues).\n\n---\n\n**Happy Scraping!**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjmitander%2Fjmscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjmitander%2Fjmscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjmitander%2Fjmscraper/lists"}