{"id":24953370,"url":"https://github.com/iamfarrokhnejad/murkmaw","last_synced_at":"2025-03-28T19:46:03.377Z","repository":{"id":240784434,"uuid":"803440024","full_name":"IAmFarrokhnejad/Murkmaw","owner":"IAmFarrokhnejad","description":"A web crawler using Rust.","archived":false,"fork":false,"pushed_at":"2024-11-22T13:47:55.000Z","size":70998,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-03T03:34:58.500Z","etag":null,"topics":["functional","functional-programming","rust","rust-lang","web-crawler","web-crawling","webcrawler","webcrawling"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/IAmFarrokhnejad.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-20T18:18:16.000Z","updated_at":"2025-01-02T01:34:05.000Z","dependencies_parsed_at":"2024-10-25T02:52:56.251Z","dependency_job_id":"1dfdee20-3bda-4f97-9757-18798a92303c","html_url":"https://github.com/IAmFarrokhnejad/Murkmaw","commit_stats":null,"previous_names":["iamfarrokhnejad/murkmaw"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IAmFarrokhnejad%2FMurkmaw","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IAmFarrokhnejad%2FMurkmaw/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IAmFarrokhnejad%2FMurkmaw/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IAmFarrokhnejad%2FMurkmaw/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/IAmFarrokhnejad","download_url":"https://codeload.github.com/IAmFarrokhnejad/Murkmaw/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246093097,"owners_count":20722395,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["functional","functional-programming","rust","rust-lang","web-crawler","web-crawling","webcrawler","webcrawling"],"created_at":"2025-02-03T03:35:28.664Z","updated_at":"2025-03-28T19:46:03.354Z","avatar_url":"https://github.com/IAmFarrokhnejad.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Murkmaw\n\nMurkmaw is a Rust-based multithreaded web crawler designed for efficient link graph construction, image extraction, and customizable logging. It features a modular architecture that supports future enhancements and customization.\n\n---\n\n## Features\n\n### Multithreaded Web Crawler\n- **Parallel Crawling:** Utilizes multithreading for faster page scraping with configurable worker threads.\n- **Link Graph Construction:** Maintains a graph structure (`LinkGraph`) tracking parent-child associations and link references.\n- **Data Extraction:** Retrieves links, images, and titles from web pages.\n- **Customizable Crawling:** Specify the maximum number of links and images to process.\n\n### Enhanced Logging\n- **Progress Bars:** Displays link discovery progress with a real-time progress bar.\n- **Spinners:** Visual feedback for different stages of image processing and serialization.\n- **Customizable Output:** Built using the `indicatif` and `console` crates.\n\n### Image Utilities\n- **Metadata Handling:** Converts extracted links into image metadata, including alt text and source URL.\n- **Image Downloading:** Saves images locally in a user-defined directory.\n- **Image Database:** Serializes image metadata into a JSON database.\n\n\n## Getting Started\n### Prerequisites\n- Rust (latest stable version)\n- Crates used in the project:\n- tokio (for asynchronous operations)\n- reqwest (for HTTP requests)\n- serde and serde_json (for serialization and JSON handling)\n- rayon (for multithreading)\n- indicatif and console (for logging and UI enhancements)\n- anyhow (for error handling)\n\n## Installation\nClone the repository:\n\n   ```bash\n   git clone https://github.com/IAmFarrokhnejad/Murkmaw.git\n    cd Murkmaw\n```\nInstall dependencies:\n ```bash\n   cargo build\n\n```\n## Usage\nRun the application with the following command:\n ```bash\ncargo run --release -- --starting_url \u003cURL\u003e --max_links \u003cN\u003e --max_images \u003cN\u003e --n_worker_threads \u003cN\u003e --log_status \u003ctrue/false\u003e --img_save_dir \u003cdirectory\u003e --links_json \u003cfilename\u003e\n\n```\n\n## Command-Line Options\n- starting_url: The initial URL to crawl (required).\n- max_links: The maximum number of links to process (default: 100).\n- max_images: The maximum number of images to extract (default: 50).\n- n_worker_threads: Number of worker threads for parallel crawling (default: 4).\n- log_status: Whether to enable logging (default: true).\n- img_save_dir: Directory to save downloaded images (default: ./images).\n- links_json: Filename for the JSON file storing the link graph (default: links.json).\n\n\n## Contribution Guidelines\nContributions are welcome! Please follow these steps:\n1. Fork the repository.\n2. Create a new branch for your feature or bug fix.\n3. Submit a pull request with a clear description of your changes.\n\n\n## License\nThis project is licensed under the MIT License - see the LICENSE file for details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiamfarrokhnejad%2Fmurkmaw","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiamfarrokhnejad%2Fmurkmaw","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiamfarrokhnejad%2Fmurkmaw/lists"}