{"id":24998903,"url":"https://github.com/thiswillbeyourgithub/mediasizeorhashmatcher","last_synced_at":"2025-03-29T18:11:56.098Z","repository":{"id":275445565,"uuid":"926103027","full_name":"thiswillbeyourgithub/MediaSizeOrHashMatcher","owner":"thiswillbeyourgithub","description":"MediaSizeOrHashMatcher: A Python tool for matching files between directories based on size, hash, or video content with parallel processing support.","archived":false,"fork":false,"pushed_at":"2025-02-02T15:08:48.000Z","size":19,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-02T16:20:09.338Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thiswillbeyourgithub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-02T15:06:10.000Z","updated_at":"2025-02-02T15:08:52.000Z","dependencies_parsed_at":"2025-02-02T16:20:16.510Z","dependency_job_id":"71a31685-2398-4fe4-b672-36446b58141c","html_url":"https://github.com/thiswillbeyourgithub/MediaSizeOrHashMatcher","commit_stats":null,"previous_names":["thiswillbeyourgithub/mediasizeorhashmatcher"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thiswillbeyourgithub%2FMediaSizeOrHashMatcher","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thiswillbeyourgithub%2FMediaSizeOrHashMatcher/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thiswillbeyourgithub%2FMediaSizeOrHashMatcher/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thiswillbeyourgithub%2FMediaSizeOrHashMatcher/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thiswillbeyourgithub","download_url":"https://codeload.github.com/thiswillbeyourgithub/MediaSizeOrHashMatcher/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246223331,"owners_count":20743167,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-04T18:52:10.263Z","updated_at":"2025-03-29T18:11:56.081Z","avatar_url":"https://github.com/thiswillbeyourgithub.png","language":"Python","readme":"# MediaSizeOrHashMatcher\n\nMediaSizeOrHashMatcher is a Python tool for finding matching files between two directories based on file size and content hash. It supports both regular files and video files with specialized video hash comparison.\n\n## Features\n\n- **File Size Matching**: Finds files with identical sizes between directories\n- **Approximate Size Matching**: Option to match files with sizes within 1% tolerance\n- **Hash Comparison**: \n  - MD5 hash for regular files\n  - VideoHash for video files (MP4, AVI, MOV, MKV)\n- **Parallel Processing**: Utilizes multiple CPU cores for faster hash comparison\n- **Comprehensive Output**: Detailed summary of matching files\n\n## Installation\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/thiswillbeyourgithub/MediaSizeOrHashMatcher.git\ncd MediaSizeOrHashMatcher\n```\n\n2. Install dependencies:\n```bash\npip install -r requirements.txt\n```\n\n## Usage\n\n```bash\npython MediaSizeOrHashMatcher.py REFERENCE_DIR CANDIDATES_DIR [OPTIONS]\n```\n\n### Arguments\n\n- `REFERENCE_DIR`: Directory containing reference files\n- `CANDIDATES_DIR`: Directory containing candidate files to match\n\n### Options\n\n- `--approximate`: Enable approximate size matching with 1% tolerance\n- `--videos`: Use video hash comparison for video files\n- `--n_jobs`: Number of parallel jobs for hash matching (default: 3)\n\n### Example Commands\n\n1. Basic usage:\n```bash\npython MediaSizeOrHashMatcher.py /path/to/reference /path/to/candidates\n```\n\n2. With approximate size matching:\n```bash\npython MediaSizeOrHashMatcher.py /path/to/reference /path/to/candidates --approximate\n```\n\n3. With video hash comparison:\n```bash\npython MediaSizeOrHashMatcher.py /path/to/reference /path/to/candidates --videos\n```\n\n4. Using 8 parallel jobs:\n```bash\npython MediaSizeOrHashMatcher.py /path/to/reference /path/to/candidates --n_jobs 8\n```\n\n## Requirements\n\n- Python 3.6+\n- Required packages:\n  - tqdm\n  - joblib\n  - videohash\n\n## How It Works\n\n1. **File Size Matching**:\n   - Scans both directories and creates a list of files with their sizes\n   - Matches files with identical sizes (or within 1% tolerance if --approximate is used)\n\n2. **Hash Comparison**:\n   - For files with matching sizes:\n     - Regular files: Computes MD5 hash\n     - Video files: Computes VideoHash (if --videos is enabled)\n   - Compares hashes to find exact matches\n\n3. **Parallel Processing**:\n   - Uses joblib to parallelize hash computation and comparison\n   - Number of parallel jobs can be controlled with --n_jobs\n\n## Output\n\nThe program provides a detailed summary of matching files, including:\n- Reference file path\n- Number of matches found\n- Paths of matching files\n\nExample output:\n```\nMatching Files Summary:\n-----------------------\n\nReference file: /path/to/reference/video.mp4\nFound one match: /path/to/candidates/video_copy.mp4\n\nReference file: /path/to/reference/document.pdf\nFound 2 matches:\n  - /path/to/candidates/doc1.pdf\n  - /path/to/candidates/doc2.pdf\n```\n\n## Contributing\n\nContributions are welcome! Please follow these steps:\n\n1. Fork the repository\n2. Create a new branch for your feature/bugfix\n3. Commit your changes\n4. Submit a pull request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- tqdm for progress bars\n- joblib for parallel processing\n- videohash for video comparison\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthiswillbeyourgithub%2Fmediasizeorhashmatcher","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthiswillbeyourgithub%2Fmediasizeorhashmatcher","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthiswillbeyourgithub%2Fmediasizeorhashmatcher/lists"}