{"id":20121802,"url":"https://github.com/parseword/massfetcher","last_synced_at":"2025-11-27T17:02:26.820Z","repository":{"id":57036000,"uuid":"168644185","full_name":"parseword/massfetcher","owner":"parseword","description":"Perform a GET request against a whole bunch of different websites using lots of concurrent threads","archived":false,"fork":false,"pushed_at":"2019-06-26T03:36:02.000Z","size":21,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-13T07:26:21.991Z","etag":null,"topics":["http","http-requests","multithreaded","php7","pthreads","spider"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/parseword.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-02-01T04:52:49.000Z","updated_at":"2021-03-25T16:52:58.000Z","dependencies_parsed_at":"2022-08-24T14:10:23.820Z","dependency_job_id":null,"html_url":"https://github.com/parseword/massfetcher","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parseword%2Fmassfetcher","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parseword%2Fmassfetcher/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parseword%2Fmassfetcher/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parseword%2Fmassfetcher/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/parseword","download_url":"https://codeload.github.com/parseword/massfetcher/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241565547,"owners_count":19983142,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["http","http-requests","multithreaded","php7","pthreads","spider"],"created_at":"2024-11-13T19:32:44.968Z","updated_at":"2025-11-27T17:02:26.744Z","avatar_url":"https://github.com/parseword.png","language":"PHP","readme":"# MassFetcher\r\n\r\nMassFetcher is a multithreaded HTTP GET request utility. Give it a path to \r\nrequest, and a giant list of domains to request it from. Retrieved files are \r\nsaved to disk (subject to configuration parameters). You may find MassFetcher \r\nuseful if you want to perform various types of web analysis:\r\n\r\n* Gauge the average size of web index pages\r\n\r\n* Determine the popularity of specific code libraries, meta tags, etc. \r\n\r\n* Inspect lots of `ads.txt` files looking for new ad networks to block\r\n\r\n* Find out how quickly (or not) a proposal like `./well-known/security.txt` is \r\nbeing implemented\r\n\r\nMassFetcher will go get the data; doing something with it is up to you.\r\n\r\n## Requirements\r\n\r\n* PHP \u003e= 7.1, with\r\n\r\n* The `pthreads` extension, either compiled-in or enabled as a module, and\r\n\r\n* The `curl` extension, either compiled-in or enabled as a module\r\n\r\n* Composer\r\n\r\n## Installation\r\n\r\nClone this repository to a new directory and then run `composer install`. This \r\nwill pull in the dependency (a logger) and set up the autoloader.\r\n\r\nCopy `config.php-dist` to `config.php`.\r\n\r\n## Usage\r\n\r\nConfigure your settings inside `config.php`. Here you can set the target URI \r\npath you want to request, along with a bunch of options to modify MassFetcher's \r\nbehavior. The options are explained in the comments.\r\n\r\nSupply your list of target hosts in a file called `domains.txt`. The \r\n[Alexa Top 1M list](http://s3.amazonaws.com/alexa-static/top-1m.csv.zip) may \r\ncome in handy, but do some small test runs first!\r\n\r\nRun `php fetcher.php` to execute MassFetcher.\r\n\r\nRetrieved files will be saved to a directory (defaults to `data`) in a series of \r\nhierarchical subdirectories.\r\n\r\nThe repository ships with a sample `domains.txt` containing 100 hostnames, a \r\na config that will request `/ads.txt` from all of them, and the logger set to \r\ndebug level. You should probably run once using these defaults, then examine \r\nthe `output.log` file to see what's going on under the hood.\r\n\r\n## Resources and Performance\r\n\r\nPerformance will vary depending upon your hardware, internet connection, and \r\nconfiguration settings. Broadly speaking, with 64 threads I've averaged around \r\n1,000 requests per minute from various commodity cloud instances.\r\n\r\nMassFetcher may use significantly more bandwidth and disk space than you expect. \r\nDue to error pages, redirects, and oddly-configured servers, you're going to get \r\nplenty of junk data. \r\n\r\nFor instance, suppose you request `/ads.txt`:\r\n\r\n* telegram.org replies with \"200 OK\" but sends their index page instead.\r\n\r\n* booking.com properly sends a 404 response, but it weighs in at a hefty 300KB.\r\n\r\n* whatsapp.com redirects to its 600KB index page.\r\n\r\nSome of MassFetcher's settings can help mitigate junk data. In particular, the \r\nstrict filename matching option will only write a fetched file to disk if the \r\nfinal destination URI, after all redirects, has the same base filename that you \r\nrequested.\r\n\r\nYou should do some small test runs whenever you change configuration, before \r\nlaunching into an enormous fetch job. \r\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparseword%2Fmassfetcher","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparseword%2Fmassfetcher","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparseword%2Fmassfetcher/lists"}