{"id":21036028,"url":"https://github.com/archiveteam/urls-sources","last_synced_at":"2025-06-24T08:38:19.116Z","repository":{"id":44738262,"uuid":"361027677","full_name":"ArchiveTeam/urls-sources","owner":"ArchiveTeam","description":"Sources for urls-grab.","archived":false,"fork":false,"pushed_at":"2024-10-23T22:56:09.000Z","size":152597,"stargazers_count":7,"open_issues_count":17,"forks_count":8,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-10-24T12:05:29.537Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArchiveTeam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-23T23:23:01.000Z","updated_at":"2024-10-23T22:56:21.000Z","dependencies_parsed_at":"2022-08-26T04:42:57.229Z","dependency_job_id":"1d7bac11-8049-4be1-bb40-dbc1ee67ebcc","html_url":"https://github.com/ArchiveTeam/urls-sources","commit_stats":{"total_commits":77,"total_committers":7,"mean_commits":11.0,"dds":0.2597402597402597,"last_synced_commit":"9b0f7d652aaae2d07f9e2b1110c955bd0ce04730"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2Furls-sources","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2Furls-sources/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2Furls-sources/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArchiveTeam%2Furls-sources/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArchiveTeam","download_url":"https://codeload.github.com/ArchiveTeam/urls-sources/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225357189,"owners_count":17461615,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T13:17:34.578Z","updated_at":"2024-11-19T13:17:35.697Z","avatar_url":"https://github.com/ArchiveTeam.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# URLs Sources\nThe URLs-grab project at https://github.com/ArchiveTeam/urls-grab allows for URLs to be archived, alongside their page requisites, and optionally other found pages. This repository contains the lists of URLs to be periodically queued and instructions on how to structure the items.\n\nThere two different types of items. The first type are the items in the txt files in this repository. These items are read and processed into items that can be queued to the tracker, which are named 'Tracker items'. The main difference between the two types is that the last 'Tracker items' use percent encoding, while the first items do not. This is done for simplicity.\n\n***warning***: The URLs-grab project can easily overload websites if too many URLs are queued at once.\n\n## Items\nThe repository contains txt files, which follow a pattern `[0-9]+_STRING.txt` for the filenames, where `STRING` is some string to identify the contents of the txt file, and `[0-9]+` is the interval for how often the items in the txt file should be queued to the tracker. Multiple files with equal intervals and different names can be created. Lines can be added and removed from the txt files.\n\nEach txt file contains a list of parameters joined with `;`, where the URLs are not percent encoded for simplicity. See the next section to supported the allowed parameters. A special case is the `random` parameter. If this parameter is specified (in our example case `3600_EXAMPLE.txt` with value `RANDOM`), a random value will be assigned automatically every time the custom item is queued.\n\n### Parameters\nCustom URL items contain the URL to be archived and a number of parameters showing how to extract and queue subsequent URLs. These parameters are:\n\n * `url`: The URL to be archived. This should be the _last_ parameter.\n * `random`: A random string. Items queued to URLs-grab are deduplicated through a bloom filter with items previously queued. This `random` parameter allows for URLs to be requeued.\n * `keep_random`: The depth up to which the `random` string shall be preserved. If `keep_random` is larger than 0, any discovered URLs to be queued will be queued with parameter `keep_random=keep_random-1`, and have the `random` parameter copied over.\n * `all`: Whether all extracted URLs from the same domains should be queued, or only the page requisites.\n * `keep_all`: Similar to `keep_random`, but for `all`.\n * `depth`: The depth up to which to queue `custom` items. If depth is larger than 0, any URLs found will be queued as `custom` item, else as regular URL item.\n * `deep_extract`: If set to 1, patterns will be used to extract hardcoded URLs that are not extracted by Wget-Lua itself, for example from any scripts. This parameter is only kept on the initial queued URL, not any subsequently queued URLs. This should be used on for example RSS feeds.\n * `any_domain`: Whether URLs from any domains should be queued, or only the current domain. `all` needs to be set in order for this to work.\n\n### Examples\nUsing the above instructions, a few example items are\n\n * `all=1;deep_extract=1;url=https://example.com/`\n\n   This will archive https://example.com/, and queue all URLs (not limited to page requisites) that can be extracted from the webpage using both Wget-Lua extraction and patterns to extract hardcoded URLs. If this item was already queued before, it will be ignored now. Parameter `depth` is not specified, effectively setting it to 0.\n\n * `all=1;deep_extract=1;random=RANDOM;depth=2;keep_random=1;keep_all=2;url=https://example.com/`\n\n   This includes the `random` string, thus making sure it is queued even if a similar item was queued before. Before queuing to the tracker, `RANDOM` is replaced by a random string. `depth` is set to 2, so `custom` items will be queued for the found URLs which will all have parameter `all`, effectively allowing a recursive crawl up to depth 3. `keep_random` has value 1, so only the next queued `custom` items will have the `random` value copied over, and subsequently queued `custom` will not. `deep_extract` is only kept for the very first item. `keep_all` is set to 2, which is equal to `depth`, so the `all=1` parameter will be copied over for all depths.\n\n   Any found URLs will be queued as `all=1;random=RANDOM;depth=1;keep_random=0;keep_all=0;url=URL`, note that parameter `deep_extract` is removed, `depth`, `keep_random`, and `keep_all` are reduced by 1, and `random` is copied over.\n\n## Tracker items\nTracker items are different from the items in the txt files in this repository. These items use the same parameters as the items in the txt files, but the URLs are structured differently. They are formatted as `custom:PARAMS` where `PARAMS` is an URL-encoded set of parameters.\n\n## Examples\nThe previous examples can be formatted as items that go into the tracker. The previous examples give respectively the following items\n\n * `custom:url=https%3A%2F%2Fexample.com%2F\u0026all=1\u0026deep_extract=1` decodes to `{'url': 'https://example.com/', 'all': 1, 'deep_extract': 1}`.\n\n * `custom:url=https%3A%2F%2Fexample.com%2F\u0026all=1\u0026deep_extract=1\u0026random=sa7ff8pjss\u0026depth=2\u0026keep_random=1\u0026keep_all=2` decodes to `{'url': 'https://example.com/', 'all': 1, 'deep_extract': 1, 'random': 'sa7ff8pjss', 'depth': 2, 'keep_random': 1, 'keep_all': 2}`.\n\n    Here, `RANDOM` is replaced by `sa7ff8pjss` as new random string. The previous example noted that this random string `sa7ff8pjss` will also be copied over to any new items queued from this items. These new items are found and queued directly from the warrior.\n   \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchiveteam%2Furls-sources","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farchiveteam%2Furls-sources","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farchiveteam%2Furls-sources/lists"}