{"id":19617236,"url":"https://github.com/keul/allanon","last_synced_at":"2025-04-28T02:31:59.260Z","repository":{"id":6152082,"uuid":"7381386","full_name":"keul/Allanon","owner":"keul","description":"A Web crawler that visit a predictable set of URLs, and automatically download resources you want from them","archived":false,"fork":false,"pushed_at":"2022-12-08T13:15:32.000Z","size":68,"stargazers_count":7,"open_issues_count":2,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-23T10:52:55.952Z","etag":null,"topics":["crawler","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/keul.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-12-30T23:18:05.000Z","updated_at":"2021-12-25T17:13:25.000Z","dependencies_parsed_at":"2023-01-13T13:51:30.338Z","dependency_job_id":null,"html_url":"https://github.com/keul/Allanon","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keul%2FAllanon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keul%2FAllanon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keul%2FAllanon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keul%2FAllanon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/keul","download_url":"https://codeload.github.com/keul/Allanon/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251238191,"owners_count":21557415,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","python"],"created_at":"2024-11-11T11:02:51.801Z","updated_at":"2025-04-28T02:31:59.029Z","avatar_url":"https://github.com/keul.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":".. contents::\n\nIntroduction\n============\n\nLet's say that you want to access a slow streaming site to see something (obviously: something not\nprotected by copyright).\n\nThe streaming site use URLs in that format:\n\n    http://legal-streaming-site.org/program-name/season5/episode4/\n\nEvery page contains some HTML code like the following::\n\n    ....\n        \u003cdiv id=\"video-container\"\u003e\n           ...\n           \u003cembed src=\"http://someotherurl.org/qwerty.flv\" ... \n           ...\n        \u003cdiv\u003e\n    ...\n\nLet say this is the URL for the episode 4 of the fifth season of your program.\nYou know that this program has 6 seasons with 22 episode each.\n\nAs said before: this site is very slow so you prefer downloading episodes in background\nthen watch them later.\n\nTo download them you need to watch the HTML inside the page and get some resources\n(commonly: and FLV file).\nThe best would be download *all* episode in a single (long running) operation instead of manually\ndoing it.\n\n**Allanon** will help you exactly in such tasks.\nYou simply need to provide it:\n\n* a simple URL or a *dynamic URL pattern*\n* a *query selector* for resources inside the page\n\nQuick example (you can keep it single lined)::\n\n    $ allanon --search=\"#movie-container embed\" \\\n    \u003e \"http://legal-streaming-site.org/program-name/season{1:6}/episode{1:22}\"\n\nDocumentation\n=============\n\nInstallation\n------------\n\nYou can use `distribute`__ or `pip`__ to install the utility in your Python environment.\n\n__ http://pypi.python.org/pypi/distribute\n__ http://pypi.python.org/pypi/pip\n\n::\n\n    $ easy_install Allanon\n\nor alternately::\n\n    $ pip install Allanon\n\nInvocation\n----------\n\nAfter installing you will be able to run the ``allanon`` script from command line.\nFor example: run the following for access the utility help::\n\n    $ allanon --help\n\nBasic usage (you probably don't need Allanon at all for this)\n-------------------------------------------------------------\n\nThe ``allanon`` script accept an URL (or a list of URLs) to be downloaded::\n\n    $ allanon http://myhost/folder/image1.jpg http://myhost/folder/image2.jpg ...\n\nEvery command line URL given to Allanon can be a simple URL or an *URL model* like the following::\n\n    $ allanon \"http://myhost/folder/image{1:50}.jpg\"\n\nThis will crawl 50 different URLs automatically. \n\nMain usage (things became interesting now)\n------------------------------------------\n\nThe ``allanon`` script take an additional ``--search`` parameter (see the first example given\nabove).\nWhen you provide it, you are meaning:\n\n    \"*I don't want to download those URLs directly, but those URLs contain links to\n    file that I really want*\".\n\nThe search parameter format must be CSS 3 compatible, like the one supported the famous\n`jQuery library`__, and it's based onto the `pyquery`__ library.\nSee it's documentation for more details about what you can look for.\n\n__ http://api.jquery.com/category/selectors/\n__ http://packages.python.org/pyquery/\n\nExtreme usage\n-------------\n\nThe ``--search`` parameter can be provided multiple times::\n\n    $ allanon --search=\"ul.image-repos a\" \\\n    \u003e --search=\"div.image-containers img\" \\\n    \u003e \"http://image-repository-sites.org/category{1:30}.html\"\n\nWhen you provide (for example) two different search parameters, you are meaning:\n\n    \"*I don't want to download resources at given URLs. Those URLs contain links to secondary pages,\n    and inside those pages there're links to resources I want to download*\"\n\nFilters are applied in the given order, so:\n\n* Allanon will search inside 30 pages named *category1.html*, *category2.html*, ...\n* inside those pages, Allanon will look for all links inside ``ul`` tags with CSS class\n  *image-repos* and recursively crawl them.\n* inside those pages, Allanon will looks for images inside ``div`` with class *image-containers*.\n* images will be downloaded.\n\nPotentially you can continue this way, providing a third level of filters, and so on.\n\nNaming and storing downloaded resources\n---------------------------------------\n\nBy default Allanon download all files in the current directory so a filename conflict\nis possible.\nYou can control how/where download, changing dynamically the filename using the\n``--filename`` option and/or change the directory where to store files with the\n``--directory`` option.\n\nAn example::\n\n    $ allanon --filename=\"%HOST-%INDEX-section%1-version%3-%FULLNAME\" \\\n    \u003e \"http://foo.org/pdf-repo-{1:10}/file{1:50}.pdf?version={0:3}\"\n\nAs you seen ``--filename`` accept some *markers* that can be used to better organize\nresources:\n\n``%HOST``\n    Will be replaced with the hostname used in the URL.\n``%INDEX``\n    Is a progressive from 1 to the number of downloaded resources.\n``%X``\n    When using dynamic URLs models you can refer to the current number of an URL\n    section.\n    \n    In this case \"%1\" is the current \"pdf-repo-*x*\" number and \"%3\" is the \"version\"\n    parameter value.\n``%FULLNAME``\n    The original filename (the one used if ``--filename`` is not provided).\n    \n    You can also use the ``%NAME`` and ``%EXTENSION`` to get only the name of the file\n    (without extension) or simply the extension.\n\nThe ``--directory`` option can be a simple directory name or a directory path (in unix-like\nformat, for example \"``foo/bar/baz``\").\n\nAn example::\n\n    $ allanon --directory=\"/home/keul/%HOST/%1\" \\\n    \u003e \"http://foo.org/pdf-repo-{1:10}/file{1:50}.pdf\" \\\n    \u003e \"http://baz.net/pdf-repo-{1:10}/file{1:50}.pdf\"\n\nAlso the ``--directory`` option supports some of the markers: you can use ``%HOST``, ``%INDEX`` and ``%X``\nwith the same meaning given above.\n\nTODO\n====\n\nThis utility is in alpha stage, a lot of thing can goes wrong when downloading and many features\nare missing:\n\n* verbosity controls\n* bandwidth control\n* multi-thread (let's look at `grequests`__)\n* Python 3\n\n__ https://github.com/kennethreitz/grequests\n\nIf you find other bugs or want to ask for missing features, use the `product's issue tracker`__.\n\n__ https://github.com/keul/Allanon/issues\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkeul%2Fallanon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkeul%2Fallanon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkeul%2Fallanon/lists"}