{"id":16513096,"url":"https://github.com/albertz/randomftpgrabber","last_synced_at":"2025-10-07T08:26:13.698Z","repository":{"id":25138468,"uuid":"28560624","full_name":"albertz/RandomFtpGrabber","owner":"albertz","description":"Random FTP grabber - downloads all the interesting stuff","archived":false,"fork":false,"pushed_at":"2019-12-30T12:22:36.000Z","size":86,"stargazers_count":61,"open_issues_count":1,"forks_count":6,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-06-14T03:07:17.486Z","etag":null,"topics":["ccc","ftp","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/albertz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-12-28T11:48:43.000Z","updated_at":"2025-05-11T01:25:14.000Z","dependencies_parsed_at":"2022-08-23T19:51:00.062Z","dependency_job_id":null,"html_url":"https://github.com/albertz/RandomFtpGrabber","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/albertz/RandomFtpGrabber","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/albertz%2FRandomFtpGrabber","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/albertz%2FRandomFtpGrabber/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/albertz%2FRandomFtpGrabber/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/albertz%2FRandomFtpGrabber/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/albertz","download_url":"https://codeload.github.com/albertz/RandomFtpGrabber/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/albertz%2FRandomFtpGrabber/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278742383,"owners_count":26037808,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ccc","ftp","python"],"created_at":"2024-10-11T16:07:26.858Z","updated_at":"2025-10-07T08:26:13.648Z","avatar_url":"https://github.com/albertz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Random FTP grabber\n\nSituation:\nYou have various file servers with interesting stuff,\ntoo much which you can possibly download,\nand most of the stuff you never heard about so you\ncannot tell how much it is of interest,\nbut you still want to download a good set of files.\n\n(A common such situation is if you are on a\nHacker Conference like the Chaos Communication Congress/Camp.)\n\nA totally random sampling might already be a good enough\nrepresentation, but we might be able to improve slightly.\n\nA bit tricky is if there are multiple-parts\nwhich belong together - they should be grabbed together.\n\n\n## Usage\n\nGo into the directory where you want to download to.\n\n    echo \"ftp://bla/blub1\" \u003e\u003e sources.txt\n    echo \"ftp://blub/bla2\" \u003e\u003e sources.txt\n    mkdir downloads\n    RandomFtpGrabber/main.py\n\nIt will create some `*.db` files, e.g. `index.db`, where\nit saves its current state, so when you kill it and restart it,\nit should resume everything, all running downloads and the lazy\nindexing.\n\n\n## Details\n\n* Python 3.\n* Downloads via `wget`.\n* Provide a list of source URLs in the file `./sources.txt`.\n* Lazy random sampled indexing of the files.\nIt doesn't build a full index in the beginning, it rather randomly\nbrowses through the given sources and randomly selects files for download.\nSee [`RandomFileQueue`](https://github.com/albertz/RandomFtpGrabber/blob/master/RandomFileQueue.py)\nfor details on the random walking algorithm.\nIf you run it long enough, it still will end up with a full file index, though.\n* FTP indexing via Python `ftplib`. HTTP via `urllib3` and `BeautifulSoup`.\n* Resumes later on temporary problems (connection timeout, FTP error 4xx),\nskips dirs/files with unrecoverable problems (file not found anymore or so, FTP error 5xx).\n* Multiple worker threads and a task system with a work queue.\nSee [`TaskSystem`](https://github.com/albertz/RandomFtpGrabber/blob/master/TaskSystem.py)\nfor details on the implementation.\n* Serializes current state (as readable Python expressions)\nand will recover it on restart, thus it will resume all current actions such as downloads.\nSee [`Persistence`](https://github.com/albertz/RandomFtpGrabber/blob/master/Persistence.py)\nfor details on the implementation.\n\n\n## Plan\n\nFor found files, it should run some detection whether it should be downloaded\n(or how to prioritize certain files more than others).\n\nVia the [Python module `guessit`](https://pypi.python.org/pypi/guessit),\nwe can extract useful information just from\nthe filename - works well for movies, episodes or music.\n\nWe can then use IMDb to get some more information for movies.\nThe [Python module `IMDbPY`](http://imdbpy.sourceforge.net/)\nmight be useful for this case\n(although it doesn't support Python 3 yet - see\n[here](https://github.com/alberanid/imdbpy/issues/17)).\nThen, also [this](http://stackoverflow.com/questions/5342329/can-i-retrieve-imdbs-movie-recommendations-for-a-given-movie-using-imdbpy) is relevant.\n\nSome movie recommendation engine can then be useful.\n\nThere also could be some movie blacklist. I don't want to download\nmovies which I already have seen.\n\nThere could be other filters.\n\nMaybe better scraping and web crawling via [Scrapy](http://scrapy.org/).\n\n\n## Contribute\n\nDo you want to hack on it?\nYou are very welcome!\n\nAbout the plans, just contact me so we can do some brainstorming.\n\nWant to support some new protocol?\nModify [`FileSysIntf`](https://github.com/albertz/RandomFtpGrabber/blob/master/FileSysIntf.py)\nfor the indexing\nand [`Downloader`](https://github.com/albertz/RandomFtpGrabber/blob/master/Downloader.py)\nfor the download logic, although this might already work because it\njust uses `wget` for everything.\n\n\n## Author\n\nAlbert Zeyer, [albzey@gmail.com](mailto:albzey@gmail.com).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falbertz%2Frandomftpgrabber","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falbertz%2Frandomftpgrabber","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falbertz%2Frandomftpgrabber/lists"}