{"id":13563525,"url":"https://github.com/TheoCoombes/crawlingathome","last_synced_at":"2025-04-03T20:31:03.629Z","repository":{"id":47497712,"uuid":"357339178","full_name":"TheoCoombes/crawlingathome","owner":"TheoCoombes","description":"A client library for LAION's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.","archived":false,"fork":false,"pushed_at":"2023-03-21T11:48:13.000Z","size":141,"stargazers_count":31,"open_issues_count":4,"forks_count":7,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-08-01T13:29:22.996Z","etag":null,"topics":["clip","dall-e","dataset","dataset-generation","image-text","machine-learning"],"latest_commit_sha":null,"homepage":"http://crawling.at","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TheoCoombes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-12T21:05:56.000Z","updated_at":"2024-03-03T22:16:39.000Z","dependencies_parsed_at":"2024-08-01T13:19:30.909Z","dependency_job_id":"22bae649-f8cb-47a5-83b4-474775aa401d","html_url":"https://github.com/TheoCoombes/crawlingathome","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheoCoombes%2Fcrawlingathome","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheoCoombes%2Fcrawlingathome/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheoCoombes%2Fcrawlingathome/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheoCoombes%2Fcrawlingathome/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TheoCoombes","download_url":"https://codeload.github.com/TheoCoombes/crawlingathome/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223030583,"owners_count":17076457,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clip","dall-e","dataset","dataset-generation","image-text","machine-learning"],"created_at":"2024-08-01T13:01:20.256Z","updated_at":"2024-11-04T16:30:47.796Z","avatar_url":"https://github.com/TheoCoombes.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"NOTE: This repo has now been rewritten into a general purpose distributed compute job manager, see below:\r\n* DistCompute Client: [TheoCoombes/distcompute-client](https://github.com/TheoCoombes/distcompute-client)\r\n* DistCompute Tracker Server: [TheoCoombes/distcompute-tracker](https://github.com/TheoCoombes/distcompute-tracker)\r\n\r\n# Crawling@Home Client\r\n[![Discord Chat](https://img.shields.io/discord/823813159592001537?color=5865F2\u0026logo=discord\u0026logoColor=white)](https://discord.gg/dall-e)\r\n\r\nA client library for Crawling@Home's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.\r\n* Server Repo: [TheoCoombes/crawlingathome-server](https://github.com/TheoCoombes/crawlingathome-server)\r\n* Worker Repo: [ARKSeal/crawlingathome-worker](https://github.com/ARKSeal/crawlingathome-worker)\r\n* Live Dashboard: http://crawlingathome.duckdns.org/\r\n\r\n# Prerequisites\r\n* Python \u003e= 3.7\r\n\r\n# Installation\r\nAs this module will only be used for creating the dataset (short-term), it has not been added to `pip`. However, installing from source is fairly simple:\r\n```\r\ngit clone https://github.com/TheoCoombes/crawlingathome\r\npip install -r crawlingathome/requirements.txt\r\n```\r\nNow, from the current directory, you can import the module:\r\n```py\r\nimport crawlingathome as cah\r\n```\r\n\r\n# Methods\r\n\r\n## crawlingathome.init(url=\"http://crawlingathome.duckdns.org/\", nickname=None, type=\"HYBRID\") -\u003e Client\r\nCreates and returns a new client instance.\r\n* `url`: the Crawling@Home server URL\r\n* `nickname`: the user's nickname (for the leaderboard)\r\n* `type`: the type of worker from \"HYBRID\", \"CPU\" \u0026 \"GPU\"\r\n    - You can also use the classes instead of a string, e.g. `crawlingathome.core.CPUClient` instead of `\"CPU\"`\r\n\r\n## crawlingathome.dump(client) -\u003e dict\r\nDumps a client into a dictionary, so that it can be loaded externally. (see below)\r\n\r\n## crawlingathome.load(**kwargs) -\u003e Client\r\nLoads an existing client using dumped data passed as kwargs, returning a client instance. (see above)\r\n\r\n# HybridClient Reference\r\n```py\r\nimport crawlingathome as cah\r\n\r\nclient = cah.init(\r\n    url=\"https://example.com\",\r\n    nickname=\"TheoCoombes\",\r\n    type=\"HYBRID\"\r\n)\r\n\r\nwhile client.jobCount() \u003e 0 and client.isAlive():\r\n    client.newJob()\r\n    client.downloadShard()\r\n    \r\n    # Saved shard at ./shard.wat\r\n    \r\n    while processing_shard:\r\n        # ... process data\r\n\r\n        client.log(\"Completed x / y images\") # Updates the client's progress to the server\r\n\r\n    client.completeJob(num_pairs_found)\r\n\r\nclient.bye()\r\n```\r\n\r\n\r\n## HybridClient.jobCount() -\u003e int\r\nFinds the amount of available Hybrid/CPU jobs from the server, returning an integer.\r\n\r\n## HybridClient.newJob()\r\nSend a request to the server, requesting for a new job.\r\n\r\n## HybridClient.downloadShard()\r\nDownloads the current job's shard to the current directory (`./shard.wat`)\r\n\r\n## HybridClient.completeJob(total_scraped: int)\r\nMarks the current job as done to the server, along with submitting the total amount of alt-text pairs scraped. (`_markjobasdone()` will be removed in future clients, use this instead)\r\n* `total_scraped` (required): the amount of alt-text pairs scraped for the current job\r\n\r\n## HybridClient.log(progress: str)\r\nLogs the string `progress` into the server.\r\n* `progress` (required): The string detailing the progress, e.g. `\"12 / 100 (12%)\"`\r\n\r\n## HybridClient.isAlive() -\u003e bool\r\nReturns `True` if this client is still connected to the server, otherwise returns `False`.\r\n\r\n## HybridClient.dump()\r\nClient-side wrapper for `crawlingathome.dump(client)`.\r\n\r\n## HybridClient.bye()\r\nRemoves the node instance from the server, ending all current jobs.\r\n\r\n## Client Variables\r\n\r\n### HybridClient.shard\r\nThe URL to the current shard.\r\n\r\n### HybridClient.start_id\r\nThe starting ID. Type: `np.int64`.\r\n\r\n### HybridClient.end_id\r\nThe ending ID, 1 million more than starting ID. Type: `np.int64`.\r\n\r\n### HybridClient.shard_piece\r\nThe 'shard' of the chunk, either 0 (first 50%) or 1 (last 50%).\r\n\r\n# CPUClient Reference\r\nThe CPU client is programatically similar to `HybridClient`, with only a differing upload function:\r\n\r\n## CPUClient.completeJob(download_url: str)\r\nMarks the current job as done to the server and sends the download URL for GPU workers to pull the generated .tar file from.\r\n* `download_url` (required): the URL to download the shards\r\n    - As this is a string, this could theoretically be anything. For example an IP to directly pull from the worker or a Google Drive link etc.\r\n\r\n# GPUClient Reference\r\nSimilarly to the CPU Client, the GPU client is programatically similar to `HybridClient`, instead with a differing `downloadShard()` function, `shard` variable and new `invalidURL` method:\r\n\r\n## GPUClient.downloadShard(path=\"./images\")\r\nExtracts the .tar file recieved from CPU workers into the path `path`, creating the directory if neccesary.\r\n\r\n## GPUClient.invalidURL()\r\nFlags a GPU job's URL as invalid to the server, moving the job back into open jobs.\r\n\r\n## GPUClient.shard\r\nInstead of being a CommonCrawl URL before, this is the string the CPU client uploaded in `CPUClient.completeJob(...)`.\r\n\r\n## GPUClient Note:\r\nGPUClient jobs are dynamically created, meaning it needs CPU clients to generate jobs for it. Because of this, there may be periods of time when your worker(s) don't have any jobs to fufil. You can prepare for this by making use of the `GPUClient.jobCount()` function as well as using a try/except on the `newJob()` call.\r\n* `GPUClient.newJob()` raises a `crawlingathome.errors.ZeroJobError` when there are no jobs to fufil.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTheoCoombes%2Fcrawlingathome","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTheoCoombes%2Fcrawlingathome","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTheoCoombes%2Fcrawlingathome/lists"}