{"id":17458160,"url":"https://github.com/pocesar/actor-spawn-workers","last_synced_at":"2025-04-02T21:29:46.627Z","repository":{"id":106841813,"uuid":"259778403","full_name":"pocesar/actor-spawn-workers","owner":"pocesar","description":"Split work through multiple actors","archived":false,"fork":false,"pushed_at":"2021-04-05T02:20:33.000Z","size":177,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-18T06:28:49.634Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pocesar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-04-28T23:50:36.000Z","updated_at":"2023-03-04T05:53:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"45a68a1f-c026-4408-b90e-5af17928d986","html_url":"https://github.com/pocesar/actor-spawn-workers","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pocesar%2Factor-spawn-workers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pocesar%2Factor-spawn-workers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pocesar%2Factor-spawn-workers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pocesar%2Factor-spawn-workers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pocesar","download_url":"https://codeload.github.com/pocesar/actor-spawn-workers/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246895486,"owners_count":20851279,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-18T03:55:45.532Z","updated_at":"2025-04-02T21:29:46.612Z","avatar_url":"https://github.com/pocesar.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spawn workers\n\nThis actor lets you spawn tasks or other actors in parallel on the Apify platform that shares a common output dataset, splitting a RequestQueue-like dataset containing request URLs\n\n## Usage\n\n```js\nconst Apify = require(\"apify\");\n\nApify.main(async () =\u003e {\n    const input = await Apify.getInput();\n\n    const {\n        limit, // every worker receives a \"batch\"\n        offset, // that changes depending on how many were spawned\n        inputDatasetId,\n        outputDatasetId,\n        parentRunId,\n        isWorker,\n        emptyDataset, // means the inputDatasetId is empty, and you should use another source, like the Key Value store\n        ...rest // any other configuration you passed through workerInput\n    } = input;\n\n    // don't mix requestList with requestQueue\n    // when in worker mode\n    const requestList = new Apify.RequestList({\n        persistRequestsKey: 'START-URLS',\n        sourcesFunction: async () =\u003e {\n            if (!isWorker) {\n                return [\n                    {\n                        \"url\": \"https://start-url...\"\n                    }\n                ]\n            }\n\n            const requestDataset = await Apify.openDataset(inputDatasetId);\n\n            const { items } = await requestDataset.getData({\n                offset,\n                limit,\n            });\n\n            return items;\n        }\n    });\n\n    await requestList.initialize();\n\n    const requestQueue = isWorker ? undefined : await Apify.openRequestQueue();\n    const outputDataset = isWorker ? await Apify.openDataset(outputDatasetId) : undefined;\n\n    const crawler = new Apify.CheerioCrawler({\n        requestList,\n        requestQueue,\n        handlePageFunction: async ({ $, request }) =\u003e {\n            if (isWorker) {\n                // scrape details here\n                await outputDataset.pushData({ ...data });\n            } else {\n                // instead of requestQueue.addRequest, you push the URLs to the dataset\n                await Apify.pushData({\n                    url: $(\"select stuff\").attr(\"href\"),\n                    userData: {\n                        label: $(\"select other stuff\").data(\"rest\"),\n                    },\n                });\n            }\n        },\n    });\n\n    await crawler.run();\n\n    if (!isWorker) {\n        const { output } = await Apify.call(\"pocesar/spawn-workers\", {\n            // if you omit this, the default dataset on the spawn-workers actor will hold all items\n            outputDatasetId: \"some-named-dataset\",\n            // use this actor default dataset as input for the workers requests, usually should be this own dataset ID\n            inputUrlsDatasetId: Apify.getEnv().defaultDatasetId,\n            // the name or ID of your worker actor (the one below)\n            workerActorId: Apify.getEnv().actorId,\n            // you can use a task instead\n            workerTaskId: Apify.getEnv().actorTaskId,\n            // Optionally pass input to the actors / tasks\n            workerInput: {\n                maxConcurrency: 20,\n                mode: 1,\n                some: \"config\",\n            },\n            // Optional worker options\n            workerOptions: {\n                memoryMbytes: 256,\n            },\n            // Number of workers\n            workerCount: 2,\n            // Parent run ID, so you can persist things related to this actor call in a centralized manner\n            parentRunId: Apify.getEnv().actorRunId,\n        });\n    }\n});\n```\n\n## Motivation\n\nRequestQueue is the best way to process requests cross actors, but it doesn't offer a way to limit or get offsets from it, you can just iterate over its contents or add new requests.\n\nBy using the dataset, you have the same functionality (sans the ability to deduplicate the URLs) that can be safely shared and partitioned to many actors at once. Each worker will be dealing with their own subset of URLs, with no overlapping.\n\n## Limitations\n\nDon't use the following keys for `workerInput` as they will be overwritten:\n\n-   offset: number\n-   limit: number\n-   inputDatasetId: string\n-   outputDatasetId: string\n-   workerId: number\n-   parentRunId: string\n-   isWorker: boolean\n-   emptyDataset: boolean\n\n## License\n\nApache 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpocesar%2Factor-spawn-workers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpocesar%2Factor-spawn-workers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpocesar%2Factor-spawn-workers/lists"}