{"id":21032010,"url":"https://github.com/advestis/adparallelengine","last_synced_at":"2025-03-13T19:44:13.712Z","repository":{"id":38458840,"uuid":"495745760","full_name":"Advestis/adparallelengine","owner":"Advestis","description":null,"archived":false,"fork":false,"pushed_at":"2022-07-22T13:41:33.000Z","size":103,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-22T23:17:46.682Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Advestis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-05-24T09:01:40.000Z","updated_at":"2022-05-24T09:09:14.000Z","dependencies_parsed_at":"2022-08-19T18:41:25.940Z","dependency_job_id":null,"html_url":"https://github.com/Advestis/adparallelengine","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Advestis%2Fadparallelengine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Advestis%2Fadparallelengine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Advestis%2Fadparallelengine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Advestis%2Fadparallelengine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Advestis","download_url":"https://codeload.github.com/Advestis/adparallelengine/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243474107,"owners_count":20296697,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T12:39:49.484Z","updated_at":"2025-03-13T19:44:13.686Z","avatar_url":"https://github.com/Advestis.png","language":"Python","readme":"[![doc](https://img.shields.io/badge/-Documentation-blue)](https://advestis.github.io/adparallelengine)\n[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n\n#### Status\n[![pytests](https://github.com/Advestis/adparallelengine/actions/workflows/pull-request.yml/badge.svg)](https://github.com/Advestis/adparallelengine/actions/workflows/pull-request.yml)\n[![push-pypi](https://github.com/Advestis/adparallelengine/actions/workflows/push-pypi.yml/badge.svg)](https://github.com/Advestis/adparallelengine/actions/workflows/push-pypi.yml)\n[![push-doc](https://github.com/Advestis/adparallelengine/actions/workflows/push-doc.yml/badge.svg)](https://github.com/Advestis/adparallelengine/actions/workflows/push-doc.yml)\n\n![maintained](https://img.shields.io/badge/Maintained%3F-yes-green.svg)\n[![issues](https://img.shields.io/github/issues/Advestis/adparallelengine.svg)](https://github.com/Advestis/adparallelengine/issues)\n[![pr](https://img.shields.io/github/issues-pr/Advestis/adparallelengine.svg)](https://github.com/Advestis/adparallelengine/pulls)\n\n\n#### Compatibilities\n![ubuntu](https://img.shields.io/badge/Ubuntu-supported--tested-success)\n![unix](https://img.shields.io/badge/Other%20Unix-supported--untested-yellow)\n\n![python](https://img.shields.io/pypi/pyversions/adparallelengine)\n\n\n##### Contact\n[![linkedin](https://img.shields.io/badge/LinkedIn-Advestis-blue)](https://www.linkedin.com/company/advestis/)\n[![website](https://img.shields.io/badge/website-Advestis.com-blue)](https://www.advestis.com/)\n[![mail](https://img.shields.io/badge/mail-maintainers-blue)](mailto:pythondev@advestis.com)\n\n# adparallelengine\n\nA wrapper around several ways of doing map multiprocessing in Python. One can use :\n* Dask\n* concurrent.futures\n* mpi4py.futures\nThe underlying engine is also available in a serial mode, for debugging purposes \n\n## Installation\n\n```\npip install adparallelengine[all,mpi,dask,support_shared,k8s]\n```\n\n## Usage\n\n### Basic use\n\nCreating the engine is done this way:\n\n```python\nfrom adparallelengine import Engine\nfrom transparentpath import Path\n\nif __name__ == \"__main__\":\n    which = \"multiproc\"  # Can also be \"serial\", \"dask\", \"mpi\" or \"k8s\"\n    engine = Engine(kind=which, path_shared=Path(\"tests\") / \"data\" / \"shared\")\n```\n\nThen using the engine is done this way:\n```python\nfrom adparallelengine import Engine\nimport pandas as pd\nfrom transparentpath import Path\n\ndef method(df):\n    return 2 * df, 3 * df\n\nif __name__ == \"__main__\":\n    which = \"multiproc\"  # Can also be \"serial\", \"dask\", \"mpi\" or \"k8s\"\n    engine = Engine(\n        kind=which,\n        # max_workers=10  One can limit the number of workers. By default, os.cpu_count() or MPI.COMM_WORLD.size is used\n    )\n    results = engine(\n        method,  # The method to use...\n        [pd.DataFrame([[1, 2]]), pd.DataFrame([[3, 4]]), pd.DataFrame([[5, 6]])]  # ...on each element of this iterable \n    )\n```\n\nNote that AdParallelEngine **supports generators** if the *length* argument is given : \n\n```python\nfrom adparallelengine import Engine\n\ndef dummy_prod(xx):\n    return 2 * xx\n\ndef fib(limit):\n    \"\"\"Fibonacci generator\"\"\"\n    a, b = 0, 1\n    while a \u003c limit:\n        yield a\n        a, b = b, a + b\n\nx = fib(25)  # will have 9 elements: 0, 1, 1, 2, 3, 5, 8, 13, 21\n\nif __name__ == \"__main__\":\n    which = \"multiproc\"  # Can also be \"serial\", \"dask\", \"mpi\" or \"k8s\"\n    engine = Engine(\n        kind=which,\n        # max_workers=10  One can limit the number of workers. By default, os.cpu_count() or MPI.COMM_WORLD.size is used\n    )\n    results = engine(\n        dummy_prod,\n        x,\n        length=9,\n        batch=4\n    )\n```\n\nAt no moment the engine will cast it to list, instead a custom iterator class is created to properly batch the generator\nand loop through it only once, when the computation actually happens.\n\n### Gathering\n\nResults will be a list of tuples, each containing two dataframes, because `method` returns a tuple of two dataframes.\nOne could have used the keyword \"gather\" to flatten this list inside the engine :\n\n```python\n    results = engine(method, [pd.DataFrame([[1, 2]]), pd.DataFrame([[3, 4]]), pd.DataFrame([[5, 6]])], gather=True)\n```\n\n### Batching\n\nBy default, one process will execute `method` on a single element of the iterable. This can result in significant\noverhead if your iterable is much bigger than the number of workers, in which case the keyword \"batched\" can be used :\n\n```python\n    results = engine(method, [pd.DataFrame([[1, 2]]), pd.DataFrame([[3, 4]]), pd.DataFrame([[5, 6]])], batched=True)\n```\n\nIn that case, sublists of elements are given to each process so that there are exactly the same number of processes than\nnumbers (unless the iterable is too small of course).\n\nDoing this can also have its own problem, namely a load unbalance of some process finish much quicker than others. One\ncan optionally use more batches than the number of workers by giving an integer instead of a boolean to the \"batched\"\nkeyword :\n\n```python\n    # Using 16 batches\n    results = engine(method, [pd.DataFrame([[1, 2]]), pd.DataFrame([[3, 4]]), pd.DataFrame([[5, 6]])], batched=16)\n```\n\n### other keyword arguments\n\nThe `method` can accept other keyword arguments, for example\n\n```python\ndef method(df, s):\n    return 2 * df * s, 3 * df * s\n```\n\nThose can be given when calling the engine and will be passed to each process. For example :\n```python\nfrom adparallelengine import Engine\nimport pandas as pd\nfrom transparentpath import Path\n\ndef method(df, s):\n    return 2 * df * s, 3 * df * s\n\nif __name__ == \"__main__\":\n    which = \"multiproc\"  # Can also be \"serial\", \"dask\", \"mpi\" or \"k8s\"\n    engine = Engine(kind=which, path_shared=Path(\"tests\") / \"data\" / \"shared\")\n    some_series = pd.Series([10, 20])\n    results = engine(method, [pd.DataFrame([[1, 2]]), pd.DataFrame([[3, 4]]), pd.DataFrame([[5, 6]])], s=some_series)\n```\n\n#### Large objects given to keyword arguments\n\nIf `method` is given large objects as keyword arguments, passing the object to workers could imply a significant loss\nof time. I observed that doing out-of-core learning can sometime be quicker, despite the I/O that it implies. It \ncan even save a bit of memory. You can use it by using the \"share\" keyword argument :\n\n```python\n    results = engine(method, [pd.DataFrame([[1, 2]]), pd.DataFrame([[3, 4]]), pd.DataFrame([[5, 6]])], share={\"s\": some_series})\n```\n\nHere, \"some_series\" will be written to disk by the engine, and only a path will be given to each process, which will then\nread it when starting. For now, only pandas dataframes and series, and numpy arrays, are supported for sharing. The directory\nwhere the shared objects are written is by default the local temp dir, by one can specify some other location by giving\nthe \"path_shared\" keyword argument when creating the engine (NOT when calling it!).\n\n### Method to run in each processes\n\nWhen using multiprocessing with numpy, one has to use the \"spawn\" multiprocessing context to avoid the GIL. By doing so\nhowever, any environment variable or class attributes defined in the main process is forgotten in the child processes,\nsince the code is imported from scratch. So, one might need to re-load some variables and re-set some class attributes\ninside each process. This can be done in an additional method that can be given to engine. The complete example below\nshows how it is done.\n\n### Complete example\n\nThe code below shows an example of how to use the engine. Here `method` accepts two other arguments, one that can be a\npandas' dataframe or series, and one that is expected to be a float. It returns a tuple of two dataframes.\n\nIf the parallelization is done using Python's native multiprocessing, do not forget to use `if __name__ == \"__main__\"`\nlike in the example !\n\n```python\nimport sys\nfrom typing import Union\nimport pandas as pd\nimport numpy as np\nfrom transparentpath import Path\n\nfrom adparallelengine import Engine\n\n\nclass Dummy:\n    some_attr = 0\n\n\ndef method_in_processes(a):\n    Dummy.some_attr = a\n\n\ndef method(\n    element: pd.DataFrame,\n    some_other_stuff: Union[float, pd.DataFrame, pd.Series, np.ndarray],\n    some_float: float,\n):\n    return (\n       element * some_other_stuff + some_float + Dummy.some_attr,\n       3 * (element * some_other_stuff + some_float + Dummy.some_attr)\n    )\n\n\nif __name__ == \"__main__\":\n\n    Dummy.some_attr = 1\n\n    dfs = [\n        pd.DataFrame([[0, 1], [2, 3]]),\n        pd.DataFrame([[4, 5], [6, 7]]),\n        pd.DataFrame([[8, 9], [10, 11]]),\n        pd.DataFrame([[12, 13], [14, 15]]),\n        pd.DataFrame([[16, 17], [18, 19]]),\n        pd.DataFrame([[21, 22], [23, 24]]),\n    ]\n    s = pd.Series([2, 3])\n    f = 5.0\n\n    which = sys.argv[1]\n    gather = True if sys.argv[2] == \"True\" else False\n    batched = True if sys.argv[3] == \"True\" else False if sys.argv[3] == \"False\" else int(sys.argv[3])\n    share = True if sys.argv[4] == \"True\" else False\n\n    if share is True:\n        share_kwargs = {\"share\": {\"some_other_stuff\": s}}\n    else:\n        share_kwargs = {\"some_other_stuff\": s}\n    engine = Engine(kind=which, path_shared=Path(\"tests\") / \"data\" / \"shared\")\n    res = engine(\n        method,\n        dfs,\n        init_method={\"method\": method_in_processes, \"kwargs\": {\"a\": 1}},\n        some_float=f,\n        gather=gather,\n        batched=batched,\n        **share_kwargs\n    )\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadvestis%2Fadparallelengine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadvestis%2Fadparallelengine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadvestis%2Fadparallelengine/lists"}