{"id":20956554,"url":"https://github.com/bluebrain/bluepyparallel","last_synced_at":"2025-05-14T05:32:06.072Z","repository":{"id":225201764,"uuid":"762289789","full_name":"BlueBrain/BluePyParallel","owner":"BlueBrain","description":"Provides an embarrassingly parallel tool with sql backend","archived":false,"fork":false,"pushed_at":"2024-11-06T09:29:56.000Z","size":123,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-11-06T10:17:32.694Z","etag":null,"topics":["parallel","parallel-computing","parallel-programming","parallelization","python","python3"],"latest_commit_sha":null,"homepage":"https://bluepyparallel.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BlueBrain.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-23T13:23:58.000Z","updated_at":"2024-11-06T09:29:56.000Z","dependencies_parsed_at":"2024-11-06T10:17:35.603Z","dependency_job_id":"21689e6b-eadb-4140-a6e9-a7c7a50aa841","html_url":"https://github.com/BlueBrain/BluePyParallel","commit_stats":null,"previous_names":["bluebrain/bluepyparallel"],"tags_count":29,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlueBrain%2FBluePyParallel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlueBrain%2FBluePyParallel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlueBrain%2FBluePyParallel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlueBrain%2FBluePyParallel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BlueBrain","download_url":"https://codeload.github.com/BlueBrain/BluePyParallel/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225277137,"owners_count":17448627,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["parallel","parallel-computing","parallel-programming","parallelization","python","python3"],"created_at":"2024-11-19T01:26:41.581Z","updated_at":"2024-11-19T01:26:42.208Z","avatar_url":"https://github.com/BlueBrain.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BluePyParallel: Bluebrain Python Embarrassingly Parallel library\n\nProvides an embarrassingly parallel tool with sql backend.\n\n## Introduction\n\nProvides an embarrassingly parallel tool with sql backend, inspired by [BluePyMM](https://github.com/BlueBrain/BluePyMM) of @wvangeit.\n\n\n## Installation\n\nThis package should be installed using pip:\n\n```bash\npip install bluepyparallel\n```\n\n\n## Usage\n\n### General computation\n\n```python\n\nfactory_name = \"multiprocessing\"  # Can also be None, dask or ipyparallel\nbatch_size = 10  # This value is used to split the data into batches before processing them\nchunk_size = 1000  # This value is used to gather the elements to process before sending them to the workers\n\n# Setup the parallel factory\nparallel_factory = init_parallel_factory(\n    factory_name,\n    batch_size=batch_size,\n    chunk_size=chunk_size,\n    processes=4,  # This parameter is specific to the multiprocessing factory\n)\n\n# Get the mapper from the factory\nmapper = parallel_factory.get_mapper()\n\n# Use the mapper to map the given function to each element of mapped_data and gather the results\nresult = sorted(mapper(function, mapped_data, *function_args, **function_kwargs))\n```\n\n### Working with Pandas\n\nThis library provides a specific function working with large :class:`pandas.DataFrame`: :func:`bluepyparallel.evaluator.evaluate`.\nThis function converts the DataFrame into a list of dict (one for each row), then maps a given function to element and finally gathers the results.\n\nExample:\n\n```python\ninput_df = pd.DataFrame(index=[1, 2], columns=['data'], data=[100, 200])\n\ndef evaluation_function(row):\n    result_1, result_2 = compute_something(row['data'])\n    return {'new_column_1': result_1, 'new_columns_2': result_2}\n\n# Use the mapper to map the given function to each element of the DataFrame\nresult_df = evaluate(\n    input_df,  # This is the DataFrame to process\n    evaluation_function,  # This is the function that should be applied to each row of the DataFrame\n    parallel_factory=\"multiprocessing\",  # This could also be a Factory previously defined\n    new_columns=[['new_column_1', 0], ['new_columns_2', None]],  # this defines default values for columns\n)\nassert result_df.columns == ['data', 'new_columns_1', 'new_columns_2']\n```\nIt is in a way  a generalisation of the pandas `.apply` method.\n\n\n### Working with an SQL backend\n\nAs it aims at working with time consuming functions, it also provides a checkpoint and resume mechanism using a SQL backend.\nThe SQL backend uses the [SQLAlchemy](https://docs.sqlalchemy.org) library, so it can work with a large variety of database types (like SQLite, PostgreSQL, MySQL, ...).\nTo activate this feature, just pass a [URL that can be processed by SQLAlchemy](https://docs.sqlalchemy.org/en/latest/core/engines.html?highlight=url#database-urls)  to the ``db_url`` parameter of :func:`bluepyparallel.evaluator.evaluate`.\n\n.. note:: A specific driver might have to be installed to access the database (like `psycopg2 \u003chttps://www.psycopg.org/docs/\u003e`_ for PostgreSQL for example).\n\nExample:\n\n```python\n# Use the mapper to map the given function to each element of the DataFrame\nresult_df = evaluate(\n    input_df,  # This is the DataFrame to process\n    evaluation_function,  # This is the function that should be applied to each row of the DataFrame\n    parallel_factory=\"multiprocessing\",  # This could also be a Factory previously defined\n    db_url=\"sqlite:///db.sql\",  # This could also just be \"db.sql\" and would be automatically turned to SQLite URL\n)\n```\n\nNow, if the computation crashed for any reason, the partial result is stored in the ``db.sql`` file.\nIf the crash was due to an external cause (therefore executing the code again should work), it is possible to resume the\ncomputation from the last computed element. Thus, only the missing elements are computed, which can save a lot of time.\n\n\n## Running with distributed Dask MPI on HPC systems\n\nThis is an example of a [sbatch](https://slurm.schedmd.com/sbatch.html) script that can be\nadapted to execute the script using multiple nodes and workers with distributed dask and MPI.\nIn this example, the code called by the ``run.py`` should be parallelized using BluePyParallel.\n\nDask variables are not strictly required, but highly recommended, and they can be fine tuned.\n\n\n```bash\n#!/bin/bash -l\n\n# Dask configuration\nexport DASK_DISTRIBUTED__LOGGING__DISTRIBUTED=\"info\"\nexport DASK_DISTRIBUTED__WORKER__USE_FILE_LOCKING=False\nexport DASK_DISTRIBUTED__WORKER__MEMORY__TARGET=False  # don't spill to disk\nexport DASK_DISTRIBUTED__WORKER__MEMORY__SPILL=False  # don't spill to disk\nexport DASK_DISTRIBUTED__WORKER__MEMORY__PAUSE=0.80  # pause execution at 80% memory use\nexport DASK_DISTRIBUTED__WORKER__MEMORY__TERMINATE=0.95  # restart the worker at 95% use\nexport DASK_DISTRIBUTED__WORKER__MULTIPROCESSING_METHOD=spawn\nexport DASK_DISTRIBUTED__WORKER__DAEMON=True\n# Reduce dask profile memory usage/leak (see https://github.com/dask/distributed/issues/4091)\nexport DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL=10000ms  # Time between statistical profiling queries\nexport DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE=1000000ms  # Time between starting new profile\n\n# Split tasks to avoid some dask errors (e.g. Event loop was unresponsive in Worker)\nexport PARALLEL_BATCH_SIZE=1000\n\nsrun -v run.py\n```\n\nTo ensure only the `evaluate` function is run with parallel dask, one has to initialise the parallel factory\nbefore anything else is done in the code. For example, ``run.py`` could look like:\n\n```python\nif __name__ == \"__main__\":\n    parallel_factory = init_parallel_factory('dask_dataframe')\n    df = pd.read_csv(\"inuput_data.csv\")\n    df = some_preprocessing(df)\n    df = evaluate(df, function_to_evaluate, parallel_factory=parallel_factory)\n    df.to_csv(\"output_data.csv\")\n```\n\nThis is because everything before `init_parallel_factory` will be run in parallel, as mpi is not initialized yet.\n\n.. note:: We recommend to use `dask_dataframe` instead of `dask`, as it is in practice more stable for large computations.\n\n## Funding \u0026 Acknowledgment\n\nThe development of this software was supported by funding to the Blue Brain Project, a research\ncenter of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH\nBoard of the Swiss Federal Institutes of Technology.\n\nFor license and authors, see `LICENSE.txt` and `AUTHORS.md` respectively.\n\nCopyright © 2023-2024 Blue Brain Project/EPFL\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbluebrain%2Fbluepyparallel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbluebrain%2Fbluepyparallel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbluebrain%2Fbluepyparallel/lists"}