{"id":20807052,"url":"https://github.com/doubleml/doubleml-serverless","last_synced_at":"2025-05-07T05:06:57.272Z","repository":{"id":37867529,"uuid":"323852857","full_name":"DoubleML/doubleml-serverless","owner":"DoubleML","description":"DoubleML-Serverless - Distributed Double Machine Learning with a Serverless Architecture","archived":false,"fork":false,"pushed_at":"2023-12-25T13:29:41.000Z","size":70,"stargazers_count":12,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-07T05:06:48.941Z","etag":null,"topics":["aws-lambda","causal-inference","data-science","double-machine-learning","econometrics","machine-learning","python","scikit-learn","serverless","statistics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DoubleML.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-23T09:00:18.000Z","updated_at":"2023-11-07T11:24:32.000Z","dependencies_parsed_at":"2024-11-18T08:15:54.352Z","dependency_job_id":null,"html_url":"https://github.com/DoubleML/doubleml-serverless","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DoubleML%2Fdoubleml-serverless","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DoubleML%2Fdoubleml-serverless/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DoubleML%2Fdoubleml-serverless/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DoubleML%2Fdoubleml-serverless/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DoubleML","download_url":"https://codeload.github.com/DoubleML/doubleml-serverless/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252817010,"owners_count":21808705,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws-lambda","causal-inference","data-science","double-machine-learning","econometrics","machine-learning","python","scikit-learn","serverless","statistics"],"created_at":"2024-11-17T19:30:23.119Z","updated_at":"2025-05-07T05:06:57.256Z","avatar_url":"https://github.com/DoubleML.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DoubleML-Serverless - Distributed Double Machine Learning with a Serverless Architecture \u003ca href=\"https://docs.doubleml.org\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/DoubleML/doubleml-for-py/main/doc/logo.png\" align=\"right\" width = \"120\" /\u003e\u003c/a\u003e\n\nThis repo contains a prototype implementation **DoubleML-Serverless** of distributed double machine learning with a serverless infrastructure\nusing [AWS Lambda](https://aws.amazon.com/lambda).\nA detailed discussion of this prototype can be found in the paper [\"Distributed Double Machine Learning with a Serverless Architecture\" (Kurz, 2021)](https://doi.org/10.1145/3447545.3451181).\nDoubleML-Serverless is an extension for serverless cloud computing of the Python package **DoubleML**.\nDoubleML is available via PyPI [https://pypi.org/project/DoubleML](https://pypi.org/project/DoubleML) and on GitHub [https://github.com/DoubleML/doubleml-for-py](https://github.com/DoubleML/doubleml-for-py).\nThe Python package DoubleML was introduced in\n\"DoubleML - An Object-Oriented Implementation of Double Machine Learning in Python\"\n([Bach et al., 2022](https://www.jmlr.org/papers/v23/21-0862.html))\nand a detailed documentation \\\u0026 user guide for the package is available at\n[https://docs.doubleml.org](https://docs.doubleml.org).\n\n## Getting Started\n\n### Installation of DoubleML-Serverless\n\nTo install download the latest source code from GitHub via\n```\ngit clone git@github.com:DoubleML/doubleml-serverless.git\ncd doubleml-serverless\n```\n\nThen build the package from source using pip in the editable mode.\n\n```\npip install --editable .\n```\n\nAlternatively to the installation from source, released versions of the DoubleML-Serverless package in form of\n.whl files can be obtained from [GitHub Releases](https://github.com/DoubleML/doubleml-serverless/releases).\nAfter downloading the wheel, the package can be installed with pip (replace `XXX` with the downloaded package version).\n```\npip install -U DoubleML-Serverless-XXX-py3-none-any.whl\n```\n\n### Deploy the Corresponding Serverless App to AWS Lambda using AWS SAM\n\nTo use AWS Lambda for estimating double machine learning models, a deployment in your AWS account is necessary.\nThe corresponding serverless application consists of the following components:\n\n* A AWS Lambda function called `LambdaCVPredict` (the source code is taken from this repository [https://github.com/DoubleML/doubleml-serverless/blob/main/aws_lambda_app/lambda_functions/cv_predict.py](https://github.com/DoubleML/doubleml-serverless/blob/main/aws_lambda_app/lambda_functions/cv_predict.py)).\n* A layer providing the Python libraries `scikit-learn`, `pandas` and `numpy` together with their dependencies.\n* An S3 bucket for the data transfer (can be optionally generated, or an existing bucket is used).\n* A role for the execution of the lambda function `LambdaCVPredict` which consists of the AWS-managed `AWSLambdaBasicExecutionRole` policy plus read access to the S3 bucket for data transfer.\n\n\nThere are two options for deployment:\n\n1. A version of DoubleML-Serverless is available in the AWS Serverless Application Repository: [https://serverlessrepo.aws.amazon.com/applications/eu-central-1/839779594349/doubleml-serverless](https://serverlessrepo.aws.amazon.com/applications/eu-central-1/839779594349/doubleml-serverless). It can be deployed by clicking on the `Deploy` button.\n\n2. The second option for deployment is based on AWS Serverless Application Model (AWS SAM).\n\n    2.1 Setup the AWS SAM CLI as described here: [https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-getting-started.html](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-getting-started.html)\n\n    2.2 To deploy the application use the following commands (for more information see [https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html))\n    ```\n    cd aws_lambda_app\n    sam build\n    sam deploy --guided\n    ```\n\n### Estimating a Partially Linear Regression Model with Double Machine Learning and Serverless Scaling Using AWS Lambda\n\nTo demonstrate the functionality of DoubleML-Serverless we revisit the Pennsylvania  Reemployment Bonus experiment\nand estimate the effect of provisioning a cash bonus on the unemployment duration as studied in [Chernozhukov et al. (2018)](https://doi.org/10.1111/ectj.12097).\nThis example is also discussed in the accompanying paper to the DoubleML-Serverless package ([Kurz, 2021](https://doi.org/10.1145/3447545.3451181)).\n\nWe first load the data using functionalities from the DoubleML package.\n```python\nfrom doubleml.datasets import fetch_bonus\ndf_bonus = fetch_bonus('DataFrame')\n```\n\nThe class `DoubleMLDataS3` serves as data-backend for DoubleML-Serverless model classes.\nIt is inherited from the `DoubleML` class `DoubleMLData`.\nWe initialize an object of the `DoubleMLDataS3` for the bonus data and upload it to the S3 bucket `doubleml-serverless-data` used for the data transfer to AWS Lambda.\n```python\nfrom doubleml_serverless import DoubleMLDataS3\n\ndml_data_bonus = DoubleMLDataS3(\n    'doubleml-serverless-data', 'bonus_data.csv',\n    df_bonus,\n    y_col='inuidur1',\n    d_cols='tg',\n    x_cols=['female', 'black', 'othrace',\n       'dep1', 'dep2', 'q2', 'q3',\n       'q4', 'q5', 'q6', 'agelt35',\n       'agegt54', 'durable', 'lusd', 'husd'])\ndml_data_bonus.store_and_upload_to_s3()\n```\n\nTo estimate the nuisance functions we use a random forest regressor which averages over 500 trees.\nWe further apply repeated cross-fitting with 5 folds and 100 repetitions/splits.\n```python\nfrom doubleml_serverless import DoubleMLPLRServerless\nfrom sklearn.base import clone\nfrom sklearn.ensemble import RandomForestRegressor\n\nml = RandomForestRegressor(n_estimators = 500)\nml_g = clone(ml)\nml_m = clone(ml)\ndml_lambda_plr_bonus = DoubleMLPLRServerless(\n    'LambdaCVPredict', 'eu-central-1',\n    dml_data_bonus, ml_g, ml_m,\n    n_folds=5, n_rep=100)\n```\n\nTo estimate the model locally we can call `dml_lambda_plr_bonus.fit()`.\nEstimation on AWS Lambda is achieved via `dml_lambda_plr_bonus.fit_aws_lambda()`.\nNote that you will be charged for all used resources in the AWS account you deployed the serverless application to.\n```python\ndml_lambda_plr_bonus.fit_aws_lambda()\n```\n\nA summary of the estimation result is available via the property `dml_lambda_plr_bonus.summary`.\nSome metrics about the estimation on AWS Lambda can be obtained via the property  `dml_lambda_plr_bonus.aws_lambda_metrics`.\n\n## Citation\n\nIf you use the DoubleML-Serverless package a citation is highly appreciated:\n\nKurz, M. S. (2021). Distributed Double Machine Learning with a Serverless Architecture.\nIn Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE '21).\nAssociation for Computing Machinery, New York, NY, USA, 27–33.\ndoi:[10.1145/3447545.3451181](https://doi.org/10.1145/3447545.3451181).\n\nBibtex-entry:\n\n```\n@inproceedings{kurz2021DoublemlServerless,\n   author = {Kurz, Malte S.},\n   title = {Distributed Double Machine Learning with a Serverless Architecture},\n   year = {2021},\n   isbn = {9781450383318},\n   publisher = {Association for Computing Machinery},\n   address = {New York, NY, USA},\n   url = {https://doi.org/10.1145/3447545.3451181},\n   doi = {10.1145/3447545.3451181},\n   abstract = {This paper explores serverless cloud computing for double machine learning. Being based on repeated cross-fitting, double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless computing. It allows to get fast on-demand estimations without additional cloud maintenance effort. We provide a prototype Python implementation DoubleML-Serverless for the estimation of double machine learning models with the serverless computing platform AWS Lambda and demonstrate its utility with a case study analyzing estimation times and costs.},\n   booktitle = {Companion of the ACM/SPEC International Conference on Performance Engineering},\n   pages = {27--33},\n   numpages = {7},\n   keywords = {machine learning, causal machine learning, serverless computing, distributed computing, AWS Lambda, function-as-a-service (FAAS)},\n   location = {Virtual Event, France},\n   series = {ICPE '21}\n}\n```\n\n## Acknowledgements\n\nFunding by the Deutsche Forschungsgemeinschaft (DFG, German Research\nFoundation) is acknowledged – Project Number 431701914.\n\n## References\n\nBach, P., Chernozhukov, V., Kurz, M. S., and Spindler, M. (2022), DoubleML - An\nObject-Oriented Implementation of Double Machine Learning in Python,\nJournal of Machine Learning Research, 23(53): 1-6,\n[https://www.jmlr.org/papers/v23/21-0862.html](https://www.jmlr.org/papers/v23/21-0862.html).\n\nChernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018).\nDouble/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68.\ndoi:[10.1111/ectj.12097](https://doi.org/10.1111/ectj.12097).\n\nKurz, M. S. (2021). Distributed Double Machine Learning with a Serverless Architecture.\nIn Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE '21).\nAssociation for Computing Machinery, New York, NY, USA, 27–33.\ndoi:[10.1145/3447545.3451181](https://doi.org/10.1145/3447545.3451181).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdoubleml%2Fdoubleml-serverless","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdoubleml%2Fdoubleml-serverless","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdoubleml%2Fdoubleml-serverless/lists"}