{"id":24956975,"url":"https://github.com/prodo-dev/plz","last_synced_at":"2025-04-10T19:05:11.196Z","repository":{"id":50208274,"uuid":"117868411","full_name":"prodo-dev/plz","owner":"prodo-dev","description":"Say the magic word 😸","archived":false,"fork":false,"pushed_at":"2022-12-08T04:53:28.000Z","size":5782,"stargazers_count":30,"open_issues_count":11,"forks_count":5,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-24T16:46:19.034Z","etag":null,"topics":["automation","aws","aws-infrastructure","examples","experiments","machine-learning","ml","ml-engine","pytorch","reproducibility"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/prodo-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-01-17T17:20:46.000Z","updated_at":"2024-01-13T23:57:47.000Z","dependencies_parsed_at":"2023-01-24T11:15:38.688Z","dependency_job_id":null,"html_url":"https://github.com/prodo-dev/plz","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prodo-dev%2Fplz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prodo-dev%2Fplz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prodo-dev%2Fplz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prodo-dev%2Fplz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/prodo-dev","download_url":"https://codeload.github.com/prodo-dev/plz/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248279195,"owners_count":21077406,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","aws","aws-infrastructure","examples","experiments","machine-learning","ml","ml-engine","pytorch","reproducibility"],"created_at":"2025-02-03T06:41:49.183Z","updated_at":"2025-04-10T19:05:11.174Z","avatar_url":"https://github.com/prodo-dev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Plz 😸\n\n_Say the magic word._\n\n_Plz_ (pronounced \"please\") runs your jobs storing code, input, outputs and\nresults so that they can be queried programmatically. That way, it helps with\ntraceability and reproducibility. In case you want to run your jobs in the\ncloud, it makes the process frictionless compared to running them locally. Jump\n[here](#plz-in-action) to see it in action.\n\nAt Prodo.AI, we use Plz to train our PyTorch-based machine learning models.\n\n_Plz is an experimental product and is not guaranteed to be stable across\nversions._\n\n## Contents:\n\n- [Plz in action](#plz-in-action)\n- [How does it work?](#how-does-it-work)\n- [Installation instructions](#installation-instructions)\n- [Examples](#examples)\n- [Plz principles](#plz-principles)\n- [Future work](#future-work)\n\n## Highlights\n\n- simple command line interface\n- cloud-agnostic architecture (on top of Docker), allowing you to run jobs\n  locally, on bare metal, or on the cloud\n  - Plz currently supports Amazon Web Services (AWS), but will most likely\n    support other cloud providers in the future\n  - full control of the type of cloud instance, allowing you to use whatever\n    machine fits your job (and budget)\n  - full support for NVIDIA GPUs, allowing you to run deep learning experiments\n- common tooling support, with the following straight out of the box:\n  - Python\n  - Anaconda\n  - PyTorch\n- data-based workflow, so that you don't accidentally compute your model with\n  the wrong input\n- parameter awareness, so that you can run the same experiment with multiple\n  sets of parameters\n- full history, so that you can review your experiments over time\n- useful examples provided (see the [Examples](#Examples) section)\n- MIT-license allowing modification, distribution, private or commercial use\n  (see [LICENSE](LICENSE) for more details)\n- open for contributions, plz\n\n## Plz in action\n\nWe offer more details [below](#installation-instructions) on how to setup Plz\nand run your jobs, but we can start by giving you an overview of what Plz does.\n\nPlz offers a command-line interface. You start by adding a `plz.config.json`\nfile to the directory where you have your source code. This file contains, among\nother things, the command you run to put your program to work (for instance,\n`python3 main.py`). Then you can use Plz to run your program with `plz run`. The\nfollowing example (provided in this repository) demonstrates this:\n\n```\nsergio@spaceship:~/plz/examples/pytorch$ plz run\n👌 Capturing the files in /home/sergio/plz/examples/pytorch\n👌 Building the program snapshot\nStep 1/4 : FROM prodoai/plz_ml-pytorch\n# Executing 3 build triggers\n ---\u003e Using cache\n[...]\n---\u003e 9c39e889659d\nSuccessfully built 9c39e889659d\nSuccessfully tagged 024444204267.dkr.ecr.eu-west-1.amazonaws.com/plz/builds:some-person-trying-pytorch-mnist-example-1541436382135\n👌 Capturing the input\n👌 983663 input bytes to upload\n👌 Sending request to start execution\nInstance status: querying availability\nInstance status: requesting new instance\nInstance status: pending\n[...]\nInstance status: starting container\nInstance status: running\n👌 Execution ID is: 55b66652-e11a-11e8-a36a-233ad251f4c1\n👌 Streaming logs...\nUsing device: cuda\nEpoch: 1. Training loss: 2.146302\nEvaluation accuracy: 47.90 (max 0.00)\nBest model found at epoch 1, with accurary 47.90\nEpoch: 2. Training loss: 0.660179\nEvaluation accuracy: 83.30 (max 47.90)\nBest model found at epoch 2, with accurary 83.30\nEpoch: 3. Training loss: 0.251717\nEvaluation accuracy: 87.80 (max 83.30)\nBest model found at epoch 3, with accurary 87.80\n[...]\nEpoch: 30. Training loss: 0.010750\nEvaluation accuracy: 97.50 (max 98.10)\n👌 Harvesting the output...\n👌 Retrieving summary of measures (if present)...\n{\n  \"max_accuracy\": 98.1,\n  \"training_loss_at_max\": 0.008485347032546997,\n  \"epoch_at_max\": 25,\n  \"training_time\": 43.3006055355072\n}\n👌 Execution succeeded.\n👌 Retrieving the output...\nle_net.pth\n👌 Done and dusted.\n```\n\nFrom the above output, you'll see Plz do the following:\n\n- Plz captures the files in your current directory. A snapshot of your code is\n  built and stored in your infrastructure, so that you can retrieve the code\n  used to run your job in the future (yes, you can specify files to be ignored,\n  and you do so in the `plz.config.json`).\n- It captures input data (as specified in the config) and uploads it. If you run\n  another execution with the same input data, it will avoid uploading the data\n  for a second time (based on timestamps and hashes).\n- It starts an AWS instance, and waits until it's ready (or just runs the\n  execution locally depending on the configuration).\n- It streams the logs, just as if you were running your program directly.\n- It shows metrics you collected during the run, such as _accuracy_ and _loss_\n  (you can query those later).\n- Finally, it downloads output files you might have created.\n- (The AWS instance will be shut down in the background)\n\nYou can be patient and wait until it finishes, or you can hit `Ctrl+C` and stop\nthe program early:\n\n```\nEpoch: 9 Training loss: 0.330538\n^C\n👌 Your program is still running. To stream the logs, type:\n\n        plz logs ad96b586-89e5-11e8-a7c5-8142e2563487\n```\n\nPlz runs your commands in a Docker container, either in your AWS infrastructure\nor in your local machine, and so your actions in the terminal don't affect the\nexecution. If you are running this execution only, you can just type `plz logs`\nand logs will be streamed from the current moment (unless you specify\n`--since=start`, which will tell it to stream from the start of execution).\n\nThe big hexadecimal number you see in the output, next to `plz logs`, is the\nexecution ID you can use to refer to this execution. Plz remembers the last\nexecution that was _started_, and if you want to refer to that one you don't\nneed to include it in your command (you can just type `plz logs`). But if you\nneed to specify the execution ID, you can do `plz logs \u003cexecution-id\u003e`.\n\nOnce your program has finished (or you've stopped it with `plz stop`) you can\nrun `plz output`, and it will download the files that your program has written.\nIn order to use this functionality, you need to tell your program to write to a\nspecific directory, which is provided to your program as an environment\nvariable. The files are saved under `output/\u003cexecution-id\u003e` by default, but you\ncan specify the location with the `-p` option.\n\nThe instance will be kept there for some time (specified in `plz.config.json`)\nin case you're running things interactively (so that you don't need to wait\nwhile the instance goes through the startup process again).\n\nYou can use `plz describe` to print metadata about an execution in JSON format.\nIt's useful to tell one execution from another if you have several running at\nthe same time.\n\nYou can use `plz run --parameters a_json_file.json` to pass parameters to your\nprogram. Passing parameters this way has two advantages:\n\n- the parameters are stored in the metadata and can be queried (see the\n  description of `plz history` below)\n- you can use `plz rerun --override-parameters some_json_file.json` and run\n  exactly the same execution but with different parameters, which helps running\n  experiments in a systematic fashion.\n\nThere's also `plz history`, returning a JSON mapping from execution IDs to\nmetadata. If you write JSON files to a specific directory (see\n`test/end-to-end/measures/simple`) they will be available in the metadata. You\ncan store things you've measured during your experiment (for instance, training\nloss). Parameters will be in the metadata as well, so you can transform the\nmetadata using, for instance, [`jq`](https://stedolan.github.io/jq/), and find\nout how your training loss changed as you changed your parameters.\n\n```\nsergio@spaceship:~/plz/examples/pytorch$ plz history | \\\n    jq 'to_entries[] | { \"execution_id\": .key,\n                         \"learning_rate\": .value.parameters.learning_rate,\n                         \"accuracy\": .value.measures.summary.max_accuracy }'\n{\n  \"execution_id\": \"dafcb478-e11e-11e8-9f2c-87dc520968d5\",\n  \"learning_rate\": 0.01,\n  \"accuracy\": 98\n}\n{\n  \"execution_id\": \"9cfd3f1a-e1cf-11e8-9449-b1cc03bcdb5f\",\n  \"learning_rate\": 0.1,\n  \"accuracy\": 98.5\n}\n{\n  \"execution_id\": \"c0d65d66-e1cf-11e8-8ed8-0d6f99ec4bc3\",\n  \"learning_rate\": 0.5,\n  \"accuracy\": 13\n}\n```\n\nIn this example, you can see that increasing the learning rate from `0.01` to\n`0.1` gives you an improvement in accuracy from 98% to 98.5%, but further\nincreasing the learning rate leads to a disastrous decrease to 13%.\n\nYou can run `plz list` to list the running executions, as well as any running\ninstances on AWS. It also shows the instance IDs. You can kill instances with\n`plz kill -i \u003cinstance-id\u003e`.\n\nThe command `plz last` is useful, particularly when writing shell commands, to\nget the last execution _started_.\n\nWe also make it easy to manage dependencies for projects using Anaconda.\nProjects using the image `prodoai/plz_ml-pytorch` need to have an\n`environment.yml` file, as the one produced by `conda env export` (see\n[the one in the Pytorch example](examples/pytorch/environment.yml)). This file\nwill be applied on top of\n[the environment in the image](base-images/ml-pytorch/environment.yml).\nInstallation of dependencies is cached, so the process of dependency\ninstallation occurs only the first time after you change the environment file.\n\n## How does it work?\n\nPlz consists of a _controller_ service and a _command-line interface_ (CLI) that\nissues requests to the controller. The CLI is a Python executable, `plz`, which\ntakes instructions (such as `plz run ...`) as described above.\n\nThere are two configurations of the controller that are ready for you to use: in\none of them your jobs are run locally, while in the other one an AWS instance is\nstarted for each job. (Note: the controller itself can be deployed to the cloud,\nand if you're in a production environment that's the recommended way to use it,\nbut we suggest you try the examples with a controller that runs locally first.)\n\nWhen you have a directory with source code, you can just add a `plz.config.json`\nfile including information such as:\n\n- the location of your Plz server,\n- the command you want to run,\n- the location of your input data,\n- whether you want to request an on-demand instance at a fixed price, or bid for\n  spot instances with a ceiling,\n- and much more.\n\nThen, just typing `plz run` will run the job for you, either locally or on AWS,\ndepending on the controller you've started.\n\n## Installation instructions\n\nChances are you that you have most of the supporting tools already installed, as\nthese are broadly used tools.\n\n1. Install Git, and Python 3.\n   1. On Ubuntu, you can run\n      `sudo apt install -y git python3 python3-pip python-pip`.\n   2. On macOS, install [Homebrew](https://brew.sh/), then run\n      `brew install git python`.\n   3. For all other operating systems, you're going to have to Google it.\n2. Install [Docker](https://docs.docker.com/install/).\n   1. On Ubuntu, you can run:\n      ```\n      sudo apt install -y curl\n      curl -fsSL https://get.docker.com -o get-docker.sh\n      sudo sh get-docker.sh\n      sudo usermod -aG docker \"$USER\"\n      ```\n      then start a new shell with `sudo su - \"$USER\"` so that it picks up the\n      membership to the `docker` group.\n   2. On macOS, you can use Homebrew to install Docker with\n      `brew cask install docker`.\n3. Install Docker Compose (`pip install docker-compose`). You might want to make\n   sure that `pip` installs the `docker-compose` command somewhere in your\n   `PATH`. On Ubuntu with the default Python installation, this is typically\n   `$HOME/.local/bin` (so you need the command\n   `export PATH=\"${HOME}/.local/bin:${PATH}\"`).\n4. If you're planning on running code with CUDA in your machine, install the\n   [NVIDIA Container Runtime for Docker](https://github.com/NVIDIA/nvidia-docker)\n   (not needed for using CUDA on AWS).\n5. `git clone https://github.com/prodo-ai/plz`, then `cd plz`.\n6. Install the CLI by running `./install_cli`, which calls `pip3`. Same as for\n   `docker-compose` you might want to check that the `plz` command is in your\n   path.\n7. Run the controller\n   ([keep reading](#running-the-controller-for-local-executions)).\n\nThe first time you run the controller, it will take some time, as it downloads a\n\"standard\" environment which includes Anaconda and PyTorch. When it's ready the\nlogs will show `Harvesting complete. You can run plz commands now`.\n\nThe controller runs in the foreground, and can be killed with _Ctrl+C_. If you'd\nlike to run it in the background, append `-d` to the command to run it in\n\"detached\" mode.\n\nIf you've run the controller in the background, or if you lose your terminal, it\nwill carry on running. You can stop it with `./stop`.\n\n### Running the controller for local executions\n\nOnce you've set up your system as above, run:\n\n```\n./start/local-prebuilt\n```\n\nThe controller can be stopped at any time with:\n\n```\n./stop\n```\n\n### Running the controller for AWS executions\n\nIf you want to run the examples using the AWS instances, be aware that this has\na cost. By default, Plz uses _t2.micro_ on-demand instances. You can find out\nhow much these cost on the\n[AWS EC2 Pricing](https://aws.amazon.com/ec2/pricing/on-demand/) page.\n\nTo start a controller that talks to AWS, you'll need to first set up the AWS\nCLI:\n\n1. Install the AWS CLI: `pip install awscli`\n2. Configure it with your access key: `aws configure`\n3. Verify you can connect to AWS by running `aws iam get-user` and checking your\n   username is correct.\n\nIf you usually use AWS in a particular region, please edit\n`aws_config/config.json` and set your region there. The default file sets the\nregion to _eu-west-1_ (Ireland).\n\nThen run:\n\n```\n./start/aws-prebuilt\n```\n\nUnless you add `\"instance_max_uptime_in_minutes\": null,` to your\n`plz.config.json`, all AWS instances you start terminate after 60 minutes.\nThat's on purpose, in case you're just trying the tool and something doesn't go\nwell (for example, there's a power cut). You can always use `plz list` and\n`plz kill` before leaving your computer, as to make sure that there no instances\nremaining. For maximum assurance, we recommend checking the state of your\ninstances in the AWS console.\n\nBy default, Plz uses on-demand instances. In order to use spot instances,\nspecify the following in your _plz.config.json_ file:\n\n```json\n{\n    ...\n    \"instance_market_type\": \"spot\",\n    \"max_bid_price_in_dollars_per_hour\": \u003cprice\u003e\n}\n```\n\nThe value in the example configuration files range from \\$0.5/hour to \\$2/hour\n(for GPU-powered machines).\n\n## Examples\n\n### Python\n\nIn the directory `examples/python`, there is a minimal example showing how to\nrun a program with Plz that handles input and output. Once you\n[have a working controller](#installation-instructions), running `plz run`\ninside the directory will start the job.\n\n### PyTorch\n\nIn the directory `examples/pytorch`, there's a full-fledged example for the task\nof digit recognition using the classic approach of LeNets and a subset of the\nwell-known MNIST dataset.\n\nAnything related to Plz is in `main.py`. In fact the most relevant lines are the\nfollowing ones:\n\n```python\ndef get_from_plz_config(key: str, non_plz_value: T) -\u003e T:\n    configuration_file = os.environ.get('CONFIGURATION_FILE', None)\n    if configuration_file is not None:\n        with open(configuration_file) as c:\n            config = json.load(c)\n        return config[key]\n    else:\n        return non_plz_value\n[...]\n    input_directory = get_from_plz_config(\n        'input_directory', os.path.join('..', 'data'))\n    output_directory = get_from_plz_config('output_directory', 'models')\n    parameters = get_from_plz_config('parameters', DEFAULT_PARAMETERS)\n    measures_directory = get_from_plz_config('measures_directory', 'measures')\n    summary_measures_path = get_from_plz_config(\n        'summary_measures_path',\n        os.path.join('measures', 'summary'))\n```\n\nThis shows how to get the input data and parameters that Plz uploads for you.\nThere's a configuration file whose name comes in the environment variable\n`CONFIGURATION_FILE`. If that variable is present, you're running with Plz, and\nyou can read and parse the file as a JSON object. The object has the following\nkeys:\n\n- `input_directory` is a directory where you'll find your input data. If you\n  have `\"input\": \"file://../data/mnist\",` in your `plz.config.json` file, the\n  directory `config['input_directory']` will have the same contents that\n  `../data/mnist` has locally.\n- `output_directory` is directory where you can write files. These are retrieved\n  via `plz output`, or downloaded if you keep the CLI running until the end of\n  the job.\n- `parameters` is the JSON object that you passed with\n  `plz run --parameters a_json_file.json`, if you so did. Otherwise it's an\n  empty object.\n- `measures_directory` is a directory in which you can write measures. You can\n  query these with `plz measures`. Each file is interpreted as a property in a\n  JSON object, using the file name as the key, and the file contents as the\n  value, interpreted as JSON. By writing the code:\n\n  ```python\n      with open(os.path.join(measures_directory, f'epoch_{epoch}'), 'w') as f:\n          json.dump({'training_loss': training_loss, 'accuracy': accuracy}, f)\n  ```\n\n  You can then run:\n\n  ```\n  sergio@spaceship:~/plz/examples/pytorch$ plz measures\n  {\n    \"epoch_1\": {\n      \"training_loss\": 2.1326301097869873,\n      \"accuracy\": 45.4\n    },\n    \"epoch_2\": {\n      [...]\n    }\n  }\n  ```\n\n- `summary_measures_path` is a path to a file in which you can write a JSON\n  object with a summary of the results you obtained in your run (best accuracy,\n  total training time, etc.). The summary is available via `plz measures -s`,\n  and also printed by the CLI if you wait until the job finishes.\n\nIf you want to use CUDA for this example, we have provided an example\nconfiguration file for this purpose:\n\n```\nplz -c plz.cuda.config.json run\n```\n\nThis tells Docker to use the\n[CUDA runtime](https://github.com/NVIDIA/nvidia-docker).\n\n## Plz principles\n\nWe built Plz following these principles:\n\n- Code and data must be stored for future reference.\n- Whatever part of the running environment can be captured by Plz, we capture it\n  as to make jobs repeatable.\n- Functionality is based on standard mechanisms like files and environment\n  variables. You don't need to add extra dependencies to your code or learn how\n  to read/write your data in specific ways.\n- The tool must be flexible enough so that no unnecessary restrictions are\n  imposed by the architecture. You should be able to do with Plz whatever you\n  can do by running a program manually. It was surprising to find out how many\n  issues, mostly around running jobs in the cloud, could be solved only by\n  tweaking the configuration, without requiring any changes to the code.\n\nPlz is routinely used at `prodo.ai` to train ML models on AWS, some of them\ntaking days to run in the most powerful instances available. We trust it to\nstart and terminate these instances as needed, and to manage our spot instances,\nallowing us to get a much better price than if we were using on-demand instances\nall the time.\n\n## Future work\n\nIn the future, Plz is intended to:\n\n- add support for named inputs and outputs, and function as a sort of \"build\n  system\" in the cloud, particulary suitable for build pipelines,\n- add support for visualisations, such as\n  [Tensorboard](https://www.tensorflow.org/guide/summaries_and_tensorboard),\n- manage epochs to capture intermediate metrics and results, and terminate runs\n  early,\n- and whatever else sounds like fun.\n  ([Please, tell us!](https://github.com/prodo-ai/plz/issues))\n\n## Instructions for developers\n\n### Installing dependencies\n\n1. Run `pip install pipenv` to install [`pipenv`](https://docs.pipenv.org/).\n2. Run `make environment` to create the virtual environments and install the\n   dependencies.\n3. Run `make check` to run the tests.\n\nFor more information, take a look at\n[the `pipenv` documentation](https://docs.pipenv.org/).\n\n### Using the CLI\n\nSee the CLI's\n[_README.rst_](https://github.com/prodo-ai/plz/blob/master/cli/README.rst).\n\n### Deploying a test environment\n\n1. Clone this repository.\n2. Install [direnv](https://direnv.net/).\n3. Create a _.envrc_ file in the root of this repository:\n   ```\n   export SECRETS_DIR=\"${PWD}/secrets\"\n   ```\n4. Create a configuration file named _secrets/config.json_ based on\n   _example.config.json_.\n5. Run `make deploy`.\n\n### Deploying a production environment\n\nDo just as above, but put your secrets directory somewhere else (for example,\nanother repository, this one private and encrypted).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprodo-dev%2Fplz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprodo-dev%2Fplz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprodo-dev%2Fplz/lists"}