{"id":19913389,"url":"https://github.com/morphl-ai/morphl-orchestrator","last_synced_at":"2025-05-03T05:30:24.429Z","repository":{"id":81784510,"uuid":"141196453","full_name":"Morphl-AI/MorphL-Orchestrator","owner":"Morphl-AI","description":"Backbone for the MorphL-Community-Edition platform.","archived":false,"fork":false,"pushed_at":"2019-11-26T18:03:45.000Z","size":380,"stargazers_count":5,"open_issues_count":9,"forks_count":4,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-03-14T17:44:20.849Z","etag":null,"topics":["cassandra-database","hdfs","keras-tensorflow","kubernetes","machine-learning","morphl-platform","pipeline","pyspark","ubuntu1604"],"latest_commit_sha":null,"homepage":"https://morphl.io/","language":"Shell","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Morphl-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-07-16T21:27:23.000Z","updated_at":"2022-11-01T04:14:13.000Z","dependencies_parsed_at":null,"dependency_job_id":"fc05a129-8485-42da-99c3-2de950a027fa","html_url":"https://github.com/Morphl-AI/MorphL-Orchestrator","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Morphl-AI%2FMorphL-Orchestrator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Morphl-AI%2FMorphL-Orchestrator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Morphl-AI%2FMorphL-Orchestrator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Morphl-AI%2FMorphL-Orchestrator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Morphl-AI","download_url":"https://codeload.github.com/Morphl-AI/MorphL-Orchestrator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224352926,"owners_count":17297171,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cassandra-database","hdfs","keras-tensorflow","kubernetes","machine-learning","morphl-platform","pipeline","pyspark","ubuntu1604"],"created_at":"2024-11-12T21:32:55.213Z","updated_at":"2024-11-12T21:32:55.731Z","avatar_url":"https://github.com/Morphl-AI.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MorphL Platform Orchestrator\n\nThe MorphL Orchestrator is the backbone of the MorphL platform. It sets up the infrastructure and software that are necessary for running the MorphL platform. It consists of 3 pipelines:\n\n- **Ingestion Pipeline** - It runs a series of connectors responsible for gathering data from various APIs (Google Analytics, Mixpanel, Google Cloud Storage, etc.) and save it into Cassandra tables.\n\n- **Training Pipeline** - Consists of pre-processors (responsible for cleaning, formatting, deduplicating, normalizing and transforming data) and model training.\n\n- **Prediction Pipeline** - It generates predictions based on the model that was trained. It is triggered at the final step of the ingestion pipeline through a preflight check.\n\nThe pipelines are set up using [Apache Airflow](https://github.com/apache/incubator-airflow). Below you can see a diagram of the platform's architecture:\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/Morphl-AI/MorphL-Architecture/master/churn-prediction/Churn.Final.Architecture.png\" style=\"width:1000px; height: auto;\" /\u003e\n\u003c/div\u003e\n\n### Prerequisites\n\n#### 1) Virtual Instance\n\nThe orchestrator can be installed on a virtual instance on a cloud platform of your choice (Google Cloud Platform, Amazon Web Services, etc.).\n\nWe recommend using a clean Ubuntu 16.04 machine, minimum 2 vCPUs, 16GB of RAM, 50GB storage.\n\n#### 2) API subdomain\n\nModel predictions will be exposed through a secure API, for easy integration within a web or mobile app. The API needs an associated domain or subdomain name.\n\n##### A record\n\nIn your DNS zone, add an A record with your subdomain and external IP address of the orchestrator instance:\n\n`A api.yourdomain.com ???.???.???.???`\n\nwhere `???.???.???.???` is the IP address of the Ubuntu machine. You should be able to get this IP address from your cloud management interface or by running from your machine:\n\n`dig +short myip.opendns.com @resolver1.opendns.com`\n\n- **Make sure you're using a static IP address that doesn't change when the instance is rebooted.**\n\n- **Also, allow both HTTP and HTTPS traffic to your VM**.\n\n##### Settings file\n\nAdd your subdomain name in a text file on your machine:\n\n```\ncat \u003e /opt/settings/apidomain.txt \u003c\u003c EOF\napi.yourdomain.com\nEOF\n```\n\nSSL certificates for the API subdomain will be automatically generated and renewed using [Let's Encrypt](https://letsencrypt.org/).\n\n## Quick Start Guide\n\n### Step 1) Installing the platform\n\nThis step is required for setting up the environment and downloading the required software on your instance.\n\nBootstrap the installation by running the following commands as root:\n\n```\nWHERE_THE_ORCHESTRATOR_IS='https://github.com/Morphl-AI/MorphL-Orchestrator'\nWHERE_AUTH_IS='https://github.com/Morphl-AI/MorphL-Auth-API.git'\nWHERE_GA_CHP_IS='https://github.com/Morphl-AI/MorphL-Model-Publishers-Churning-Users'\nWHERE_GA_CHP_BQ_IS='https://github.com/Morphl-AI/MorphL-Model-Publishers-Churning-Users-BigQuery'\nWHERE_USI_CSV_IS='https://github.com/Morphl-AI/MorphL-Model-User-Search-Intent'\n\napt update -qq \u0026\u0026 apt -y install git ca-certificates\n\ngit clone ${WHERE_THE_ORCHESTRATOR_IS} /opt/orchestrator\ngit clone ${WHERE_AUTH_IS} /opt/auth\ngit clone ${WHERE_GA_CHP_IS} /opt/ga_chp\ngit clone ${WHERE_GA_CHP_BQ_IS} /opt/ga_chp_bq\ngit clone ${WHERE_USI_CSV_IS} /opt/usi_csv\n\nbash /opt/orchestrator/bootstrap/runasroot/rootbootstrap.sh\n```\n\nThe installation process is fully automated and will take a while to complete (25-35 minutes). The `rootbootstrap.sh` script will install Docker, Docker Registry, Kubernetes, PostgreSQL and various utilities libraries. A second script (`airflowbootstrap.sh`) will be run and will install Anaconda, Airflow, JDK, Cassandra, Spark and Hadoop.\n\nOnce the installation is done, check the bottom of the output to see the if the status `The installation has completed successfully.` has been reported.\n\nAt this point a few more setup steps are necessary.\n\n### Step 2) Provide connectors credentials\n\nThe next step is creating a series of files that store credentials for connecting to various data sources APIs.\n\nFrom the root prompt, log into `airflow`:\n\n```\nsu - airflow\n```\n\n**Add credentials depending on our data source API**:\n\n- Churning users based on Google Analytics data (_GA_CHP_ model) - see docs [here](https://github.com/Morphl-AI/MorphL-Model-Publishers-Churning-Users#orchestrator-setup).\n\n- Churning users based on Google Analytics 360 with BigQuery integration (_GA_CHP_BQ_ model) - see docs [here](https://github.com/Morphl-AI/MorphL-Model-Publishers-Churning-Users-BigQuery/tree/master/bq_extractor#orchestrator-setup).\n\nLog out of `airflow` and back in again, and verify that your key file and view ID have been configured correctly:\n\n```\ncat /opt/secrets/keyfile.json\n\nenv | grep KEY_FILE_LOCATION\n```\n\nIf the output of `env | grep KEY_FILE_LOCATION` is empty, like this:\n\n```\nKEY_FILE_LOCATION=\n```\n\nit means you have forgotten to log out of `airflow` and back in again.\n\nUnless specified otherwise, all commands referred to below should be run as user `airflow`.\n\n### Step 3) Loading historical data\n\nTo train the models, you'll need to bring in historical data. If you don't have historical data, you can let the ingestion pipeline gather it. However, in most cases, you'll have data that was already gathered and can be immediately downloaded.\n\nRun the command:\n\n```\n# Load historical data for churning users with Google Analytics\nload_ga_chp_historical_data.sh\n\n# OR load historical data for churning users with Big Query\nload_ga_chp_bq_historical_data.sh\n```\n\nYou will be presented with a prompt that lets you select the time interval for loading the data:\n\n```\nHow much historical data should be loaded?\n\n1) 2018-08-04 - present time (5 days worth of data)\n2) 2018-07-30 - present time (10 days worth of data)\n3) 2018-07-10 - present time (30 days worth of data)\n4) 2018-06-10 - present time (60 days worth of data)\n5) 2018-04-11 - present time (120 days worth of data)\n6) 2018-02-10 - present time (180 days worth of data)\n7) 2017-11-12 - present time (270 days worth of data)\n8) 2017-08-09 - present time (365 days worth of data)\n\nSelect one of the numerical options 1 thru 8:\n```\n\nOnce you select an option, you should see an output like this:\n\n```\nEmptying the relevant Cassandra tables ...\n\nInitiating the data load ...\n\nThe data load has been initiated.\n```\n\nOpen [http://???.???.???.???:8181/admin/](http://???.???.???.???:8181/admin/) in a browser.  \n`???.???.???.???` is the Internet-facing IP address of the Ubuntu machine.  \nYou should be able to get this IP address from your cloud management interface or by running:\n\n```\ndig +short myip.opendns.com @resolver1.opendns.com\n```\n\nTo visualize the pipelines' status, logs, etc. you can log into Airflow's web UI.\n\nUse username `airflow` and the password found with:\n\n```\nenv | grep AIRFLOW_WEB_UI_PASSWORD\n```\n\nKeep refreshing the UI page until all the data for the number of days you specified previously, has been loaded into Cassandra.\n\n### Step 4) Scheduling the remaining parts of the pipeline\n\nOnce all the raw data has been loaded, there is one more thing to do for the ML pipeline to be fully operational:\n\n```\n# Trigger pipeline for churning users with Google Analytics\nairflow trigger_dag ga_chp_training_pipeline\n\n# OR trigger pipeline for churning users with Big Query\nairflow trigger_dag ga_chp_bq_training_pipeline\n```\n\nThe command above will trigger the training pipeline, and upon running it you should see output similar to this:\n\n```\n[...] {__init__.py:45} INFO - Using executor LocalExecutor\n[...] {models.py:189} INFO - Filling up the DagBag from /home/airflow/airflow/dags\n[...] {cli.py:203} INFO - Created \u003cDagRun ga_chp_training_pipeline, externally triggered: True\u003e\n```\n\nSince we have already loaded historical data (step 3), we can start running the pre-processors and train the models. If you do not manually trigger the training pipeline as described above, it will automatically start at its scheduled date (it runs on a weekly basis).\n\nThe step above only needs to be performed once, immediately following the installation.\n\nFrom this point forward, **the platform is on auto-pilot** and will on a regular basis collect new data and generate fresh ML models fully automatically.\n\n### Using Predictions\n\nOnce a model has been trained, the prediction pipeline also needs to be triggered. You can wait until it is automatically triggered by the preflight check at the end of the ingestion pipeline (which runs daily) or you can trigger it yourself with the following command:\n\n```\n# Trigger pipeline for churning users with Google Analytics\nairflow trigger_dag ga_chp_prediction_pipeline\n\n# OR trigger pipeline for churning users with Big Query\nairflow trigger_dag ga_chp_bq_prediction_pipeline\n\n```\n\nAfter the pipeline is triggered, the API can be accessed using the following command:\n\n```\n# Authorize API\ncurl -s http://${AUTH_KUBERNETES_CLUSTER_IP_ADDRESS}\n\n# Churning users API\ncurl -s http://${GA_CHP_KUBERNETES_CLUSTER_IP_ADDRESS}/churning\n\n# Churning users with BigQuery API\ncurl -s http://${GA_CHP_BQ_KUBERNETES_CLUSTER_IP_ADDRESS}/churning-bq\n\n# User search intent with CSVs\ncurl -s http://${USI_CSV_KUBERNETES_CLUSTER_IP_ADDRESS}/search-intent\n```\n\nSee [GA_CHP Wiki](https://github.com/Morphl-AI/MorphL-Model-Publishers-Churning-Users/wiki/Public-API-Endpoints) or [GA_CHP_BQ wiki](https://github.com/Morphl-AI/MorphL-Model-Publishers-Churning-Users-BigQuery/wiki/Public-API-Endpoints) for examples on how to access predictions.\n\n### Troubleshooting\n\nShould you need the connection details for Cassandra, the user name is `morphl` and you can find the password with:\n\n```\nenv | grep MORPHL_CASSANDRA_PASSWORD\n```\n\n### (Optional) PySpark development\n\nSince running PySpark on your local machine can be challenging, we recommend using the MorphL Orchestrator.\n\nTo start developing PySpark applications, you need to run the Jupyter Notebook with a very specific configuration.  \nTo do that, you have at your disposal a script that sets up that environment:\n\n```\nrun_pyspark_notebook.sh\n```\n\nLook for these messages in the output:\n\n```\n[I 14:01:20.091 NotebookApp] The Jupyter Notebook is running at:\n[I 14:01:20.091 NotebookApp] http://???.???.???.???:8282/?token=2501b8f79e8f128a01e83a457311514e021f0e33c70690cb\n```\n\nIt is recommended that every PySpark notebook should have this snippet at the top:\n\n```\nfrom os import getenv\n\nMASTER_URL = 'local[*]'\nAPPLICATION_NAME = 'preprocessor'\n\nMORPHL_SERVER_IP_ADDRESS = getenv('MORPHL_SERVER_IP_ADDRESS')\nMORPHL_CASSANDRA_USERNAME = getenv('MORPHL_CASSANDRA_USERNAME')\nMORPHL_CASSANDRA_PASSWORD = getenv('MORPHL_CASSANDRA_PASSWORD')\nMORPHL_CASSANDRA_KEYSPACE = getenv('MORPHL_CASSANDRA_KEYSPACE')\n\nspark.stop()\n\nspark_session = (\n    SparkSession.builder\n                .appName(APPLICATION_NAME)\n                .master(MASTER_URL)\n                .config('spark.cassandra.connection.host', MORPHL_SERVER_IP_ADDRESS)\n                .config('spark.cassandra.auth.username', MORPHL_CASSANDRA_USERNAME)\n                .config('spark.cassandra.auth.password', MORPHL_CASSANDRA_PASSWORD)\n                .config('spark.sql.shuffle.partitions', 16)\n                .getOrCreate())\n\nlog4j = spark_session.sparkContext._jvm.org.apache.log4j\nlog4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmorphl-ai%2Fmorphl-orchestrator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmorphl-ai%2Fmorphl-orchestrator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmorphl-ai%2Fmorphl-orchestrator/lists"}