{"id":18725573,"url":"https://github.com/reljicd/ml-airflow","last_synced_at":"2025-04-12T16:12:40.574Z","repository":{"id":30616905,"uuid":"125552030","full_name":"reljicd/ml-airflow","owner":"reljicd","description":"Generalized project for running Airflow DAGs, with possibility of skipping tasks already done for some set of input parameters.","archived":false,"fork":false,"pushed_at":"2022-11-16T05:41:24.000Z","size":60,"stargazers_count":15,"open_issues_count":6,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-26T10:36:28.592Z","etag":null,"topics":["airflow","bash","docker","docker-compose","mysql","pytest","python","python3","rest","rest-api","script","sql","sqlalchemy","sqlite","sqlite3"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/reljicd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-16T17:59:36.000Z","updated_at":"2024-09-20T09:25:05.000Z","dependencies_parsed_at":"2023-01-14T17:19:52.129Z","dependency_job_id":null,"html_url":"https://github.com/reljicd/ml-airflow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reljicd%2Fml-airflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reljicd%2Fml-airflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reljicd%2Fml-airflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/reljicd%2Fml-airflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/reljicd","download_url":"https://codeload.github.com/reljicd/ml-airflow/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248594140,"owners_count":21130313,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","bash","docker","docker-compose","mysql","pytest","python","python3","rest","rest-api","script","sql","sqlalchemy","sqlite","sqlite3"],"created_at":"2024-11-07T14:10:50.801Z","updated_at":"2025-04-12T16:12:40.548Z","avatar_url":"https://github.com/reljicd.png","language":"Python","readme":"# ML Airflow\n\n## About\n\nThis is generalized project for running Airflow DAGs, with possibility of skipping tasks already done for some set of input parameters.\n\nMetadata about DAGs (i.e. parameters set: parameter_1...)\nand tasks (datetime_started, datetime_finished, output_path...) \nis saved in **ml_airflow** DB schema. SQL files (DDL and test data set) for MySQL and SQLite are in folder ***sql***.\n\nBased on this metadata from DB, already done tasks (for some set of input parameters) will be skipped.\n\nWhen running DAGs (ml_training or ml_testing) for each of the tasks Python logic will first do some pre processing (write some metadata in db for example), \ncheck if task is already done, skip it if it is (based on the datetime_finished in task's table), or run it if it is not.\nIf it is run or rerun, Python logic will first start task by writing datetime_started, call Bash using proper parameters for\nthat task, after that finish task by writing datetime_finished and do some additional post processing if there is need.\n\nCustomization of Bash parameters for each task, as well customization of output paths for each tasks is explained later in this document\n\nAirflow is being run by starting Airflow web server and scheduler inside Docker container.\n\n![Screenshot 1](https://user-images.githubusercontent.com/7095876/37589803-920655c8-2b66-11e8-9377-ec3980d54df9.png)\n\n![Screenshot 2](https://user-images.githubusercontent.com/7095876/37589750-7704e29e-2b66-11e8-8067-b7590068e851.png)\n\n## Customization\n\nEach of the tasks can have its own Python class, that is a subclass of \n**dags.subdags.base_subdag.MLTaskSubDag**:\n\n* Common Task 1 - **dags.subdags.common_task_1_subdag.CommonTask1SubDag**\n\n**dags.subdags.base_subdag.MLTaskSubDag** has one abstract method: \n\n* **_parameters_provider** - overridden in each task subclass, which should provide Bash parameters for that task\n\nBy customizing this methods in each task subclass it is possible to construct parameters programmatically \nusing metadata from DB or some calculated info.\n\n## Airflow configuration\n\nConfiguration file is found in **config/airflow.cfg**\n\nThis file is copied to Docker image during build, modified and copied to **airflow_home** folder for local installation.\n\n### Configuration parameters\n\nConfiguration parameters are passed through environment variables:\n\n* **DB_HOST** - DB host\n\n* **DB_LOGIN** - DB login username\n\n* **DB_PASSWORD** - DB password\n\n* **DB_SCHEMA** - DB schema\n\n* **DB_CONN_ID** - DB connection ID\n\n## Prerequisites\n\n\\[Optional\\] Install virtual environment:\n\n```bash\n$ python -m virtualenv venv\n```\n\n\\[Optional\\] Activate virtual environment:\n\nOn macOS and Linux:\n```bash\n$ source venv/bin/activate\n```\n\nOn Windows:\n```bash\n$ .\\venv\\Scripts\\activate\n```\n\nInstall dependencies:\n```bash\n$ pip install -r requirements.txt\n```\n\n## How to run\n\n### Docker\n\nIt is possible to run Airflow using Docker:\n\nBuild Docker image:\n```bash\n$ docker build -t reljicd/ml-airflow -f docker\\Dockerfile .\n```\n\nRun Docker container:\n```bash\n$ docker run --rm -i -p 8080:8080 reljicd/ml-airflow\n```\n\n#### Helper script\n\nIt is possible to run all of the above with helper script:\n\n```bash\n$ chmod +x scripts/run_docker.sh\n$ scripts/run_docker.sh\n```\n\n### Docker Compose\n\nDocker Compose file **docker/docker-compose.yml** is written to facilitate running of both properly initialized test MySQL DB, \nas well of the Airflow inside Docker containers.\n\n#### Helper script\n\nIt is possible to run all of the above with helper script:\n\n```bash\n$ chmod +x scripts/run_docker_compose.sh\n$ scripts/run_docker_compose.sh\n```\n\n## Airflow CLI in Docker\n\nSince we named Docker container **\"ml-airflow\"** in our **run_docker.sh** script, we can run\nany [Airflow CLI command](https://airflow.apache.org/cli.html) inside of Docker container as:\n\n```bash\n$ docker exec $(docker ps -aqf \"name=ml-airflow\") airflow [COMMAND]\n```\n\nFor example:\n\n```bash\n$ docker exec $(docker ps -aqf \"name=ml-airflow\") airflow list_dags\n```\n\n## Triggering DAG Runs\n\n### CLI in Docker\n\nExample:\n\n```bash\n$ docker exec $(docker ps -aqf \"name=ml-airflow\") airflow trigger_dag -c CONF dag_id\n```\n\nWhere CONF is JSON string that gets pickled into the DagRun’s conf attribute, i.e.: '{\"foo\":\"bar\"}'\n\n#### ML Testing DAG\n\n```bash\n$ docker exec $(docker ps -aqf \"name=ml-airflow\") airflow trigger_dag -c '{\"parameter_1\":\"parameter_1\",\"parameter_3\":\"parameter_3\"}' ml_testing_dag\n```\n\n#### ML Training DAG\n\n```bash\n$ docker exec $(docker ps -aqf \"name=ml-airflow\") airflow trigger_dag -c '{\"parameter_1\":\"parameter_1\",\"parameter_2\":\"parameter_2\"}' ml_training_dag\n```\n\n### REST API\n\nIt is possible to trigger dags using GET HTTP method and proper URLs (basically copy pasting these URLs in browser's URL field, and calling it just like any other URL)\n\n```\nhttp://{AIRFLOW_HOST}:{AIRFLOW_PORT}/admin/rest_api/api?api=trigger_dag\u0026dag_id=value\u0026conf=value\n```\n\nWhere CONF is JSON string that gets pickled into the DagRun’s conf attribute, i.e.: {\"foo\":\"bar\"}\n\n#### ML Testing DAG\n\n```\nhttp://{AIRFLOW_HOST}:{AIRFLOW_PORT}/admin/rest_api/api?api=trigger_dag\u0026dag_id=ml_testing_dag\u0026conf={\"parameter_1\":\"parameter_1\",\"parameter_3\":\"parameter_3\"}\n```\n\n#### ML Training DAG\n\n```\nhttp://{AIRFLOW_HOST}:{AIRFLOW_PORT}/admin/rest_api/api?api=trigger_dag\u0026dag_id=ml_trainig_dag\u0026conf={\"parameter_1\":\"parameter_1\",\"parameter_2\":\"parameter_2\"}\n```\n\n## Testing DAG Tasks\n\nIt is possible to test DAG task instances locally with **airflow test** command. This command outputs their log to stdout (on screen), \ndoesnt bother with dependencies, and doesnt communicate state (running, success, failed, …) to the database. \nIt simply allows testing a single task instance.\n\nFor example:\n\n```bash\n$ docker exec $(docker ps -aqf \"name=ml-airflow\") airflow test dag_id task_id 2015-06-01\n```\n\n## Airflow Web UI\n\nAfter starting Docker container, it is possible to access Airflow's Web UI on [http://localhost:8080/admin/](http://localhost:8080/admin/)\n\n## Airflow REST API\n\nIt is possible to use all the Airflow's CLI commands through REST API.\n\nA [plugin](https://github.com/teamclairvoyant/airflow-rest-api-plugin) for Apache Airflow that exposes REST endpoints for the Command Line Interfaces listed in the airflow documentation:\n\nhttp://airflow.incubator.apache.org/cli.html\n\nThe plugin also includes other custom REST APIs.\n\nOnce you deploy the plugin and restart the web server, you can start to use the REST API. Bellow you will see the endpoints that are supported. In addition, you can also interact with the REST API from the Airflow Web Server. When you reload the page, you will see a link under the Admin tab called \"REST API\". Clicking on the link will navigate you to the following URL:\n\n```\nhttp://{AIRFLOW_HOST}:{AIRFLOW_PORT}/admin/rest_api/\n```\n\nThis web page will show the Endpoints supported and provide a form for you to test submitting to them.\n\n## Docker \n\nFolder **docker** contains Dockerfiles:\n\n* **docker/docker-compose.yml** - Docker Compose file. Instructions for running Docker containers of **ml-airflow**\nas well of **MySQL** Docker container, with proper mounting of sql files from **sql** folder, so that MySQL is initialized with\nml_airflow schema properly.\n\n* **docker/Dockerfile** - Docker build file for ml-airflow\n\n* **docker/initialize_airflow.sh** - Airflow initialization script (making DB connection and unpausing ml_training_dag and ml_testing_dag). This script\nis run in **ml-airflow** during starting of Airflow scheduler.\n\n* **docker/supervisord.conf** - Supervisor's config file\n\n## Util Scripts\n\n* **scripts/init_airflow.sh** - util script for initialization of local Airflow installation.\n\n* **scripts/run_docker.sh** - util script for building Docker image, and running Docker container with passing to it\nof proper env variables.\n\n* **scripts/run_docker_compose.sh** - util script for running Docker Compose with export of proper env variables.\n\n## Tests\n\nTests can be run by executing following command from the root of the project (dependencies for the project need to be installed, of course):\n\n```bash\n$ python -m pytest\n```","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freljicd%2Fml-airflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Freljicd%2Fml-airflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freljicd%2Fml-airflow/lists"}