{"id":26263708,"url":"https://github.com/fblettner/liberty-airflow-plugins","last_synced_at":"2025-03-14T01:19:23.245Z","repository":{"id":281335125,"uuid":"944963579","full_name":"fblettner/liberty-airflow-plugins","owner":"fblettner","description":null,"archived":false,"fork":false,"pushed_at":"2025-03-08T11:00:37.000Z","size":0,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-08T11:29:19.473Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fblettner.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"fblettner"}},"created_at":"2025-03-08T10:36:58.000Z","updated_at":"2025-03-08T11:00:33.000Z","dependencies_parsed_at":"2025-03-08T11:39:55.289Z","dependency_job_id":null,"html_url":"https://github.com/fblettner/liberty-airflow-plugins","commit_stats":null,"previous_names":["fblettner/liberty-airflow-plugins"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fblettner%2Fliberty-airflow-plugins","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fblettner%2Fliberty-airflow-plugins/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fblettner%2Fliberty-airflow-plugins/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fblettner%2Fliberty-airflow-plugins/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fblettner","download_url":"https://codeload.github.com/fblettner/liberty-airflow-plugins/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243505896,"owners_count":20301619,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-14T01:19:22.575Z","updated_at":"2025-03-14T01:19:23.238Z","avatar_url":"https://github.com/fblettner.png","language":"Python","funding_links":["https://github.com/sponsors/fblettner"],"categories":[],"sub_categories":[],"readme":"# liberty-plugins\nPlugins for Liberty Framework\n\n## ✅ Dags\n\n### **Daily DAGs**\n- **`airflow-purge-daily-1`**: Purges old Airflow logs and metadata on a daily schedule (`@daily`).\n- **`database-backup-daily-1`**: Backs up databases every day at 01:00 AM (`00 1 * * *`).\n\n### **Weekly DAGs**\n- **`database-purge-weekly-1`**: Performs database cleanup and purging on a weekly schedule (`@weekly`).\n\n### **Unscheduled DAGs**\n- **`airflow-sync-1`**: Synchronizes repositories as needed (manually triggered).\n\n\n## ✅ Airflow Purge\n\n### **Overview**\nThis function automates the purge of Airflow dags, jobs and logs.\n\n\n### **Purge Functions**\n- **`dag_runs`**: Deletes old DAG run records from the Airflow database based on the retention period set in the `airflow_retention_days` variable.\n- **`task_instances`**: Removes outdated task instance records from the database, ensuring only recent executions are retained.\n- **`jobs`**: Cleans up old job entries by deleting records where the job's end date is beyond the retention threshold.\n- **`logs_in_db`**: Deletes historical log entries stored in the Airflow database to free up space and improve query performance.\n- **`logs_on_disk`**: Scans and removes log files from the Airflow logs directory that exceed the configured retention period, keeping disk usage in check.\n\n### **Configuration**\n- The retention period is controlled by the **Airflow Variable**: `airflow_retention_days`.\n- If the variable is not set, the default retention period is **30 days**.\n- Logs stored in the database and on disk are purged accordingly.\n\n### **Usage**\n- These functions should be executed periodically to prevent excessive log and metadata buildup.\n- They can be added to an **Airflow DAG** that runs on a **daily schedule (`@daily`)**.\n\n## ✅ Airflow PostgreSQL Backup\n\n### **Overview**\nThis function automates the backup of PostgreSQL databases using `pg_dump`, managed as an Airflow task.\n\n### **Function**\n- **`pg_dump(dag, database_name, conn_id=\"postgres_conn\")`**: Creates a BashOperator task to back up a PostgreSQL database.\n  - **`dag`**: The DAG to which this task belongs.\n  - **`database_name`**: The name of the database to be backed up.\n  - **`conn_id`**: The Airflow connection ID for PostgreSQL (default: `postgres_conn`).\n  - **Returns**: A BashOperator task that executes the backup.\n\n### **Configuration**\n- The **connection details** are fetched dynamically from Airflow using `get_connection`.\n- The **backup location** is inside `$AIRFLOW_HOME/tmp/`.\n- **Environment variables** are used to avoid storing credentials in plaintext.\n\n### **Usage**\n- This function should be used within an Airflow DAG.\n- It ensures database backups are automated and stored securely.\n- Can be scheduled to run at desired intervals using DAG scheduling.\n\n## ✅ Database Utils\n\n### **Overview**\nThis module provides utility functions for retrieving database connection details and column metadata for Oracle and PostgreSQL.\n\n### **Functions**\n- **`get_column_lengths(spark, table, schema, conn_source)`**: Retrieves column lengths based on the database type.\n- **`get_column_types(data_df, column_lengths, conn_source)`**: Generates a SQL column type definition string for Oracle or PostgreSQL.\n- **`get_connection(conn_id, schema=None)`**: Dynamically retrieves database connection details from Airflow based on the connection type.\n\n### **Configuration**\n- Supports both Oracle (`oracle.jdbc.driver.OracleDriver`) and PostgreSQL (`org.postgresql.Driver`).\n- Uses Airflow's `BaseHook` to dynamically retrieve connection details.\n- Calls appropriate helper functions based on the detected database type.\n\n### **Usage**\n- Can be integrated into Airflow DAGs for database connection handling.\n- Useful for schema extraction, table metadata analysis, and type mapping.\n- Ensures compatibility with both Oracle and PostgreSQL environments.\n\n## ✅ Postgres Utils\n\n### **Overview**\nThis module provides utility functions for working with Apache Spark and PostgreSQL within an Airflow environment.\n\n### **Functions**\n- **`create_spark_session()`**: Initializes and returns a Spark session with predefined configurations.\n- **`get_all_tables(spark, conn)`**: Retrieves all tables from the database, categorizing them based on foreign key dependencies.\n- **`get_foreign_key_dependencies(spark, conn)`**: Fetches foreign key relationships between tables.\n- **`topological_sort(dependencies)`**: Performs a topological sort on a given dependency graph.\n- **`get_primary_key_for_table(spark, table_name, conn)`**: Retrieves the primary key columns for a specified table.\n- **`delete_existing_rows(target_conn, primary_keys, table, rows_to_update)`**: Deletes rows from the target database that need to be updated.\n- **`merge_all_tables(conn_source, conn_target, source_schema, target_schema)`**: Manages table synchronization by processing tables without foreign keys first and then handling dependent tables.\n- **`merge_single_table(spark, table, source_conn, target_conn)`**: Handles the data merging process for a single table, identifying rows to insert or update in the target database.\n\n### **Configuration**\n- Spark session is configured with JDBC drivers for PostgreSQL and Oracle.\n- PostgreSQL connections are retrieved dynamically using `get_connection`.\n- Foreign key dependencies are processed using topological sorting.\n\n### **Usage**\n- Used for efficient data synchronization and migration between databases.\n- Can be integrated into Airflow DAGs for automated execution.\n- Supports large datasets by leveraging Spark’s distributed processing capabilities.\n\n## ✅ Data Transfer Utils\n\n### **Overview**\nThis module provides utility functions for reading, processing, and writing data between databases using Apache Spark.\n\n### **Functions**\n- **`read_data_from_db(spark, table, source_conn, source_schema)`**: Reads data from a source database using JDBC.\n- **`lowercase_columns(df)`**: Converts all column names in a DataFrame to lowercase.\n- **`write_data_to_db(spark, data_df, table, source_conn, source_schema, target_conn, target_schema)`**: Writes data to a target database, ensuring proper column types and formatting.\n- **`create_spark_session()`**: Initializes a Spark session with necessary configurations.\n- **`copy_table(conn_source, conn_target, table_name, source_schema, target_schema)`**: Copies a table from a source schema to a target schema, handling data transformation and loading.\n\n### **Configuration**\n- Uses JDBC for database interactions.\n- Handles column data type conversion to ensure compatibility.\n- Utilizes `get_column_lengths`, `get_column_types`, and `get_connection` for metadata extraction and connection handling.\n\n### **Usage**\n- Facilitates ETL operations between databases.\n- Ensures clean and structured data processing.\n- Can be integrated into Airflow DAGs for automated data migration workflows.\n\n\n## ✅ Git Backup Utils\n\n### **Overview**\nThis module provides utility functions for managing backups in Git, including pulling, pushing, and purging old backups.\n\n### **Functions**\n- **`pull_repository(local_path, repo_name, conn_id=\"git_conn\")`**: Pulls the latest changes from a Git repository.\n- **`push_backup(local_path, repo_name, databases, conn_id=\"git_conn\")`**: Pushes database backups to a Git repository.\n- **`purge_old_backups(local_path, repo_name, conn_id=\"git_conn\")`**: Deletes backups older than the configured retention period and commits the changes.\n\n### **Configuration**\n- Uses Airflow’s `BaseHook` to retrieve Git connection details dynamically.\n- Retrieves the backup retention period from the `backup_retention_days` Airflow variable (default: 30 days).\n- Automatically configures Git user details for committing changes.\n\n### **Usage**\n- Automates backup management by storing database dumps in Git.\n- Ensures that outdated backups are removed efficiently.\n- Can be scheduled within an Airflow DAG for periodic execution.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffblettner%2Fliberty-airflow-plugins","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffblettner%2Fliberty-airflow-plugins","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffblettner%2Fliberty-airflow-plugins/lists"}