{"id":20629657,"url":"https://github.com/agile-lab-dev/uss-transformer","last_synced_at":"2025-03-08T17:23:19.162Z","repository":{"id":262645128,"uuid":"887911342","full_name":"agile-lab-dev/uss-transformer","owner":"agile-lab-dev","description":null,"archived":false,"fork":false,"pushed_at":"2024-11-14T14:49:22.000Z","size":84,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-01-17T07:05:51.408Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/agile-lab-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-13T13:52:03.000Z","updated_at":"2024-11-14T14:49:25.000Z","dependencies_parsed_at":"2024-11-13T14:49:13.972Z","dependency_job_id":"629ecc05-1649-4089-8490-d74c83e01343","html_url":"https://github.com/agile-lab-dev/uss-transformer","commit_stats":null,"previous_names":["agile-lab-dev/uss-transformer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fuss-transformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fuss-transformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fuss-transformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fuss-transformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/agile-lab-dev","download_url":"https://codeload.github.com/agile-lab-dev/uss-transformer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242582984,"owners_count":20153360,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T14:05:34.843Z","updated_at":"2025-03-08T17:23:19.132Z","avatar_url":"https://github.com/agile-lab-dev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# USS Transformer\n\nThis automated tool, developed in python, works on an architecture composed of the combination of MinIO as the storage \nlayer and Trino as the computational layer. Following the medallion architecture approach, the raw data is stored as \nparquet files in the bronze layer. The purpose of this tool is to extract the schema from the raw data, perform the USS \ntransformation, and store the transformed data in the silver layer.\n\n## Tool Operations\n\n### Setup (Optional)\n\nThis operation is optional and allows Trino to be set up from a schema stored in PostgreSQL. \n\nGiven the name of a specific schema, the tool executes a dump command to obtain the SQL statements to completely \nrecreate the schema. Afterward, the tool parses them to retrieve all the important information about the schema. Primary \nkeys and foreign keys are not considered because not required for the creation of the schema in Trino. \n\nThe tool uses a feature of the DuckDB python API to extract table data as parquet file. These files are uploaded to MinIO via API. The \npaths indicate that these files belong to the bronze layer. An example of path is \n\"s3://bronze/\\\u003cSchemaName\u003e/\\\u003cTableName\u003e/\". \n\nUsing SQLGlot, an SQL transpiler for python, the tool can easily convert the data types used in PostgreSQL to the ones \nused in Trino. Finally, the tool creates and runs SQL statements for Trino to recreate the schema tables and associate \nthem to the parquet files.\n\n### Schema Extraction\n\nGiven the name of a specific schema to be transformed into USS, the tool connects to Trino via API to run several \nqueries to retrieve all useful information about this schema. The first of them is to get all the table names of the \nspecified schema. The following SQL statement returns the list of table names of the indicated schema.\n\n\u003e SHOW TABLES [ FROM schema_name ]\n\nQueries are then performed for discovering the names and the data types of the columns in each table. The query shown \nbelow is to retrieve the SQL statement which creates the specified table.\n\n\u003e SHOW CREATE TABLE table_name\n \nA general example of the CREATE TABLE statement is shown below, which is the result of the SHOW CREATE TABLE statement.\n\n\u003e CREATE [ OR REPLACE ] TABLE [ IF NOT EXISTS ]  \n\u003e table_name (  \n\u003e \u0026nbsp; \u0026nbsp; { column_name data_type [ NOT NULL ]  \n\u003e \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; [ COMMENT comment ]  \n\u003e \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; [ WITH ( property_name = expression [, ...] ) ]  \n\u003e \u0026nbsp; \u0026nbsp; | LIKE existing_table_name  \n\u003e \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; [ { INCLUDING | EXCLUDING } PROPERTIES ]  \n\u003e \u0026nbsp; \u0026nbsp; }  \n\u003e \u0026nbsp; \u0026nbsp; [, ...]  \n\u003e )  \n\u003e [ COMMENT table_comment ]  \n\u003e [ WITH ( property_name = expression [, ...] ) ]\n\nThe tool parses the SQL statements which create all the tables in the schema to store all the important information \nsuch as table names, column names and data types, in an organized way to proceed with the USS transformation.\n\n### Annotation Required\n\nSince Trino does not maintain primary keys and foreign keys, the tool requires a special file in which to find all the \ninformation about them contained in the schema to be transformed. These annotations, written by users, must follow a \nsimplified SQL syntax in order to be read by the tool. \n\n\u003e PRIMARY KEY table_name(column_name [, ...]);\n\nThe annotation to declare which columns compose the primary key of the specified table is above. The annotation \nto declare which columns compose the foreign key of the specified table and the corresponding columns in the referenced \ntable is below.\n\n\u003e FOREIGN KEY table_name(column_name [, ...]) REFERENCES referenced_table(column_name [, ...]);\n\n### USS Transformation\n\nOnce the tool has appropriately stored all the information about the schema, such as table names, column names and data \ntypes, primary and foreign keys, the USS transformation can begin. \n\nFirst of all, the tool needs to create an important list named \"links\" for each table in the schema. Given a specific \ntable, its list initially contains only the name of the tables referenced by its foreign keys. At this point, an initial \niteration is performed on this list to check whether the stored tables have foreign keys which refers to tables not yet \npresent in it. In case new referenced tables are not in the list, they are added into it and another iteration is \nexecuted, otherwise it means there are no new tables iteratively reachable through FKs. Therefore, if a table has no \nforeign keys, its list will be empty. \n\nFor each table, the technical column named \"_Key_\\\u003cTableName\u003e\" used as the unique PK is stored in the data structure with all the \ninformation about the schema. If the original primary key is single column, the data type of the new column is the same \nas the original data type. If the table has no PK or has a multi-column one, the data type is defined as binary. These \nnew columns are also added as columns of the new table \"bridge\" along with the column \"stage\", whose data type is \ndefined as variable-length character string. Since the USS structure is defined, the schema tables can be created and \npopulated from the raw data using the SQL statement explained below.\n\n### CREATE TABLE AS SELECT (CTAS)\n\nThe SQL statement used in Trino to create a new table containing the result of a SELECT query is shown below.\n\n\u003e CREATE [ OR REPLACE ] TABLE [ IF NOT EXISTS ] table_name [ ( column_alias, ... ) ]  \n\u003e [ COMMENT table_comment ]  \n\u003e [ WITH ( property_name = expression [, ...] ) ]  \n\u003e AS query  \n\u003e [ WITH [ NO ] DATA ]\n\nThe bridge table requires the data from the other tables to be created and populated, hence the other tables have \npriority. How create and populate a specific table using the CTAS is described below.\n\nThe \"table_name\" in the CREATE TABLE clause of CTAS will be replaced by \"silver_\\\u003cOriginalSchemaName\u003e.\n\\\u003cOriginalTableName\u003e\", to indicate that the table belongs to the silver layer. Inside the round brackets there will only \nbe column names, because their data types are directly taken from the data type of the columns listed in the SELECT \nquery executed in the AS clause.\n\nThe first WITH clause can be used to set properties of the new table. Since the tool works on parquet files in the \nS3-compatible object storage, the property \"format\" can be set to the value \"parquet\" and the property \n\"external_location\" can be set to the path where the parquet file, which contains the table data, is stored in the \nobject storage. The path must also indicate that the table belongs to the silver layer. An example of path is\n\"s3://silver/\\\u003cSchemaName\u003e/\\\u003cTableName\u003e/\".\n\nThe AS clause is followed by a SELECT query, which retrieves all data from the original table belonging to the bronze \nlayer. To populate the technical column \"_Key_\\\u003cTableName\u003e\", the original PK is reselected if it consists of a single \ncolumn. If the original table has no PK, the Trino function \"uuid()\" is applied to associate a universally unique \nidentifier to each row.\n\nOtherwise, a slightly complex step must be performed if the original PK is multi-column. Each column composing the \noriginal PK must be transformed into a variable character string using the function of Trino \"cast()\". Then, the \nconcatenation of these columns is executed using the function \"array_join()\", which must subsequently be transformed as \nbinary data. Finally, the function \"sha256()\" is applied to obtain a hash value to be used as a technical PK.\n\nAfter executing the CTAS query for each table, the bridge can be created and populated. The columns listed in the CREATE \nTABLE clause are the column \"stage\" and the columns which refer to the technical PKs of each tables.\n\nIn the AS clause, UNION operations are performed between several SELECT queries. Each of these queries retrieves data to \npopulate the stage table associate to a specific table. These queries must select the columns in the same order.\n\nStarting from a specific table, left join operations are performed to connects all the tables contained in the list \n\"links\" related to the specific table. At this point, all technical PKs of joined tables can be retrieved in addition to \nthe technical PK of the specific table. Instead, the columns which refer to the technical PKs of non-joined tables are \nset to null. The table name is inserted for each row in the column \"stage\". \n\nOnce the CTAS statement for the bridge table \nis performed, the USS transformation is completed and the schema in the silver layer is ready to be transformed to \ncreate data overviews useful for business analysis to be stored in the gold layer.\n\n### Test\n\nThe test to verify the correctness of the USS transformation performed by the tool consists of checking that the bridge \ntable produced is the same as the one expected by the user. The test is passed if the number of rows in the produced \nbridge table is equal to the number of rows in the user-supplied one, and if every row provided by the user is found in \nthe produced bridge table. The test is performed using SQL queries via Trino. Since the generation of the UUID is not \ndeterministic, the primary key for tables without it is created through the Trino function \"row_number()\" which assigns \nthe row position based on an order on a specific column chosen by the tool. It is not efficient to execute sorting \noperations when data is distributed, but in this way the user can provide an expected result for PKs in the bridge \ntables, which is not possible using UUIDs.\n\n## Example of Tool Execution\n\nOpen a terminal in the parent folder of the project \"uss-transformer\" and execute the following instructions:\n\n1. Install python3, pip and the python packages written in \"requirements.txt\" \n\n\u003e sudo apt install python3 python3-pip  \n\u003e pip install -r requirements.txt\n\n2. Install PostgreSQL 14\n\n\u003e sudo apt install postgresql-14\n\n3. Open the postgres configuration file  and set up the port to 5433 instead of 5432 (at line 64)\n\n\u003e sudo gedit /etc/postgresql/14/main/postgresql.conf\n\n4. Restart the postgres service\n\n\u003e sudo service postgresql restart\n\n5. Set up the password equal to \"postgres\" for the user \"postgres\"\n\n\u003e sudo -u postgres psql\n\n6. Once \"postgres=#\" is visible at the beginning of the line, run the following command to create the password\n\n\u003e \\password postgres\n\nAfter entering the password twice, exit:\n\n\u003e \\q\n\n7. Run these commands to create two sample schemas on postgres\n\n\u003e sudo -u postgres psql postgres \u003c samples/loops_dump.sql -q  \n\u003e sudo -u postgres psql postgres \u003c samples/northwind_dump.sql -q\n\n8. Run the main\n\n\u003e python3 main.py\n\n9. Run the test\n\n\u003e python3 test_loops.py\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagile-lab-dev%2Fuss-transformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fagile-lab-dev%2Fuss-transformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagile-lab-dev%2Fuss-transformer/lists"}