{"id":16598217,"url":"https://github.com/benhuds/db2w-ml-examples","last_synced_at":"2026-05-28T03:31:47.074Z","repository":{"id":80419073,"uuid":"170748178","full_name":"benhuds/db2w-ml-examples","owner":"benhuds","description":"Tutorial: in-database machine learning with Db2 Warehouse on Cloud","archived":false,"fork":false,"pushed_at":"2020-01-09T17:17:09.000Z","size":258,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-12-25T14:56:29.475Z","etag":null,"topics":["data-warehouse","db2","machine-learning","sql"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/benhuds.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-02-14T19:44:27.000Z","updated_at":"2020-01-09T17:17:11.000Z","dependencies_parsed_at":"2023-06-07T17:56:14.366Z","dependency_job_id":null,"html_url":"https://github.com/benhuds/db2w-ml-examples","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/benhuds/db2w-ml-examples","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benhuds%2Fdb2w-ml-examples","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benhuds%2Fdb2w-ml-examples/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benhuds%2Fdb2w-ml-examples/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benhuds%2Fdb2w-ml-examples/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/benhuds","download_url":"https://codeload.github.com/benhuds/db2w-ml-examples/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benhuds%2Fdb2w-ml-examples/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33593400,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-warehouse","db2","machine-learning","sql"],"created_at":"2024-10-12T00:08:03.376Z","updated_at":"2026-05-28T03:31:47.027Z","avatar_url":"https://github.com/benhuds.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# In-database machine learning with Db2 Warehouse on Cloud\n\n**[IBM Db2 Warehouse on Cloud](https://www.ibm.com/cloud/db2-warehouse-on-cloud)\nis IBM's elastic cloud data warehouse service**.  This is a simple tutorial that\nwalks through how to use Db2 Warehouse on Cloud's built-in machine learning\nfunctions to train and run models on data residing in Db2 Warehouse on Cloud.\nOperations are performed by the Db2 Warehouse engine itself - no data movement\nrequired.\n\n### What this tutorial covers\n\n1. How to provision an IBM Db2 Warehouse on Cloud system\n2. How to load some data into your Db2 Warehouse on Cloud system\n3. How to train and run a simple ML model on data residing in a Db2 Warehouse on\nCloud system\n\n### Prerequisites\n\nFor this tutorial, you'll need a Db2 Warehouse on Cloud system:\n \n1. Register for an IBM Cloud account\n[here](https://cloud.ibm.com/registration?target=%2Fcatalog%2Fservices%2Fdb2-warehouse).\nYou'll get $200 in IBM Cloud credit, which you can use towards Db2 Warehouse on\nCloud.\n2. Navigate to the [Db2 Warehouse on Cloud\npage](https://cloud.ibm.com/catalog/services/db2-warehouse).  Select the  _Flex\nOne_ plan, and click _Create_ to create your Flex One system.\n\nFlex One is a paid Db2 Warehouse on Cloud plan, but the amount of IBM Cloud\ncredits incurred during this tutorial is minimal.\n\n### Loading data\n\nIn this repository, you'll find two CSV files: `Aircraft.csv` and\n`BOS-SFO-12months.csv`.  They contain manufacturing information (e.g. make,\nmodel) on commercial aircraft and flight on-time performance respectively.  For\nthe purposes of this tutorial, the dataset focuses on direct flights from Boston\nto San Francisco.\n\nYou can easily drag and drop these files from your desktop and into Db2\nWarehouse on Cloud through the \"Load\" page on the Db2 Warehouse on Cloud\nconsole.  A quick video on how to do this can be found\n[here](https://www.youtube.com/watch?v=AwiHZNaGkoA).  Load each file into\na separate table.\n\n### Simple analysis\n\nOnce you've loaded the data into Db2 Warehouse on Cloud, you can explore it\nthrough the \"Explore\" page on the Db2 Warehouse on Cloud console, or use the\n\"Run SQL\" page to query your data directly.  Here's a simple query you can run\nto calculate average arrival delay for each carrier in the past 12 months:\n\n```sql\nSELECT \"OP_CARRIER\", AVG(\"ARR_DELAY\") AS \"AVERAGE ARRIVAL DELAY\"\nFROM FLIGHTDATA\nGROUP BY \"OP_CARRIER\"\nORDER BY \"AVERAGE ARRIVAL DELAY\" ASC;\n```\n\n`FLIGHTDATA` is what I've named my table containing on-time performance\ninformation, but you can replace that with whatever name you go with for your\nown database.\n\n### One step further: machine learning\n\nThis information can be easily visualized through any standard BI/dashboarding\ntool, but let's take this a step further.  Let's use this information to try and\npredict how late a flight will be in future.  When booking flights, I'd like to\nmake sure my flight arrives at my destination on time so I don't miss any\nconnections or meetings on the other end, so a tool to help me predict delays\nwill definitely help me make smarter flight choices.\n\nThe problem we want to solve?  Simply put, given a choice of carrier, month/date/time\nof departure, and manufacturer, I'd like to know if my flight will have a\ndelayed arrival.  We can use a [linear\nregression](https://en.wikipedia.org/wiki/Linear_regression) model to help us do\nthis.\n\nFirst, let's enrich our on-time performance data so it contains the aircraft\nmanufacturer, and store it as a view.  As with our on-time data, I've named my\ntable containing manufacturing data `TAILNUMS` because it uses aircraft tail\nnumbers as the primary key.  Feel free to replace that name with whatever name\nyou decide to go with.\n\n```sql\nCREATE VIEW FLIGHTDATA_VIEW\n(ID, OP_CARRIER, TAIL_NUM, MANUFACTURER, MODEL,\nORIGIN, ORIGIN_CITY_NAME, DEST, DEST_CITY_NAME,\nMONTH, DAY_OF_WEEK, DEP_TIME_BLK, ARR_DELAY)\nAS\n(SELECT\nFD.ID, FD.OP_CARRIER, FD.TAIL_NUM, TD.MANUFACTURER, TD.MODEL,\nFD.ORIGIN, FD.ORIGIN_CITY_NAME, FD.DEST, FD.DEST_CITY_NAME,\nFD.MONTH, FD.DAY_OF_WEEK, FD.DEP_TIME_BLK, FD.ARR_DELAY\nFROM\nFLIGHTDATA AS FD\nINNER JOIN\nTAILNUMS AS TD\nON\nFD.TAIL_NUM = TD.TAIL_NUMBER);\n```\n\nAfter all, manufacturer may be a factor in lateness, but it's up to our ML model\nto determine how much impact it really has.\n\nNow that we've enriched our data, let's split our overall dataset into training\nand test datasets.  We'll use the training data to train our ML model, then\nevaluate its performance/accuracy later by running the ML model on our test\ndata.\n\n```sql\nCALL IDAX.SPLIT_DATA('intable=FLIGHTDATA_VIEW,\ntraintable=FLIGHTDATA_TRAIN,\ntesttable=FLIGHTDATA_TEST,\nid=ID,\nfraction=0.35');\n```\n\nThe training and test data will be stored in the `FLIGHTDATA_TRAIN` and `FLIGHTDATA_TEST`\ntables respectively, and this function will use the `ID` column to randomly\nsplit the data for each.\n\nNow, let's train our linear regression model using the training data in the\n`FLIGHTDATA_TRAIN` table, and call the resulting model `FLIGHTDATA_MODEL`.  We would like to\npredict arrival delay (`ARR_DELAY`), given carrier, manufacturer, model, day of\nweek, month, and scheduled departure time (maybe time of day affects delays\ntoo, who knows).\n\n```sql\nCALL IDAX.LINEAR_REGRESSION\n('model=FLIGHTDATA_MODEL,\nintable=FLIGHTDATA_TRAIN,\nid=ID,\ntarget=ARR_DELAY,\nincolumn=OP_CARRIER; MANUFACTURER; MODEL; DAY_OF_WEEK; MONTH; DEP_TIME_BLK,\ncoldefrole=ignore, calculatediagnostics=false');\n```\n\nThat was easy! No new frameworks to learn, no need to move the data to an\nexternal execution engine - just one stored procedure to run.  Tell the database\nwhat you're trying to predict, and the variables you think affect its value.\n\nIt's even easier to test your model against the test data:\n\n```sql\nCALL IDAX.PREDICT_LINEAR_REGRESSION\n('model=FLIGHTDATA_MODEL,\nintable=FLIGHTDATA_TEST,\nouttable=FLIGHTDATA_MODEL_OUT,\nid=ID');\n```\n\nEssentially: \"use the `FLIGHTDATA_MODEL` model to predict arrival delays in\n`BFD12_TEST` and store the output in `FLIGHTDATA_MODEL_OUT`.\"\n\nOnce you've run the model on your test data, let's compare the actual delays to\nthe delays we predicted.  Keep in mind that for this data, negative delay means\nthe flight landed early.\n\n```sql\nSELECT IN.ID,\nIN.ARR_DELAY as SOURCE_DELAY, OUT.ARR_DELAY AS SOURCE_PREDICT\nFROM FLIGHTDATA_TEST AS IN, FLIGHTDATA_MODEL_OUT AS OUT WHERE IN.ID=OUT.ID;\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenhuds%2Fdb2w-ml-examples","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbenhuds%2Fdb2w-ml-examples","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenhuds%2Fdb2w-ml-examples/lists"}