{"id":18400702,"url":"https://github.com/databricks/tpch-dbgen","last_synced_at":"2025-04-07T06:33:43.783Z","repository":{"id":66099413,"uuid":"108048554","full_name":"databricks/tpch-dbgen","owner":"databricks","description":"Patched version of dbgen","archived":false,"fork":false,"pushed_at":"2024-02-25T15:28:16.000Z","size":384,"stargazers_count":28,"open_issues_count":5,"forks_count":29,"subscribers_count":299,"default_branch":"master","last_synced_at":"2025-04-03T00:59:00.665Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks.png","metadata":{"files":{"readme":"README","changelog":"HISTORY","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-10-23T22:44:49.000Z","updated_at":"2024-11-19T16:24:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"62d8a403-a797-4bd1-a23b-fc453b7d0837","html_url":"https://github.com/databricks/tpch-dbgen","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Ftpch-dbgen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Ftpch-dbgen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Ftpch-dbgen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Ftpch-dbgen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks","download_url":"https://codeload.github.com/databricks/tpch-dbgen/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247607782,"owners_count":20965945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T02:36:12.670Z","updated_at":"2025-04-07T06:33:43.767Z","avatar_url":"https://github.com/databricks.png","language":"C","readme":"# @(#)README\t2.4.0\n\nTable of Contents\n===================\n 0. What is this document?\n 1. What is DBGEN?\n 2. What will DBGEN create?\n 3. How is DBGEN built?\n 4. Command Line Options for DBGEN\n 5. Building Large Data Sets with DBGEN\n 6. DBGEN limitations and compliant usage\n 7. Sample DBGEN executions\n 8. What is QGEN?\n 9. What will QGEN create?\n10. How is QGEN built?\n11. Command Line Options for QGEN\n12. Query Template Syntax\n13. Sample QGEN executions and Query Templates\n14. Environment variable\n15. Version Numbering in DBGEN and QGEN\n16. Validated Platforms\n\n0. What is this document?\n\nThis is the general README file for DBGEN and QGEN, the data-\nbase population and executable query text generation programs \nused in the TPC-H benchmark. It covers the proper use \nof DBGEN and QGEN. For information on porting the utility to your \nparticular platform see Porting.Notes.\n\n1. What is DBGEN?\n\nDBGEN is a database population program for use with the TPC-H benchmark.  \nIt is written in ANSI 'C' for portability, and has \nbeen successfully ported to over a dozen different systems. While the \nTPC-H specification allow an implementor to use any utility \nto populate the benchmark database, the resultant population must exactly \nmatch the output of DBGEN. The source code has been provided to make the \nprocess of building a compliant database population as simple as possible.\n\n2. What will DBGEN create?\n\nWithout any command line options, DBGEN will generate 8 separate ascii\nfiles. Each file will contain pipe-delimited load data for one of the\ntables defined in the TPC-H database schema. The default tables \nwill contain the load data required for a scale factor 1 database. By \ndefault the file will be created in the current directory and be \nnamed \u003ctable\u003e.tbl. As an example, customer.tbl will contain the \nload data for the customer table.\n\nWhen invoked with the '-U' flag, DBGEN will create the data sets to be \nused in the update functions and the SQL syntax required to delete the \ndata sets. The update files will be created in the same directory as \nthe load data files and will be named \"u_\u003ctable\u003e.set\". The delete \nsyntax will be written to \"delete.set\". For instance, the data set to \nbe used in the third query set to update the lineitem table will be \nnamed \"u_lineitem.tbl.3\", and the SQL to remove those rows will be \nfound in \"delete.3\". The size of the update files can be controlled \nwith the '-r' flag.\n\n3. How is DBGEN built?\n\nCreate an appropriate makefile, using makefile.suite as a basis, \nand type make.  Refer to Porting.Notes for more details and for \nsuggested compile time options.\n\n4. Command Line Options for DBGEN\n\nDBGEN's output is controlled by a combination of command line options\nand environment variables. Command line options are assumed to be single\nletter flags preceded by a minus sign. They may be followed by an\noptional argument.\n\noption  argument    default     action\n------  --------    -------     ------\n-h                              Display a usage summary\n\n-f      none                    Force. Existing data files will be\n                                overwritten.\n\n-F      none        yes         Flat file output.\n\n-D      none                    Direct database load. ld_XXXX() routines\n                                must be defined in load_stub.c\n\n-s      \u003cscale\u003e     1           Scale of the database population. Scale\n                                1.0 represents ~1 GB of data\n\n-T      \u003ctable\u003e                 Generate the data for a particular table\n                                ONLY. Arguments: p -- part/partuspp, \n                                c -- customer, s -- supplier, \n                                o -- orders/lineitem, n -- nation, r -- region,\n                                l -- code (same as n and r),\n                                O -- orders, L -- lineitem, P -- part, \n                                S -- partsupp\n\n-O      d                       Generate SQL for delete function \n                                instead of key ranges\n\n-O      f                       Allow over-ride of default output file \n                                names\n\n-O      h                       Generate headers in flat ascii files.\n                                hd_XXX routines must be defined in \n                                load_stub.c\n\n-O      m                       Flat files generate fixed length records\n\n-O      r                       Generate key ranges for the UF2 update \n                                function\n\n-O      v                       Verify data set without generating it.\n\n-r      \u003cpercentage\u003e     10     Scale each udpate file to the given \n                                percentage (expressed in basis points)\n                                of the data set\n\n-v      none                    Verbose. Progress messages are \n                                displayed as data is generated.\n\n-n      \u003cname\u003e                  Use database \u003cname\u003e for in-line load\n\n-C      \u003cchildren\u003e              Use \u003cchildren\u003e separate processes to \n                                generate data\n\n-S      \u003cn\u003e                     Generate the \u003cn\u003eth part of a multi-part load\n                                or update set\n\n-U      \u003cupdates\u003e               Create a specified number of data sets\n                                in flat files for the update/delete \n                                functions\n\n-i      \u003cn\u003e                     Split the inserted rows in an refresh pair \n\t\t\t\t\t\t\t\tbetween \u003cn\u003e files\n\n-d      \u003cn\u003e                     Split the deleted rows in an refresh pair\n\t\t\t\t\t\t\t\tbetween \u003cn\u003e files\n\n5. DBGEN limitations and compliant usage\n\nDBGEN is meant to be a robust population generator for use with the \nTPC-H benchmark. It is hoped that DBGEN will make it easier \nto experiment with and become proficient in the execution of TPC decision \nsupport benchmarks.  As a result, it includes a number of command line \noptions which are not, strictly speaking, necessary to generate a compliant \ndata set for a TPC-D run. In addition, some command line options will accept \narguments which result in the generation of NON-COMPLIANT data sets. Options \nwhich should be used with care include:\n\n-s -- scale factor. TPC-H runs are only compliant when run against SF's \n      of 1, 10, 100, 300, 1000, 3000, 10000, 30000, 100000\n-r -- refresh percentage. TPC-H runs are only compliant when run with \n      -r 10, the default.\n\n6. Sample DBGEN executions\n\nDBGEN has been built to allow as much flexibility as possible, but is\nfundementally intended to generate two things: a database population \nagainst which the queries in TPC-H can be run, and the updates \nthat are used during the update functions in TPC-H. Here are \nsome sample uses of DBGEN.\n\n  1. To generate the database population for the qualification database\n\tdbgen -s 1\n  2. To generate the lineitem table only, for a scale factor 10 database,\n     and over-write any existing flat files:\n\tdbgen -s 10 -f -T L\n  4. To geterate a 100GB data set in 1GB pieces, generate only the part and \n     partsupplier tables, and include some progress reports along the way:\n\tdbgen -s 100 -S 1 -C 100 -T p -v (to generate the first 1GB file)\n\tdbgen -s 100 -S 2 -C 100 -T p -v (to generate the second 1GB file)\n        (and so on, incrementing the argument to -S each time)\n  5. To generate the update files needed for a 4 stream run of the throughput\n     test at 100 GB, using an existing set of seed files from an 8 process \n     load:\n\tdbgen -s 100 -U 4 -C 8\n     \n\n7. What is QGEN?\n\nQGEN is a query generation program for use with the TPC-H benchmark.\nIt is written in ANSI 'C' for portability, and has been successfully\nported to over a dozen different systems. While the benchmark specifications\nallow an implementor to use any utility to create the benchmark query\nsets, QGEN has been provided to make the process of building\na benchmark implementation as simple as possible.\n\n8. What will QGEN create?\n\nQGEN is a filter, triggered by :'s. It does line-at-a-time reads of its\ninput (more on that later), scanning for :foo, where foo determines the\nsubstitution that occurs. Including:\n\n:\u003cint\u003e          replace with the appropriate value for parameter \u003cint\u003e\n:b              replace with START_TRAN (from tpcd.h)\n:c              replace with SET_DBASE (from tpcd.h)\n:n\u003cint\u003e         replace with SET_ROWCOUNT(\u003cint\u003e) (from tpcd.h)\n:o              replace with SET_OUTPUT (from tpcd.h)\n:q              replace with query number\n:s              replace with stream number\n:x              replace with GEN_QUERY_PLAN (from tpcd.h)\n\nQgen takes an assortment of command line options, controlling which of these\noptions should be active during the translation from template to EQT, and a\nlist of query \"names\". It then translates the template found in\n$DSS_QUERY/\u003cname\u003e.sql and puts the result of stdout.\n\nHere is a sample query template:\n\n{  Sccsid:     @(#)1.sql        9.1.1.1     1/25/95  10:51:56  }\n:n 0\n:o\nselect\n l_returnflag,\n l_linestatus,\n sum(l_quantity) as sum_qty,\n sum(l_extendedprice) as sum_base_price,\n sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,\n sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,\n avg(l_quantity) as avg_qty,\n avg(l_extendedprice) as avg_price,\n avg(l_discount) as avg_disc,\n count(*) as count_order\nfrom lineitem\nwhere l_shipdate \u003c= date '1998-12-01' - interval :1 day\ngroup by l_returnflag, l_linestatus\norder by l_returnflag, l_linestatus;\n\nAnd here is what is generated:\n$ qgen -d 1\n\n{return 0 rows}\n\nselect\n l_returnflag,\n l_linestatus,\n sum(l_quantity) as sum_qty,\n sum(l_extendedprice) as sum_base_price,\n sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,\n sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,\n avg(l_quantity) as avg_qty,\n avg(l_extendedprice) as avg_price,\n avg(l_discount) as avg_disc,\n count(*) as count_order\nfrom lineitem\nwhere l_shipdate \u003c= date('1998-12-01') - interval (90)  day to day\ngroup by l_returnflag, l_linestatus\norder by l_returnflag, l_linestatus;\n\nSee \"Query Template Syntax\" below for more detail on converting your prefered query\nphrasing for use with QGEN.\n\n9. How is QGEN built?\n\nQGEN is built by the same makefile that creates DBGEN. If the makefile\nis successfully creating DBGEN, no further compilation modifications\nshould be necessary. You may need to modify some of the options which\nallow QGEN to integrate with your preferred query tool. Refer to\nPorting.Notes for more detail.\n\n10. Command Line Options for QGEN\n\nLike DBGEN, QGEN is controlled by a combination of command line options\nand environment variables (See \"Environment Variables\", below for more\ndetail).  Command line options are assumed to be single\nletter flags preceded by a minus sign. They may be followed by an\noptional argument.\n\noption  argument    default     action\n------  --------    -------     ------\n-c      none                    Retain comments in translation of template to\n                                EQT\n\n-d      none                    Default. Use the parameter substitutions\n                                required for query validation\n\n-h                              Display a usage summary\n\n-i      \u003cfile\u003e                  Use contents of \u003cfile\u003e to init a query stream\n\n-l      \u003cfile\u003e                  Save query parameters to \u003cfile\u003e\n\n-n      \u003cname\u003e                  Use database \u003cname\u003e for queries\n\n-N                              Always use default rowcount, and ignore :n directives\n\n-o      \u003cpath\u003e                  Save query n's output in \u003cpath\u003e/n.\u003cstream\u003e\n                                Uses -p option, and uses :o tag\n\n-p      \u003cstream\u003e                Use the query permutation defined for\n                                stream \u003cstream\u003e. If this option is\n                                omited, EQT will be generated for the\n                                queries named on the command line.\n\n-r      \u003cn\u003e                     Seed the rnadom number generator with \u003cn\u003e\n\n-s      \u003cn\u003e                     Set scale to \u003cn\u003e for parameter \n                                substitutions.\n\n-t      \u003cfile\u003e                  Use contents of \u003cfile\u003e to complete a query \n                                stream\n\n-T      none                    Use time table format for date substitution\n\n-v      none                    Verbose. Progress messages are \n                                displayed as data is generated.\n\n-x      none                    Generate a query plan as part of query\n                                execution.\n\n11. Query Template Syntax\n\nQGEN is a simple ASCII text filter, meant to translate query generalized\nquery syntax(\"query template\") into the executable query text(EQT) re-\nquired by the benchmarks. It provides a number of shorthands and syntactic \nextensions that allow the automatic generation of query parameters and some \ncontrol over the operation of the benchmark implementation.\n\nQGEN first strips all comments from the query template, recognizing both\n{comment} and --comment styles. Next it traverses the query template\none line at a time, locating required substitution points, called\nparameter tags. The values substituted for a given tag are summarized\nbelow.  QGEN does not support nested substitutions. That is, if\nthe text substituted for tag itself contains a valid tag the second tag\nwill not be expanded.\n\nTag             Converted To            Based on\n===             ============            ========\n:c\t\tdatabase \u003cdbname\u003e;(1)   -n from the command line\n:x              set explain on;(1)      -x from the command line\n:\u003cnumber\u003e       paremeter \u003cnumber\u003e\n:s              stream number\n:o              output to outpath/qnum.stream;(1)\n\t\t\t\t\t-o from command line, -s from \n                                        command line\n:b              BEGIN WORK;(1)          -a from comand line\n:e              COMMIT WORK(1)          -a from command line\n:q              query number\n:n \u003cnumber\u003e                             sets rowcount to be returned \n                                        to \u003cnumber\u003e, unless -N appears on the command line\n\nNotes:\n   (1)  This is Informix-specific syntax. Refer to Porting.Notes for\n   tailoring the generated text to your database environment.\n   \n12. Sample QGEN executions and Query Templates\n\nQGEN translates generic query templates into valid SQL. In addition, it \nallows conditional inclusion of the commands necessary to connect to a \ndatabase, produce diagnostic output, etc. Here are some sample of QGEN\nusage, and the way that command line parameters and the query templates \ninteract to produce valid SQL.\n\n  Template, in $DSS_QUERY/1.sql:\n            :c\n            :o\n            select count(*) from foo;\n            :x\n            select count(*) from lineitem\n              where l_orderdate \u003c ':1';\n\n  1. \"qgen 1\", would produce:\n      select count(*) from foo;\n      select count(*) from lineitem \n        where l_orderdate \u003c '1997-01-01'; \n   Assuming that 1 January 1997 was a valid substitution for parameter 1.\n\n  2. \"qgen -d -c dss1 1, would produce:\n      database dss1;\n      select count(*) from foo;\n      select count(*) from lineitem \n        where l_orderdate \u003c '1995-07-18'; \n   Assuming that 18 July 1995 was the default substitution for parameter 1,\n    and using Informix syntax.\n\n  3. \"qgen -d -c dss1 -x -o somepath 1, would produce:\n      database dss1;\n      output to \"somepath/1.0\"\n      select count(*) from foo;\n      set explain on;\n      select count(*) from lineitem \n        where l_orderdate \u003c '1995-07-18'; \n   Assuming that 18 July 1995 was the default substitution for parameter 1,\n    and using Informix syntax.\n \n\n13. Environment Variables\n\nEnviroment variables are used to control features of DBGEN and QGEN \nwhich are unlikely to change from one execution to another.\n\nVariable    Default     Action\n-------     -------     ------\nDSS_PATH    .           Directory in which to build flat files\nDSS_CONFIG  .           Directory in which to find configuration files\nDSS_DIST    dists.dss   Name of distribution definition file\nDSS_QUERY   .           Directory in which to find query templates\n\n14. Version Numbering in DBGEN and QGEN\n\nDBGEN and QGEN use a common version numbering algorithm. Each executable\nis stamped with a version number which is displayed in the usage messages\navailable with the '-h' option. A version number is of the form:\n\n   V.R.P.M\n   | | | |\n   | | | |\n   | | | |\n   | | |  -- modification: alphabetic, incremented for any trivial changes \n   | | |                   to the source (e.g, porting ifdef's)\n   | |  ---- patch level:  numeric, incremented for any minor bug fix\n   | |                     (e.g, qgen parameter range)\n   | ------- release:      numeric, incremented for each minor revision of the\n   |                       specification\n   |-------- version:      numeric, incremented for each major revision of the \n                           specification\n\nAn implementation of TPC-H is valid only if it conforms to the \nfollowing version usage rules:\n\n  -- The Version of DBGEN and QGEN must match the integer portion of the \n     current specification revision\n\n15. The current revisions are:\n    DBGEN: 2.4.0\n    QGEN:  2.4.0\n\n16. Validated Platforms\n    The following platforms have been validated to produce the reference \n    data set for TPC-H 2.4.0\n    Processor\tOperating System (version)  Compiler (version)\t\tCompiler Flags\n    ----------------------------------------------------------------------------\n    POWER5 \t\tAIX 64-bit (5.3)\t\t\t  C for AIX Compiler, v7  -q64 (no -g)\n    IA-64\t\tHPUX 64-bit ()\t\ticc \t\t\n    Linux 32-bit ()\tgcc\t\t\t\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Ftpch-dbgen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks%2Ftpch-dbgen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Ftpch-dbgen/lists"}