{"id":13471180,"url":"https://github.com/zakimjz/IBMGenerator","last_synced_at":"2025-03-26T13:30:53.138Z","repository":{"id":37706486,"uuid":"70114443","full_name":"zakimjz/IBMGenerator","owner":"zakimjz","description":"IBM Synthetic Data Generator for Itemsets and Sequences","archived":false,"fork":false,"pushed_at":"2019-12-22T16:59:01.000Z","size":25,"stargazers_count":27,"open_issues_count":0,"forks_count":9,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-10-30T02:58:47.675Z","etag":null,"topics":["itemset-mining","sequence-datasets","sequence-mining","synthetic-dataset-generation"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zakimjz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-10-06T01:49:50.000Z","updated_at":"2024-09-14T17:24:17.000Z","dependencies_parsed_at":"2022-09-10T02:10:45.798Z","dependency_job_id":null,"html_url":"https://github.com/zakimjz/IBMGenerator","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zakimjz%2FIBMGenerator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zakimjz%2FIBMGenerator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zakimjz%2FIBMGenerator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zakimjz%2FIBMGenerator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zakimjz","download_url":"https://codeload.github.com/zakimjz/IBMGenerator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245662760,"owners_count":20652077,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["itemset-mining","sequence-datasets","sequence-mining","synthetic-dataset-generation"],"created_at":"2024-07-31T16:00:41.152Z","updated_at":"2025-03-26T13:30:52.743Z","avatar_url":"https://github.com/zakimjz.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# IBMGenerator\nIBM Synthetic Data Generator for Itemsets and Sequences\n\nType make, which will create the executable file 'gen'\n\ntype ./gen -help for general help\n\nFor itemsets, type ./gen lit -help\nFor sequences, type ./gen seq -help\n\n\n## Itemset Datasets\n\nThese datasets mimic the transactions in a retailing\nenvironment, where people tend to buy sets of items\ntogether, the so called potential maximal frequent set. The\nsize of the maximal elements is clustered around a mean\nwith a few long itemsets. A transaction may contain one or\nmore of such frequent sets. The transaction size is also\nclustered around a mean, but a few of them may contain\nmany items.\nLet *D* denote the number of transactions, *T* the average\ntransaction size, *I* the size of a maximal potentially frequent\nitemset, *L* the number of maximal potentially frequent\nitemsets, and *N* the number of items. The data is generated\nusing the following procedure. We first generate *L* maximal\nitemsets of average size *I* by choosing from the *N* items. We\nnext generate *D* transactions of average size *T* by choosing\nfrom the *L* maximal itemsets. \n\nType: ./gen lit -help \n\nfor all the parameters to generate sequence datasets:\n\n\nCommand Line Options:\n  \n  -ncust number_of_customers (in 1000's) (default: 100)\n  \n  -slen avg_trans_per_customer (default: 10)\n  \n  -tlen avg_items_per_transaction (default: 2.5)\n  \n  -nitems number_of_different_items (in '000s) (default: 10000)\n  \n  -rept repetition-level (default: 0)\n\n  -seq.npats number_of_seq_patterns (default: 5000)\n\n  -seq.patlen avg_length_of_maximal_pattern (default: 4)\n  \n  -seq.corr correlation_between_patterns (default: 0.25)\n  \n  -seq.conf avg_confidence_in_a_rule (default: 0.75)\n\n  -lit.npats number_of_patterns (default: 25000)\n  \n  -lit.patlen avg_length_of_maximal_pattern (default: 1.25)\n  \n  -lit.corr correlation_between_patterns (default: 0.25)\n  \n  -lit.conf avg_confidence_in_a_rule (default: 0.75)\n\n  -fname \u003cfilename\u003e (write to filename.data and filename.pat)\n  \n  -ascii (Write data in ASCII format; default: False)\n  \n  -version (to print out version info)\n\n\nAn example run can be: \n\n./gen lit -ntrans 100 -tlen 10 -nitems 1 -npats 1000 -patlen 4 -fname T10I4D100K -ascii\n\nThis will generate a datafile named \"T10I4D100K.data\"\nIn fact it generates three files:\n\n[fname].data -- the actual data file\n\n[fname].conf -- configuration info\n\n[fname].pat -- the embedded patterns\n\n\n### Data Format\nThe generated file has the following format. Each line contains:\n\nTID TID NITEMS ITEMSET\n\nwhere TID is a transaction identifier, NITEMS is the number of items in\nthat transaction, and ITEMSET is the set of items making up that\ntransaction. All ITEMSETS are sorted lexicographically. Note that TID is\nrepeated for consistency with the sequence generator.\n\n\n## Sequence Datasets\n\nThe generator generates sequence datasets that \nmimic real-world transactions, where people buy a\nsequence of sets of items. Some customers may buy only some items from\nthe sequences, or they may buy items from multiple sequences. The\ninput-sequence size and event size are clustered around a mean and a few\nof them may have many elements. \n\nThe datasets are generated using the\nfollowing process. First *NI* maximal events of average size *I* are\ngenerated by choosing from *N* items. Then *NS* maximal sequences of average\nsize *S* are created by assigning events from *NI* to each sequence. Next a\ncustomer (or input-sequence) of average *C* transactions (or events) is\ncreated, and sequences in *NS* are assigned to different customer\nelements, respecting the average transaction size of *T*. The generation\nstops when *D* input-sequences have been generated. Default values are \n*NS* = 5000, *NI* = 25000 and *N* = 10000. \n\nType: ./gen seq -help \n\nfor all the parameters to generate sequence datasets:\n\nCommand Line Options:\n\n  -ncust number_of_customers (in 1000's) (default: 100)\n\n  -slen avg_trans_per_customer (default: 10)\n  \n  -tlen avg_items_per_transaction (default: 2.5)\n  \n  -nitems number_of_different_items (in '000s) (default: 10000)\n  \n  -rept repetition-level (default: 0)\n\n  -seq.npats number_of_seq_patterns (default: 5000)\n  \n  -seq.patlen avg_length_of_maximal_pattern (default: 4)\n  \n  -seq.corr correlation_between_patterns (default: 0.25)\n  \n  -seq.conf avg_confidence_in_a_rule (default: 0.75)\n\n  -lit.npats number_of_patterns (default: 25000)\n\n  -lit.patlen avg_length_of_maximal_pattern (default: 1.25)\n  \n  -lit.corr correlation_between_patterns (default: 0.25)\n  \n  -lit.conf avg_confidence_in_a_rule (default: 0.75)\n\n  -fname \u003cfilename\u003e (write to filename.data and filename.pat)\n  \n  -ascii (Write data in ASCII format; default: False)\n  \n  -version (to print out version info)\n\nAn example run can be:\n\n./gen seq -ncust 200 -fname C10T2.5S4I1.25D200K -ascii\n\nThis will generate a datafile named \"C10T2.5S4I1.25D200K.data\"\nIn fact, it generates four files:\n\n[fname].data -- the actual data file\n\n[fname].conf -- configuration info\n\n[fname].pat -- the embedded patterns\n\n[fname].ntpc -- info on number of trans per customer (ignore this file)\n\n\n### Data Format\nThe generated file has the following format. Each line contains:\n\nSID TID NITEMS ITEMSET\n\nwhere SID is the sequence identifier, TID is a transaction/event identifier, NITEMS is the number of items in\nthat transaction, and ITEMSET is the set of items making up that\ntransaction. The TIDs for an SID are listed in temporal order, i.e.,\nTIDs are event ids within that sequence. All ITEMSETS are also sorted\nlexicographically.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzakimjz%2FIBMGenerator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzakimjz%2FIBMGenerator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzakimjz%2FIBMGenerator/lists"}