Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zakimjz/IBMGenerator
IBM Synthetic Data Generator for Itemsets and Sequences
https://github.com/zakimjz/IBMGenerator
itemset-mining sequence-datasets sequence-mining synthetic-dataset-generation
Last synced: 15 days ago
JSON representation
IBM Synthetic Data Generator for Itemsets and Sequences
- Host: GitHub
- URL: https://github.com/zakimjz/IBMGenerator
- Owner: zakimjz
- Created: 2016-10-06T01:49:50.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2019-12-22T16:59:01.000Z (almost 5 years ago)
- Last Synced: 2024-06-01T13:39:28.271Z (6 months ago)
- Topics: itemset-mining, sequence-datasets, sequence-mining, synthetic-dataset-generation
- Language: C++
- Size: 24.4 KB
- Stars: 25
- Watchers: 4
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# IBMGenerator
IBM Synthetic Data Generator for Itemsets and SequencesType make, which will create the executable file 'gen'
type ./gen -help for general help
For itemsets, type ./gen lit -help
For sequences, type ./gen seq -help## Itemset Datasets
These datasets mimic the transactions in a retailing
environment, where people tend to buy sets of items
together, the so called potential maximal frequent set. The
size of the maximal elements is clustered around a mean
with a few long itemsets. A transaction may contain one or
more of such frequent sets. The transaction size is also
clustered around a mean, but a few of them may contain
many items.
Let *D* denote the number of transactions, *T* the average
transaction size, *I* the size of a maximal potentially frequent
itemset, *L* the number of maximal potentially frequent
itemsets, and *N* the number of items. The data is generated
using the following procedure. We first generate *L* maximal
itemsets of average size *I* by choosing from the *N* items. We
next generate *D* transactions of average size *T* by choosing
from the *L* maximal itemsets.Type: ./gen lit -help
for all the parameters to generate sequence datasets:
Command Line Options:
-ncust number_of_customers (in 1000's) (default: 100)
-slen avg_trans_per_customer (default: 10)
-tlen avg_items_per_transaction (default: 2.5)
-nitems number_of_different_items (in '000s) (default: 10000)
-rept repetition-level (default: 0)-seq.npats number_of_seq_patterns (default: 5000)
-seq.patlen avg_length_of_maximal_pattern (default: 4)
-seq.corr correlation_between_patterns (default: 0.25)
-seq.conf avg_confidence_in_a_rule (default: 0.75)-lit.npats number_of_patterns (default: 25000)
-lit.patlen avg_length_of_maximal_pattern (default: 1.25)
-lit.corr correlation_between_patterns (default: 0.25)
-lit.conf avg_confidence_in_a_rule (default: 0.75)-fname (write to filename.data and filename.pat)
-ascii (Write data in ASCII format; default: False)
-version (to print out version info)An example run can be:
./gen lit -ntrans 100 -tlen 10 -nitems 1 -npats 1000 -patlen 4 -fname T10I4D100K -ascii
This will generate a datafile named "T10I4D100K.data"
In fact it generates three files:[fname].data -- the actual data file
[fname].conf -- configuration info
[fname].pat -- the embedded patterns
### Data Format
The generated file has the following format. Each line contains:TID TID NITEMS ITEMSET
where TID is a transaction identifier, NITEMS is the number of items in
that transaction, and ITEMSET is the set of items making up that
transaction. All ITEMSETS are sorted lexicographically. Note that TID is
repeated for consistency with the sequence generator.## Sequence Datasets
The generator generates sequence datasets that
mimic real-world transactions, where people buy a
sequence of sets of items. Some customers may buy only some items from
the sequences, or they may buy items from multiple sequences. The
input-sequence size and event size are clustered around a mean and a few
of them may have many elements.The datasets are generated using the
following process. First *NI* maximal events of average size *I* are
generated by choosing from *N* items. Then *NS* maximal sequences of average
size *S* are created by assigning events from *NI* to each sequence. Next a
customer (or input-sequence) of average *C* transactions (or events) is
created, and sequences in *NS* are assigned to different customer
elements, respecting the average transaction size of *T*. The generation
stops when *D* input-sequences have been generated. Default values are
*NS* = 5000, *NI* = 25000 and *N* = 10000.Type: ./gen seq -help
for all the parameters to generate sequence datasets:
Command Line Options:
-ncust number_of_customers (in 1000's) (default: 100)
-slen avg_trans_per_customer (default: 10)
-tlen avg_items_per_transaction (default: 2.5)
-nitems number_of_different_items (in '000s) (default: 10000)
-rept repetition-level (default: 0)-seq.npats number_of_seq_patterns (default: 5000)
-seq.patlen avg_length_of_maximal_pattern (default: 4)
-seq.corr correlation_between_patterns (default: 0.25)
-seq.conf avg_confidence_in_a_rule (default: 0.75)-lit.npats number_of_patterns (default: 25000)
-lit.patlen avg_length_of_maximal_pattern (default: 1.25)
-lit.corr correlation_between_patterns (default: 0.25)
-lit.conf avg_confidence_in_a_rule (default: 0.75)-fname (write to filename.data and filename.pat)
-ascii (Write data in ASCII format; default: False)
-version (to print out version info)An example run can be:
./gen seq -ncust 200 -fname C10T2.5S4I1.25D200K -ascii
This will generate a datafile named "C10T2.5S4I1.25D200K.data"
In fact, it generates four files:[fname].data -- the actual data file
[fname].conf -- configuration info
[fname].pat -- the embedded patterns
[fname].ntpc -- info on number of trans per customer (ignore this file)
### Data Format
The generated file has the following format. Each line contains:SID TID NITEMS ITEMSET
where SID is the sequence identifier, TID is a transaction/event identifier, NITEMS is the number of items in
that transaction, and ITEMSET is the set of items making up that
transaction. The TIDs for an SID are listed in temporal order, i.e.,
TIDs are event ids within that sequence. All ITEMSETS are also sorted
lexicographically.