https://github.com/anicolaspp/mapr-data-gen

Data generator for MapR Data Platform
https://github.com/anicolaspp/mapr-data-gen

data mapr mapr-db mapr-es mapr-streams maprdb parquet scala spark

Last synced: 4 months ago
JSON representation

Data generator for MapR Data Platform

Host: GitHub
URL: https://github.com/anicolaspp/mapr-data-gen
Owner: anicolaspp
License: apache-2.0
Created: 2019-04-22T19:44:13.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2020-11-03T16:13:24.000Z (over 4 years ago)
Last Synced: 2025-01-16T13:22:37.505Z (6 months ago)
Topics: data, mapr, mapr-db, mapr-es, mapr-streams, maprdb, parquet, scala, spark
Language: Scala
Size: 44.9 KB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # MapR Data Generator

Data generator for MapR Data Platform

## How to build

```bash

mvn clean compile assembly:single     

```

This should give you `/mapr-data-gen-1.0.jar` in your `target` folder.

## How to run

```bash

./bin/spark-submit --master yarn \ 

--class class com.github.anicolaspp.Generator \ 

mapr-data-gen-1.0.jar [OPTIONS]

```

Current options are: 

```bash  

  usage: parquet-generator

   -C,--compress                            compression type, valid values are:

                                                uncompressed, snappy, gzip,

                                                lzo (default: uncompressed)

   -f,--format                              output format type (e.g., parquet, maprdb (default), etc.)

   -o,--output                              the output file name (default: /ParqGenOutput.parquet)

   -O,--options                             key,value strings that will be passed to the data source of spark in

                                                writing. E.g., for parquet you may want to re-consider parquet.block.size. The

                                                default is 128MB (the HDFS block size).

   -p,--partitions                          number of output partitions (default: 1)

   -r,--rows                                total number of rows (default: 10)

   -R,--rangeInt                            maximum int value, value for any Int column will be generated between

                                                [0,rangeInt), (default: 2147483647)

   -s,--size                                any variable payload size, string or payload in IntPayload (default: 100)

   -S,--show                                show  number of rows (default: 0, zero means do not show)

   -t,--tasks                               number of tasks to generate this data (default: 1)

```

An example run would be : 

```bash 

./bin/spark-submit --master yarn \

--class com.github.anicolaspp.Generator mapr-data-gen-1.0.jar  \

-o /user/mapr/tables/test_gen -r 84 -s 42 -p 12 -f maprdb

```

This will create `984 ( = 12 * 84)` rows for `case class Data` as 

`[String, Int, Array[Byte], Double, Float, Long, String]` with 42 bytes byte array and 42 chars String, and save this 

as a MapR-DB table in `/user/mapr/tables/test_gen`.

We can generate parquet data in the following way. 

```bash 

./bin/spark-submit --master yarn \

--class com.github.anicolaspp.Generator mapr-data-gen-1.0.jar  \

-o /user/mapr/data/parquet -r 84 -s 42 -p 12 -f parquet 

```

In this case the same data is generated and saved in `/user/mapr/data/parquet`

We can generate stream data in the following way. 

```bash

./spark-submit --master yarn \

--deploy-mode client \

--num-executors 12 \

--executor-cores 2 \

--executor-memory 5G \

--class com.github.anicolaspp.Generator \

~/mapr-data-gen-1.0.jar -o /user/mapr/streams/random_data:t1 -r 1000000 -s 1024 -p 24 -f mapres -c 50 -t 20

```

1. Notice that -o points to a MapR Stream and includes the topic (t1 in our case). 

2. We are generating 1000000 rows 

3. We use `-t 20` to indicate that we use 20 tasks to generate the data

4. We use `-p 24` to indicate that we use 24 partitions to write to the stream

5. We use `-c 50` to indicate that we use 50 threads on each partition to write to MapR-ES.

 

Finally, if we need to run the stream data generator for a period of time (we want to test a consumer while writing data) we can add `-m` argument to indicate the time we want to run it (in minutes).

```bash

./spark-submit --master yarn \

--deploy-mode client \

--num-executors 12 \

--executor-cores 2 \

--executor-memory 5G \

--class com.github.anicolaspp.Generator \

~/mapr-data-gen-1.0.jar -o /user/mapr/streams/random_data:t1 -r 10000 -s 1024 -p 5 -f mapres -c 50 -t 100 -m 5

```

Notice the same arguments than before, but we have added `-m 5` to indicate that we want to write the generated data during 5 minutes. The application will stop after that.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anicolaspp/mapr-data-gen

Awesome Lists containing this project

README