Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/anicolaspp/mapr-data-gen
Data generator for MapR Data Platform
https://github.com/anicolaspp/mapr-data-gen
data mapr mapr-db mapr-es mapr-streams maprdb parquet scala spark
Last synced: 5 days ago
JSON representation
Data generator for MapR Data Platform
- Host: GitHub
- URL: https://github.com/anicolaspp/mapr-data-gen
- Owner: anicolaspp
- License: apache-2.0
- Created: 2019-04-22T19:44:13.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-11-03T16:13:24.000Z (about 4 years ago)
- Last Synced: 2023-03-01T20:25:59.393Z (over 1 year ago)
- Topics: data, mapr, mapr-db, mapr-es, mapr-streams, maprdb, parquet, scala, spark
- Language: Scala
- Size: 44.9 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MapR Data Generator
Data generator for MapR Data Platform## How to build
```bash
mvn clean compile assembly:single
```This should give you `/mapr-data-gen-1.0.jar` in your `target` folder.
## How to run
```bash
./bin/spark-submit --master yarn \
--class class com.github.anicolaspp.Generator \
mapr-data-gen-1.0.jar [OPTIONS]
```Current options are:
```bash
usage: parquet-generator
-C,--compress compression type, valid values are:
uncompressed, snappy, gzip,
lzo (default: uncompressed)
-f,--format output format type (e.g., parquet, maprdb (default), etc.)
-o,--output the output file name (default: /ParqGenOutput.parquet)
-O,--options key,value strings that will be passed to the data source of spark in
writing. E.g., for parquet you may want to re-consider parquet.block.size. The
default is 128MB (the HDFS block size).
-p,--partitions number of output partitions (default: 1)
-r,--rows total number of rows (default: 10)
-R,--rangeInt maximum int value, value for any Int column will be generated between
[0,rangeInt), (default: 2147483647)
-s,--size any variable payload size, string or payload in IntPayload (default: 100)
-S,--show show number of rows (default: 0, zero means do not show)
-t,--tasks number of tasks to generate this data (default: 1)
```An example run would be :
```bash
./bin/spark-submit --master yarn \
--class com.github.anicolaspp.Generator mapr-data-gen-1.0.jar \
-o /user/mapr/tables/test_gen -r 84 -s 42 -p 12 -f maprdb
```This will create `984 ( = 12 * 84)` rows for `case class Data` as
`[String, Int, Array[Byte], Double, Float, Long, String]` with 42 bytes byte array and 42 chars String, and save this
as a MapR-DB table in `/user/mapr/tables/test_gen`.We can generate parquet data in the following way.
```bash
./bin/spark-submit --master yarn \
--class com.github.anicolaspp.Generator mapr-data-gen-1.0.jar \
-o /user/mapr/data/parquet -r 84 -s 42 -p 12 -f parquet
```In this case the same data is generated and saved in `/user/mapr/data/parquet`
We can generate stream data in the following way.
```bash
./spark-submit --master yarn \
--deploy-mode client \
--num-executors 12 \
--executor-cores 2 \
--executor-memory 5G \
--class com.github.anicolaspp.Generator \
~/mapr-data-gen-1.0.jar -o /user/mapr/streams/random_data:t1 -r 1000000 -s 1024 -p 24 -f mapres -c 50 -t 20
```1. Notice that -o points to a MapR Stream and includes the topic (t1 in our case).
2. We are generating 1000000 rows
3. We use `-t 20` to indicate that we use 20 tasks to generate the data
4. We use `-p 24` to indicate that we use 24 partitions to write to the stream
5. We use `-c 50` to indicate that we use 50 threads on each partition to write to MapR-ES.
Finally, if we need to run the stream data generator for a period of time (we want to test a consumer while writing data) we can add `-m` argument to indicate the time we want to run it (in minutes).```bash
./spark-submit --master yarn \
--deploy-mode client \
--num-executors 12 \
--executor-cores 2 \
--executor-memory 5G \
--class com.github.anicolaspp.Generator \
~/mapr-data-gen-1.0.jar -o /user/mapr/streams/random_data:t1 -r 10000 -s 1024 -p 5 -f mapres -c 50 -t 100 -m 5
```
Notice the same arguments than before, but we have added `-m 5` to indicate that we want to write the generated data during 5 minutes. The application will stop after that.