Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chezou/audience_generator

Create dummy data for Audience Studio on Treasure Data
https://github.com/chezou/audience_generator

Last synced: 3 months ago
JSON representation

Create dummy data for Audience Studio on Treasure Data

Awesome Lists containing this project

README

        

# Audience Data Generator

A script to create dummy data for Audience Studio on Treasure Data.

You can create a database (if not exists) including the following tables:

- `users`
- `cities`
- `behavior_1`
- `behavior_2`
- `attribute_1`
- `attribute_2`

## How to use

### Set up

Prerequisites:

- Python 3.8+

We recommend to install after creating a virtual environment as the following:

```shell script
$ python3 -m venv .venv
$ source .venv/bin/activate
(.venv)$ pip install git+https://github.com/chezou/audience_generator
```

Or, you can install with `--user` option.

```shell script
$ pip install --user git+https://github.com/chezou/audience_generator
```

### Usage

Set `TD_API_KEY` for your master API key and `TD_API_SERVER` for your API endpoint as environment variables.

```bash
$ export TD_API_KEY="1234/XXXXXXXXXX"
$ export TD_API_SERVER="api.treasuredata.com"
$ audience_generator my_db
```

You can see detailed options with `--help` option.

```shell script
Usage: audience_generator [OPTIONS] DATABASE

Create dummy data for Audience Studio in a database.

Target tables are: users, cities, behavior_1, and behavior_2.

Target database will be created automatically if not exists.

Options:
-s, --api-server TEXT Treasure Data API Endpoint
-n, --user-size INTEGER Target order of generated users. Must be bewteen 11
to 100000000

-o, --overwrite Recreate target tables
-d, --dry-run Check query with dry run. Set -vv to show query.
-v, --verbose
--help Show this message and exit.
```

## The dummy data examples

### users table

This table consists of `time`, `td_client_id`, `email`, and `country` columns.

- `email` column can be `a`, `b`, or `c` randomly
- `country` can be `japan`, `usa`, `canada`. The ratio of them is japan:usa:canada = 2:3:1.

The example table looks like this:

|time|td_client_id|email|country|
|:---|:---|:----|:---|
|1000010|8fc00148-4309-4337-8b7f-89472cf9a6e5|c|japan|
|1000009|a2a61a1a-1ba7-4195-d96c-de92bded5648|b|japan|
|1000007|2239f4b6-c5b8-40af-9838-48bedb6e08e7|a|canada|
|1000006|e69b4154-c45f-4ac3-8529-975bf93a51dc|c|usa|

### cities table

This table consists of `name`, `cn`, `fanoutn`, and `time`.

The example table looks like this:

|name|cn|fanoutn|time|
|:---|:---|:---|:---|
|Abidjan1|Abidjan|1|1585294664|
|Abilene1|Abilene|1|1585294664|
|Rabat9|Rabat|9|1585294664|
|Raleigh9|Raleigh|9|1585294664|

### behavior_1 table

This table consists of `time`, `test_city_name`, and `td_client_id`.

- By using `td_client_id`, you can join with `users` table
- You can join with `cities` table by joining `behavior_1`.test_city_name` and `cities.name`.

The example table looks like this:

|time|test_city_name|td_client_id|
|:---|:---|:---|
|8146810|Ulaanbaatar7|3e590038-7f2b-4634-dc32-ea7fa82436cc|
|8119605|Adamstown10|3dd6aff9-0f25-49c9-ecd5-fa6b72cf0bea|
|7974002|Luxembourg6|8e25a0db-97ca-46a8-ae9c-e0b03ab54cab|

### behavoir_2 table

This table consists of `time`, `test_city_name`, `td_client_id`, and `opts`.

- `opts` can be an integer from `0` to `2`
- By using `td_client_id`, you can join with `users` table
- You can join with `cities` table by joining `behavior_2`.test_city_name` and `cities.name`.

The example table looks like this:

|time|test_city_name|td_client_id|opts|
|:---|:---|:---|:---|
|8376800|Beirut9|dddc1ac8-a68d-4c11-d477-2fb58908b23f|1|
|8266808|Prague5|7ad27e8e-adce-4537-ecf2-d43e4f3ed5bd|0|
|8234002|Palikir9|621c1386-5f7e-44ea-af4d-acd3020349eb|0|
|8204000|Rio Rancho4|dddc1ac8-a68d-4c11-d477-2fb58908b23f|2|

### attribute_1 table

This table consists of `time`, `td_client_id`, `country`, `td_os`, and `td_language`.

- `country` is the same field as `users` table
- `td_os` contains `Linux`, `Windows`, `macOS`, `iOS`, and `Android`. The ratio is 1:1:1:2:2.
- `td_language` contains `ja_JP`, `en_GB`, and `en_US`. The ratio is 2:1:1.

|time|country|td_client_id|td_os|td_language|
|:---|:---|:---|:---|:---|
|8376800|japan|dddc1ac8-a68d-4c11-d477-2fb58908b23f|Windows|en_US|
|8266808|canada|7ad27e8e-adce-4537-ecf2-d43e4f3ed5bd|Android|ja_JP|
|8234002|usa|621c1386-5f7e-44ea-af4d-acd3020349eb|iOS|ja_JP|
|8204000|usa|dddc1ac8-a68d-4c11-d477-2fb58908b23f|macOS|en_GB|

### attribute_2 table

This table consists of `td_client_id`, `age`, `item_count`, `ctr`, and `time`.

- `td_client_id` is nullable. The ratio is determined by `non_null_rate` option
- `age` is a random number which can be from 0 to 60
- `item_count` is a random number which can be from 0 to 5
- `ctr` is a random number from 0.0 to 1.0

|time|td_client_id|age|item_count|ctr|
|:---|:---|:---|:---|:---|
|8376800|dddc1ac8-a68d-4c11-d477-2fb58908b23f|36|1|0.994594137517313|
|8266808|7ad27e8e-adce-4537-ecf2-d43e4f3ed5bd|54|2|0.58730152122260440.5873015212226044|
|8234002|621c1386-5f7e-44ea-af4d-acd3020349eb|17|1|0.6011213596629439|
|8204000|dddc1ac8-a68d-4c11-d477-2fb58908b23f|6|2|0.7007648809644941|