{"id":18830662,"url":"https://github.com/materializeinc/datagen","last_synced_at":"2025-04-04T10:02:41.521Z","repository":{"id":94931092,"uuid":"581198418","full_name":"MaterializeInc/datagen","owner":"MaterializeInc","description":"Generate authentic looking mock data based on a SQL, JSON or Avro schema and produce to Kafka in JSON or Avro format.","archived":false,"fork":false,"pushed_at":"2024-10-01T09:07:04.000Z","size":547,"stargazers_count":142,"open_issues_count":10,"forks_count":13,"subscribers_count":7,"default_branch":"main","last_synced_at":"2024-10-16T06:32:26.120Z","etag":null,"topics":["avro","kafka","sql"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MaterializeInc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-22T14:33:05.000Z","updated_at":"2024-10-14T18:03:19.000Z","dependencies_parsed_at":"2024-05-21T20:46:26.629Z","dependency_job_id":"99af5ed1-0e49-4d76-9495-a7316c981b57","html_url":"https://github.com/MaterializeInc/datagen","commit_stats":{"total_commits":87,"total_committers":7,"mean_commits":"12.428571428571429","dds":"0.33333333333333337","last_synced_commit":"e5f8fe00649a2f1cb8b5d92618a08a2f10d440b1"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaterializeInc%2Fdatagen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaterializeInc%2Fdatagen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaterializeInc%2Fdatagen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaterializeInc%2Fdatagen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MaterializeInc","download_url":"https://codeload.github.com/MaterializeInc/datagen/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247157275,"owners_count":20893220,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avro","kafka","sql"],"created_at":"2024-11-08T01:49:51.901Z","updated_at":"2025-04-04T10:02:41.477Z","avatar_url":"https://github.com/MaterializeInc.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Datagen CLI\n\nThis command line interface application allows you to take schemas defined in JSON (`.json`), Avro (`.avsc`), or SQL (`.sql`) and produce believable fake data to Kafka in JSON or Avro format or to Postgres.\n\nThe benefits of using this datagen tool are:\n- You can specify what values are generated using the expansive [FakerJS API](https://fakerjs.dev/api/) to craft data that more faithfully imitates your use case. This allows you to more easily apply business logic downstream.\n- This is a relatively simple CLI tool compared to other Kafka data generators that require Kafka Connect.\n- When using the `avro` output format, datagen connects to Schema Registry. This allows you to take advantage of the [benefits](https://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one/) of using Schema Registry.\n- Often when you generate random data, your downstream join results won't make sense because it's unlikely a randomly generated field in one dataset will match a randomly generated field in another. With this datagen tool, you can specify relationships between your datasets so that related columns will match up, resulting in meaningful joins downstream. Jump to the [end-to-end ecommerce tutorial](./examples/ecommerce) for a full example.\n\n\u003e :construction: Specifying relationships between datasets currently requires using JSON for the input schema.\n\n\u003e :construction: The `postgres` output format currently does not support specifying relationships between datasets.\n\n## Installation\n\n### npm\n\n```\nnpm install -g @materializeinc/datagen\n```\n\n### Docker\n\n```\ndocker pull materialize/datagen\n```\n### From Source\n\n\n```bash\ngit clone https://github.com/MaterializeInc/datagen.git\ncd datagen\nnpm install\nnpm run build\nnpm link\n```\n\n## Setup\n\nCreate a file called `.env` with the following environment variables\n\n```bash\n# Kafka Brokers\nexport KAFKA_BROKERS=\n\n# For Kafka SASL Authentication:\nexport SASL_USERNAME=\nexport SASL_PASSWORD=\nexport SASL_MECHANISM=\n\n# For Kafka SSL Authentication:\nexport SSL_CA_LOCATION=\nexport SSL_CERT_LOCATION=\nexport SSL_KEY_LOCATION=\n\n# Connect to Schema Registry if using '--format avro'\nexport SCHEMA_REGISTRY_URL=\nexport SCHEMA_REGISTRY_USERNAME=\nexport SCHEMA_REGISTRY_PASSWORD=\n\n# Postgres\nexport POSTGRES_HOST=\nexport POSTGRES_PORT=\nexport POSTGRES_DB=\nexport POSTGRES_USER=\nexport POSTGRES_PASSWORD=\n\n# MySQL\nexport MYSQL_HOST=\nexport MYSQL_PORT=\nexport MYSQL_DB=\nexport MYSQL_USER=\nexport MYSQL_PASSWORD=\n```\n\nThe `datagen` program will read the environment variables from `.env` in the current working directory.\nIf you are running `datagen` from a different directory, you can first `source /path/to/your/.env` before running the command.\n\n\n## Usage\n\n```bash\ndatagen -h\n```\n\n```\nUsage: datagen [options]\n\nFake Data Generator\n\nOptions:\n  -V, --version             output the version number\n  -s, --schema \u003cchar\u003e       Schema file to use\n  -f, --format \u003cchar\u003e       The format of the produced data (choices: \"json\", \"avro\", \"postgres\", \"webhook\", \"mysql\", default: \"json\")\n  -n, --number \u003cchar\u003e       Number of records to generate. For infinite records, use -1 (default: \"10\")\n  -c, --clean               Clean (delete) Kafka topics and schema subjects previously created\n  -dr, --dry-run            Dry run (no data will be produced to Kafka)\n  -d, --debug               Output extra debugging information\n  -w, --wait \u003cint\u003e          Wait time in ms between record production\n  -rs, --record-size \u003cint\u003e  Record size in bytes, eg. 1048576 for 1MB\n  -p, --prefix \u003cchar\u003e       Kafka topic and schema registry prefix\n  -h, --help                display help for command\n```\n\n\n## Quick Examples\n\nSee example input schema files in [examples](./examples) and [tests](/tests) folders.\n\n### Quickstart\n\n1. Iterate through a schema defined in SQL 10 times, but don't actually interact with Kafka or Schema Registry (\"dry run\"). Also, see extra output with debug mode.\n    ```bash\n    datagen \\\n      --schema tests/products.sql \\\n      --format avro \\\n      --dry-run \\\n      --debug\n    ```\n\n1. Same as above, but actually create the schema subjects and Kafka topics, and actually produce the data. There is less output because debug mode is off.\n    ```bash\n    datagen \\\n        --schema tests/products.sql \\\n        --format avro\n    ```\n\n1. Same as above, but produce to Kafka continuously. Press `Ctrl+C` to quit.\n    ```bash\n    datagen \\\n        -s tests/products.sql \\\n        -f avro \\\n        -n -1\n    ```\n\n1. If you want to generate a larger payload, you can use the `--record-size` option to specify number of bytes of junk data to add to each record. Here, we generate a 1MB record. So if you have to generate 1GB of data, you run the command with the following options:\n    ```bash\n    datagen \\\n        -s tests/products.sql \\\n        -f avro \\\n        -n 1000 \\\n        --record-size 1048576\n    ```\n    This will add a `recordSizePayload` field to the record with the specified size and will send the record to Kafka.\n\n    \u003e :notebook: The 'Max Message Size' of your Kafka cluster needs to be set to a higher value than 1MB for this to work.\n\n1. Clean (delete) the topics and schema subjects created above\n    ```bash\n    datagen \\\n        --schema tests/products.sql \\\n        --format avro \\\n        --clean\n    ```\n\n### Generate records with sequence numbers\n\nTo simulate auto incrementing primary keys, you can use the `iteration.index` variable in the schema.\n\nThis is particularly useful when you want to generate a small set of records with sequence of IDs, for example 1000 records with IDs from 1 to 1000:\n\n```json\n[\n    {\n        \"_meta\": {\n            \"topic\": \"mz_datagen_users\"\n        },\n        \"id\": \"iteration.index\",\n        \"name\": \"faker.internet.userName()\",\n    }\n]\n```\n\nExample:\n\n```\ndatagen \\\n    -s tests/iterationIndex.json \\\n    -f json \\\n    -n 1000 \\\n    --dry-run\n```\n\n### Docker\n\nCall the docker container like you would call the CLI locally, except:\n- include `--rm` to remove the container when it exits\n- include `-it` (interactive teletype) to see the output as you would locally (e.g. colors)\n- mount `.env` and schema files into the container\n- note that the working directory in the container is `/app`\n\n```\ndocker run \\\n  --rm -it \\\n  -v ${PWD}/.env:/app/.env \\\n  -v ${PWD}/tests/schema.json:/app/blah.json \\\n      materialize/datagen -s blah.json -n 1 --dry-run\n```\n\n## Input Schemas\n\nYou can define input schemas using JSON (`.json`), Avro (`.avsc`), or SQL (`.sql`). Within those schemas, you use the [FakerJS API](https://fakerjs.dev/api/) to define the data that is generated for each field.\n\nYou can pass arguments to `faker` methods by escaping quotes. For example, here is [faker.datatype.number](https://fakerjs.dev/api/datatype.html#number) with `min` and `max` arguments:\n\n```\n\"faker.datatype.number({min: 100, max: 1000})\"\n```\n\n\u003e :construction: Right now, JSON is the only kind of input schema that supports generating relational data.\n\n\u003e :warning: Please inspect your input schema file since `faker` methods can contain arbitrary Javascript functions that `datagen` will execute.\n### JSON Schema\n\nHere is the general syntax for a JSON input schema:\n\n```json\n[\n  {\n    \"_meta\": {\n      \"topic\": \"\u003cmy kafka topic\u003e\",\n      \"key\": \"\u003cfield to be used for kafka record key\u003e\" ,\n      \"relationships\": [\n        {\n          \"topic\": \"\u003ctopic for dependent dataset\u003e\",\n          \"parent_field\": \"\u003cfield in this dataset\u003e\",\n          \"child_field\": \"\u003cmatching field in dependent dataset\u003e\",\n          \"records_per\": \u003cnumber of records in dependent dataset per record in this dataset\u003e\n        },\n        ...\n      ]\n    },\n    \"\u003cmy first field\u003e\": \"\u003cmethod from the faker API\u003e\",\n    \"\u003cmy second field\u003e\": \"\u003canother method from the faker API\u003e\",\n    ...\n  },\n  {\n    ...\n  },\n  ...\n]\n```\n\nGo to the [end-to-end ecommerce tutorial](./examples/ecommerce) to walk through an example that uses a JSON input schema with relational data.\n\n\n### SQL Schema\n\nThe SQL schema option allows you to use a `CREATE TABLE` statement to define what data is generated. You specify the [FakerJS API](https://fakerjs.dev/api/) method using a `COMMENT` on the column. Here is an example:\n\n```sql\nCREATE TABLE \"ecommerce\".\"products\" (\n  \"id\" int PRIMARY KEY,\n  \"name\" varchar COMMENT 'faker.internet.userName()',\n  \"merchant_id\" int NOT NULL COMMENT 'faker.datatype.number()',\n  \"price\" int COMMENT 'faker.datatype.number()',\n  \"status\" int COMMENT 'faker.datatype.boolean()',\n  \"created_at\" timestamp DEFAULT (now())\n);\n```\n\nThis will produce the desired mock data to the topic `ecommerce.products`.\n\n#### Producing to Postgres\n\nYou can also produce the data to a Postgres database. To do this, you need to specify the `-f postgres` option and provide Postgres connection information in the `.env` file. Here is an example `.env` file:\n\n```\n# Postgres\nexport POSTGRES_HOST=\nexport POSTGRES_PORT=\nexport POSTGRES_DB=\nexport POSTGRES_USER=\nexport POSTGRES_PASSWORD=\n```\n\nThen, you can run the following command to produce the data to Postgres:\n\n```bash\ndatagen \\\n    -s tests/products.sql \\\n    -f postgres \\\n    -n 1000\n```\n\n\u003e :warning: You can only produce to Postgres with a SQL schema.\n\n#### Producing to MySQL\n\nYou can also produce the data to a MySQL database. To do this, you need to specify the `-f mysql` option and provide MySQL connection information in the `.env` file. Here is an example `.env` file:\n\n```\n# MySQL\nexport MYSQL_HOST=\nexport MYSQL_PORT=\nexport MYSQL_DB=\nexport MYSQL_USER=\nexport MYSQL_PASSWORD=\n```\n\nThen, you can run the following command to produce the data to MySQL:\n\n```bash\ndatagen \\\n    -s tests/products.sql \\\n    -f mysql \\\n    -n 1000\n```\n\n\u003e :warning: You can only produce to MySQL with a SQL schema.\n\n#### Producing to Webhook\n\nYou can also produce the data to a Webhook. To do this, you need to specify the `-f webhook` option and provide Webhook connection information in the `.env` file. Here is an example `.env` file:\n\n```\n# Webhook\nexport WEBHOOK_URL=\nexport WEBHOOK_SECRET=\n```\n\nThen, you can run the following command to produce the data to Webhook:\n\n```bash\ndatagen \\\n    -s tests/products.sql \\\n    -f webhook \\\n    -n 1000\n```\n\n\u003e :warning: You can only produce to Webhook with basic authentication.\n\n### Avro Schema\n\n\u003e :construction: Avro input schema currently does not support arbitrary FakerJS methods. Instead, data is randomly generated based on the type.\n\nHere is an example Avro input schema from `tests/schema.avsc` that will produce data to a topic called `products`:\n\n```json\n{\n  \"type\": \"record\",\n  \"name\": \"products\",\n  \"namespace\": \"exp.products.v1\",\n  \"fields\": [\n    { \"name\": \"id\", \"type\": \"string\" },\n    { \"name\": \"productId\", \"type\": [\"null\", \"string\"] },\n    { \"name\": \"title\", \"type\": \"string\" },\n    { \"name\": \"price\", \"type\": \"int\" },\n    { \"name\": \"isLimited\", \"type\": \"boolean\" },\n    { \"name\": \"sizes\", \"type\": [\"null\", \"string\"], \"default\": null },\n    { \"name\": \"ownerIds\", \"type\": { \"type\": \"array\", \"items\": \"string\" } }\n  ]\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaterializeinc%2Fdatagen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaterializeinc%2Fdatagen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaterializeinc%2Fdatagen/lists"}