{"id":13906342,"url":"https://github.com/spotify/ratatool","last_synced_at":"2025-05-15T11:00:34.630Z","repository":{"id":10316994,"uuid":"64687812","full_name":"spotify/ratatool","owner":"spotify","description":"A tool for data sampling, data generation, and data diffing","archived":false,"fork":false,"pushed_at":"2025-04-30T20:43:12.000Z","size":1348,"stargazers_count":342,"open_issues_count":26,"forks_count":54,"subscribers_count":27,"default_branch":"master","last_synced_at":"2025-05-11T10:57:02.143Z","etag":null,"topics":["avro","bigquery","parquet","protobuf","scala","scalacheck"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spotify.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-08-01T17:33:25.000Z","updated_at":"2025-04-30T20:43:16.000Z","dependencies_parsed_at":"2024-01-19T10:22:45.842Z","dependency_job_id":"ba21eae6-c588-42c3-b6d8-92e6e4bbf9fa","html_url":"https://github.com/spotify/ratatool","commit_stats":{"total_commits":648,"total_committers":42,"mean_commits":"15.428571428571429","dds":0.7561728395061729,"last_synced_commit":"feaa3aec37207f3d0791df7780ec478457bd6e00"},"previous_names":[],"tags_count":61,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fratatool","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fratatool/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fratatool/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Fratatool/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spotify","download_url":"https://codeload.github.com/spotify/ratatool/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254328384,"owners_count":22052632,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avro","bigquery","parquet","protobuf","scala","scalacheck"],"created_at":"2024-08-06T23:01:33.874Z","updated_at":"2025-05-15T11:00:34.601Z","avatar_url":"https://github.com/spotify.png","language":"Scala","readme":"Ratatool\n========\n\n[![CircleCI](https://circleci.com/gh/spotify/ratatool/tree/master.svg?style=svg)](https://circleci.com/gh/spotify/ratatool/tree/master)\n[![codecov.io](https://codecov.io/github/spotify/ratatool/coverage.svg?branch=master)](https://codecov.io/github/spotify/ratatool?branch=master)\n[![GitHub license](https://img.shields.io/github/license/spotify/ratatool.svg)](./LICENSE)\n[![Maven Central](https://img.shields.io/maven-central/v/com.spotify/ratatool-common_2.12.svg)](https://maven-badges.herokuapp.com/maven-central/com.spotify/ratatool-common_2.12)\n[![Scala Steward badge](https://img.shields.io/badge/Scala_Steward-helping-blue.svg?style=flat\u0026logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA4AAAAQCAMAAAARSr4IAAAAVFBMVEUAAACHjojlOy5NWlrKzcYRKjGFjIbp293YycuLa3pYY2LSqql4f3pCUFTgSjNodYRmcXUsPD/NTTbjRS+2jomhgnzNc223cGvZS0HaSD0XLjbaSjElhIr+AAAAAXRSTlMAQObYZgAAAHlJREFUCNdNyosOwyAIhWHAQS1Vt7a77/3fcxxdmv0xwmckutAR1nkm4ggbyEcg/wWmlGLDAA3oL50xi6fk5ffZ3E2E3QfZDCcCN2YtbEWZt+Drc6u6rlqv7Uk0LdKqqr5rk2UCRXOk0vmQKGfc94nOJyQjouF9H/wCc9gECEYfONoAAAAASUVORK5CYII=)](https://scala-steward.org)\n\nA tool for random data sampling and generation\n\n# Features\n\n- [ScalaCheck Generators](https://github.com/spotify/ratatool/tree/master/ratatool-scalacheck) - [ScalaCheck](http://scalacheck.org/) generators (`Gen[T]`) for property-based testing for scala case classes, [Avro](https://avro.apache.org/), [Protocol Buffers](https://developers.google.com/protocol-buffers/), [BigQuery](https://cloud.google.com/bigquery/) [TableRow](https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/TableRow.html)\n- [IO](https://github.com/spotify/ratatool/tree/master/ratatool-sampling/src/main/scala/com/spotify/ratatool/io) - utilities for reading and writing records in Avro, [Parquet](http://parquet.apache.org/) (via Avro GenericRecord), BigQuery and TableRow JSON files. Local file system, HDFS and [Google Cloud Storage](https://cloud.google.com/storage/) are supported.\n- [Samplers](https://github.com/spotify/ratatool/tree/master/ratatool-sampling) - random data samplers for Avro, BigQuery and Parquet. True random sampling is supported for Avro only while head mode (sampling from the start) is supported for all sources.\n- [Diffy](https://github.com/spotify/ratatool/tree/master/ratatool-diffy) - field-level record diff tool for Avro, Protobuf and BigQuery TableRow.\n- [BigDiffy](https://github.com/spotify/ratatool/blob/master/ratatool-diffy) - [Scio](https://github.com/spotify/scio) library for pairwise field-level statistical diff of data sets. See [slides](http://www.lyh.me/slides/bigdiffy.html) for more.\n- [Command line tool](https://github.com/spotify/ratatool/tree/master/ratatool-cli/src/main/scala/com/spotify/ratatool/tool) - command line tool for local sampler, or executing BigDiffy and BigSampler.\n- [Shapeless](https://github.com/spotify/ratatool/tree/master/ratatool-shapeless) - An extension for Case Class Diffing via Shapeless.\n\nFor more information or documentation, project level READMEs are provided.\n\n# Usage\n\nIf you use [sbt](http://www.scala-sbt.org/) add the following dependency to your build file:\n```scala\nlibraryDependencies += \"com.spotify\" %% \"ratatool-scalacheck\" % \"0.3.10\" % \"test\"\n```\n\nIf needed, the following other libraries are published:\n* `ratatool-diffy`\n* `ratatool-sampling`\n\nOr install via our [Homebrew tap](https://github.com/spotify/homebrew-public) if you're on a Mac:\n\n```\nbrew tap spotify/public\nbrew install ratatool\nratatool\n```\n\nOr download the [release](https://github.com/spotify/ratatool/releases) jar and run it.\n\n```bash\nwget https://github.com/spotify/ratatool/releases/download/v0.3.10/ratatool-cli-0.3.10.tar.gz\nbin/ratatool directSampler\n```\n\nThe command line tool can be used to sample from local file system or Google Cloud Storage directly if [Google Cloud SDK](https://cloud.google.com/sdk/) is installed and authenticated.\n\n```bash\nbin/ratatool bigSampler avro --head -n 1000 --in gs://path/to/dataset --out out.avro\nbin/ratatool bigSampler parquet --head -n 1000 --in gs://path/to/dataset --out out.parquet\n\n# write output to both JSON file and BigQuery table\nbin/ratatool bigSampler bigquery --head -n 1000 --in project_id:dataset_id.table_id \\\n    --out out.json--tableOut project_id:dataset_id.table_id\n```\n\nIt can also be used to sample from HDFS with if `core-site.xml` and `hdfs-site.xml` are available.\n\n```bash\nbin/ratatool bigSampler avro \\\n    --head -n 10 --in hdfs://namenode/path/to/dataset --out file:///path/to/out.avro\n```\n\nOr execute BigDiffy directly\n\n```bash\nbin/ratatool bigDiffy \\\n    --input-mode=avro \\\n    --key=record.key \\\n    --lhs=gs://path/to/left \\\n    --rhs=gs://path/to/right \\\n    --output=gs://path/to/output \\\n    --runner=DataflowRunner ....\n```\n\n\n\n# Development\n## Testing local changes to the CLI before releasing\n\nTo test local changes before release:\n```\n$ sbt\n\u003e project ratatoolCli\n\u003e packArchive\n```\nand then find the built CLI at `ratatool-cli/target/ratatool-cli-{version}.tar.gz`\n\n# License\n\nCopyright 2016-2018 Spotify AB.\n\nLicensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0\n","funding_links":[],"categories":["Scala"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspotify%2Fratatool","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspotify%2Fratatool","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspotify%2Fratatool/lists"}