{"id":30048322,"url":"https://github.com/segmentio/data-digger","last_synced_at":"2025-08-07T10:09:35.189Z","repository":{"id":44849780,"uuid":"304452498","full_name":"segmentio/data-digger","owner":"segmentio","description":"Dig through structured messages in Kafka, S3, or local files","archived":false,"fork":false,"pushed_at":"2025-07-25T20:52:42.000Z","size":74,"stargazers_count":41,"open_issues_count":1,"forks_count":6,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-07-26T03:45:57.935Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/segmentio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-10-15T21:30:51.000Z","updated_at":"2025-07-25T20:48:33.000Z","dependencies_parsed_at":"2024-06-19T01:37:00.276Z","dependency_job_id":"943c9709-0c80-44bc-b555-98b950883c65","html_url":"https://github.com/segmentio/data-digger","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/segmentio/data-digger","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/segmentio%2Fdata-digger","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/segmentio%2Fdata-digger/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/segmentio%2Fdata-digger/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/segmentio%2Fdata-digger/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/segmentio","download_url":"https://codeload.github.com/segmentio/data-digger/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/segmentio%2Fdata-digger/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269238177,"owners_count":24383485,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-07T02:00:09.698Z","response_time":73,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-07T10:09:30.980Z","updated_at":"2025-08-07T10:09:35.177Z","avatar_url":"https://github.com/segmentio.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Circle CI](https://circleci.com/gh/segmentio/data-digger.svg?style=svg\u0026circle-token=567fa1e91d2191b3158e7e0d2145cb3cda9d6f83)](https://circleci.com/gh/segmentio/data-digger)\n[![Go Report Card](https://goreportcard.com/badge/github.com/segmentio/data-digger)](https://goreportcard.com/report/github.com/segmentio/data-digger)\n# data-digger\n\nThe data-digger is a simple tool for \"digging\" through JSON or protobuf-formatted\nstreams and outputting the approximate\n[top K](https://link.springer.com/chapter/10.1007/978-3-642-00887-0_74) values for one or more\nmessage fields. In the process of doing this analysis, it can also output the raw message\nbodies (i.e., \"tail\" style).\n\nCurrently, the tool supports reading data in Kafka, S3, or local files. Kafka-sourced messsages can\nbe in either JSON or protobuf format. S3 and local file sources support\n[newline-delimited JSON](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON) only.\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"1000\" alt=\"digger_screenshot2\" src=\"https://user-images.githubusercontent.com/54862872/96078675-e026ed80-0e67-11eb-9cf3-96b6e0da5556.png\"\u003e\n\u003c/p\u003e\n\n## Motivation\n\nMany software systems generate and/or consume streams of structured messages; these messages might\nrepresent customer events (as at [Segment](https://segment.com)) or system logs,\nfor example, and can be stored in local files, Kafka, S3, or other destinations.\n\nIt's sometimes useful to scan these streams, apply some filtering, and then either print the\nmessages out or generate basic summary stats about them. At Segment, for instance, we frequently\ndo this to debug issues in production, e.g. if a single event type or customer source is\noverloading our data pipeline.\n\nThe data-digger was developed to support these kinds of use cases in an easy, lightweight way.\nIt's not as powerful as frameworks like [Presto](https://prestodb.io/), but it's a lot easier\nto run and, in conjunction with other tools like [`jq`](https://github.com/stedolan/jq), is often\nsufficient for answering basic questions about data streams.\n\n## Installation\n\nEither:\n\n1. Run `GO111MODULE=\"on\" go get github.com/segmentio/data-digger/cmd/digger` *or*\n2. Clone this repo and run `make install` in the repo root\n\nThe `digger` binary will be placed in `$GOPATH/bin`.\n\n## Quick tour\n\nFirst, clone this repo and install the `digger` binary as described above.\n\nThen, generate some sample data (requires [Python3](https://www.python.org/downloads/)):\n\n```\n./scripts/generate_sample_data.py\n```\n\nBy default, the script will dump 20 files, each with around 65k JSON-delimited messages, into\nthe `test_inputs` subdirectory (run with `--help` to see the configuration options).\n\nEach message, in turn, is a generic, Segment-like event that represents a logged\ncustomer interaction:\n\n```\n{\n  \"app: [name of the app where event occurred],\n  \"context\": {\n    \"os\": [user os],\n    \"version\": [user os version]\n  },\n  \"latency\": [latency observed by user, in ms],\n  \"messageId\": [id of the message],\n  \"timestamp\": [when the event occurred],\n  \"type\": [interaction type]\n}\n```\n\nWe're now ready to do some digging! Here are some examples to try:\n\n1. Get the top K values for the `app` field:\n\n```\ndigger file --file-paths=test_inputs --paths='app'\n```\n\n2. Get the top K values for the combination of the `app`, `type`, and `os`:\n\n```\ndigger file --file-paths=test_inputs --paths='app;type;context.os'\n```\n\n3. Show the number of events by day:\n\n```\ndigger file --file-paths=test_inputs --paths='timestamp|@trim:10' --sort-by-name\n```\n\n4. Pretty-print all messages that contain the string \"oreo\" (also requires [`jq`](https://stedolan.github.io/jq/)):\n\n```\ndigger file --file-paths=test_inputs --filter=oreo --raw | jq\n```\n\n5. Get basic stats on the `latency` values by `type`:\n\n```\ndigger file --file-paths=test_inputs --paths='type;latency' --numeric\n```\n\n\n## Usage\n\n```\ndigger [source type] [options]\n```\n\nCurrently, three source types are supported:\n\n1. `kafka`: Read JSON or proto-formatted messages in a Kafka topic.\n2. `s3`: Read newline-delimited, JSON formatted messages from the objects in one or more S3\n  prefixes.\n3. `file`: Read newline-delimited, JSON formatted messages from one or more local file paths.\n\nThe common options include:\n\n```json\n    --debug               turn on debug logging (default: false)\n-f, --filter string       filter regexp to apply before generating stats\n-k, --num-categories int  number of top values to show (default: 25)\n    --numeric             treat values as numbers instead of strings (default: false)\n    --paths string        comma-separated list of paths to generate stats for\n    --plugins string      comma-separated list of golang plugins to load at start\n    --print-missing       print out messages that missing all paths (default: false)\n    --raw                 show raw messages that pass filters (default: false)\n    --raw-extended        show extended info about messages that pass filters (default: false)\n    --sort-by-name        sort top k values by their category/key names (default: false)\n```\n\nEach source also has source-specific options, described in the sections below.\n\n#### Kafka source\n\nThe `kafka` subcommand exposes a number of options to configure the underlying Kafka reader:\n\n```\n-a, --address string      kafka address\n-o, --offset int64        kafka offset (default: -1)\n-p, --partitions string   comma-separated list of partitions\n    --since string        time to start at; can be either RFC3339 timestamp or duration relative to now\n-t, --topic string        kafka topic\n    --until string        time to end at; can be either RFC3339 timestamp or duration relative to now\n```\n\nThe `address` and `topic` options are required; the others are optional and will default to\nreasonable values if omitted (i.e., all partitions starting from the latest message).\n\n#### S3 source\n\nThe `s3` source is configured with a bucket, list of prefixes, and (optional) number of workers:\n\n```\n-b, --bucket string       s3 bucket\n    --num-workers int     number of objects to read in parallel (default: 4)\n-p, --prefixes string     comma-separated list of prefixes\n```\n\nThe objects under each prefix can be compressed provided that the `ContentEncoding` is set\nto the appropriate value (e.g., `gzip`).\n\n#### Local file(s) source\n\nThe `file` source is configured with a list of paths:\n\n```\n  --file-paths string   comma-separated list of file paths\n  --resursive           scan directories recursively\n```\n\nEach path can be either a file or directory. If `--recursive` is set, then each directory\nwill be scanned recursively; otherwise, only the top-level files will be processed.\n\nFiles with names ending in `.gz` will be assumed to be gzipped compressed. All other files\nwill be processed as-is.\n\n### Paths syntax\n\nThe optional `paths` flag is used to pull out the values that will be used for the top K\nstats. All arguments should be in\n[gjson syntax](https://github.com/tidwall/gjson/blob/master/SYNTAX.md).\n\nIf desired, multiple paths can be combined with either commas or semicolons.\nIn the comma case, the components will be treated as independent paths and the\n*union* of all values will be counted. When using semicolons, the values for each\npath or path group will be *intersected* and treated as a single value. If both\ncommas and semicolons are used, the union takes precedence over the intersection.\n\nIf the element at a path is an array, then each item in the array will be treated\nas a separate value. The tool doesn't currently support intersections involving array\npaths; if a path query would results in more than one intersected value for a single\nmessage, then only the first combination will be counted and the remaining ones will be\ndropped.\n\nIf `paths` is empty, all messages will be assigned to an `__all__` bucket.\n\n#### Extra gjson modifiers\n\nIn addition to the standard `gjson` functionality, the `digger` includes a few\n[custom modifiers](https://github.com/tidwall/gjson#modifiers-and-path-chaining) that we've\nfound helpful for processing data inside Segment:\n\n1. `base64d`: Do a base64 decode on the input\n2. `trim`: Trim the input to the argument length\n\n### Outputs\n\nThe tool output is determined by the flags it's run with. The most common modes include:\n\n1. No output flags set (default): Only show summary stats while running, then print out a top K\n  summary table after an interrupt is detected.\n2. `--raw`: Print out raw values of messages after any filtering and/or decoding. Can be piped to\n  a downstream tool that expects JSON like `jq`.\n3. `--raw-extended`: Like `--raw`, but wraps each message value in a JSON struct that also includes\n  message context like the partition (kafka case) or key (s3 case) and offset. Can be piped to\n  a downstream tool that expects JSON like `jq`.\n4. `--print-missing`: Prints out summary stats plus bodies of any messages that don't match\n  the argument paths. Useful for debugging path expressions.\n5. `--debug`: Prints out summary stats plus lots of debug messages, including the details of each\n  processed message. Intended primarily for tool developers.\n\n### Protocol buffer support\n\nThe `kafka` input mode supports processing protobuf types that are in the\n[`gogo`](https://github.com/gogo/protobuf) registry in the `digger` binary.\n\nTo add protobuf types to the registry either:\n\n1. Clone this repo and import your protobuf types somewhere in the main package *or*\n2. Create a golang plugin that includes your protobuf type and run the `digger` with the `--plugins`\n  option\n\nOnce the types are included, you can use them by running the `kafka` subcommand with the\n`--proto-types` option. The values passed to that flag should match the names that your types\nare registered as; you can find these names by looking in the `init` function in the generated\ngo code for your protos.\n\nIn the future, we plan on adding support for protobufs registered via the\n[v2 API](https://blog.golang.org/protobuf-apiv2). The new API supports iterating over all\nregistered message types, which should make the `--proto-types` flag unnecessary in most cases.\n\n## Local development\n\n#### Build binary\n\nRun `make digger`, which will place a binary in `build/digger`.\n\n#### Run unit tests\n\nFirst, run `docker-compose up -d` to start up local Kafka and S3 endpoints. Then,\nrun the tests with `make test`.\n\nWhen you're done running the tests, you can stop the Kafka and S3 containers by running\n`docker-compose down`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsegmentio%2Fdata-digger","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsegmentio%2Fdata-digger","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsegmentio%2Fdata-digger/lists"}