{"id":13657351,"url":"https://github.com/caltechlibrary/dataset","last_synced_at":"2026-03-12T02:09:13.185Z","repository":{"id":38949377,"uuid":"79394591","full_name":"caltechlibrary/dataset","owner":"caltechlibrary","description":"dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections","archived":false,"fork":false,"pushed_at":"2026-02-19T22:57:53.000Z","size":16507,"stargazers_count":24,"open_issues_count":6,"forks_count":4,"subscribers_count":5,"default_branch":"main","last_synced_at":"2026-02-22T13:48:51.374Z","etag":null,"topics":["datasets","json"],"latest_commit_sha":null,"homepage":"https://caltechlibrary.github.io/dataset","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/caltechlibrary.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.html","contributing":"CONTRIBUTING.html","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.html","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":"codemeta.json","zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2017-01-18T23:18:07.000Z","updated_at":"2026-02-19T22:57:51.000Z","dependencies_parsed_at":"2024-02-27T01:28:24.159Z","dependency_job_id":"5a42159b-0e3d-4533-940e-d78b3ee449f7","html_url":"https://github.com/caltechlibrary/dataset","commit_stats":{"total_commits":1340,"total_committers":5,"mean_commits":268.0,"dds":0.07985074626865674,"last_synced_commit":"accdaa513c01ce4fb72610f6cc541fc3700a2d7d"},"previous_names":[],"tags_count":134,"template":false,"template_full_name":null,"purl":"pkg:github/caltechlibrary/dataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/caltechlibrary%2Fdataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/caltechlibrary%2Fdataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/caltechlibrary%2Fdataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/caltechlibrary%2Fdataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/caltechlibrary","download_url":"https://codeload.github.com/caltechlibrary/dataset/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/caltechlibrary%2Fdataset/sbom","scorecard":{"id":263211,"data":{"date":"2025-08-11","repo":{"name":"github.com/caltechlibrary/dataset","commit":"590085568df7c025bd7deaf28e6fa7ba24888b5e"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.8,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":10,"reason":"30 commit(s) and 29 issue activity found in the last 90 days -- score normalized to 10","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"License","score":9,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Warn: project license file does not contain an FSF or OSI license."],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'main'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Signed-Releases","score":0,"reason":"Project has not signed or included provenance with any releases.","details":["Warn: release artifact v2.3.2 not signed: https://api.github.com/repos/caltechlibrary/dataset/releases/231911706","Warn: release artifact v2.3.1 not signed: https://api.github.com/repos/caltechlibrary/dataset/releases/231628768","Warn: release artifact v2.3.0 not signed: https://api.github.com/repos/caltechlibrary/dataset/releases/228408372","Warn: release artifact v2.2.8 not signed: https://api.github.com/repos/caltechlibrary/dataset/releases/228083892","Warn: release artifact v2.2.7 not signed: https://api.github.com/repos/caltechlibrary/dataset/releases/224412900","Warn: release artifact v2.3.2 does not have provenance: https://api.github.com/repos/caltechlibrary/dataset/releases/231911706","Warn: release artifact v2.3.1 does not have provenance: https://api.github.com/repos/caltechlibrary/dataset/releases/231628768","Warn: release artifact v2.3.0 does not have provenance: https://api.github.com/repos/caltechlibrary/dataset/releases/228408372","Warn: release artifact v2.2.8 does not have provenance: https://api.github.com/repos/caltechlibrary/dataset/releases/228083892","Warn: release artifact v2.2.7 does not have provenance: https://api.github.com/repos/caltechlibrary/dataset/releases/224412900"],"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-17T11:18:54.473Z","repository_id":38949377,"created_at":"2025-08-17T11:18:54.473Z","updated_at":"2025-08-17T11:18:54.473Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30412443,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-12T00:40:14.898Z","status":"online","status_checked_at":"2026-03-12T02:00:07.260Z","response_time":114,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasets","json"],"created_at":"2024-08-02T05:00:41.374Z","updated_at":"2026-03-12T02:09:13.167Z","avatar_url":"https://github.com/caltechlibrary.png","language":"Go","funding_links":[],"categories":["Go (134)","Development Tools"],"sub_categories":[],"readme":"Dataset Project\n===============\n[![DOI](https://data.caltech.edu/badge/79394591.svg)](https://data.caltech.edu/badge/latestdoi/79394591)\n\n[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n\nThe Dataset Project provides tools for working with collections of JSON documents. It uses a simple key and object pair to organize JSON documents into a collection. It supports SQL querying of the objects stored in a collection.\n\nIt is suitable for temporary storage of JSON objects in data processing pipelines as well as a persistent storage mechanism for collections of JSON objects.\n\nThe Dataset Project provides a command line program and a web service for working with JSON objects as a collection or individual objects. As such it is well suited for data science projects as well as building web applications that work with metadata.\n\ndataset, a command line tool\n----------------------------\n\n[dataset](doc/dataset.md) is a command line tool for working with collections of [JSON](https://en.wikipedia.org/wiki/JSON) documents. Collections can be stored on the file system in a [pairtree](https://datatracker.ietf.org/doc/html/draft-kunze-pairtree-01) or stored in a SQL database that supports JSON columns like SQLite3, PostgreSQL or MySQL.\n\nThe __dataset__ command line tool supports common data management operations as\n\n- initialization of a collection\n- dump and load JSON lines files into collection\n- CRUD operations on a collection\n- Query a collection using SQL\n\nSee [Getting started with dataset](how-to/getting-started-with-dataset.md) for a tour and tutorial.\n\ndatasetd is dataset implemented as a web service\n------------------------------------------------\n\n[datasetd](docs/datasetd.md) is a JSON REST web service and static file host. It provides a JSON API supporting the main operations found in the __dataset__ command line program. This allows dataset collections to be integrated safely into web applications or be used concurrently by multiple processes.\n\nThe Dataset Web Service can host multiple collections each with their own custom query API defined in a simple YAML configuration file.\n\nDesign choices\n--------------\n\n__dataset__ and __datasetd__ are intended to be simple tools for managing collections JSON object documents in a predictable structured way. The dataset web service allows multi process or multi user access to a dataset collection via HTTP.\n\n__dataset__ is guided by the idea that you should be able to work with JSON documents as easily as you can any plain text document on the Unix command line. __dataset__ is intended to be simple to use with minimal setup (e.g.  `dataset init mycollection.ds` creates a new collection called 'mycollection.ds').\n\n- __dataset__ and __datasetd__ store JSON object documents in collections\n  - Storage of the JSON documents may be either in a pairtree on disk or in a SQL database using JSON columns (e.g. SQLite3 or MySQL 8)\n  - dataset collections are made up of a directory containing a collection.json and codemeta.json files.\n  - collection.json metadata file describing the collection, e.g. storage type, name, description, if versioning is enabled\n  - codemeta.json is a [codemeta](https://codemeta.github.io) file describing the nature of the collection, e.g. authors, description, funding\n  - collection objects are accessed by their key, a unique identifier, made up of lower case alpha numeric characters\n  - collection names are usually lowered case and usually have a `.ds` extension for easy identification\n\n__dataset__ collection storage options\n  - SQL store stores JSON documents in a JSON column\n    - SQLite3 (default), PostgreSQL \u003e= 12 and MySQL 8 are the current SQL databases support\n    - A \"DSN URI\" is used to identify and gain access to the SQL database\n    - The DSN URI maybe passed through the environment\n  - [pairtree](https://datatracker.ietf.org/doc/html/draft-kunze-pairtree-01) (depricated, will be removed in v3)\n    - the pairtree path is always lowercase\n    - non-JSON attachments can be associated with a JSON document and found in a directories organized by semver (semantic version number)\n    - versioned JSON documents are created along side the current JSON document but are named using both their key and semver\n\n__datasetd__ is a web service\n  - it is intended as a back end web service run on localhost\n    - it runs on localhost and a designated port (port 8485 is the default)\n    - supports multiple collections each can have their own configuration for global object permissions and supported SQL queries\n\nThe choice of plain UTF-8 is intended to help future proof reading dataset collections.  Care has been taken to keep _dataset_ simple enough and light weight enough that it will run on a machine as small as a Raspberry Pi Zero while being equally  comfortable on a more resource rich server or desktop environment. _dataset_ can be re-implement in any programming language supporting file input and output, common string operations and along with JSON encoding and decoding functions. The current  implementation is in the Go language.\n\nFeatures\n--------\n\n[dataset](docs/dataset.md) supports\n\n- Collection level\n  - [Initialize](docs/init.md) a new dataset collection\n  - Codemeta file support for describing the collection contents\n  - [Dump](docs/load.md) a collection to a JSON lines document\n  - [Load](docs/load.md) a collection from a JSON lines document\n  - Listing [Keys](docs/keys.md) in a collection\n- Object level actions\n  - [create](docs/create.md)\n  - [read](docs/read.md)\n  - [update](docs/update.md)\n  - [delete](docs/delete.md)\n  - [keys](docs/keys.md)\n  - [has-key](docs/haskey.md)\n  - Documents as attachments\n    - [attachments](docs/attachments.md) (list)\n    - [attach](docs/attach.md) (create/update)\n    - [retrieve](docs/retrieve.md) (read)\n    - [prune](docs/prune.md) (delete)\n\n[datasetd](docs/datasetd.md) supports\n\n- List [collections](docs/collections-endpoint.md) available from the\n  web service\n- List a [collection](collection-endpoint.md)'s metadata\n- List a collection's [Keys](docs/keys-endpoint.md)\n- Object level actions\n    - [create](docs/create-endpoint.md)\n    - [read](docs/read-endpoint.md)\n    - [update](docs/update-endpoint.md)\n    - [delete](docs/delete-endpoint.md)\n    - Documents as attachments\n        - [attach](docs/attach-endpoint.md)\n        - [retrieve](docs/retrieve-endpoint.md)\n        - [prune](docs/prune-endpoint.md)\n\n\nBoth __dataset__  and __datasetd__ maybe useful for general data science applications needing JSON object management or in implementing repository systems in research libraries and archives.\n\n\nLimitations of __dataset__ and __datasetd__\n-------------------------------------------\n\n__dataset__ has many limitations, some are listed below\n\n- the pairtree implementation it is not a multi-process, multi-user data store\n- it is not a general purpose database system\n- it stores all keys in lower case in order to deal with file systems \n- it stores collection names as lower case to deal with file systems that\n  are not case sensitive\n- **it should NOT be used for sensitive, confidential or secret information** because it lacks access controls and data encryption\n\n__datasetd__ is a simple web service intended to run on \"localhost:8485\".\n\n- it does not include support for authentication\n- it does not support access control for users or roles\n- it does not encrypt the data it stores\n- it does not support HTTPS\n- it does not provide auto key generation\n- it limits the size of JSON documents stored to the size supported by\n  with host SQL JSON columns\n- it limits the size of attached files to less than 250 MiB\n- it does not support partial JSON record updates or retrieval\n- it does not provide an interactive Web UI for working with dataset\n  collections\n- **it should NOT be used for sensitive, confidential or secret information** because it lacks access controls and data encryption\n\n\nRead next ...\n-------------\n\n- About the [dataset](docs/dataset.md) command\n- About [datasetd](docs/datasetd.md) web service\n- [Installation](INSTALL.md)\n- [License](LICENSE)\n- [Contributing](CONTRIBUTING.md)\n- [Code of conduct](CODE_OF_CONDUCT.md)\n- Explore __dataset__ and __datasetd__\n    - [Getting Started with Dataset](how-to/getting-started-with-dataset.md \"Python examples as well as command line\")\n    - [How To](how-to/) guides\n    - [Reference Documentation](docs/).\n    - [Topics](docs/topics.md)\n\nAuthors and history\n-------------------\n\n- R. S. Doiel\n- Tommy Morrell\n\nReleases\n--------\n\nCompiled versions are provided for Linux (x86, aarch64), Mac OS X (x86 and M1), Windows 11 (x86, aarch64) and Raspberry Pi OS. \n\n[github.com/caltechlibrary/dataset/releases](https://github.com/caltechlibrary/dataset/releases)\n\nRelated projects\n----------------\n\nYou can use __dataset__ from Python via the [py_dataset](https://github.com/caltechlibrary/py_dataset) package. \n\nYou can use __dataset__ from Deno+TypeScript by running datasetd and access it with [ts_dataset](https://github.com/caltechlibraray/ts_dataset).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcaltechlibrary%2Fdataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcaltechlibrary%2Fdataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcaltechlibrary%2Fdataset/lists"}