{"id":25981296,"url":"https://github.com/stathissideris/spec-provider","last_synced_at":"2025-05-16T04:05:43.070Z","repository":{"id":39620085,"uuid":"60732356","full_name":"stathissideris/spec-provider","owner":"stathissideris","description":"Infer Clojure specs from sample data. Inspired by F#'s type providers.","archived":false,"fork":false,"pushed_at":"2020-05-24T21:14:45.000Z","size":174,"stargazers_count":515,"open_issues_count":11,"forks_count":22,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-05-14T02:33:22.727Z","etag":null,"topics":["clojure","clojure-spec"],"latest_commit_sha":null,"homepage":"","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stathissideris.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-06-08T21:35:19.000Z","updated_at":"2025-05-01T17:41:45.000Z","dependencies_parsed_at":"2022-09-16T11:32:14.626Z","dependency_job_id":null,"html_url":"https://github.com/stathissideris/spec-provider","commit_stats":null,"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stathissideris%2Fspec-provider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stathissideris%2Fspec-provider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stathissideris%2Fspec-provider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stathissideris%2Fspec-provider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stathissideris","download_url":"https://codeload.github.com/stathissideris/spec-provider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254464895,"owners_count":22075570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","clojure-spec"],"created_at":"2025-03-05T08:25:14.842Z","updated_at":"2025-05-16T04:05:41.917Z","avatar_url":"https://github.com/stathissideris.png","language":"Clojure","funding_links":[],"categories":["Generators"],"sub_categories":[],"readme":"# spec-provider\n\n![](https://circleci.com/gh/stathissideris/spec-provider.svg?\u0026style=shield\u0026circle-token=8aed611e2ff989f042a00dcb5886803db7bbe34c)\n\nThis is a library that will produce a best-guess\n[Clojure spec](https://clojure.org/guides/spec) based on multiple\nexamples of in-memory data. The inferred spec is *not* meant to be\nused as is and without human intervention, it is rather a starting\npoint that can (and should) be refined.\n\nThe idea is analogous to F# type providers -- specifically the JSON\ntype provider, but the input in the case of spec-provider is any\nin-memory Clojure data structure.\n\nSince Clojure spec is still in alpha, this library should also be\nconsidered to be in alpha -- so, highly experimental, very likely to\nchange, possibly flawed.\n\nThis library works in both Clojure and ClojureScript.\n\nMaturity level: mature and useful. Has not reached full potential as\nsome ideas are still unexplored.\n\n\n## Usage\n\nTo use this library, add this dependency to your Leiningen `project.clj` file:\n\n```\n[spec-provider \"0.4.14\"]\n```\n\n[Version history](https://github.com/stathissideris/spec-provider/blob/master/doc/history.md)\n\n## Use cases\n\nThe are two main use cases for spec-provider:\n\n1. You have a lot of examples of raw data (maybe in a JSONB column of\n   a PostreSQL table) and you'd like to:\n\n   * See a summary of what shape the data is. You can use\n     spec-provider as a way to explore new datasets.\n\n   * You already know what shape your data is, and you just want some\n     help getting started writing a spec for it because your data is\n     deeply nested, has a lot of corner cases, you're lazy etc.\n\n   * You *think* you know what shape your data is, but because it's\n     neither typed checked nor contract checked, some exceptions have\n     sneaked into it. Instead of eyeballing 100,000 maps, you run\n     spec-provider on them and to your surprise you find that one of\n     the fields is `(s/or :integer integer? :string string?)` instead\n     of just string as you expected. You can use spec-provider as a\n     data debugging tool.\n\n2. You have an un-spec'ed function and you also have a good way to\n   exercise it (via unit tests, actual usage etc). You can instrument\n   the function with spec-provider, run it a few times with actual\n   data, and then ask spec-provider for the function spec based on the\n   data that flowed through the function.\n\n## Inferring the spec of raw data\n\nTo infer a spec of a bunch of data just pass the data to the\n`infer-specs` function:\n\n```clojure\n\u003e (require '[spec-provider.provider :as sp])\n\n\u003e (def inferred-specs\n    (sp/infer-specs\n     [{:a 8  :b \"foo\" :c :k}\n      {:a 10 :b \"bar\" :c \"k\"}\n      {:a 1  :b \"baz\" :c \"k\"}]\n     :toy/small-map))\n\n\u003e inferred-specs\n\n((clojure.spec.alpha/def :toy/c (clojure.spec/or :keyword keyword? :string string?))\n (clojure.spec.alpha/def :toy/b string?)\n (clojure.spec.alpha/def :toy/a integer?)\n (clojure.spec.alpha/def :toy/small-map (clojure.spec/keys :req-un [:toy/a :toy/b :toy/c])))\n```\n\nThe sequence of specs that you get out of `infer-spec` is technically\ncorrect, but not very useful for pasting into your code. Luckily, you\ncan do:\n\n```clojure\n\u003e (sp/pprint-specs inferred-specs 'toy 's)\n\n(s/def ::c (s/or :keyword keyword? :string string?))\n(s/def ::b string?)\n(s/def ::a integer?)\n(s/def ::small-map (s/keys :req-un [::a ::b ::c]))\n```\n\nPassing `'toy` to `pprint-specs` signals that we intend to paste this\ncode into the `toy` namespace, so spec names are printed using the\n`::` syntax.\n\nPassing `'s` signals that we are going to require clojure.spec as `s`,\nso the calls to `clojure.spec/def` become `s/def` etc.\n\n### Nested data structures\n\nspec-provider will walk nested data structures in your sample data and\nattempt to infer specs for everything.\n\nLet's use clojure.spec to generate a larger sample of data with nested\nstructures.\n\n```clojure\n(s/def ::id (s/or :numeric pos-int? :string string?))\n(s/def ::codes (s/coll-of keyword? :max-gen 5))\n(s/def ::first-name string?)\n(s/def ::surname string?)\n(s/def ::k (nilable keyword?))\n(s/def ::age (s/with-gen\n               (s/and integer? pos? #(\u003c= % 130))\n               #(gen/int 130)))\n(s/def :person/role #{:programmer :designer})\n(s/def ::phone-number string?)\n\n(s/def ::street string?)\n(s/def ::city string?)\n(s/def ::country string?)\n(s/def ::street-number pos-int?)\n\n(s/def ::address\n  (s/keys :req-un [::street ::city ::country]\n          :opt-un [::street-number]))\n\n(s/def ::person\n  (s/keys :req-un [::id ::first-name ::surname ::k ::age ::address]\n          :opt-un [::phone-number ::codes]\n          :req    [:person/role]))\n```\n\nThis spec can be used to generate a reasonably large random sample of\npersons:\n\n```clojure\n(def persons (gen/sample (s/gen ::person) 100))\n```\n\nWhich generates structures like:\n\n```clojure\n{:id \"d7FMcH52\",\n :first-name \"6\",\n :surname \"haFsA\",\n :k :a-*?DZ/a,\n :age 5,\n :person/role :designer,\n :address {:street \"Yrx963uDy\", :city \"b\", :country \"51w5NQ6\", :street-number 53},\n :codes\n [:*.?m_o-9_j?b.N?_!a+IgUE._coE.S4l4_8_.MhN!5_!x.axztfh.x-/?*\n  :*-DA?+zU-.T0u5R.evD8._r_D!*K0Q.WY-F4--.O*/**O+_Qg+\n  :Bh8-A?t-f]}\n```\n\nNow watch what happens when we infer the spec of `persons`:\n\n```clojure\n\u003e (sp/pprint-specs\n   (sp/infer-specs persons :person/person)\n   'person 's)\n\n(s/def ::codes (s/coll-of keyword?))\n(s/def ::phone-number string?)\n(s/def ::street-number integer?)\n(s/def ::country string?)\n(s/def ::city string?)\n(s/def ::street string?)\n(s/def\n ::address\n (s/keys :req-un [::street ::city ::country] :opt-un [::street-number]))\n(s/def ::age integer?)\n(s/def ::k (s/nilable keyword?))\n(s/def ::surname string?)\n(s/def ::first-name string?)\n(s/def ::id (s/or :string string? :integer integer?))\n(s/def ::role #{:programmer :designer})\n(s/def\n ::person\n (s/keys\n  :req [::role]\n  :req-un [::id ::first-name ::surname ::k ::age ::address]\n  :opt-un [::phone-number ::codes]))\n```\n\nWhich is very close to the original spec. We are going to break down\nthis result to bring attention to specific features in the following\nsections.\n\n#### Nilable\n\nIf the sample data contain any `nil` values, this is detected and\nreflected in the inferred spec:\n\n```clojure\n(s/def ::k (s/nilable keyword?))\n```\n\n#### Optional detection\n\nThings like `::street-number`, `::codes` and `::phone-number` did not\nappear consistently in the sampled data, so they are correctly\nidentified as optional in the inferred spec.\n\n```clojure\n(s/def\n ::address\n (s/keys :req-un [::street ::city ::country] :opt-un [::street-number]))\n```\n\n#### Qualified vs unqualified keys\n\nMost of the keys in the sample data are not qualified, and they are\ndetected as such in the inferred spec. The `:person/role` key is\nidentified as fully qualified.\n\n```clojure\n(s/def\n ::person\n (s/keys\n  :req [::role]\n  :req-un [::id ::first-name ::surname ::k ::age ::address]\n  :opt-un [::phone-number ::codes]))\n```\n\nNote that the `s/def` for role is pretty printed as `::role` because\nwhen calling `pprint-specs` we indicated that we are going to paste\nthis into the `person` namespace.\n\n```clojure\n\u003e (sp/pprint-specs\n   (sp/infer-specs persons :person/person)\n   'person 's)\n\n...\n\n(s/def ::role #{:programmer :designer})\n```\n\n#### Enumerations\n\nYou may have also noticed that role has been identified as an\nenumeration of `:programmer` and `:designer`. To see how it's decided\nwhether a field is an enumeration or not, we have to look under the\nhood. Let's generate a small sample of roles:\n\n```clojure\n\u003e (gen/sample (s/gen ::role) 5)\n\n(:designer :designer :designer :designer :programmer)\n```\n\nspec-provider collects statistics about all the sample data before\ndeciding on the spec:\n\n```clojure\n\u003e (require '[spec-provider.stats :as stats])\n\u003e (stats/collect-stats (gen/sample (s/gen ::role) 5) {})\n\n#:spec-provider.stats{:distinct-values #{:programmer :designer},\n                      :sample-count 5,\n                      :pred-map {#function[clojure.core/keyword?] #:spec-provider.stats{:sample-count 5}}}\n```\n\nThe stats include a set of distinct values observed (up to a certain\nlimit), the sample count for each field, and counts on each of the\npredicates that the field matches -- in this case just\n`keyword?`. Based on these statistics, the spec is inferred and a\ndecision is made on whether the value is an enumeration or not.\n\nIf the following statement is true, then the value is considered an\nenumeration:\n\n```clojure\n(\u003e= 0.1\n    (/ (count distinct-values)\n       sample-count))\n```\n\nIn other words, if the number of distinct values found is less that\n10% of the total recorded values, then the value is an\nenumeration. This threshold is configurable.\n\nLooking at the actual numbers can make this logic easier to\nunderstand. For the small sample above:\n\n```clojure\n\u003e (sp/infer-specs (gen/sample (s/gen ::role) 5) ::role)\n\n((clojure.spec/def :spec-provider.person-spec/role keyword?))\n```\n\nWe have 2 distinct values in a sample of 5, which is 40% of the values\nbeing distinct. Imagine this percentage in a larger sample, say\ndistinct 400 values in a sample of size 2000. That doesn't sound\nlikely to be an enumeration, so it's interpreted as a normal value.\n\nIf you increase the sample:\n\n```clojure\n\u003e (sp/infer-specs (gen/sample (s/gen ::role) 100) ::role)\n\n((clojure.spec/def :spec-provider.person-spec/role #{:programmer :designer}))\n```\n\nWe have 2 distinct values in a sample of 100, which is 2%, which means\nthat the same values appear again and again in the sample, so it must\nbe an enumeration.\n\n#### Merging\n\nclojure-spec makes the same assumption as clojure.spec that keys that\nhave same name also have the same data shape as their value, even when\nthey appear in different maps. This means that the specs from\ndifferent maps are merged by key.\n\nTo demonstrate this we need to \"spike\" the generated persons with an\nid field that's inconsistent with the existing\n`(s/or :numeric pos-int? :string string?)`:\n\n```clojure\n(defn add-inconsistent-id [person]\n  (if (:address person)\n    (assoc-in person [:address :id] (gen/generate (gen/keyword)))\n    person))\n\n(def persons-spiked (map add-inconsistent-id (gen/sample (s/gen ::person) 100)))\n```\n\nInferring the spec of `persons-spiked` yields a different result for\nids:\n\n```clojure\n\u003e (sp/pprint-specs\n   (sp/infer-specs persons-spiked :person/person)\n   'person 's)\n\n...\n(s/def ::id (s/or :string string? :integer integer? :keyword keyword?))\n...\n```\n\n#### Do I know you from somewhere?\n\nThis feature is not illustrated by the person example, but before\nreturning them, spec-provider will walk the inferred specs and look\nfor forms that already occur elsewhere and replace them with the name\nof the known spec. For example:\n\n```clojure\n\u003e (sp/pprint-specs\n    (sp/infer-specs [{:a [{:zz 1}] :b {:zz 2}}\n                     {:a [{:zz 1} {:zz 4} nil] :b nil}] ::foo) *ns* 's)\n\n(s/def ::zz integer?)\n(s/def ::b (s/nilable (s/keys :req-un [::zz])))\n(s/def ::a (s/coll-of ::b))\n(s/def ::foo (s/keys :req-un [::a ::b]))\n```\n\nIn this case, because maps like `{:zz 2}` appear under the key `:b`,\nspec-provider knows what to call them, so it uses that name for\n`(s/def ::a (s/coll-of ::b))`. This replacement is not performed if\nthe spec definition is a predicate from the `clojure.core` namespace.\n\n#### Inferring specs with numerical ranges\n\nspec-provider collects stats about the min/max values of numerical\nfields, but will not output them in the inferred spec by default. To\nget range predicates in your specs you have to pass the\n`:spec-provider.provider/range` option:\n\n```clojure\n\u003e (require '[spec-provider.provider :refer :all :as sp])\n\n\u003e (pprint-specs\n    (infer-specs [{:foo 3, :bar -400}\n                  {:foo 3, :bar 4}\n                  {:foo 10, :bar 400}] ::stuff {::sp/range true})\n    *ns* 's)\n\n(s/def ::bar (s/and integer? (fn [x] (\u003c= -400 x 400))))\n(s/def ::foo (s/and integer? (fn [x] (\u003c= 3 x 10))))\n(s/def ::stuff (s/keys :req-un [::bar ::foo]))\n```\n\nYou can also restrict range predicates to specific keys by passing a\nset of qualified keys that are the names of the specs that should get\na range predicate:\n\n```clojure\n\u003e (sp/pprint-specs\n    (sp/infer-specs [{:foo 3, :bar -400}\n                     {:foo 3, :bar 4}\n                     {:foo 10, :bar 400}] ::stuff {::sp/range #{::foo}})\n    *ns* 's)\n\n(s/def ::bar integer?)\n(s/def ::foo (s/and integer? (fn [x] (\u003c= 3 x 10))))\n(s/def ::stuff (s/keys :req-un [::bar ::foo]))\n```\n\n### How it's done\n\nInferring a spec from raw data is a two step process: Stats collection\nand then summarization of the stats into specs.\n\nFirst each data structure is visited recursively and statistics are\ncollected at each level about the types of values that appear, the\ndistinct values for each field (up to a limit), min and max values for\nnumbers, lengths for sequences etc.\n\nTwo important points about stats collection:\n\n* Spec-provider **will not** run out of memory even if you throw a lot\n  of data at it because it updates the same statistics data structure\n  with every new example datum it receives.\n\n* Collecting stats will (at least partly) realize lazy sequences.\n\nAfter stats collection, code from the `spec-provider.provider`\nnamespace goes through the stats and it summarizes it as a collection\nof specs.\n\n### Alternative uses\n\nAs mentioned in the previous section, spec-provider first collects\nstatistics about the data that you pass to it and then it uses them to\ninfer specs for this data. The entry point for collecting stats is the\n`spec-provider.stats/collect` function. This can be used to explore\nyour data and give you insight about its structure as it was very\nnicely explained in\n[this blog post](https://akvo.org/blog/production-data-never-lies/) by\nDan Lebrero.\n\n### Options\n\nAssume this:\n\n```clojure\n(require [spec-provider.provider :as sp]\n         [spec-provider.stats :as stats])\n```\n\nThere is only one option that affects how the specs are inferred and\nit can be passed as a map in an extra parameter to `sp/infer-specs`:\n\n* `::sp/range` If true, all numerical specs include a range\n  predicate. If it's a set of spec names (qualified keywords), only\n  these specs will include range predicates. See section\n  [Inferring specs with numerical ranges](#inferring-specs-with-numerical-ranges)\n  for an example (default false).\n\nThere is a number of options that can affect how the sample stats are\ncollected (and consequently also affect what spec is inferred). These\noptions are passed to `stats/collect`, or as part of the options map\npassed to `sp/infer-specs`.\n\n* `::stats/distinct-limit` How many distinct values are collected for\n  collections (default 10).\n\n* `::stats/coll-limit` How many elements of the collection are used to\n  infer/collect data about the type of the contained element (default\n  101). This means that lazy sequences are at least partly realized.\n\n* `::stats/positional` Results in positional stats being collected for\n  sequences, so that `s/cat` can be inferred instead of `s/coll-of`\n  (default false).\n\n* `::stats/positional-limit` Bounds the positional stats length\n  (default 100).\n\n## Inferring the spec of functions\n\nUndocumented/under development: there is experimental support for\ninstrumenting functions for the purpose of inferring the spec of args\nand return values.\n\n## Limitations\n\n* There is no attempt to infer the regular expression of collections.\n* There is no attempt to infer tuples.\n* There is no attempt to infer `multi-spec`.\n* For functions, only the `:args` and `:ret` parts of the spec is\n  generated, the `:fn` part is up to you.\n* Spec-provider assumes that you want to follow the Clojure spec\n  convention that the same map keys identify the same \"entity\", so it\n  will merge stats that appear under the identical keys but in\n  different parts of your tree structure. This may not be what you\n  want. For more details see the \"Merging\" section.\n\n## FAQ\n\n* Will I run out of memory if I pass a lot of examples of my data to\n  `infer-specs`?\n\n  No, stats collection works by updating the same data structure with\n  every example of data received. The data structure will initially\n  grow a bit and then maintain a constant size. That means that you\n  can use a lazy sequence to stream your huge table through it if you\n  feel that's necessary (not tested!).\n\n* Can I do this for Prismatic schema?\n\n  The hard part of inferring a spec is collecting the\n  statistics. Summarizing the stats as specs was relatively easy, so\n  plugging in a different \"summarizer\" that will output schemas from\n  the same stats should be possible. Look at the `provider` namespace,\n  write the schema equivalent and send me a pull request!\n\n## Developers\n\nRun Clojure unit tests with:\n\n```\nlein test\n```\n\nRun ClojureScript unit tests with (default setup uses node):\n\n```\nlein doo\n```\n\nRun self-hosted ClojureScript unit tests with:\n\n```\nlein tach lumo\n```\n\nand\n\n```\nlein tach planck\n```\n\n## Contributors\n\n* [Stathis Sideris](https://github.com/stathissideris) - original author\n* [Paulo Rafael Feodrippe](https://github.com/pfeodrippe)\n* [Dan Lebrero](https://github.com/dlebrero)\n* [Marco Molteni](https://github.com/marco-m)\n* [Allan Jiang](https://github.com/jiangts)\n* [Gibran Rosa](https://github.com/gibranrosa)\n* [Mike Fikes](https://github.com/mfikes)\n\n## License\n\nCopyright © 2016-2018 Stathis Sideris\n\nDistributed under the Eclipse Public License either version 1.0 or (at\nyour option) any later version.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstathissideris%2Fspec-provider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstathissideris%2Fspec-provider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstathissideris%2Fspec-provider/lists"}