{"id":20624921,"url":"https://github.com/ofnote/aos","last_synced_at":"2025-04-15T15:04:47.529Z","repository":{"id":57410757,"uuid":"235520843","full_name":"ofnote/aos","owner":"ofnote","description":"A regex-like shape language for arbitrary data","archived":false,"fork":false,"pushed_at":"2021-01-27T08:19:59.000Z","size":311,"stargazers_count":5,"open_issues_count":4,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-28T21:34:44.151Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ofnote.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-22T07:26:41.000Z","updated_at":"2023-04-24T18:41:32.000Z","dependencies_parsed_at":"2022-08-28T01:14:02.005Z","dependency_job_id":null,"html_url":"https://github.com/ofnote/aos","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ofnote%2Faos","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ofnote%2Faos/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ofnote%2Faos/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ofnote%2Faos/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ofnote","download_url":"https://codeload.github.com/ofnote/aos/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248606801,"owners_count":21132412,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T13:07:20.997Z","updated_at":"2025-04-15T15:04:47.508Z","avatar_url":"https://github.com/ofnote.png","language":"Python","readme":"![experimental](https://img.shields.io/badge/stability-experimental-orange.svg)\n\n# And-Or Shape (aos) Language\n\n\u003cimg src=\"./docs/aos-trees.png\" alt=\"Data as Trees\" width=\"500px\" /\u003e\n\nWriting data pipelines involves complex data transformations over *nested* data, e.g., list of dictionaries or dictionary of tensors. \n\n- The *shape* of nested data is not explicit in code and hence not accessible readily to the developer.\n- Leads to cognitive burden (guessing shapes), technical debt and inadvertent programming errors.\n- Data pipelines are very opaque to examination and comprehension.\n\n---\n\n`aos` is a compact, regex-like language for describing the *shapes* (schemas) of both *homogeneous* (tensors) and *heterogeneous* (dictionaries, tables) data, and combinations, independent of the specific data library. \n\n* Based on an intuitive **regex-like** algebra of data shapes.\n* [**Infer**](#Shape-Inference) `aos` shape from a data instance: `aos.infer.infer_aos`.\n* [**Validate**](#shapeschema-validation) data against `aos` shapes anywhere: `aos.checker.instanceof`.\n* [**Transform**](#transformations-with-aos) data using `aos` shapes, declaratively: `aos.tfm.do_tfm`.\n* Allows writing explicit data shapes, **inline** in code. In Python, use type annotations.\n* Write shapes for a variety of data conveniently -- Python native objects (`dict`, `list`, scalars), tensors (`numpy`,` pytorch`, `tf`), `pandas`,`hdf5`,`tiledb`,`xarray`,`struct-tensor`, etc.\n\nAn **article** on *aos* is [here](https://medium.com/@ekshakhs/aos-wrangle-nested-data-with-regular-exprs-5510a27bab13).\n\n\n\n### Installation\n\n```pip install aos```\n\n## Shape of Data ?\n\nConsider a few quick examples.\n\n- the shape of scalar data is simply its type, e.g., `int`, ` float`, `str`, ...\n- for nested data, eg.  list of `int`s:  `(int)*`\n- for a dictionary of form `{'a': 3, b: 'hi'}` : shape is  `(a \u0026 int) | (b \u0026 str)`.\n\nNow, we can describe the shape of *arbitrary, nested* data with these `\u0026`(and)- `|`(or) expressions. Intuitively, a list is an `or`-structure, a dictionary is an `or` of `and`s, a tensor is an `and`-structure, and so on.\n\n* Why is a `list` an or-structure? Ask: how do we *access* any value `v` in the `list`? Choose **some** index of the list, corresponding to the value `v`. \n* Similarly, a `dictionary` is an or-and structure: we pick **one** of the *key*s, together (**and**) with its *value*.\n* In contrast, an n-dimensional `tensor` has an `and`-shape: we must choose indices from *all* the dimensions of the tensor to access a scalar value. \n* In general, for a data structure, we *ask*: what choices must we make to access a scalar value?\n\nThinking in terms of `and`-`or` shapes takes a bit of practice initially. Read more about the and-or expressions [here](docs/and-or-thinking.md).\n\n#### More complex `aos` examples\n\n* Lists over shape `s` are denoted as `(s)*`.  Shorthand for `(s|..|s)`.\n* Dictionary: `(k1 \u0026 v1) | (k2 \u0026 v2) | ... | (kn \u0026 vn)` where `ki` and `vi` is the `i`th key and value.\n* Pandas tables: `(n \u0026 ( (c1\u0026int)| (c2\u0026str) | ... | (cn\u0026str) )` where `n` is the row dimension (the number of rows) and `c1,...,cn` are column names.\n\nThe `aos` expressions are very *compact*. For example, consider a highly nested Python object `X` of type\n\n `Sequence[Tuple[Tuple[str, int], Dict[str, str]]]`  \n\nThis is both verbose and hard to interpret. Instead, `X`'s `aos` is written compactly as\n `((str|int) | (str : str))* `.\n\n\u003e The full data shape may be irrelevant in many cases. To keep it brief, the language supports wildcards: `_` and `...` to allow writing partial shapes. \n\u003e\n\u003e So, we could write a dictionary's shape as `(k1 \u0026 ...)| ... | (kn \u0026 ...)`.\n\n\n\n## Shape Inference\n\nUnearthing the shape of opaque data instances, e.g., returned from a web request, or passed into a function call, is a major pain. \n\n* Use `aos.infer.infer_aos` to obtain compact shapes of arbitrary data instances.\n* From command line, run `aos-infer \u003cfilename.json\u003e`\n\n```python\nfrom aos.infer import infer_aos\n\ndef test_infer():\n\n  d = {\n      \"checked\": False,\n      \"dimensions\": { \"width\": 5, \"height\": 10},\n      \"id\": 1,\n      \"name\": \"A green door\",\n      \"price\": 12.5,\n      \"tags\": [\"home\",\"green\"]\n  }\n\n  infer_aos(d) \n\n  # ((checked \u0026 bool) \n  # | (dimensions \u0026 ((width \u0026 int) | (height \u0026 int)))\n  # | (id \u0026 int) | (name \u0026 str) | (price \u0026 float) | (tags \u0026 (str *)))\n  \n  dlist = []\n  for i in range(100):\n      d['id'] = i\n      dlist.append(d.copy())\n      \n  infer_aos(dlist) \n\n  # ((checked \u0026 bool) \n  # | (dimensions \u0026 ((width \u0026 int) | (height \u0026 int)))\n  # | (id \u0026 int) | (name \u0026 str) | (price \u0026 float) | (tags \u0026 (str *)))*\n\n\n```\n\n\n\n## Shape/Schema Validation\n\nUsing `aos.checker.instanceof`, we can \n\n* write `aos` assertions to validate data shapes (schemas). \n* validate data structure partially using placeholders:  `_` matches a scalar, `...` matches an arbitrary object (sub-tree).\n* works with python objects, pandas, numpy, ..., extensible to other data types (libraries).\n\n```python\nfrom aos.checker import instanceof\n\ndef test_pyobj():\n    d = {'city': 'New York', 'country': 'USA'}\n    t1 = ('Google', 2001)\n    t2 = (t1, d)\n\n    instanceof(t2, '(str | int) | (str \u0026 str)') #valid\n    instanceof(t2, '... | (str \u0026 _)') #valid\n    instanceof(t2, '(_ | _) | (str \u0026 int)') #error\n    \n    tlist = [('a', 1), ('b', 2)]\n    instanceof(tlist, '(str | int)*') #valid\n\ndef test_pandas():\n    d =  {'id': 'CS2_056', 'cost': 2, 'name': 'Tap'}\n    df = pd.DataFrame([d.items()], columns=list(d.keys()) )\n\n    instanceof(df, '1 \u0026 (id | cost | name)')\n\ndef test_numpy():\n    #arr = np.array()\n    arr = np.array([[1,2,3],[4,5,6]]) \n    instanceof(arr, '2 \u0026 3')\n\ndef test_pytorch():\n    #arr = np.array()\n    arr = torch.tensor([[1,2,3],[4,5,6]])\n    instanceof(arr, '2 \u0026 3')\n```\n\n\n\n## Transformations with AOS\n\nBecause `aos` expressions can both *match* and *specify* heterogeneous data shapes, we can write `aos` **rules** to **transform** data. \n\nThe rules are written as `lhs -\u003e rhs`, where both `lhs` and `rhs` are `aos` expressions:\n\n* `lhs` *matches* a part (sub-tree) of the input data instance *I*. \n* `query` variables in the `lhs` *capture* (bind with) parts of *I*.\n* `rhs` specifies the expected shape (aos) of the output data instance *O*.\n\nTo write rules, ask: which *parts* of *I*, do we need to construct *O* ?\n\n```python\nfrom aos.tfm import do_tfm\ndef tfm_example():\n    # input data\n    I = {'items': [{'k': 1}, {'k': 2}, {'k': 3}],\n        'names': ['A', 'B', 'C']}\n\n    # specify transformation (left aos -\u003e right aos)\n    # using `query` variables `k` and `v`\n    \n    # here `k` binds with each of the keys in the list and \n    # `v` binds with the corresponding value\n    # the `lhs` automatically ignores parts of I, which are irrelevant to O\n    \n    tfm = 'items \u0026 (k \u0026 v)* -\u003e values \u0026 (v)*'\n\n    O = do_tfm(I, tfm)\n    print(O) # {'values': [1, 2, 3]}\n```\n\n\n\nThe above example illustrates a simple JSON transformation using `aos` rules. Rules can be more complex, e.g., include *conditions*, *function* application on query variables. They work not only with JSON data, but also apply to heterogeneous nested objects.\n\nSee more examples [here](tests/test_tfm_json.py) and [here](tests/test_tfm_spark_json.py). \n\n\n\n## And-Or Shape Dimensions\n\nThe above examples of use strings or type names (`str`) or integer values (`2`,`3`) in shape expressions. A more principled approach is to first declare **dimension names** and define shape over these names. \n\nData is defined over two kinds of dimensions:\n\n* **Continuous**. A range of values, e.g., a numpy array of shape (5, 200) is defined over two continuous dimensions, say `n` and `d`, where `n` ranges over values `0-4` and `d` ranges over `0-199`.\n* **Categorical**. A set of names, e.g., a dictionary `{'a': 4, 'b': 5}` is defined over *keys*  (dim names) `['a', 'b']`. One can also view each key, e.g., `a` or `b` , as a **Singleton** dimension.\n\n\n\n**Programmatic API**. The library provides an API to declare both type of dimensions and `aos` expressions over these dimensions, e.g., declare `n` and `d` as two continuous dimensions and then define shape `n \u0026 d`.\n\n\n\n## Status\n\n*The library is under active development. More documentation coming soon..*\n\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fofnote%2Faos","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fofnote%2Faos","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fofnote%2Faos/lists"}