{"id":29293927,"url":"https://github.com/galliaproject/gallia-core","last_synced_at":"2026-02-12T15:07:10.048Z","repository":{"id":57733277,"uuid":"336357768","full_name":"galliaproject/gallia-core","owner":"galliaproject","description":"A schema-aware Scala library for data transformation","archived":false,"fork":false,"pushed_at":"2024-02-23T15:49:10.000Z","size":2701,"stargazers_count":88,"open_issues_count":0,"forks_count":4,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-07-06T12:01:48.561Z","etag":null,"topics":["data-engineering","data-manipulation","data-science","data-transformation","etl","feature-engineering","json","nesting","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/galliaproject.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-05T18:36:52.000Z","updated_at":"2025-04-27T17:09:49.000Z","dependencies_parsed_at":"2024-01-30T22:36:00.411Z","dependency_job_id":"d1740bb7-cd6c-435b-823b-90a022b65597","html_url":"https://github.com/galliaproject/gallia-core","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/galliaproject/gallia-core","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/galliaproject%2Fgallia-core","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/galliaproject%2Fgallia-core/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/galliaproject%2Fgallia-core/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/galliaproject%2Fgallia-core/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/galliaproject","download_url":"https://codeload.github.com/galliaproject/gallia-core/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/galliaproject%2Fgallia-core/sbom","scorecard":{"id":417679,"data":{"date":"2025-08-11","repo":{"name":"github.com/galliaproject/gallia-core","commit":"6b16e40b290aaee29595fad168685610a06116df"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}}]},"last_synced_at":"2025-08-19T00:20:29.205Z","repository_id":57733277,"created_at":"2025-08-19T00:20:29.205Z","updated_at":"2025-08-19T00:20:29.205Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29369493,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-12T08:51:36.827Z","status":"ssl_error","status_checked_at":"2026-02-12T08:51:26.849Z","response_time":55,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-manipulation","data-science","data-transformation","etl","feature-engineering","json","nesting","scala","spark"],"created_at":"2025-07-06T12:00:52.444Z","updated_at":"2026-02-12T15:07:10.042Z","avatar_url":"https://github.com/galliaproject.png","language":"Scala","funding_links":[],"categories":["Table of Contents"],"sub_categories":["Big Data"],"readme":"\u003cp align=\"center\"\u003e\u003cimg src=\"./images/logo.png\" alt=\"icon\"\u003e\u003c/p\u003e\n\n# Introducing Gallia: a Scala library for data transformation\nby \u003ca href=\"http://anthonycros.com/\" target=\"_blank\"\u003eAnthony Cros\u003c/a\u003e (2021)\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"./images/trivial_example.png\" alt=\"trivial_example\"\u003e\u003c/p\u003e\n\n## Introduction\n\n\u003ca name=\"210121153145\"\u003e\u003c/a\u003e\n_Gallia_ is a Scala library for generic data transformation whose main goals are:\n\n1. \u003ca name=\"210127120327\"\u003e\u003c/a\u003e Practicality\n2. \u003ca name=\"210127120328\"\u003e\u003c/a\u003e Readability\n3. \u003ca name=\"210127120329\"\u003e\u003c/a\u003e Scalability (optionally)\n\nExecution happens in two phases, each traversing a dedicated execution [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph):\n1. A initial _meta_ phase which ignores the data entirely and ensures that transformation steps are consistent (schema-wise)\n2. A subsequent _data_ phase where the data is actually transformed\n\n\u003ca name=\"210121153146\"\u003e\u003c/a\u003e\n\u003ca name=\"210202173234\"\u003e\u003c/a\u003e\nSee introductory articles in *Towards Data Science*: [Introduction](https://towardsdatascience.com/gallia-a-library-for-data-transformation-3fafaaa2d8b9) and [Follow-up](https://towardsdatascience.com/data-transformations-in-scala-with-gallia-version-0-4-0-is-out-f0b8df3e48f3).\nThe rest of this README serves as temporary documentation.\nMore thorough discussions of design choices/limitations/direction will come as subsequent article(s).\n\n\u003ca name=\"210121153202\"\u003e\u003c/a\u003e\nPreliminary notes:\n- Some links lead to [documentation](http://github.com/galliaproject/gallia-docs) that is still to be written.\n- The examples use _JSON_ because of its ubiquity as a notation, and despite its [flaws](http://github.com/galliaproject/gallia-docs/blob/master/json.md)\n\n\u003ca name=\"210121153147\"\u003e\u003c/a\u003e\u003ca name=\"dependencies\"\u003e\u003c/a\u003e\n## Dependencies\n\nThe library is available for Scala 2.12, 2.13, and 3.3.1\n\n\u003ca name=\"sbt\"\u003e\u003c/a\u003e\u003ca name=\"210121153201\"\u003e\u003c/a\u003e\nInclude the following in your `build.sbt` file:\n```\nlibraryDependencies += \"io.github.galliaproject\" %% \"gallia-core\" % \"0.6.1\"\n```\n\n\u003ca name=\"210121153200\"\u003e\u003c/a\u003e\nThe client code then requires the following import:\n\n```scala\nimport gallia._\n```\n\n\u003ca name=\"210121153148\"\u003e\u003c/a\u003e\nOne can also optionally add the following import for general utilities:\n\n```scala\n// our open-source utilities library,\n//   see https://github.com/aptusproject/aptus-core\nimport aptus._\n```\n\n## Preliminary examples\n\n\u003ca name=\"shines\"\u003e\u003c/a\u003e\nWhile Gallia shines with (and makes most sense for) complex data processing such as [this one (dbNSFP)](https://github.com/galliaproject/gallia-dbnsfp#description),\nit can also cater to the more trivial cases such as the ones presented below as an introduction.\nThe same paradigm can therefore handle all (most) of your data manipulation needs.\n\n\n### Process individual entity\n```scala\n\"\"\"{\"foo\": \"hello\", \"bar\": 1, \"baz\": true, \"qux\": \"world\"}\"\"\"\n  .read() // will infer schema if none is provided\n\n    // uppercase string value for field \"foo\" (\"hello\" -\u003e \"HELLO\")\n    .toUpperCase('foo)\n\n    // increment integer value for field \"bar\" (1 -\u003e 2)\n    .increment('bar)\n\n    // remove field \"qux\" (irrespective of field type)\n    .remove('qux)\n\n    // nest (boolean) field \"baz\" under (new) field \"parent\"\n    .nest('baz).under('parent)\n\n    // flip boolean value of field \"baz\" (now nested under \"parent\")\n    .flip('parent |\u003e 'baz)\n\n  .printJson()\n  // prints: {\"foo\": \"HELLO\", \"bar\": 2, \"parent\": { \"baz\": false }}\n```\n\n\u003ca name=\"210121153151\"\u003e\u003c/a\u003e\nIt is very important to note that the schema is maintained throughout operations, so you will get an error if you try for example to square a boolean:\n```scala\n\"\"\"{\"foo\": \"hello\", \"bar\": 1, \"baz\": true, \"qux\": \"world\"}\"\"\"\n  .read()\n      .toUpperCase('foo)\n      .increment  ('bar)\n      .remove     ('qux)\n      .nest       ('baz).under('parent)\n      .square     ('parent |\u003e 'baz ~\u003e 'BAZ) // instead of \"flip\" earlier\n    .printJson()\n    // ERROR: TypeMismatch (Boolean, expected Number): 'parent |\u003e 'baz\n```\n\n\u003ca name=\"210121153152\"\u003e\u003c/a\u003e\nNotes:\n* This error occurs *prior* to the actual data run, and no data is therefore processed (potential schema inferrence aside)\n* The error mechanisms works at any level of nesting/multiplicity\n* Of course, some errors cannot be caught until the data is actually seen (e.g. IndexOutOfBounds types of checks)\n\n### Process collection of entities\n```scala\n// INPUT:\n//    {\"first\": \"John\", \"last\": \"Johnson\", \"DOB\": \"1986-02-04\", ...}\\n\n//    {\"first\": \"Kate\", ...\n\"/data/protopeople.jsonl.gz\"\n  .stream() // vs .read() for single entity\n\n    .generate('username).from(_.string('first), _.string('last))\n      .using { (f, l) =\u003e s\"${f.head}${l}\".toLowerCase } // -\u003e \"jjohnson\"\n    .toUpperCase('last)\n    .fuse('first, 'last).as('name).using(_ + \" \" + _)\n    .transformString('DOB ~\u003e 'age).using(\n        _.toLocalDateFromIso.getYear.pipe(2021 - _))\n\n  .write(\n    \"/tmp/people.jsonl.gz\")\n    // OUTPUT:\n    //  {\"username\": \"jjohnson\", \"name\": \"John JOHNSON\", \"age\": 32, ...}\\n\n    //  {\"username\": ...\n```\n\n\u003ca name=\"210121153154\"\u003e\u003c/a\u003e\nNotes:\n- \u003ca href=\"https://jsonlines.org/\" target=\"_blank\"\u003eJSONL\u003c/a\u003e = one JSON document per line\n- This example makes use of:\n  - `.pipe()` from `scala.util.chaining`\n  - `.toLocalDateFromIso()` from our `import aptus._` above (see [docs](http://github.com/galliaproject/gallia-docs/blob/master/aptus.md))\n\n### Process CSV/TSV files\n\n```scala\n\"/data/some.tsv.gz\"\n  .stream()\n    .retain('_id, 'age, 'gender)\n    .groupBy('age)\n  // ...\n```\n\n See more in [inputs](#210120155618) below.\n\n## Basics\n\n### Key referencing\nKeys can be referenced as scala's `String`, `Enumeration`, and `enumeratum.Enum`\n```scala\n\"\"\"{\"foo\": 1}\"\"\"\n  .read().rename(\"foo\" ~\u003e 'FOO)\n  // OUTPUT: {\"FOO\":1}\n\n\"\"\"{\"Very Poor Key Choice  \":\n    \"please_stop_using_spaces_and_unnecessary_uppercasing_in_keys\"}\"\"\"\n  .read()\n    .rename(\"Very Poor Key Choice  \" ~\u003e 'much_better)\n    .transformString('much_better).using(_ =\u003e \"isn't it?\")\n  // OUTPUT: {\"much_better\": \"isn't it?\"}\n```\n\n### Target selection (keys/paths)\nApplicable for both `.read()` and `.stream()` (one vs multiple entities)\n```scala\n// INPUT: {\"foo\": \"hello\", \"bar\": 1, \"baz\": true, \"qux\": \"world\"}\ndata.retain(_.firstKey) // {\"foo\": \"hello\"}\n\ndata.retain(_.allBut('qux))      //{\"foo\": \"hello\", \"bar\": 1, \"baz\": true}\ndata.retain(_.customKeys(_.tail))//{\"bar\": 1, \"baz\": true, \"qux\": \"world\"}\n```\n\n### Generalization of target selection\nLikewise applicable for both `.read()` and `.stream()`\n```scala\nval obj = \"\"\"{\"foo\": \"hi\", \"bar\": 1, \"baz\": true, \"qux\": \"you\"}\"\"\".read()\n\n// can't use \"then\" (reserved in scala)\nobj.forKey    ('foo)      .thn(_ toUpperCase _) // { \"foo\": \"HI\", ...\nobj.forEachKey('foo)      .thn(_ toUpperCase _)\nobj.forEachKey('foo, 'bar).thn(_ toUpperCase _)\n\nobj.forAllKeys((o, k) =\u003e o.rename(k).using(_.toUpperCase)) //{\"FOO\":\"hi\",..\n// ... likewise with forPath, forEachPath, forAllPaths, forLeafPaths, ...\n```\n\n### Nested data selection\n\nPaths can be referenced conveniently via the \"pipe+greater-than\" (`|\u003e`) [notation](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210127123739):\n\n```scala\n\"\"\"{\"parent\": {\"foo\": \"bar\"}}\"\"\".read()\n  .toUpperCase('parent |\u003e 'foo)\n  // OUTPUT: {\"parent\":{\"foo\":\"BAR\"}}\n``\n\nNotes:\n- A _key_ is just a trivial _path_.\n- _Gallia_ can generally apply transformations irrespective of multiplicity, as long as they still make sense:\n\n```scala\n\"\"\"{\"parent\": {\"foo\": [\"bar\", \"baz\"]}}\"\"\".read()\n  .toUpperCase('parent |\u003e 'foo)\n  // OUTPUT: {\"parent\":{\"foo\":[\"BAR\", \"BAZ\"]}}\n```\n\n### Renaming keys\n\nRenaming can be expressed conveniently via the \"tilde+greater-than\" (`~\u003e`) [notation](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210127123739) :\n\n```scala\n           \"\"\"{\"foo\": \"bar\"}\"\"\" .read().rename           ('foo ~\u003e 'FOO)\n\"\"\"{\"parent\": {\"foo\": \"bar\"}}\"\"\".read().rename('parent |\u003e 'foo ~\u003e 'FOO)\n// OUTPUT: (respectively)\n//             {\"FOO\":\"bar\"}\n//   {\"parent\":{\"FOO\":\"bar\"}}\n```\n\n\u003ca name=\"210128094950\"\u003e\u003c/a\u003e\n A case could be made that `rekey` would be more appropriate than `rename`, but it feels rather unnatural.\n\n### Renaming keys _\"while-at-it\"_\n\n```scala\n\"\"\"{\"foo\": 1}\"\"\".read()\n  .increment('foo ~\u003e 'FOO)\n  // OUTPUT: {\"FOO\":2} - value is incremented and key is uppercased\n```\n\nNote that this is functionally equivalent too:\n```scala\n\"\"\"{\"foo\": 1}\"\"\".read()\n  .increment('foo)\n  .rename   ('foo ~\u003e 'FOO)\n```\n\n## Single vs Multiple entities\n\n _Gallia_ does not necessarily expect its elements (\"entities\") to come in multiples, it is capable of processing them as individuals.\n\nExample of going from one to the other, then back:\n\n```scala \n\"\"\"{\"foo\": \"bar\"}\"\"\".read()\n    .convertToMultiple // now     [{\"foo\": \"bar\"}]\n    .head              // back to  {\"foo\": \"bar\"}\n```\n\nIn a nested context:\n\n```scala\n\"\"\"[{\"foo\": \"bar1\"}, {\"foo\": \"bar2\"}]\"\"\".stream()\n  .asArray1        //  {\"foo\":[\"bar1\",\"bar2\"]}\n  .flattenBy('foo) // [{\"foo\": \"bar1\"}, {\"foo\": \"bar2\"}] (original array)\n```\n\nThere are other ways to go back and forth between the two (e.g. [reducing](#210120142925) as shown below)\n\n\u003ca name=\"210121153206\"\u003e\u003c/a\u003e\nInternally, all entity-wise operations on \"streams\" are actually just implicit MAP-pings, so that the following two expressions are equivalent\n```scala\n\"\"\"[{\"foo\": \"bar1\"}, {\"foo\": \"bar2\"}]\"\"\".stream()      .toUpperCase('foo)\n\"\"\"[{\"foo\": \"bar1\"}, {\"foo\": \"bar2\"}]\"\"\".stream().map(_.toUpperCase('foo))\n```\n\n## DAG Heads\n\nThe Head type models a leaf in the DAG(s) that underlies the execution plan.\n\nInternally, heads comes in as three flavors, each offering a different and relevant subset of operations:\n1. _HeadO_: For single __O__-bject manipulation\n2. _HeadS_: For multiple object-__S__ manipulation\n3. _HeadV[T]_: For _\"naked\"_ __V__-alues manipulation (_HeadV_ is rarely encountered explicitly in client code)\n\nNotes:\n- _\"Naked\"_ values are more conceptually relevant to nested subgraphs, not commonly manipulated by client code. It represents values that are not part of a structured entity, e.g the string `\"foo\"` alone as opposed to the same string `\"foo`\" within an entity `{\"key1\": 1, \"key2\": \"foo\", ...}`.\n- The DAGs/heads concepts will be discussed in more details in a future article dedicated to design.\n\n\u003ca name=\"201118133206\"\u003e\u003c/a\u003e\n## SQL-like querying\n\n```scala\npeople\n  // INPUT: [{\"name\": \"John\", \"age\": 20, \"city\": \"Toronto\"}, {...\n\n    /* 1. WHERE            */ .filterBy('age).matches(_ \u003c 25)\n    /* 2. SELECT           */ .retain('name, 'age)\n    /* 3. GROUP BY + COUNT */ .countBy('age)\n\n  // OUTPUT: [{\"age\": 21, \"_count\": 10}, {\"age\": 22, ...\n```\n\n\u003ca name=\"210121153208\"\u003e\u003c/a\u003e\n1. _WHERE_ clause: Alternatively as `filterBy(_.int('age)).matches(_ \u003c 25)` if need more than the basic =, \u003c, \u003e, +, ... (see [types](#201118133133))\n2. _SELECT_ clause: this would actually be redundant since the subsequent GROUP BY step also retains those fields implicitly\n3. _GROUP BY_ + _COUNT_: if unspecified, uses default `_count` output field\n\n\u003ca name=\"210120142925\"\u003e\u003c/a\u003e\n## Reduction\n```scala\npeople.reduceWithMean('age)      // {\"age\":21.5}\npeople.reduce('age).wit(_.stdev) // {\"age\":1.118[...]}\n```\n\n\u003ca name=\"210121153209\"\u003e\u003c/a\u003e\n More powerfully:\n```scala\npeople\n  .reduce(\n      'age .aggregates(_.mean, _.stdev),\n      'city.count_distinct)\n  // OUTPUT: {\"age\":{\"_mean\":21.5,\"_stdev\":1.118[...]},\"city\":3}\n```\n\n## Aggregations\n\n```scala\npeople.group('name).by('city)\n\n// \"GROUP all keys but the last key BY that last key\"\npeople\n  .group(_.initKeys)\n    .by(_.lastKey)\n      .as('grouped) // would use '_group if unspecified\n  //OUTPUT: [\n  // [{\"gender\":\"male\",\"grouped\":[{\"name\":\"John\",\"age\":21,\"city\":\"Toronto\"},\n  //     ... ]\n\n// other count types available:\n//   distinct, present, missing and distinct+present\npeople.count('name).by('city)\n\npeople.sum  ('age).by('city) // also sum, mean, stdev, ...\npeople.stats('age).by('city) // descriptive statistics (minimal for now)\n  // OUTPUT: [ {\"city\":\"Toronto\",\"_stats\":{\"mean\":21.0, ...\n```\n\n\u003ca name=\"210121153210\"\u003e\u003c/a\u003e\n A more \"custom\" aggregation (nonsensical):\n```scala\npeople\n  .groupBy('city)\n  .transformGroupEntitiesUsing {\n    _.squash(_.string('name), _.int('age))\n      // random nonsensical aggregation for demonstration purpose only\n      .using(_.map { case (n, a) =\u003e n.size + a }.sum) }\n  .rename(_group ~\u003e 'awesomeness)\n  // OUTPUT:\n  //  [{\"city\":\"Toronto\"     , \"awesomeness\":25},\n  //   {\"city\":\"Philadelphia\", \"awesomeness\":24},\n  //   {\"city\":\"Lyon\"        , \"awesomeness\":53}, ... ]\n```\n\n## Pivoting\n```scala\npeople\n  .pivot(_.int('age)).usingMean\n    .rows   ('city)\n    .column ('gender)\n      // having to provide those is an unfortunate consequence of\n      // maintaining a schema (these values are only known at runtime)\n      .asNewKeys('male, 'female)\n  // OUTPUT:\n  //  [ {\"city\":\"Toronto\",\"male\":21},\n  //    {\"city\":\"Toronto\",\"female\":20},\n  //    {\"city\":\"Lyon\",\"male\":22.5},     ...]\n```\n\nNote that [unpivoting](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210120171258) isn't available, but scheduled\n\n## Renesting Tables\n\nCommon prefixes can be leveraged for re-nesting, e.g. \"contact_\" below:\n\n```scala\n// INPUT: \"name\u003cTAB\u003econtact_phone\u003cTAB\u003econtact_address\u003cTAB\u003e...\"\n//                  ^^^^^^^           ^^^^^^^\ntable\n  .renest(_.allKeys)\n    .usingSeparator(\"_\")\n    // OUTPUT: \"{\"name\":\"John\", \"contact\":{\"phone\": 1234567, \"address\":..\n    //                           ^^^^^^^\n```\n\n\u003ca name=\"210128101227\"\u003e\u003c/a\u003e\nThis mechanism is not limited to a single level, it can transform keys:\n\n```foo_bar_baz1\u003cTAB\u003efoo_bar_baz2\u003cTAB\u003e...```\n\ninto\n\n```{\"foo\": {\"bar\": {\"baz1\": ..., \"baz2\": ...}}, ...}```\n\n\u003ca name=\"210121153211\"\u003e\u003c/a\u003e\nIn practice the renesting operation typically involves a lot more work,\n  e.g. if a value is like `\"foo1,foo2,foo3\"`, it may also need to be split and denormalized on a one-per-row basis.\nIt is also common to encounter values such as `\"John:32|Kate:33|Jean:34\"` or combinations of values such as `\"John|Kate|Jean\"` + `\"32|33|34\"`\n  (the latter two actually sharing the same cardinality of elements pipe-wise).\nThis alone would deserve its own article, but in the meantime the [DbNsfp](http://github.com/galliaproject/gallia-dbnsfp/blob/master/src/main/scala/galliaexample/dbnsfp/DbNsfp.scala#L14) example highlights a number of interesting such cases.\n\nThe opposite operation (_flattening_ to table) is [scheduled](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210131110456) .\n\n## IO\n\n\u003ca name=\"210120155618\"\u003e\u003c/a\u003e\n### Input\n\n`.read()` (single entity) and `.stream()` (multiple entities) guess as much about the input format as they can from the input `String` provided:\n- JSON markers, e.g. `{`, `[`, ...\n- extensions, e.g. `.json`, `.tsv`, `.gz`, ...\n- URI schemes, e.g. `file://`, `http://`, `jdbc://`, ..\n- ...\n\nWe will see later an example of how to override the [default behavior](#210121135434) for reading and writing.\n\nHere are some examples of input consumption:\n\n\u003ca name=\"210121153212\"\u003e\u003c/a\u003e\n```scala\n// will infer schema (costly timewise)\n\"/some/local/file.json\" .read  ()\n\"/some/local/file.jsonl\".stream()\n\n// providing schema\n\"/some/local/file.json\" .read  [MyCaseClass]\n\"/some/local/file.jsonl\".stream[MyCaseClass]\n\n// equivalently\n\"/some/local/file.json\" .read  ('foo.string, 'baz.int)\n\"/some/local/file.jsonl\".stream('foo.string, 'baz.int)\n\n       \"/some/local/file.jsonl\".stream()\n\"file:///some/local/file.jsonl\".stream()\n\n \"http://someserver/test.jsonl\".stream()\n\"https://someserver/test.jsonl\".stream()\n\n\"ftp://someserver/pub/foo/bar.tsv\".stream()\n\n// must make corresponding JDBC driver jar available\n\"jdbc:myfavdb://localhost:1234/test?user=root\u0026password=root\"\n  .stream(_.allFrom(\"TABLE1\"))\n\n\"jdbc:myfavdb://localhost:1234/test?user=root\u0026password=root\"\n  .stream(_.query(\"SELECT * from TABLE1\"))\n\n(conn: java.sql.Connection)       .stream(_.sql(\"SELECT * from TABLE1\"))\n(ps:   java.sql.PreparedStatement).stream()\n\n// requires gallia-mongodb module and import gallia.mongodb._\n//   (see https://github.com/galliaproject/gallia-mongodb)\n\"mongodb://localhost:27017/test.coll1\".stream()\n\"mongodb://localhost:27017/test\"      .stream(_.query(\"\"\"{\"find\":\"coll1\"}\"\"\"))\n```\n\n#### Tables\n\nConsidering the following TSV file:\n```bash\n$ cat /data/some.tsv | column -nt\nf1  f2  f3   f4     f5     f6  f7     f8\nz   1   1.1  true   9,8,7  k   d,e,f  T\ny   2   2.2  false  6,5,4\n```\n\n\u003ca name=\"210121135434\"\u003e\u003c/a\u003e\n And the following call:\n```scala\n\"/data/some.tsv\".stream()\n\n// or its explicit equivalent\n\"/data/some.tsv\".stream(_.tsv.inferSchema)\n```\n\n\u003ca name=\"210121153213\"\u003e\u003c/a\u003e\n The following schema and data will be inferred and ingested:\n```scala\nval schema =\n  cls(\n      'f1.string,  'f2.int     , 'f3.double, 'f4.boolean, 'f5.ints,\n      'f6.string_, 'f7.strings_, 'f8.boolean_)\n\nval data =\n Seq(\n  obj('f1 -\u003e \"z\", 'f2 -\u003e 1, 'f3 -\u003e 1.1, 'f4 -\u003e true , 'f5 -\u003e Seq(9, 8, 7),\n        'f6 -\u003e \"k\", 'f7 -\u003e Seq(\"d\", \"e\", \"f\"), 'f8 -\u003e true),\n  obj('f1 -\u003e \"y\", 'f2 -\u003e 2, 'f3 -\u003e 2.2, 'f4 -\u003e false, 'f5 -\u003e Seq(6, 5, 4)))\n```\n\n\u003ca name=\"210121153214\"\u003e\u003c/a\u003e\nNote that `_` here stands for `?`, meaning optional. For instance `'f7.strings_` would be represented as `Option[Seq[String]]` in Scala.\n\n\u003ca name=\"avro\"\u003e\u003c/a\u003e\u003ca name=\"221014125512\"\u003e\u003c/a\u003e\n#### Apache Avro\n\nAvro read/write support was added in `0.4.0`, see [CHANGELOG.md#avro](https://github.com/galliaproject/gallia-core/blob/master/CHANGELOG.md#221014125247)\n\n\u003ca name=\"parquet\"\u003e\u003c/a\u003e\u003ca name=\"221014125513\"\u003e\u003c/a\u003e\n#### Apache Parquet\n\nLikewise, Parquet read/write support was added in `0.4.0`, see [CHANGELOG.md#parquet](https://github.com/galliaproject/gallia-core/blob/master/CHANGELOG.md#221014125248)\n\n\u003ca name=\"210121153215\"\u003e\u003c/a\u003e\n#### Additional sources/destinations\n Additional modules using a similar paradigm will be added in the future, e.g.:\n```scala\n// NEO4J\n\"neo4j+s://demo.neo4jlabs.com\".stream(\n    _.query(\"\"\"(:Person {name: string})\n        -[:ACTED_IN {roles: [string]}]\n          -\u003e(:Movie {title: string, released: number})\"\"\"))\n\n// Sparql\n\"http://www.disease-ontology.org?query=\".stream(\n    _.query(\"\"\"\n      SELECT DISTINCT *\n      WHERE {?s \u003chttp://www.w3.org/2000/01/rdf-schema#label\u003e \"common cold\"}\n      LIMIT 3\"\"\"))\n\n// GraphQL\n\"https://swapi.com/graphql\".stream(\n    _.query(\n        \"\"\"{user (id: 1) { firstname } }\"\"\"))\n\n// Excel (if sheet contains a single table)\n\"/data/doc.xlsx\".stream(_.allFrom(\"Some Sheet Name\"))\n\n// XML\n\"/data/doc.xml\".stream() // Requires costly schema inferring first\n\n```\n\n Note: There are proof of concepts for the last two (XML and Excel).\n\n### Output\nOutput works in a similar fashion, relying on extensions/URI schemes as much as possible\n\n```scala\nmodifiedPeople.write(\"/tmp/output/result.tsv\")\nmodifiedPeople.write(\"/tmp/output/result.jsonl.bz2\")\n\n// these are not actually implemented for mongo yet (only reading is):\nmodifiedPeople.write(\"mongodb://localhost:27017/test.coll1\")\nmodifiedPeople.write(\n    uri       = \"mongodb://localhost:27017/test\",\n    container = \"coll1\")\n\nmodifiedPeople.write(\n  uri       = \"jdbc:myfavdb://localhost:1234/test?user=foo\u0026password=bar\",\n  container = \"SOME_RESULT_TABLE\")\n```\n\n\u003ca name=\"scaling\"\u003e\u003c/a\u003e\n## Scaling\n\n\u003ca name=\"spark\"\u003e\u003c/a\u003e\n### Spark RDDs\n\nSee Apache Spark's \u003ca href=\"https://spark.apache.org/docs/latest/rdd-programming-guide.html\"  target=\"_blank\"\u003eRDD documentation\u003c/a\u003e.\n\nThis module requires\n\n```\nlibraryDependencies += \"org.gallia\" %% \"gallia-spark\" % \"0.6.1\"\n```\n\nAnd the following import:\n\n```\nimport gallia.spark._\n```\n\n__Abstraction__:\n\n\u003ca name=\"top-level-multiplicity-abstraction\"\u003e\u003c/a\u003e\u003ca name=\"210224092156\"\u003e\u003c/a\u003e\nThe main abstraction in _Gallia_ for top-level multiplicity is [`data.multiple.streamer.Streamer[T]`](https://github.com/galliaproject/gallia-core/blob/master/src/main/scala/gallia/data/multiple/streamer/Streamer.scala#L12), which is then wrapped by the [`data.single.Obj`](https://github.com/galliaproject/gallia-core/blob/master/src/main/scala/gallia/data/single/Obj.scala#L8)-aware counterpart [`data.multiple.Objs`](https://github.com/galliaproject/gallia-core/blob/master/src/main/scala/gallia/data/multiple/Objs.scala#L8) (wraps a `Streamer[Obj]`). It currently comes in three flavors, all also under `data.multiple.streamer`:\n1. \u003ca name=\"210224092157\"\u003e\u003c/a\u003e[`ViewStreamer`](https://github.com/galliaproject/gallia-core/blob/master/src/main/scala/gallia/data/multiple/streamer/ViewStreamer.scala#L12): _default_\n2. \u003ca name=\"210224092158\"\u003e\u003c/a\u003e[`IteratorStreamer`](https://github.com/galliaproject/gallia-core/blob/master/src/main/scala/gallia/data/multiple/streamer/IteratorStreamer.scala#L10): enabled via `.stream(_.iteratorMode)`\n3. \u003ca name=\"210224092159\"\u003e\u003c/a\u003e[`RddStreamer`](https://github.com/galliaproject/gallia-spark/blob/master/src/main/scala/gallia/data/multiple/streamer/RddStreamer.scala#L9): enabled via usage of a `SparkContext` if `gallia.spark._` has been [imported](https://github.com/galliaproject/gallia-core/blob/master/README.md#spark-rdds)\n\n__Example__:\n\n\u003ca name=\"using-spark\"\u003e\u003c/a\u003e\u003ca name=\"210121153218\"\u003e\u003c/a\u003e\nSee Spark used in action in [this repo](https://github.com/galliaproject/gallia-genemania-spark/blob/master/README.md#description)\n\n__Bypassing abstraction__:\n\n\u003ca name=\"210121153221\"\u003e\u003c/a\u003e\nYou can modify the underlying RDD (think \u003ca href=\"https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/\" target=\"_blank\"\u003eLaw of Leaky Abstractions\u003c/a\u003e) via `.rdd()`, eg:\n\n```scala\ndata\n  // ...\n  // can by-pass abstraction when needed,\n  //   though schema is not allowed to change\n  //   (which cannot be enforced)\n  .rdd { _.coalesce(1).cache }\n  // ...\n```\n\n\u003ca name=\"spilling\"\u003e\u003c/a\u003e\u003ca name=\"poor-man-scaling\"\u003e\u003c/a\u003e\u003ca name=\"poorman-scaling\"\u003e\u003c/a\u003e\u003ca name=\"210303163034\"\u003e\u003c/a\u003e\n### Poor man's scaling (_\"spilling\"_)\n\nMay be useful to your average scientist who may have access to powerful machines (think `qsub`) but not to conveniently provisioned clusters.\nSadly this is a very common occurrence in research settings and the author cares deeply about this problem.\n\n```scala\n\"/data/huge.tsv.bz2\"\n  // uses an GNU sort-based approach to sorting/grouping/joining\n  .stream(_.iteratorMode)\n    .rename('gene).to('hugo_symbol)\n    .groupBy('mutation_id).as('genes)\n    // ...\n```\n\n\u003ca name=\"210121153224\"\u003e\u003c/a\u003e\nNotes:\n- \u003ca name=\"210304140445\"\u003e\u003c/a\u003eAll wide transformations can be written in terms of an external sort such as \u003ca href=\"https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html\" _target=\"blank\"\u003e_GNU sort_\u003c/a\u003e\n- \u003ca name=\"210304140446\"\u003e\u003c/a\u003eWe can combine such operations and leverage pipes to ensure the execution tree is executed lazily (forking however would benefit from a form of [checkpointing](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210121160956))\n- \u003ca name=\"210304140447\"\u003e\u003c/a\u003e_GNU sort_ is favored for now because replacing it would constitute an significant endeavour, and even then it would be extremely hard to beat performance-wise\n- \u003ca name=\"210304140448\"\u003e\u003c/a\u003eIdeally this would be an alternative run mode for _Spark_ itself\n- \u003ca name=\"210304140450\"\u003e\u003c/a\u003eThe current implementation can be seen in action in the [GeneMania processing](https://github.com/galliaproject/gallia-genemania/blob/master/src/main/scala/galliaexample/genemania/GeneMania.scala#L95) sub-project\n- \u003ca name=\"210304140449\"\u003e\u003c/a\u003eThis feature is only __partially__ [implemented](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210204111309). It's basically enabled via the `_.stream(_.iteratorMode.[...])` call, and follows this type of invocation paths:\n[`Streamer.groupByKey`](./src/main/scala/gallia/data/multiple/streamer/Streamer.scala#L62)\n  -\u003e [Iterator's](./src/main/scala/gallia/data/multiple/streamer/IteratorStreamer.scala#L57)\n  -\u003e [utility](./src/main/scala/gallia/data/multiple/streamer/IteratorStreamerUtils.scala#L48)\n  -\u003e [GNU sort wrapper](./src/main/scala/gallia/data/multiple/streamer/spilling/GnuSortByFirstFieldHack.scala#L15)\n\n\u003ca name=\"201118133133\"\u003e\u003c/a\u003e\n## Explicit types\n\nLet's revisit the [SQL-like](#201118133206) example. Note that the [`Whatever`](http://github.com/galliaproject/gallia-core/blob/master/src/main/scala/gallia/Whatever.scala#L15) type placeholder is being used\n(basically an `Any` wrapper that accepts very basic operations such as `+`, `\u003c`, etc.)\n\n```scala\n// the following two expressions are equivalent:\n//\n//          omitting type implies the use of Whatever here  and here\n//         v                 v                            v          v\nz.fuse(         'first ,          'last ).as('name).using(_  + \" \" + _)\nz.fuse(_.string('first), _.string('last)).as('name).using(_  + \" \" + _)\n//                                                        ^          ^\n//                                                         vs strings\n```\n\n A more disciplined and powerful approach than relying on `Whatever` is to be explicit, which gives access to all the corresponding type's operations\n```scala\nz.fuse(_.string('first), _.string('last)).as('name)\n   // .head and .toUpperCase require knowledge of the exact type (String here)\n   .using { (f, n) =\u003e s\"${f.head}${n.toUpperCase}\" }\n```\n\nMore types than the currently [supported](http://github.com/galliaproject/gallia-core/blob/master/src/main/scala/gallia/reflect/BasicType.scala#L16) `BasicTypes` will be added in the [future](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210121124808)\n\n## Schema (metadata)\n\n\u003ca name=\"schema-aware\"\u003e\u003c/a\u003e\u003ca name=\"210127201457\"\u003e\u003c/a\u003e\n_Gallia_ is \"schema-aware\", meaning it keeps track of schema changes for every step. This allows the library to detect many errors prior to seeing the actual data.\n\nAs we've seen before, there are multiple ways to explicitly provide the data's underlying schema.\nThis saves the library the task of looping over the data first to \"infer\" said schema.\n\n\u003ca name=\"210121153230\"\u003e\u003c/a\u003e\n\n1. By using a case class\n```scala\ncase class Foo(foo: String, bar: Int, baz: Boolean, qux: String)\n\n\"\"\"{\"foo\": \"hello\", \"bar\": 1, \"baz\": true, \"qux\": \"world\"}\"\"\".read[Foo]\n```\n\n\u003ca name=\"210121153231\"\u003e\u003c/a\u003e\n2. By providing it \"manually\"\n```scala\n\"\"\"{\"foo\": \"hello\", \"bar\": 1, \"baz\": true, \"qux\": \"world\"}\"\"\"\n  // underscore means optional (since can't conveniently use '?' in Scala)\n  .read('foo.string, 'bar.int, 'baz.boolean, 'qux.string, 'corge.string_)\n```\n\n\u003ca name=\"210121153232\"\u003e\u003c/a\u003e\n3. By providing an external resource that contains a JSON-serialized version of the schema\n```scala\n\"\"\"{\"foo\": \"hello\", \"bar\": 1, \"baz\": true, \"qux\": \"world\"}\"\"\"\n    .read(\"/meta/myschema.json\")\n```\n\nWhere \"/meta/myschema.json\" contains: `{\"fields\":[{\"key\":\"foo\",\"info\":...`\n\n\u003ca name=\"210121153233\"\u003e\u003c/a\u003e\nMore interactions with case classes are available (e.g. in transformations); they will be detailed in a future article.\n\n__Note__: Gallia schemas are mostly meant to be descriptive, but they can be prescriptive in the case of \nlooser formats such as JSON or {T,C}SV files. For instance a field defined as an `_Int` in a schema describing \na numerical JSON entry will be interpreted as an `_Int` instead of a `_Double` (as would be expected from the JSON specification).\n\n\u003ca name=\"macros\"\u003e\u003c/a\u003e\u003ca name=\"210326142045\"\u003e\u003c/a\u003e\n## Macros\n\nSee dedicated [repo](https://github.com/galliaproject/gallia-macros), which contains examples\n\n\n\u003ca name=\"full-blown\"\u003e\u003c/a\u003e\u003ca name=\"210121135252\"\u003e\u003c/a\u003e\n## Full blown example\n\nI am providing a [link](http://github.com/galliaproject/gallia-dbnsfp/blob/master/src/main/scala/galliaexample/dbnsfp/DbNsfp.scala#L26) to one of the full blow examples I've written using _Gallia_: turning the big\n\u003ca href=\"https://sites.google.com/site/jpopgen/dbNSFP\" target=\"_blank\"\u003edbNSFP\u003c/a\u003e tables into a corresponding nested structure more conducive to querying (_mongodb_, _elasticsearch_, ...).\nSee the example [input row](https://github.com/galliaproject/gallia-dbnsfp/blob/master/src/main/scala/galliaexample/dbnsfp/DbNsfpDriver.scala#L19-L20) and example [output entity](https://github.com/galliaproject/gallia-dbnsfp/blob/master/src/main/scala/galliaexample/dbnsfp/DbNsfpDriver.scala#L31-L348).\n\nIt is in no way complete or 100% correct in its current form, as it is primarily designed to showcase _Gallia_.\nI only tested it on a small subset of the data, and I expect unfortunate surprises would arise from processing the entire dataset.\n\n\u003ca name=\"210121153241\"\u003e\u003c/a\u003e\nIt showcases among other things how to turn a long `String` full of extractable information, e.g:\n\n```json\n\"Loss of ubiquitination at K551 (P = 0.0092); Loss of methylation [...]\"\n```\n\n\u003ca name=\"210121153242\"\u003e\u003c/a\u003e\nInto a more parseable object:\n```\n[\n  {\"type\":\"loss\", \"change_type\":\"ubiquitination\",\n     \"location\":\"K551\", \"p_value\":0.0092 },\n  {\"type\":\"loss\", \"change_type\":\"methylation\",\n      ... },\n  ...\n]\n```\n\u003ca name=\"210121153243\"\u003e\u003c/a\u003e\nVia an intermediate Scala case class (which contains most of transformation logic):\n```scala\n// ...\n.transformString(top_5_features).using(MutPred.apply)\n// ...\n```\n\n\u003ca name=\"210121153244\"\u003e\u003c/a\u003e\nProcessing this kind of data is exactly why I designed the library in the first place.\nI believe a lot of useful knowledge can be unlocked by making this kind of resource more parseable (_DbNsfp_ itself is an incredibly useful resource in terms of content).\nThe field of bioinformatics in particular is laden with archaic technologies and practices, which in turns\nresults in tons of lost opportunities for impactful medical discoveries.\nI have never dealt with it personally but I imagine the likes of computational physics and other \"computational-driven\" disciplines probably suffer from similar problems.\n\n\u003ca name=\"examples\"\u003e\u003c/a\u003e\u003ca name=\"210223093237\"\u003e\u003c/a\u003e\n## List of concrete examples\n- \u003ca name=\"210223143334\"\u003e\u003c/a\u003e\u003ca name=\"trivial-examples\"\u003eTrivial examples:\n  - [Word Count](https://gist.github.com/anthony-cros/2ceba1be56bd99a8d4bafd2b9f52b9b3#file-wordcount-scala-L11) example, the \"hello world\" of big data\n  - [Count by word length](https://gist.github.com/anthony-cros/2ceba1be56bd99a8d4bafd2b9f52b9b3#file-wordcount-scala-L29) example\n- \u003ca name=\"210223143709\"\u003e\u003c/a\u003eSQL-like:\n  - \u003ca name=\"210224103624\"\u003e\u003c/a\u003eNorthwind queries: coming soon\n- \u003ca name=\"210429153511\"\u003e\u003c/a\u003e\u003ca name=\"web-app-server-logic\"\u003e\u003c/a\u003eWeb application server logic:\n  - \u003ca name=\"210429153512\"\u003e\u003c/a\u003e\u003ca name=\"cbio-studies-summary\"\u003e\u003c/a\u003ecbioportal's \"_studies summary_\" API call: reproducing response to obtain a summary of all studies for [cbioportal](https://www.cbioportal.org/), arguably the most commonly used web portal for cancer data.\n  This is the first API call made upon loading the portal's [main page](https://www.cbioportal.org/), and it is specified on their [swagger page](https://www.cbioportal.org/api/swagger-ui.html#/Studies/getAllStudiesUsingGET]).\n    See [dedicated page](https://github.com/galliaproject/gallia-docs/blob/master/examples/cbioportal_cancer_studies.md) for the code.\n- \u003ca name=\"210223094317\"\u003e\u003c/a\u003e\u003ca name=\"articles-examples\"\u003eReproducing random examples encountered in articles on data manipulation:\n  - \u003ca name=\"220112153055\"\u003e\u003c/a\u003eNotebooks for Databricks articles \u003ca href=\"http://anthonycros.com/dais2022.html\"\u003e(Spark) Datasets tutorial\u003c/a\u003e and \u003ca href=\"http://anthonycros.com/dais2022.html\"\u003eComplex nested structures\u003c/a\u003e\n  - \u003ca name=\"210224103606\"\u003e\u003c/a\u003eTPC-DS Sales summary [example query](https://gist.github.com/anthony-cros/f6d82744523349a65bc86598c79cabdc) as discussed in Andrew Ray's Databricks post: _[\"Reshaping Data with Pivot in Apache Spark\"]((https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html))_ (February 2016)\n  - \u003ca name=\"210224123645\"\u003e\u003c/a\u003edata [manipulation task](https://gist.github.com/anthony-cros/4b5d40014dd52d57bc7dd05c35029066) for the Cars93 dataset (R MASS package), as discussed in Darren Wilkinson's blog post: _[\"Data frames and tables in Scala\"](https://darrenjw.wordpress.com/2015/08/21/data-frames-and-tables-in-scala/)_ (August 2015)\n  - \u003ca name=\"210224103607\"\u003e\u003c/a\u003eEurostat census data [example queries](https://gist.github.com/anthony-cros/74811b85f9634f3e5646eed71ad7aa20) as discussed in Mathijs Vogelzang's Medium article: _[\"Doing cool data science in Java: how 3 DataFrame libraries stack up\"]((https://medium.com/@thijser/doing-cool-data-science-in-java-how-3-dataframe-libraries-stack-up-5e6ccb7b437))_ (September 2018)\n  - \u003ca name=\"210224103608\"\u003e\u003c/a\u003eFootball premier league data [manipulations](https://gist.github.com/anthony-cros/4a3a9ae9a31881d9fd85c4c67b8a5559) as discussed in Chloe Connor's Towards Data Science article: _[\"Stop using Pandas and start using Spark with Scala\"](https://towardsdatascience.com/stop-using-pandas-and-start-using-spark-with-scala-f7364077c2e0)_ (June 2020)\n- \u003ca name=\"210223094353\"\u003e\u003ca name=\"bioinformatics-examples\"\u003e\u003c/a\u003eBioinformatics examples\n  - \u003ca name=\"clinvar-example\"\u003ere-processing [clinvar VCF file](https://github.com/galliaproject/gallia-clinvar)\n  - \u003ca name=\"snpeff-example\"\u003ere-processing [SnpEff output](https://github.com/galliaproject/gallia-snpeff)\n  - \u003ca name=\"dbnsfp-example\"\u003ere-processing [dbNSFP table](#210121135252) example from section just above\n  - \u003ca name=\"genemania-example\"\u003ere-processing [GeneMania TSV files](https://github.com/galliaproject/gallia-genemania#README.md); uses the [__poor man's scaling__](#poor-man-scaling) approach (_spilling_)\n  - \u003ca name=\"lovd-example\"\u003ere-processing [rare disease LOVD data](https://gist.github.com/anthony-cros/1416c544438ef39ca36ae723d02c3ce9) (from [EDS Variant Database](https://databases.lovd.nl/shared/genes/COL3A1))\n- \u003ca name=\"210223095346\"\u003e\u003c/a\u003e\u003ca name=\"physics-examples\"\u003ePhysics examples\n  - [ENSDF](https://www.nndc.bnl.gov/ensdf/) data (WIP)\n  - WIP (see [forum question](https://www.physicsforums.com/threads/looking-for-large-dataset-of-non-image-centric-physics-data.1000073/))\n- \u003ca name=\"210223094318\"\u003e\u003c/a\u003e\u003ca name=\"spark-examples\"\u003e\u003c/a\u003eSpark-powered:\n  - [GeneMania TSV files](https://github.com/galliaproject/gallia-genemania-spark#description) via Spark RDDs\n- (more coming soon)\n\n\n## Strengths\nGallia's main strengths can be summed up like so:\n* \u003ca name=\"211220163120\"\u003e\u003c/a\u003eOffers a one-stop shop paradigm for most or all data transformations needs within one's application.\n* \u003ca name=\"211220163121\"\u003e\u003c/a\u003eThe most common/useful data operations are provided, or at least [scheduled](https://github.com/galliaproject/gallia-docs/blob/master/tasks.md).\n* \u003ca name=\"211220163122\"\u003e\u003c/a\u003eReadable DSL that domain experts should be able to at least partially comprehend.\n* \u003ca name=\"211220163123\"\u003e\u003c/a\u003eScaling is not an afterthought and *Spark RDDs* can be leveraged when required.\n* \u003ca name=\"211220163124\"\u003e\u003c/a\u003eMeta-awareness, meaning inconsistent transformations are rejected whenever possible (for instance, cannot use a field that's been removed already).\n* \u003ca name=\"211220163125\"\u003e\u003c/a\u003eCan process individual entities, not just collections thereof; that is, there's no need to create \"dummy\" collections of one entity in order to operate on that entity.\n* \u003ca name=\"211220163126\"\u003e\u003c/a\u003eCan process nested entities of any multiplicity in a natural way.\n* \u003ca name=\"211220163127\"\u003e\u003c/a\u003eMacros are [available](https://github.com/galliaproject/gallia-macros) for a smooth integration with case class hierarchies.\n* \u003ca name=\"211220163128\"\u003e\u003c/a\u003eProvides flexible target selection - i.e. which field(s) to act on - which ranges from explicit reference to actual queries, including when nesting is involved.\n* \u003ca name=\"211220163129\"\u003e\u003c/a\u003eThe execution DAG is sufficiently abstracted that its optimization is a well-separated concern (e.g. predicate pushdowns, pruning, ...); note however, that few such optimizations are in place at the moment.\n\n\n## FAQ\n\n\u003ca name=\"210127134030\"\u003e\u003c/a\u003e\n### Is this ready for production?\nNot even remotely. There are known bugs, blatantly missing features, a lot of missing validation, and most importantly it performs rather slowly at the moment.\nThere is a lot planned in the way of addressing these issues, but it will require more resources than the author working alone. In particular, performance has\na prominent place in the task [list](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210121095401).\n\n\u003ca name=\"210127134032\"\u003e\u003c/a\u003e\n### How can I help?\nI'm already aware of many issues and have a long list of [tasks](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210121100050) meant to address them, as well as add the features that are critically missing.\nAs a result the most useful thing one can do to help at the moment is simply letting me know if this is an effort worth pursuing.\nOnce a definitive license is chosen, code contributions will be more than welcome.\n\n\u003ca name=\"210127134033\"\u003e\u003c/a\u003e\n### What are the biggest limitations by design?\n~At this point, a given field can only be of a given type. Ironically this prevents _Gallia_ from having its own metaschema specified in _Gallia_ terms.~ (see [metaschema](https://github.com/galliaproject/gallia-core/blob/master/CHANGELOG.md#221013105445), made possible by [(partial) union types](https://github.com/galliaproject/gallia-core/blob/master/CHANGELOG.md#221013103753)). \n~See problem in action in the [code](http://github.com/galliaproject/gallia-core/blob/master/src/main/scala/gallia/meta/MetaObj.scala#L35)~\nA more thorough discussion of design choices and trade-offs/limitations will come in a future article.\n\nAnother potential trick is that there can be only one meaning to a missing value. For instance `[{\"foo\": null}, {\"foo\": []}, {}]` would all collapse to the same absence of a value: `{}`.\nNote that overloading the various `null`/`Nil` mechanisms with alternative meanings is probably not great data modeling practise in the first place.\n\n\u003ca name=\"210127134034\"\u003e\u003c/a\u003e\n### In what way is readability prioritized?\nWe aim to make the code as readable as possible (goal [#2](#210127120328)) whenever it doesn't affect practicality (goal [#1](#210127120327)).\nIn particular we want to make it possible for domain experts - who may not be programmers - to understand at least superficially what is happening in each step.\nIt is obviously not always [feasible](http://github.com/galliaproject/gallia-dbnsfp/blob/master/src/main/scala/galliaexample/dbnsfp/DbNsfp.scala#L224) for the task at hand, but this is otherwise a major goal for the library.\n\n\u003ca name=\"210127134035\"\u003e\u003c/a\u003e\n### What are good use cases for the library?\nThe main use cases that come to mind at this point are batch ETL, querying, feature engineering, internal application logic, and data validation and evolution.\nOn the batch ETL front, it would be interesting to see how alternative libraries/languages take examples such as the [dbNSFP](http://github.com/galliaproject/gallia-dbnsfp/blob/master/src/main/scala/galliaexample/dbnsfp/DbNsfp.scala#L14) one above.\nIn particular, how would the various thresholds (readability/practicality/scalability) be shifted by a different choice.\n\n\u003ca name=\"210127134036\"\u003e\u003c/a\u003e\n### What about features like streaming? EDA? [visualization](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210124092211)? linear algebra? graph queries? notebooks? metadata [semantics](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210124095616) ? squaring the circle?\nThere are lots of features that could be added in the future, but they all require a pretty sturdy base first.\n\n\u003ca name=\"210127140116\"\u003e\u003c/a\u003e\nNote that the most important part of the library at this point is its client code interface. The internals could be entirely scrapped in the future,\nthough it's more likely it would be replaced in phases short of a major design flaw.\n\n\u003ca name=\"why-macros\"\u003e\u003c/a\u003e\u003ca name=\"210127134037\"\u003e\u003c/a\u003e\n### Why not more macros-based features?\nI [prototyped](#macros) a lot with macros and they will play an important role in the future of _Gallia_.\n\nThey can also be tricky to deal with, and since they are scheduled for a major overhaul, I am reluctant to invest a lot of time on that front at the moment.\nI see them helping a lot in particular with boilerplate and some compile-time validation (e.g. key [validation](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210127134525)).\nThe very initial plan was to leverage [whitebox](https://docs.scala-lang.org/overviews/macros/blackbox-whitebox.html) macros for every step, but I gave up on the idea pretty early on. I'd like to re-investigate it for a subset of features/use cases at some point however,\nespecially since there seems to be some interesting projects (e.g. [quill](https://github.com/getquill/quill)) that already make interesting use of them.\n\n\u003ca name=\"210127134038\"\u003e\u003c/a\u003e\n### Where is the category theory?\nI'm quite impressed with the likes of \u003ca href=\"https://github.com/typelevel/cats\" target=\"_blank\"\u003e_cats_\u003c/a\u003e (-\u003e great [book](https://underscore.io/books/scala-with-cats/)) or\n\u003ca href=\"https://github.com/milessabin/shapeless\" target=\"_blank\"\u003e_shapeless_\u003c/a\u003e but while I find them intellectually fascinating,\nI do side with the \u003ca href=\"https://skillsmatter.com/skillscasts/6483-keynote-scaling-intelligence-moving-ideas-forward\" target=\"_blank\"\u003e\"blue sky\"\u003c/a\u003e perspective when it comes to prioritizing\n\u003ca href=\"https://gonfva.medium.com/another-scala-is-possible-99bcc6006c7c\" target=\"_blank\"\u003epracticality\u003c/a\u003e.\n\n\u003ca name=\"210127134039\"\u003e\u003c/a\u003e\n### What about other programming languages?\nInitially the idea was for this to be a language agnostic DSL for data manipulation, with a reference implementation in Scala basically acting as specification.\nIt may still become a reality but I'd rather focus on maturing a Scala version first.\n\n\u003ca name=\"210129170214\"\u003e\u003c/a\u003e\n### What is aptus?\n\"Aptus\" is latin for suitable, appropriate, fitting. It is our utility library to help smooth certain pain points of the Java/Scala ecosystem.\nIt was originally included in _Gallia_ for convenience, but is now externalized in its [own repo](https://github.com/aptusproject/aptus-core) (Apache 2 licensed)\n\n\u003ca name=\"210127134040\"\u003e\u003c/a\u003e\u003ca name=\"tests\"\u003e\u003c/a\u003e\n### Where are the tests?\nThey live in a different [repo](https://github.com/galliaproject/gallia-testing#gallia-testing) and are being introduced incrementally (unpublished ones need a lot of cleaning up). They basically take the following form:\n\n\u003ca name=\"210121153250\"\u003e\u003c/a\u003e\n```scala\naobj( // the \"a\" in aobj stands for \"Annotated\"\n    cls('p   .cls_('f.string  , 'g.int ), 'z.boolean))(\n    obj('p -\u003e obj ('f -\u003e \"foo\", 'g -\u003e 1), 'z -\u003e true) )\n  .generate('h)\n    .from(_.entity('p))\n    .using {\n        _ .translate('f ~\u003e 'F).using(\"foo\" -\u003e \"oof\")\n          .remove('g) }\n  .check {\n    aobj(\n      cls('p   .cls_('f.string, 'g.int   ), 'z.boolean, 'h .cls_ ('F.string)))(\n      obj('p -\u003e obj ('f -\u003e \"foo\", 'g -\u003e 1), 'z -\u003e true, 'h -\u003e obj('F -\u003e \"oof\")) ) }\n```\n\nWhere `check` wraps an equality assertion. I have not settled on a definitive [testing library](https://github.com/galliaproject/gallia-docs/blob/master/tasks.md#testing-library) yet, though considering utest at this point.\n\n\u003ca name=\"210127134041\"\u003e\u003c/a\u003e\n### Why so few comments, especially scaladoc?\nI try to leverage the language constructs as much as possible, e.g. by naming variables and methods so they convey semantics as much as possible.\nI then add the occasional comment when I deem it necessary, but overall expect any contributor to be sufficiently familiar with Scala to understand what's going on.\nAs the project matures, proper scaladoc-friendly comments can hopefully be [added](http://github.com/galliaproject/gallia-docs/blob/master/quotes.md#mvp) as well.\n\n\u003ca name=\"210127134042\"\u003e\u003c/a\u003e\n### Why does the [terminology](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210124100007) sometimes sound funny or full-on neological?\nNaming things is hard. Sometimes I give up and favor an alternative until a better idea comes along. Sometimes a [temporary](http://github.com/galliaproject/gallia-core/blob/master/src/main/scala/gallia/heads/HeadU.scala#L8) name just sticks around, by way of organic growth.\nMore generally I'd like to create an OWL [ontology](http://github.com/galliaproject/gallia-docs/blob/master/tasks.md#t210127124029) to more formally define terms that may deserve it.\n\n\u003ca name=\"210127134043\"\u003e\u003c/a\u003e\n### What's with the IDs that look like timestamps and pop up everywhere (e.g. `210121162536`)?\nThey're my quick-and-dirty mechanism for ID-ing elements, and are generated by combining the `date` command along with `xautomation`, called via `xbindkeys` keyboard shortcuts.\nWhen they represent a task, it allows me to ID the task temporarily. Many small tasks will never see an actual issue tracking system ID assigned to them.\nNote that the timestamp itself is never guaranteed to be meaningful, as I occasionally hack them around (for consolidation purposes for instance).\n\n\u003ca name=\"210127134044\"\u003e\u003c/a\u003e\n### Where does the name \"Gallia\" come from?\n_Gallia_ is the name of a Romano-Gallic [goddess](https://en.wikipedia.org/wiki/Gallia_%28goddess%29). It is also the latin name for [Gaul](https://en.wikipedia.org/wiki/Gaul), the area the author is originally from.\n\nRumor has it that the goddess Gallia appeared in 16 BCE to a group of data engineers gathered at a local tavern in Lugdunum (now [Lyon](https://www.google.com/maps/place/Lyon,+France)),\nand that she told them to keepeth their code (1) _practical_, (2) _readable_, and (3) _scalable_ (if needed), in that exact order.\n\n## Contact \u0026 Announcements\n\n- Contact: `contact.galliaproject at gmail.com`\n- [Blog](http://anthonycros.com/)\n- [Linked In](https://www.linkedin.com/in/anthony-cros-3587b063/)\n- [Twitter (@AnthonyCros)](https://twitter.com/anthony_cros) - for further announcements\n- Original announcement on the [Scala Users list](https://users.scala-lang.org/t/introducing-gallia-a-library-for-data-manipulation/7112)\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"./images/logo.png\" alt=\"icon\"\u003e\u003c/p\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003cbr/\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgalliaproject%2Fgallia-core","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgalliaproject%2Fgallia-core","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgalliaproject%2Fgallia-core/lists"}