{"id":17635938,"url":"https://github.com/mohawk2/data-prepare","last_synced_at":"2025-10-12T15:17:46.533Z","repository":{"id":56833880,"uuid":"331953056","full_name":"mohawk2/data-prepare","owner":"mohawk2","description":"Module to prepare CSV (etc) data for automatic processing","archived":false,"fork":false,"pushed_at":"2021-03-25T17:04:06.000Z","size":174,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-05T06:14:22.485Z","etag":null,"topics":["data-cleaning","data-preparation","data-science","perl"],"latest_commit_sha":null,"homepage":"https://metacpan.org/pod/distribution/Data-Prepare/scripts/data-prepare","language":"Perl","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mohawk2.png","metadata":{"files":{"readme":"README.md","changelog":"Changes","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-01-22T13:23:49.000Z","updated_at":"2021-03-25T17:04:09.000Z","dependencies_parsed_at":"2022-09-09T21:10:39.627Z","dependency_job_id":null,"html_url":"https://github.com/mohawk2/data-prepare","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohawk2%2Fdata-prepare","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohawk2%2Fdata-prepare/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohawk2%2Fdata-prepare/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohawk2%2Fdata-prepare/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mohawk2","download_url":"https://codeload.github.com/mohawk2/data-prepare/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246273515,"owners_count":20750904,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","data-preparation","data-science","perl"],"created_at":"2024-10-23T02:24:43.028Z","updated_at":"2025-10-12T15:17:46.413Z","avatar_url":"https://github.com/mohawk2.png","language":"Perl","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NAME\n\nData::Prepare - prepare CSV (etc) data for automatic processing\n\n# SYNOPSIS\n\n    use Text::CSV qw(csv);\n    use Data::Prepare qw(\n      cols_non_empty non_unique_cols\n      chop_lines chop_cols header_merge\n    );\n    my $data = csv(in =\u003e 'unclean.csv', encoding =\u003e \"UTF-8\");\n    chop_cols([0, 2], $data);\n    header_merge($spec, $data);\n    chop_lines(\\@lines, $data); # mutates the data\n\n    # or:\n    my @non_empty_counts = cols_non_empty($data);\n    print Dumper(non_unique_cols($data));\n\n# DESCRIPTION\n\nA module with utility functions for turning spreadsheets published for\nhuman consumption into ones suitable for automatic processing. Intended\nto be used by the supplied [data-prepare](https://metacpan.org/pod/data-prepare) script. See that script's\ndocumentation for a suggested workflow.\n\nAll the functions are exportable, none are exported by default.\nAll the `$data` inputs are an array-ref-of-array-refs.\n\n# FUNCTIONS\n\n## chop\\_cols\n\n    chop_cols([0, 2], $data);\n\nUses `splice` to delete each zero-based column index. The example above\ndeletes the first and third columns.\n\n## chop\\_lines\n\n    chop_lines([ 0, (-1) x $n ], $data);\n\nUses `splice` to delete each zero-based line index, in the order\ngiven. The example above deletes the first, and last `$n`, lines.\n\n## header\\_merge\n\n    header_merge([\n      { line =\u003e 1, from =\u003e 'up', fromspec =\u003e 'lastnonblank', to =\u003e 'self', matchto =\u003e 'HH', do =\u003e [ 'overwrite' ] },\n      { line =\u003e 1, from =\u003e 'self', matchfrom =\u003e '.', to =\u003e 'down', do =\u003e [ 'prepend', ' ' ] },\n      { line =\u003e 2, from =\u003e 'self', fromspec =\u003e 'left', to =\u003e 'self', matchto =\u003e 'Year', do =\u003e [ 'prepend', '/' ] },\n      { line =\u003e 2, from =\u003e 'self', fromspec =\u003e 'literal:Country', to =\u003e 'self', tospec =\u003e 'index:0', do =\u003e [ 'overwrite' ] },\n    ], $data);\n    # Turns:\n    # [\n    #   [ '', 'Proportion of households with', '', '', '' ],\n    #   [ '', '(HH1)', 'Year', '(HH2)', 'Year' ],\n    #   [ '', 'Radio', 'of data', 'TV', 'of data' ],\n    # ]\n    # into (after a further chop_lines to remove the first two):\n    # [\n    #   [\n    #     'Country',\n    #     'Proportion of households with Radio', 'Proportion of households with Radio/Year of data',\n    #     'Proportion of households with TV', 'Proportion of households with TV/Year of data'\n    #   ]\n    # ]\n\nApplies the given transformations to the given data, so you can make the\ngiven data have the first row be your desired headers for the columns.\nAs shown in the above example, this does not delete lines so further\noperations may be needed.\n\nBroadly, each hash-ref specifies one operation, which acts on a single\n(specified) line-number. It scans along that line from left to right,\nunless `tospec` matches `index:\\d+` in which case only one operation\nis done.\n\nThe above merge operations in YAML format:\n\n    spec:\n      - do:\n          - overwrite\n        from: up\n        fromspec: lastnonblank\n        line: 2\n        matchto: HH\n        to: self\n      - do:\n          - prepend\n          - ' '\n        from: self\n        line: 2\n        matchfrom: .\n        to: down\n      - do:\n          - prepend\n          - /\n        from: self\n        fromspec: left\n        line: 3\n        matchto: Year\n        to: self\n      - do:\n          - overwrite\n        from: self\n        fromspec: literal:Country\n        line: 3\n        to: self\n        tospec: index:0\n\nThis turns the first three lines of data excerpted from the supplied example\ndata (shown in CSV with spaces inserted for alignment reasons only):\n\n          ,Proportion of households with,       ,     ,\n          ,(HH1)                        ,Year   ,(HH2),Year\n          ,Radio                        ,of data,TV   ,of data\n    Belize,58.7                         ,2019   ,78.7 ,2019\n\ninto the following. Note that the first two lines will still be present\n(not shown), possibly modified, so you will need your chop\\_lines to\nremove them. The columns of the third line are shown, one per line,\nfor readability:\n\n    Country,\n    Proportion of households with Radio,\n    Proportion of households with Radio/Year of data,\n    Proportion of households with TV,\n    Proportion of households with TV/Year of data\n\nThis achieves a single row of column-headings, with each column-heading\nbeing unique, and sufficiently meaningful.\n\n## pk\\_insert\n\n    pk_insert({\n      column_heading =\u003e 'ISO3CODE',\n      local_column =\u003e 'Country',\n      pk_column =\u003e 'official_name_en',\n    }, $data, $pk_map, $stopwords);\n\nIn YAML format, this is the same configuration:\n\n    pk_insert:\n      - files:\n          - examples/CoreHouseholdIndicators.csv\n        spec:\n          column_heading: ISO3CODE\n          local_column: Country\n          pk_column: official_name_en\n          use_fallback: true\n\nAnd the `$pk_map` made with [\"make\\_pk\\_map\"](#make_pk_map), inserts the\n`column_heading` in front of the current zero-th column, mapping the\nvalue of the `Country` column as looked up from the specified column\nof the `pk_spec` file, and if `use_fallback` is true, also tries\n[\"pk\\_match\"](#pk_match) if no exact match is found. In that case, `stopwords`\nmust be specified in the configuration\n\n## cols\\_non\\_empty\n\n    my @col_non_empty = cols_non_empty($data);\n\nIn the given data, iterates through all rows and returns a list of\nquantities of non-blank entries in each column. This can be useful to spot\ncolumns with only a couple of entries, which are more usefully chopped.\n\n## non\\_unique\\_cols\n\n    my $col2count = non_unique_cols($data);\n\nTakes the first row of the given data, and returns a hash-ref mapping\nany non-unique column-names to the number of times they appear.\n\n## key\\_to\\_index\n\nGiven an array-ref (probably the first row of a CSV file, i.e. column\nheadings), returns a hash-ref mapping the cell values to their zero-based\nindex.\n\n## make\\_pk\\_map\n\n    my $altcol2value2pk = make_pk_map($data, $pk_colkey, \\@other_colkeys);\n\nGiven `$data`, the heading of the primary-key column, and an array-ref\nof headings of alternative key columns, returns a hash-ref mapping each\nof those alternative key columns (plus the `$pk_colkey`) to a map from\nthat column's value to the relevant row's primary-key value.\n\nThis is most conveniently represented in YAML format:\n\n    pk_spec:\n      file: examples/country-codes.csv\n      primary_key: ISO3166-1-Alpha-3\n      alt_keys:\n        - ISO3166-1-Alpha-2\n        - UNTERM English Short\n        - UNTERM English Formal\n        - official_name_en\n        - CLDR display name\n      stopwords:\n        - islands\n        - china\n        - northern\n\n## pk\\_col\\_counts\n\n    my ($colname2potential_key2count, $no_exact_match) = pk_col_counts($data, $pk_map);\n\nGiven `$data` and a primary-key (etc) map created by the above, returns\na tuple of a hash-ref mapping each column that gave any matches to a\nfurther hash-ref mapping each of the potential key columns given above\nto how many matches it gave, and an array-ref of rows that had no exact\nmatches.\n\n## pk\\_match\n\n    my ($best, $pk_cols_unique_best) = pk_match($value, $pk_map, $stopwords);\n\nGiven a value, `$pk_map`, and an array-ref of case-insensitive stopwords,\nreturns its best match for the right primary-key value, and an array-ref\nof which primary-key columns in the `$pk_map` matched the given value\nexactly once.\n\nThe latter is useful for analysis purposes to select which primary-key\ncolumn to use for this data-set.\n\nThe algorithm used for this best-match:\n\n- Splits the value into words (or where a word is two or more capital\nletters, letters). The search allows any, or no, text, to occur between\nthese entities. Each configured primary-key column's keys are searched\nfor matches.\n- If there is a separating `,` or `(` (as commonly used for\nabbreviations), splits the value into chunks, reverses them, and then\nreassembles the chunks as above for a similar search.\n- Only if there were no matches from the previous steps, splits the value\ninto words. Words that are shorter than three characters, or that occur in\nthe stopword list, are omitted. Then each word is searched for as above.\n- \"Votes\" on which primary-key value got the most matches. Tie-breaks on\nwhich primary-key value matched on the shortest key in the relevant\n`$pk_map` column, and then on the lexically lowest-valued primary-key\nvalue, to ensure stable return values.\n\n# SEE ALSO\n\n[Text::CSV](https://metacpan.org/pod/Text%3A%3ACSV)\n\n# LICENSE AND COPYRIGHT\n\nCopyright (C) Ed J\n\nThis library is free software; you can redistribute it and/or modify\nit under the same terms as Perl itself.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohawk2%2Fdata-prepare","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmohawk2%2Fdata-prepare","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohawk2%2Fdata-prepare/lists"}