{"id":41319606,"url":"https://github.com/lyrasis/csv-data-tools","last_synced_at":"2026-01-23T05:49:27.368Z","repository":{"id":43810765,"uuid":"394480523","full_name":"lyrasis/csv-data-tools","owner":"lyrasis","description":"Tools for working with CSV data (or other basic tabular data formats)","archived":false,"fork":false,"pushed_at":"2025-11-19T00:27:08.000Z","size":125,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-11-19T02:23:10.795Z","etag":null,"topics":["csv","migrations","tabular-data"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lyrasis.png","metadata":{"files":{"readme":"README.adoc","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-08-10T00:34:22.000Z","updated_at":"2025-11-07T22:21:39.000Z","dependencies_parsed_at":"2025-02-24T22:22:03.286Z","dependency_job_id":"479e62bc-d101-43ce-a671-951ed8548c2c","html_url":"https://github.com/lyrasis/csv-data-tools","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lyrasis/csv-data-tools","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyrasis%2Fcsv-data-tools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyrasis%2Fcsv-data-tools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyrasis%2Fcsv-data-tools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyrasis%2Fcsv-data-tools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lyrasis","download_url":"https://codeload.github.com/lyrasis/csv-data-tools/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyrasis%2Fcsv-data-tools/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28681344,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-23T05:48:07.525Z","status":"ssl_error","status_checked_at":"2026-01-23T05:48:07.129Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","migrations","tabular-data"],"created_at":"2026-01-23T05:49:26.873Z","updated_at":"2026-01-23T05:49:27.348Z","avatar_url":"https://github.com/lyrasis.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":":toc:\n:toc-placement!:\n:toclevels: 4\n\nifdef::env-github[]\n:tip-caption: :bulb:\n:note-caption: :information_source:\n:important-caption: :heavy_exclamation_mark:\n:caution-caption: :fire:\n:warning-caption: :warning:\nendif::[]\n\n= csv-data-tools\n\nTools for working with CSV data (or other basic tabular data formats)\n\ntoc::[]\n\n== chars_in_field.rb\n\nA silly utility script to output a list of unique characters used in specified field(s) of a CSV, or the whole CSV if no fields are specified.\n\nInitial use case: find all characters used in CollectionSpace media file names, so I could replace any characters unsafe for use in S3 key names with something else.\n\nDo `ruby chars_in_field.rb -h` for usage details.\n\n== `check_structure.rb`\nNOTE: Supports comma, tab, or pipe delimited files currently\n\n* Writes out a report on whether the rows of a CSV all have the same number of columns\n* If a file has any ragged rows, it tells you how many columns the weird rows have and gives you the row numbers of up to 3 rows having that number of columns.\n* Where the example field for a non-ok file is blank, that's the expected/correct number of columns based on the number of headers\n\n== `compact_csvs.rb`\n\nRewrites CSV files, omitting any columns having no values. See usage comments at top of script.\n\n== `compile_profiled.rb`\n\nWARNING: This initial version is very specific to the CSU project, but this might be generalizable for other uses.\n\nGenerates a multi-tabbed Excel worksheet compiling:\n\n* unique variable values used in non-id and non-date fields, with per-data-source and total occurrence counts for each value\n* unique id patterns used in id fields\n* unique date patterns used in date fields\n\nThe main value of this tool is found if:\n\n* you have *multiple clients/users/data sources* with data sets that use the same data input template(s) or metadata standard\n\nIt allows you to get an overview of data to answer questions like:\n\n* has this field been used in generally the same way across data sources?\n* where two data input templates have the same field name/column header, has that field been used consistently across the templates?\n\nSee https://3.basecamp.com/3410311/buckets/38281121/cloud_files/8346461511#__recording_8347588463[this locked-to-Lyrasis Basecamp file description] for details on what this produces and how you might use it.\n\nAssumes you have generated profile detail files for all data sources by running:\n\n`ruby profiler.rb -i ~/data/export/ -o ~/data/profiled -s tsv -c '{col_sep: \"\\t\"} -d compiled'`\n\nThe output path from the above will be the input path for this script.\n\n\n== `convert_encoding.rb`\n\nConverts files whose encoding is not UTF-8 or ascii to UTF-8.\n\nDo `ruby convert_encoding.rb -h` for usage info.\n\nYou need to create an output directory to write converted files into before running the script.\nIt does not overrwrite original files.\nIf the converted files look good, you can copy them back to the original location, replacing originals.\n\n[NOTE]\n====\ncharlock_holmes' encoding detection is not the same as whatever the `file --mime-encoding` command uses.\nThis script usually looks like it converts way too many files.\nThey generally end up all being ascii/UTF-8, though, so no worries\n====\n\n[IMPORTANT]\n====\nIf you have not installed or used the https://github.com/brianmario/charlock_holmes?tab=readme-ov-file[`charlock_holmes`] gem before, you need to:\n\n* Install icu4c (https://icu.unicode.org/home[C/C++ and Java libraries for Unicode and globalization]) if you do not have it:\n\n`brew install icu4c`\n\nCheck whether you have this installed by doing `brew info icu4c`.\n\n* Run the following command in your terminal before running this script:\n\n`bundle config build.charlock_holmes --with-icu-dir=/usr/local/opt/icu4c`\n====\n\n== `csv_to_xlsx.rb`\n\nConvert all CSVs in a directory to XLSX files where all data values are encoded as STRINGS.\nThis minimizes the chance that clients opening files directly in Excel will have data messed up by Excel being \"helpful\".\n\nCurrently only works on actual CSVs, not tab-delimited or other tabular formats.\n\n== `dbf2csv.pl`\n\nA Perl script Brian used to convert DBF files to CSV format. DBF files are from the FoxPro database management system, which is the backend for PastPerfect desktop editions.\n\nI have no idea if this still works, or if there are better options out there.\n\n== `empty_columns.rb`\n\nInitially, works on a single CSV file. Outputs to STDOUT a list of column headers where there are literally no data values in the column (i.e. column value is nil or an empty string).\n\nUse case: quick list for use in quotes/estimates.\n\nNOTE: Isn't smart enough to recognize stuff like \"This column value is always `0` so it's probably a default value automatically set by the source system and doesn't mean anything\"\n\n== `field_signatures.rb`\n\n\"Field signature\" refers to \"the fields in a row that are actually populated.\"\n\nGiven a path to a CSV file, this script generates a new CSV that lets you see what field signatures are used in the source file. Number of occurrences and example IDs of records having each field signature are output, along with one column per source file column. If the column has \"X\" then it's populated in the field signature.\n\nYou will need to customize the `ID_HEADERS` variable in the script to include the name of any column(s) in your file that should be treated as row IDs.\n\n== `file_info.rb`\n\nGiven path to a directory, a file suffix, and an optional minimum file size (in bytes)...\n\nPrints to screen a the following info for all files with suffix in the directory:\n\n* size\n* result of unix `file` command (file encoding, presence of \"very long lines,\" line terminators)\n\nDo `ruby file_info.rb -h` for parameters/options.\n\nExample output:\n\n----\ndbo_ZLocalUseAttribs.txt:\n  size: 814242\n  info: Little-endian UTF-16 Unicode text, with CRLF line terminators\n\n\ndbo_ZResultFields.txt:\n  size: 13274\n  info: Little-endian UTF-16 Unicode text, with CRLF line terminators\n\n\ndbo_dtproperties.txt:\n  size: 34462\n  info: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators\n----\n\n== `find_ctrl_chars.rb`\n\nWorks on one file at a time. Reports out any non-EOL characters in https://www.compart.com/en/unicode/category/Cc[the Unicode Control category], in context so the locations can be found in file.\n\nWARNING: output is weird for combined/composed characters, but currently works ok enough for identifying issues that I'm not putting more time into fixing it.\n\nDo `ruby find_ctrl_chars.rb -h` for parameters/options.\n\n== `obfuscator.rb`\n\nReplaces all field values in a CSV that do not match the SKIP_PATTERNS with a MD5 digest (translated to bubblebabble) of the\noriginal value.\nHeader values and fields matching SKIP_PATTERNS regular expressions are left as-is.\n\nCurrently only works on CSVs (not other tabular data formats)\n\nCurrently includes known SKIP_PATTERNS for PPWE.\n\n*See comments at top of script for more on how this works*\n\n== `populated_columns.rb`\n\nInitially, works on a single CSV file. Outputs to STDOUT a list of column headers where there is at least one non-empty/nil value in the column.\n\nUse case: quick list for use in quotes/estimates.\n\n== `profiler.rb`\n\nUsage example: Defaults:\n\n`ruby profiler.rb -i ~/data/export/ -o ~/data/profiled`\n\nAll `.csv` files in `~/data/export` directory are included, and the default details mode is `files`. Default options sent to Ruby standard library CSV parser are:\n\n[source,ruby]\n----\n{headers: true, header_converters: [:downcase], converters: [:stripplus],\n  skip_blanks: true, empty_value: nil}\n----\n\nUsage example: compiled details for tab-separated .tsv files:\n\n`ruby profiler.rb -i ~/data/export/ -o ~/data/profiled -s tsv -c '{col_sep: \"\\t\"} -d compiled'`\n\nAll `.tsv` files in `~/data/export` directory are included. The Ruby standard library CSV parser option `col_sep: \"\\t\"` is merged into the default option hash shown above.\n\n=== When `--details file`\n\nOne `.csv` file written to output directory per table column.\n\nFor example, if source file `addresses.csv` has a `:city` column, there is an `address_city.csv` file written.\n\nThe output CSV has one row per unique value found in the source column. The first column is the occurrence count of the value in the source column. The second column is the value.\n\n=== When `--details compile`\nGiven a directory containing CSV files, writes out two CSV reports:\n\n* summary - a row for each column in source CSVs, with the following columns:\n** table - source CSV name\n** column - column name\n** column index - for putting them in the order in which they appear in source document\n** uniq vals - count of unique values found in column\n** null vals - count of empty cells in column\n\n* details - a row for each unique value in each column in source CSVs, with the following columns:\n** table - source CSV name\n** column - column name\n** column index - for putting them in the order in which they appear in source document\n** value - a unique value found in column (puts \"NULL VALUE/EMPTY FIELD\" to represent that)\n** occurrences - number of time value occurs in column\n\nWARNING: There's a known bug where not all apparently empty fields are getting counted as \"NULL VALUE/EMPTY FIELD\". The number that get left out is small and I didn't have time to chase this down now, but will try to the next time I need this thing.\n\n== `reformat_csv.pl`\n\n[[reformatcsv]]Reformats a list of CSVs, allowing you to change the separator and escape characters.  Output is to STDOUT.\n\n[TIP]\n====\nThis can handle parsing `\\n` inside quoted fields that contain unescaped quotes. We did not find a Ruby CSV parsing solution that handled this particular flavor of CSV horror. _(Thanks, potential client legacy system which shall not be named...)_\n\nTo run this script on all files in a directory, writing the reformatted files to another directory, see \u003c\u003creformattables,`reformat_tables.rb`\u003e\u003e.\n====\n\nUsage: `reformat_csv.pl [options] FILES`\n\nUsage example: `perl reformat_csv.pl --input_sep ';' --input_esc '#' ~/data/test.csv \u003e ~/data/test_fix.csv`\n\n.Options:\n- input_sep - Separator character in input CSVs (default: ,)\n- input_esc - Escape character in input CSVs (default: \")\n- output_sep - Separator character in output CSVs (default: ,)\n- output_esc - Escape character in output CSVs (default: \")\n\nTIP: To pass TAB as `input_sep` or `output_sep`, use the literal tab character by typing `Ctrl-v`, then `TAB` on the command line.\n\nWhile handy, this program primarily exists to take adavantage of Text::CSV_XS's ability to deal with unescaped quotes in fields. To do this, set input_esc to anything other than '\"', for instance '#'.\n\n*Requires you have the Text::CSV_XS Perl module installed*\n\n== `reformat_tables.rb`\n\n[[reformattables]]This is a wrapper around `reformat_csv.pl`. *It requires you have Perl and the `Text::CSV_XS` module installed.*\n\nThe input/output sep and esc options are the same as described for \u003c\u003creformatcsv,`reformat_csv.pl`\u003e\u003e\n\nThe only required argument is `--input` (or `-i`), which specifies the directory containing the tabular data files you wish to reformat.\n\nIf no `--output`/`-o` value is given, a new directory called `reformatted` is created in your `--input` directory, and reformatted files are saved in new directory. Any other directory value can be provided. If the directory does not exist at run time, it will be created.\n\nFile suffix (`--suffix`/`-s`) defaults to `csv`.\n\nUsage example:\n\n`ruby reformat_tables.rb -i ~/data/lafayette/export --input_sep ';' --input_esc \"#\" --output_sep '    '`\n\nWrites semicolon delimited .csv files with unescaped quotes to tab-delimited.\n\n== `show_rows_with_column_ct.rb`\n\nMeant to be used to investigate specific files reported by `clean_convert_report.rb` as having bad structure (i.e. ragged columns: some row having different number of columns than other rows)\n\nGiven path to file, delimiter name, number of columns you want to see rows for, and option number of rows you want to see...\n\nOutputs to screen rows with the given number of columns.\n\nThis is useful for coming up with the specific find/replace mess you are going to have to implement to keep rows from being broken up in a ragged way.\n\nGenerally I use this iteratively with edits made to a migration-specific copy of `clean_convert_report.rb` to eliminate or minimize the number of ragged-column files I end up having to manually fix for a migration.\n\n== `tables_and_columns.rb`\n\nUtility script for creating data review spreadsheet.\n\nGiven a directory containing tabular data files, writes an .xlsx file with 2 tabs:\n\n* tables\n** table\n** column ct\n** row ct\n\n* columns\n** table\n** column name\n\nDo `ruby tables_and_columns.rb -h` for parameters/options.\n\n== `table_preview.rb`\n\nUseful for initial data review work.\n\nReads all files with given file suffix in the given directory. For each, prints out the file/table name, headers, and the first X (set max num of rows when script is run) rows of data, nicely formatted, in one text file you can scroll/search through. You don't have to open a million files to get your head around the general shape and character of the data.\n\n*Requires `csvlook` from https://csvkit.readthedocs.io/en/latest/index.html[csvkit] to be installed and available in your PATH*\n\nDo `ruby table_preview.rb -h` for parameters/options.\n\n== `xlsx_to_csv.rb`\n\n*Requires `in2csv` from https://csvkit.readthedocs.io/en/latest/index.html[csvkit] to be installed and available in your PATH*\n\nConverts all .xlsx files in the given directory to .csv.\nEach sheet in an .xlsx file is written to a separate .csv file.\nThe script deletes .csvs written out for empty sheets.\n\nIf given an output directory, the script moves the output CSVs into that directory.\nIf not, it creates a new `csv` directory inside the given input directory and moves all CSV files generated into that subdirectory.\n\n.Tread with extreme caution using this script\n[WARNING]\n====\nIt does not detect the encoding of the .xlsx and pass that to https://csvkit.readthedocs.io/en/latest/scripts/in2csv.html[in2csv].\nKeep an eye out for character encoding problems.\n\nFor some reason it sometimes detects empty columns as \"unnamed columns,\" warns about them, and adds placeholder header names to them.\nThis is true even when I pass in2csv the `--reset-dimensions` flag.\nThis seems to be sometimes caused by formatting having been applied to an entire row, rather than to just populated cells.\nIt may also be caused by a weird \"last cell\" assignment.\nDo `Find \u0026 Select \u003e Special \u003e Last cell` to jump to the assigned last cell.\nSee also https://support.microsoft.com/en-us/office/locate-and-reset-the-last-cell-on-a-worksheet-c9e468a8-0fc3-4f69-8038-b3c1d86e99e9[Microsoft documentation] that might help you fix this kind of thing.\n\nPeople format their Excel worksheets in all kind of terrible ways to make them unusable as data.\nI have no idea what this does with stuff like:\n\n* blank lines inserted in the data\n* \"hanging\" data (when a value applying to many rows is only filled in on one row, and visually/mentally filled in downward, *_IF YOU ARE A HUMAN LOOKING AT IT_*)\n* merged cells\n\nin2csv is called with `--no-inference` and both `--date-format` and `--datetime-format` set to `-` (i.e. do not mess with this).\nThis _should_ have the effect of everything being written to CSV as text values, but I honestly don't know for sure how well this works.\n====\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyrasis%2Fcsv-data-tools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flyrasis%2Fcsv-data-tools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyrasis%2Fcsv-data-tools/lists"}