https://github.com/ngrunwald/meta-csv
A Clojure smart reader for CSV files
https://github.com/ngrunwald/meta-csv
csv data-analysis
Last synced: 28 days ago
JSON representation
A Clojure smart reader for CSV files
- Host: GitHub
- URL: https://github.com/ngrunwald/meta-csv
- Owner: ngrunwald
- License: epl-2.0
- Created: 2019-06-19T10:24:30.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2022-09-03T13:29:08.000Z (over 3 years ago)
- Last Synced: 2025-11-22T21:08:06.879Z (4 months ago)
- Topics: csv, data-analysis
- Language: Clojure
- Size: 48.8 KB
- Stars: 28
- Watchers: 0
- Forks: 1
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# meta-csv
A smart reader for CSV files, spiritual successor to [ultra-csv](https://github.com/ngrunwald/ultra-csv).
## Features
- Smart statistical heuristics to guess pretty much anything about
your csv file, from delimiter to quotes and whether a header is present
- Handles for you the boring but dangerous stuff, like encoding detection
and bom skipping if present, but also embedded new lines and quote escaping
- Coerces the numerical values that have been recognised. The types and
coercions can be extended by the user to *dates*, *phone numbers*, etc.
- Designed to be both very easy to use in an exploratory way to get a quick
feel for the data, and then be put into production with almost the same code
## Installation
`meta-csv` is available as a Maven artifact from
[Clojars](http://clojars.org/meta-csv):
In your `project.clj` dependencies for *leiningen*:
[](http://clojars.org/meta-csv)
## Usage
The easiest way to use when hacking at the REPL is simply:
```clojure
(require '[meta-csv.core :as csv])
```
```clojure
(first (csv/read-csv "./dev-resources/samples/marine-economy-2007-18.csv"))
=> {:year 2007,
:category "Fisheries and aquaculture",
:variable "Cont. to ME Wage and salary earners",
:units "Proportion",
:magnitude "Actual",
:source "LEED",
:data_value 43.1,
:flag "R"}
```
If the file has a header, this returns a lazy seq of maps of field names to values.
If any field name would be problematic as keyword, then all field names will be
strings instead:
```clojure
(first (csv/read-csv "./dev-resources/samples/sales-records-sample.csv"))
=> {"Region" "Australia and Oceania",
"Country" "Tuvalu",
"Item Type" "Baby Food",
"Sales Channel" "Offline",
"Order Priority" "H",
"Order Date" "5/28/2010",
"Order ID" 669165933,
"Ship Date" "6/27/2010",
"Units Sold" 9925,
"Unit Price" 255.28,
"Unit Cost" 159.42,
"Total Revenue" 2533654.0,
"Total Cost" 1582243.5,
"Total Profit" 951410.5}
```
The maps are array-maps, which means the order of the keys is the same as the
order of the fields in the file.
If no header is present, the rows will be returned as a seq of vectors, in the same fashion as [clojure.data.csv/read-csv](https://clojure.github.io/data.csv/#clojure.data.csv/read-csv).
A lot of options are available, as an optional second argument spec. Check the
docstring for a more or less exhaustive description.
This spec can actually be created by another noteworthy function, `guess-spec`.
```clojure
(csv/guess-spec "./dev-resources/samples/marine-economy-2007-18.csv")
=> {:fields
[{:field :year, :type :long}
{:field :category, :type :string}
{:field :variable, :type :string}
{:field :units, :type :string}
{:field :magnitude, :type :string}
{:field :source, :type :string}
{:field :data_value, :type :double}
{:field :flag, :type :string}],
:delimiter \,,
:bom :none,
:encoding "ISO-8859-1",
:skip-analysis? true,
:header? true,
:quoted? false}
```
Then the `:fields` vector describing the processing on each field can be
customized to produce exactly the right format of data. This spec can be used
directly as the second argument to `read-csv`.
The useful functions are extensively documented in the docstrings of the [API
Documentation](https://cljdoc.org/d/meta-csv/meta-csv/CURRENT/doc/readme).
[](https://cljdoc.org/d/meta-csv/meta-csv/CURRENT)
The test file also contains interesting examples.
## Tips and tricks
Need to get out put as an array like clojure.data.csv but with type coercions?
The `:skip` param skips the first line and the false `:header?` returns arrays.
```clojure
(first (csv/read-csv "./dev-resources/samples/marine-economy-2007-18.csv" {:skip 1 :header? false}))
=> [2007
"Fisheries and aquaculture"
"Cont. to ME Wage and salary earners"
"Proportion"
"Actual"
"LEED"
43.1
"R"]
```
One of the differences with [ultra-csv](https://github.com/ngrunwald/ultra-csv)
is that meta-csv makes no attempt at validating output data. Validation is an
important concern but should not be handled by the file format parser, even a
smart one. I recommend however in production using something like
[spec-provider](https://github.com/stathissideris/spec-provider) to generate
specs and validating the data with them when they come from a manual source.
In the same spirit, I tend to use `read-csv` at the REPL when doing analysis
work, but when and if going to production, I generate a spec with `guess-spec`
and uses that with `read-csv`, to make the process more reliable if the input
file format presents problems at a future time.
## License
Copyright © 2019-2020 Nils Grunwald
This program and the accompanying materials are made available under the
terms of the Eclipse Public License 2.0 which is available at
http://www.eclipse.org/legal/epl-2.0.
This Source Code may also be made available under the following Secondary
Licenses when the conditions for such availability set forth in the Eclipse
Public License, v. 2.0 are satisfied: GNU General Public License as published by
the Free Software Foundation, either version 2 of the License, or (at your
option) any later version, with the GNU Classpath Exception which is available
at https://www.gnu.org/software/classpath/license.html.