https://github.com/sl-solution/dlmreader.jl

High-performance delimited-file reader and writer for Julia
https://github.com/sl-solution/dlmreader.jl

csv csv-files csv-import csv-parser dataframes dataset delimited-files delimiter fwf high-performance informat julia julia-language threaded

Last synced: 7 months ago
JSON representation

High-performance delimited-file reader and writer for Julia

Host: GitHub
URL: https://github.com/sl-solution/dlmreader.jl
Owner: sl-solution
License: mit
Created: 2021-08-18T07:58:54.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2024-06-09T19:07:20.000Z (over 1 year ago)
Last Synced: 2024-10-19T05:18:32.369Z (about 1 year ago)
Topics: csv, csv-files, csv-import, csv-parser, dataframes, dataset, delimited-files, delimiter, fwf, high-performance, informat, julia, julia-language, threaded
Language: Julia
Homepage:
Size: 800 KB
Stars: 28
Watchers: 1
Forks: 1
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # DLMReader

[![](https://img.shields.io/badge/docs-stable-blue.svg)](https://sl-solution.github.io/DLMReader.jl/stable) [![](https://img.shields.io/badge/docs-dev-blue.svg)](https://sl-solution.github.io/DLMReader.jl/dev)

An efficient multi-threaded package for reading(writing) delimited files. It is designed as a file parser for `InMemoryDatasets.jl`.

> DLMReader writes and reads AbstractDatasets types, i.e. other types must be converted to/from AbstractDatasets.

It works very well for huge files (long or/and wide).

> `DLMReader` does not guess `delimiter` and if it is different from `,`, it must be passed via the `delimiter` keyword argument. By default, the `DLMReader` package assumes Strings are not quoted, if they are quoted, user must pass the quote character via the `quotechar` keyword argument.

## Documentation

* [Stable](https://sl-solution.github.io/DLMReader.jl/stable)

* [Dev](https://sl-solution.github.io/DLMReader.jl/dev)

## Features

`DLMReader.jl` has some interesting features which distinguish it from other packages for reading delimited files. In what follows, we list few of them;

* **`Informats`**: The `DLMReader` package uses  `informats`  to call a class of functions on the raw text before parsing its value(s). This provides a flexible and extendable approach to parse values with special patterns. For instance, using the predefined informat `COMMA!` allows users to read a numeric column with "thousands separator" and/or the dollar sign, e.g. using this informat, the raw text like "`$12,000.00`" will be parsed as "`12000.00`". Moreover,  `informat`s support function composing, e.g. `COMMA! ∘ ACC!`  parses "`$(12,000.00)`" as "`-12000.00`", i.e. `ACC!` is first applied and then `COMMA!` is applied on its result.

  

  * Additionally, `informats` can be applied on whole line before processing individual values.

* **Fixed-width text**: If users pass the columns locations via the `fixed` keyword argument, the package reads those columns as fixed-width format. For instance, passing `fixed = Dict(1=>1:1, 2=>2:2)` helps to parse "`10`" as "`[1,0]`".  Mixing fixed-width format and delimited format is also allowed.

* **Multiple observations per line**: The package allows reading more than one observation per line. This can be done by passing the `multiple_obs = true` keyword argument. The multithreading feature (plus some other features) will be switched off if this option is set.

* **Fast file writer**: The `DLMReader` package exploits the `byrow` function from [`InMemoryDatasets.jl`](https://github.com/sl-solution/InMemoryDatasets.jl) to write delimited files into disk. This enables `DLMReader` to convert values to string using multiple threads.

* **Alternative delimiters**: User can pass a vector of delimiters to the function. In this case, `filereader` treats any of the passed delimiters as field delimiter.

* **Multiple Date formats**: User can pass different date formats for different columns.

* **Different integer base**: The `DLMReader` package allows users pass the integer base if it is different from 10 when parsing integers.

* **String as delimiter**: User can pass a string as delimiter of values. This must be passed via the `dlmstr` keyword argument.

* **Informative warnings/info**: If something goes wrong during the reading phase, the package will provide detailed warnings/info to help user investigate the issue.

## Benchmarks

See [here](https://discourse.julialang.org/t/ann-dlmreader-the-most-versatile-julia-package-for-reading-delimited-files-yet/81899) for some benchmarks.

## Examples

The following files will be used during the examples, it is assumed that the files are located in the current working directory

ex1.csv

```text

a, b, c

1,2,NA

2,3,2001-1-2

2,4,2020-4-2

1,2,2000-12-1

```

ex2.csv

```text

a::b::C::DD

12::1345::15::15

12::13::15::15

12::13::15::15

12::13::15::15

12::13::15::15

12::13::15::15

12::13::15::15

12::13::::15

12::13::15::15

12::13::15::157

```

ex3.csv

```text

1

2

4;5

6

8;9

1

4;

```

ex4.csv

```text

1   3,5

2   4,6

33  5,7

```

ex5.csv

```text

x1;x2:x3,x4

1;2;123;3

2;4,4,5

```

ex6.csv

```text

id1 $2,000,000 3

id2 $34,000 4

id3 $200,000 1

```

And the code to read them into Julia

```julia

julia> using DLMReader

julia> filereader("ex1.csv", dtformat = Dict(3 => dateformat"y-m-d"))

julia> filereader("ex2.csv", dlmstr = "::")

julia> filereader("ex3.csv", types = [Int, Int, Int], header = false, linebreak = ';', delimiter = '\n')

julia> filereader("ex4.csv", fixed = Dict(1 => 1:4), header = false)

julia> filereader("ex5.csv", delimiter = [';', ':', ','])

julia> filereader("ex6.csv", delimiter = ' ', informat = Dict(2=>COMMA!), header = [:ID, :price, :quarter])

```

`COMMA!` is a built-in informat which removes the comma from numbers. If number contains dollar or sterling signs, it also removes them. The trimmed text is sent to the parser for converting to a number.

## Extra examples

```julia

julia> filereader(IOBuffer("1,2,3,4,5\n6,7,8\n10\n"),

                  header = [:x1, :x2],

                  types = [Int, Int],

                  multiple_obs = true)

5×2 Dataset

 Row │ x1        x2       

     │ identity  identity

     │ Int64?    Int64?   

─────┼────────────────────

   1 │        1         2

   2 │        3         4

   3 │        5         6

   4 │        7         8

   5 │       10   missing

julia> filereader(IOBuffer(""" name1 name2 avg1 avg2  y

              0   A   D   75   5    32

              1   A   D   75   5    32

              2   D   L   32   7    12

              3   F   C   99   8    42

              4   F   C   99   8    42

              5   C   A   43   6    39

              6   C   A   43   6    39

              7   L   R   53   3    11

              8   R   F   21   2    25

              9   R   F   21   2    25

              """), delimiter = ' ', ignorerepeated = true, emptycolname = true)

10×6 Dataset

 Row │ NONAME1   name1     name2     avg1      avg2      y        

     │ identity  identity  identity  identity  identity  identity

     │ Int64?    String?   String?   Int64?    Int64?    Int64?   

─────┼────────────────────────────────────────────────────────────

   1 │        0  A         D               75         5        32

   2 │        1  A         D               75         5        32

   3 │        2  D         L               32         7        12

   4 │        3  F         C               99         8        42

   5 │        4  F         C               99         8        42

   6 │        5  C         A               43         6        39

   7 │        6  C         A               43         6        39

   8 │        7  L         R               53         3        11

   9 │        8  R         F               21         2        25

  10 │        9  R         F               21         2        25

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sl-solution/dlmreader.jl

Awesome Lists containing this project

README