Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fogfish/csv

csv parser, optimized for performance
https://github.com/fogfish/csv

Last synced: 10 days ago
JSON representation

csv parser, optimized for performance

Awesome Lists containing this project

README

        

CSV-file parser
***************

Copyright (C) 2012, Dmitry Kolesnikov

This file is free documentation; unlimited permisions are give to copy,
distribute and modify the documentation.

This library is free software; you can redistribute it and/or modify
it under the terms of the the 3-clause BSD License (the "License");
as published by http://www.opensource.org/licenses/BSD-3-Clause.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!! !!!
!!! WARNING !!!
!!! The library is not supported. !!!
!!! Use CSV feature of https://github.com/fogfish/feta !!!
!!! !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Introduction
============

The simple CSV-file parser based on event model. The parser generates an
event/callback when the CSV line is parsed. The parser supports both
sequential and parallel parsing. The major goal is an performance of
intake procedure with an parsing target of 3 - 4 micro seconds pr line on
the reference hardware.

Acc
+--------+
| |
V |
+---------+ |
----Input---->| Parser |--------> AccN
+ +---------+
Acc0 |
V
Event Line

The parser takes as input binary stream, event handler function and
initial state/accumulator. Event function is evaluated agains current
accumulator and parsed line of csv-file. Note: The accumaltor allows to
carry-on application specific state throught event functions.

Compile and build
=================

The library source code is available at git repository

git clone https://github.com/fogfish/pts.git

Briefly, the shell command `./configure; make; make install' should
configure, build, and assembly distribution package. The following
instructions are specific to this package; see the `INSTALL' file for
instructions specific to GNU build tools.

The `configure' shell script attempts to guess dependencies and system
configuration required to build library, the following build time dependencies exists:

--with-erlang={prefix_to_otp} supplied to `./configure' binds the library
with chosen Erlang runtime, if you have
multiple Erlang environments available at
build machine

High performance version of library shall be build with native targets

make BUILD=native

Interface
=========
Briefly, the sequence of operations for data parse/intake is following;
see the src/csv.erl file for detailed interface specification and/or
example parser at priv/csv_example.erl

%% define an event funtion that takes two arguments line value and
%% accumulator. The function shall return a new accumulator state.
%% The structure of accumulator is an application specific, that might
%% vary from integer to comprex record.
Fun = fun({line, L}, #my_record{count = C} = Acc0) ->
do_my_intake_to_somewhere(lists:reverse(L)),
Acc0#my_record{count = C + 1}
end

%%
%% A sequential parse, parses whole data stream in client process
csv:parse(CSV, Fun, #myrecord{})

%%
%% a parallel parse splits the CSV into multiple chunks;
%% spawns multiple processes (process per chunk)
%% results agregated in the client process.
csv:parse(CSV, 20, Fun, #myrecord{})

Performance
===========

Reference platform:
* MacMini, Lion Server,
* 1x Intel Core i7 (2 GHz), 4x cores
* L2 Cache 256KB per core
* L3 Cache 6MB
* Memory 4GB 1333 MHZ DDR3
* Disk 750GB 7200rpm WDC WD7500BTKT-40MD3T0
* erlang R15B + native build of the library

The data set is has following patterns: key, date, time, float numbers and
zz suffix
* key{1..300 000},2012-03-25,23:26:15.543,166.280,...,zz

The numbers of keys is 300.000, and number of float fields varies from 8,
24 and 40 in reference data. Reference data set is generated by command

make example or perl priv/gen_set.pl 300 40 > priv/set-300K-40.txt


version 0.0.1

E/Parse Size (MB) Read (ms) Handle (ms) Per Line (us)
-------------------------------------------------------------------
300K, 8 flds 23.41 91.722 350.000 1.16
300K, 24 flds 50.42 489.303 697.739 2.33
300K, 40 flds 77.43 780.296 946.003 3.15


ET/hash Size (MB) Read (ms) Handle (ms) Per Line (us)
-------------------------------------------------------------------
300K, 8 flds 23.41 91.722 384.598 1.28
300K, 24 flds 50.42 489.303 761.414 2.54
300K, 40 flds 77.43 780.296 1047.329 3.49


ET/tuple Size (MB) Read (ms) Handle (ms) Per Line (us)
-------------------------------------------------------------------
300K, 8 flds 23.41 91.722 228.306 0.76
300K, 24 flds 50.42 489.303 601.025 2.00
300K, 40 flds 77.43 780.296 984.676 3.28

ETL/ets Size (MB) Read (ms) Handle (ms) Per Line (us)
-------------------------------------------------------------------
300K, 8 flds 23.41 91.722 1489.543 4.50
300K, 24 flds 50.42 489.303 2249.689 7.50
300K, 40 flds 77.43 780.296 2519.401 8.39

ETL/pts Size (MB) Read (ms) Handle (ms) Per Line (us)
-------------------------------------------------------------------
300K, 8 flds 23.41 91.722 592.886 1.98
300K, 24 flds 50.42 489.303 1190.745 3.97
300K, 40 flds 77.43 780.296 1734.898 5.78