An open API service indexing awesome lists of open source software.

https://github.com/ir33k/billion

My attempt of solving The One Billion Row Challenge in C
https://github.com/ir33k/billion

Last synced: 3 months ago
JSON representation

My attempt of solving The One Billion Row Challenge in C

Awesome Lists containing this project

README

        

One Billion Row Challenge
=========================

My attempt in solving The One Billion Row Challenge [1] without Hash
Map in C. Also I'm doing this on very old ThinkPad X220 laptop [2].

[1] https://github.com/gunnarmorling/1brc/
[2] CPU: Intel i5-2520M (4) @ 3.200GHz

Rules and limits that I follow, from original repo 1brc/README.md, that
are relevant to solutions not written in Java:

- No external library dependencies may be used.
- Implementations must be provided as a single source file.
- The computation must happen at application _runtime_, i.e. you cannot
process the measurements file at _build time_ (for instance, when using
GraalVM) and just bake the result into the binary.
- Input value ranges are as follows:
- Station name: non null UTF-8 string of min length 1 character and
max length 100 bytes, containing neither `;` nor `\n` characters.
- Temperature value: non null double between -99.9 (inclusive) and
99.9 (inclusive), always with one fractional digit.
- There is a maximum of 10,000 unique station names.
- Line endings in the file are `\n` characters on all platforms.
- Implementations must not rely on specifics of a given data set, e.g. any
valid station name as per the constraints above and any data distribution
(number of measurements per station) must be supported.
- The rounding of output values must be done using the semantics of IEEE
754 rounding-direction "roundTowardPositive".

Build:

$ ./build

Generate input files with varing number of lines:

$ ./gen 1000000 > data1m.tmp
$ ./gen 1000000000 > data1b.tmp

Run solution program with one of input files:

$ ./solve < data1m.tmp
$ ./solve < data1b.tmp

Devlog
======

2024-07-07 Sun 10:40
--------------------

I started by writing my own input data generator gen.c. Just because I
don't want to insetall Java to run the data generators from original repo.
They also have an Python script [1] but the code is very slopy. I don't want
to use it. My version is mostly a clone of CreateMeasurements.java [2]
from original repo.

Generation of 1B lines took around 8 minutes and file weights 13 GB.

[1] src/main/python/create_measurements.py
[2] src/main/java/dev/morling/onebrc/CreateMeasurements.java