https://github.com/ir33k/billion
My attempt of solving The One Billion Row Challenge in C
https://github.com/ir33k/billion
Last synced: 3 months ago
JSON representation
My attempt of solving The One Billion Row Challenge in C
- Host: GitHub
- URL: https://github.com/ir33k/billion
- Owner: ir33k
- Created: 2024-07-07T09:20:10.000Z (11 months ago)
- Default Branch: master
- Last Pushed: 2024-07-07T21:18:45.000Z (11 months ago)
- Last Synced: 2025-01-08T16:30:08.870Z (5 months ago)
- Language: C
- Size: 8.79 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
README
One Billion Row Challenge
=========================My attempt in solving The One Billion Row Challenge [1] without Hash
Map in C. Also I'm doing this on very old ThinkPad X220 laptop [2].[1] https://github.com/gunnarmorling/1brc/
[2] CPU: Intel i5-2520M (4) @ 3.200GHzRules and limits that I follow, from original repo 1brc/README.md, that
are relevant to solutions not written in Java:- No external library dependencies may be used.
- Implementations must be provided as a single source file.
- The computation must happen at application _runtime_, i.e. you cannot
process the measurements file at _build time_ (for instance, when using
GraalVM) and just bake the result into the binary.
- Input value ranges are as follows:
- Station name: non null UTF-8 string of min length 1 character and
max length 100 bytes, containing neither `;` nor `\n` characters.
- Temperature value: non null double between -99.9 (inclusive) and
99.9 (inclusive), always with one fractional digit.
- There is a maximum of 10,000 unique station names.
- Line endings in the file are `\n` characters on all platforms.
- Implementations must not rely on specifics of a given data set, e.g. any
valid station name as per the constraints above and any data distribution
(number of measurements per station) must be supported.
- The rounding of output values must be done using the semantics of IEEE
754 rounding-direction "roundTowardPositive".Build:
$ ./build
Generate input files with varing number of lines:
$ ./gen 1000000 > data1m.tmp
$ ./gen 1000000000 > data1b.tmpRun solution program with one of input files:
$ ./solve < data1m.tmp
$ ./solve < data1b.tmpDevlog
======2024-07-07 Sun 10:40
--------------------I started by writing my own input data generator gen.c. Just because I
don't want to insetall Java to run the data generators from original repo.
They also have an Python script [1] but the code is very slopy. I don't want
to use it. My version is mostly a clone of CreateMeasurements.java [2]
from original repo.Generation of 1B lines took around 8 minutes and file weights 13 GB.
[1] src/main/python/create_measurements.py
[2] src/main/java/dev/morling/onebrc/CreateMeasurements.java