Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/davidmoten/big-sorter

Java library that sorts very large files of records by splitting into smaller sorted files and merging
https://github.com/davidmoten/big-sorter

big-data java sorting

Last synced: 17 days ago
JSON representation

Java library that sorts very large files of records by splitting into smaller sorted files and merging

Awesome Lists containing this project

README

        

# big-sorter


[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.github.davidmoten/big-sorter/badge.svg?style=flat)](https://maven-badges.herokuapp.com/maven-central/com.github.davidmoten/big-sorter)

[![codecov](https://codecov.io/gh/davidmoten/big-sorter/branch/master/graph/badge.svg)](https://codecov.io/gh/davidmoten/big-sorter)

Sorts very large files (or `InputStream`s) by splitting to many intermediate small sorted files and merging.

Status: *deployed to Maven Central*

Note that the merge step in the diagram above will happen repeatedly till one file remains.

## Features

* Easy to use builder
* Single threaded
* Sorts one billion integers from a file to a new file in 444s
* Serialization helpers for
* lines of strings
* Java IO Serialization
* DataInputStream base
* fixed length binary records
* CSV
* JSON arrays
* Serialization is customizable
* Functional style transforms of input data (`filter`, `map`, `flatMap`, `transform`, `transformStream`), includes java.util.Stream support
* Compare sorted files (`findSame`, `findDifferent`, `findComplement`)
* Runtime complexity is O(n log(n))
* 100% test coverage

## Algorithm

One or more large files or `InputStream`s of records are sorted to one output file by:
* splitting the whole files into smaller segments according to `maxItemsPerFile`
* each segment is sorted in memory and then written to a file
* the segment files are then merged in groups according to `maxFilesPerMerge`
* the merged files are repeatedly merged in groups until only one file remains (with all of the sorted entries)
* Note that, where possible, files are merged with similarly sized files to ensure that we don't start approaching insertion sort computational complexity (O(n2).
* the merge step uses a Min Heap (`PriorityQueue`) for efficiency

## Getting started
Add this dependency to your maven pom.xml:
```xml

com.github.davidmoten
big-sorter
VERSION_HERE

```
If you want to sort csv add this extra dependency:
```xml

org.apache.commons
commons-csv
1.9_OR_LATER

```

If you want to sort JSON arrays add this extra dependency:
```xml

com.fasterxml.jackson.core
jackson-databind
LATEST_VERSION_HERE

```
If you are new to Java or Maven, go to [big-sorter-example](https://github.com/davidmoten/big-sorter-example).

## Serialization
To read records from files or InputStreams and to write records to files we need to specify the *serialization* method to use.

### Example for sorting text lines

Make special note of the ability to do functional style transforms of the input data (`filter`, `map`).

```java
File in1 = ...
File in2 = ...
File out = ...
Sorter
// set both serializer and natural comparator
.linesUtf8()
.input(in1, in2)
.filter(line -> !line.isEmpty())
.filter(line -> !line.startsWith("#"))
.map(line -> line.toLowerCase())
.output(out)
.maxFilesPerMerge(100) // default is 100
.maxItemsPerFile(100000) // default is 100,000
.initialSortInParallel() // may want to use a large maxItemsPerFile for this to be effective
.bufferSize(8192) // default is 8192
.sort();
```

or for a different character set with "\r\n" line delimiters and in reverse order:

```java
Sorter
// set both serializer and natural comparator
.serializerLines(charset, LineDelimiter.CARRIAGE_RETURN_LINE_FEED)
.comparator(Comparator.reverseOrder())
.input(in)
.output(out)
.sort();
```

### Example for sorting integers from a text file
This approach will work but there is a lot of overhead from `Integer.parseInt` and writing int values as strings:

```java
Sorter
.serializerLinesUtf8()
.comparator((a, b) -> Integer.compare(Integer.parseInt(a), Integer.parseInt(b)))
.input(new File("src/test/resources/numbers.txt"))
.filter(line -> !line.isEmpty())
.output(new File("target/numbers-sorted.txt"))
.sort();
```
A more efficient approach (if you need better runtime) is to use an `inputMapper` (you can also use an `outputMapper` at the end):

```java
File textInts = new File("src/test/resources/numbers.txt");

// It's much more efficient to be dealing in 4 bytes of integer
// than strings
Serializer intSerializer = Serializer.dataSerializer( //
dis -> (Integer) dis.readInt(), //
(dos, v) -> dos.writeInt(v));

Sorter
.serializer(intSerializer)
.inputMapper(Serializer.linesUtf8(), line -> Integer.parseInt(line))
.naturalOrder()
.input(textInts)
.outputAsStream()
.sort()
.peek(System.out::println)
.count();
```

A test was made sorting 100m random integers in a text file (one per line).

* Using the first method the runtime was 456s
* With the second more efficient method the runtime was 81s

### Example for sorting CSV
Note that for sorting CSV you need to add the *commons-csv* dependency (see [Gettting started](#getting-started)).

Given the CSV file below, we will sort on the second column (the "number" column):
```
name,number,cost
WIPER BLADE,35,12.55
ALLEN KEY 5MM,27,3.80
```

```java
Serializer serializer =
Serializer.csv(
CSVFormat
.DEFAULT
.withFirstRecordAsHeader()
.withRecordSeparator("\n"),
StandardCharsets.UTF_8);
Comparator comparator = (x, y) -> {
int a = Integer.parseInt(x.get("number"));
int b = Integer.parseInt(y.get("number"));
return Integer.compare(a, b);
};
Sorter
.serializer(serializer)
.comparator(comparator)
.input(inputFile)
.output(outputFile)
.sort();
```
The result is:
```
name,number,cost
ALLEN KEY 5MM,27,3.80
WIPER BLADE,35,12.55
```
### Example for sorting fixed length binary
This example uses a comparator based on byte arrays of length 32. You can also use [`DataSerializer`](#example-using-the-dataserializer-helper) to do more fine grained extraction from the byte arrays (or to handle non-fixed length records).

```java
Serializer serializer = Serializer.fixedSizeRecord(32);
Sorter //
.serializer(serializer)
.comparator((x, y) -> compare(x, y))
.input(new File("input.bin"))
.output(new File("sorted.bin"))
.sort();
```
You would of course have to implement the `compare(byte[], byte[])` function yourself ( returns -1 if x < y, 1 if x > y, 0 if x == y).

### Example for sorting a JSON array
Note that for sorting JSON you need to add the *jackson-databind* dependency (see [Gettting started](#getting-started)).

Given a JSON array like:

```json
[
{ "name": "fred", "age": 23 },
{ "name": "anne", "age": 31 }
]
```
We can sort the elements by the "name" field like this:

```java
Sorter //
.serializer(Serializer.jsonArray())
.comparator((x, y) -> x.get("name").asText().compareTo(y.get("name").asText()))
.input(new File("input.json"))
.output(new File("sorted.json"))
.sort();
```

and we get:
```json
[
{ "name": "anne", "age": 31 },
{ "name": "fred", "age": 23 }
]
```
If your structure is more complex than this (for example the array might not be top-level) then copy and customize the class [JsonArraySerializer.java](src/main/java/com/github/davidmoten/bigsorter/JsonArraySerializer.java).

### Example using Java IO Serialization
If each record has been written to the input file using `ObjectOutputStream` then we specify the *java()* Serializer:

```java
Sorter
.serializer(Serializer.java())
.comparator(Comparator.naturalOrder())
.input(in)
.output(out)
.sort();

```
### Example using the DataSerializer helper
If you would like to serializer/deserialize your objects using `DataOutputStream`/`DataInputStream` then extend the `DataSerializer` class as below. This is a good option for many binary formats.

Let's use a binary format with a person's name and a height in cm. We'll keep it unrealistically simple with a short field for the length of the persons name, the bytes of the name, and an integer for the height in cm:

```java
public static final class Person {
final String name;
final int heightCm;
...
}

Serializer serializer = new DataSerializer() {

@Override
public Person read(DataInputStream dis) throws IOException {
short length;
// only check for EOF on first item. If it happens after then we have a corrupt file
// with incompletely written records
try {
length= dis.readShort();
} catch (EOFException e) {
return null;
}
byte[] bytes = new byte[length];
dis.readFully(bytes);
String name = new String(bytes, StandardCharsets.UTF_8);
int heightCm = dis.readInt();
return new Person(name, heightCm);
}

@Override
public void write(DataOutputStream dos, Person p) throws IOException {
dos.writeShort((short) p.name.length());
dos.write(p.name.getBytes(StandardCharsets.UTF_8));
dos.writeInt(p.heightCm);
}

};

Sorter
.serializer(serializer)
.comparator((x, y) -> Integer.compare(x.heightCm, y.heightCm))
.input(in)
.output(out)
.sort();
```
### But my binary file has a header record!

In that case make a type T that can be header or an item and have your serializer return that T object. In your comparator ensure that the header is always sorted to the top and you are done.

### Custom serialization
To fully do your own thing you need to implement the `Serializer` interface.

## Filtering, transforming input
Your large input file might have a lot of irrelevant stuff in it that you want to ignore or you might want to select only the columns from a csv line that you are interested in. You can use the java.util.Stream API to modify the input or use direct methods `filter`, `map`, `flatMap`:

```java
Sorter
// set both serializer and natural comparator
.linesUtf8()
.input(in)
.filter(line -> !line.isEmpty())
.filter(line -> !line.startsWith("#"))
.map(line -> line.toLowerCase())
.output(out)
.sort();
```
or

```java
Sorter
// set both serializer and natural comparator
.linesUtf8()
.input(in)
.transformStream(stream ->
stream.filter(line -> !line.isEmpty())
.filter(line -> !line.startsWith("#"))
.map(line -> line.toLowerCase()))
.output(out)
.sort();
```
### Converting input files
Bear in mind that the filter, map and transform methods don't change the Serializer and you'll notice that the map method only maps to the same type (so you can modify a string for instance but it stays a string). To transform input into a different format that is more efficient for sorting (like text integer to binary integer) you can use this utility method:

```java
import com.github.davidmoten.bigsorter.Util;

static void convert(File in, Serializer inSerializer, File out, Serializer outSerializer, Function super S, ? extends T> mapper);
```

For example, let's convert a file of integers, one per text line to binary integers (4 bytes each):

```java
Serializer intSerializer = Serializer.dataSerializer(
dis -> (Integer) dis.readInt(),
(dos, v) -> dos.writeInt(v));

// convert input from text integers to 4 byte binary integers
File textInts = new File("src/test/resources/numbers.txt");
File ints = new File("target/numbers-integers");
Util.convert(textInts, Serializer.linesUtf8(), ints, intSerializer, line -> Integer.parseInt(line));
```

Converting the input to a more efficient format can make a big difference to the sort runtime. Sorting 100m integers was 6 times faster when the input was converted first to 4 byte binary ints (and that includes the conversion time).

## How to read the output file
Having sorted to a file `f`, you can read from that file like so (`Reader` is `Iterable`):

```java
Reader reader = serializer.createReader(f);
for (T t: reader) {
System.out.println(t);
}
```
An example with String lines:

```java
Reader reader = Serializer.linesUtf8().createReader(f);
reader.forEach(System.out::println);
```

If you want to stream records from a file do this:

```java
// ensure reader is closed after handling stream
try (Reader reader = serializer.createReader(output)) {
Stream stream = reader.stream();
...
}
```

## Returning the result as a Stream

You might want to deal with the results of the sort immediately and be prepared to throw away the output file once read by a stream:

```java
try (Stream stream = Sorter
.linesUtf8()
.input(in)
.outputAsStream()
.sort()) {
stream.forEach(System.out::println);
}
```
The interaction is a little bit clumsy because you need the stream to be auto-closed by the try-catch-with-resources block.
Note especially that a terminal operation (like `.collect(...)` or `count()`) does **not** close the stream. When called,
the close action of the stream deletes the file used as output. If you don't close the stream then you will accumulate final output files in the temp directory and possibly run out of disk.

The fact that java.util.Stream has poor support for closing resources tempts the author to switch to a more appropriate functional library like [kool](https://github.com/davidmoten/kool). We'll see.

See [here](#how-to-read-the-output-file) to stream records from a file.

## Comparing sorted files
Once you've got multiple sorted files you may want to perform some comparisons. Common comparisons include:

* find common records (use `Util.findSame`)
* find different records (use `Util.findDifferent`)
* find records that are not present in the other file (use `Util.findComplement`)

Here's an example of using `Util.findSame` (`findDifferent` and `findComplement` use the same approach):

```java
// already sorted
File a = ...
// already sorted
File b = ...
// result of the operation
File out = ...
Util.findSame(a, b, Serializer.linesUtf8(), Comparator.naturalOrder(), out);
```

## Logging
If you want some insight into the progress of the sort then set a logger in the builder:

```java
Sorter
.linesUtf8()
.input(in)
.output(out)
.logger(x -> log.info(x))
.sort();
```
You can use the `.loggerStdOut()` method in the builder and you will get timestamped output written to the console:

```
2019-05-25 09:12:59.4+1000 starting sort
2019-05-25 09:13:03.2+1000 total=100000, sorted 100000 records to file big-sorter2118475291065234969 in 1.787s
2019-05-25 09:13:05.9+1000 total=200000, sorted 100000 records to file big-sorter2566930097970845433 in 2.240s
2019-05-25 09:13:08.9+1000 total=300000, sorted 100000 records to file big-sorter6018566301838556627 in 2.243s
2019-05-25 09:13:11.9+1000 total=400000, sorted 100000 records to file big-sorter4803313760593338955 in 0.975s
2019-05-25 09:13:14.3+1000 total=500000, sorted 100000 records to file big-sorter9199236260699264566 in 0.962s
2019-05-25 09:13:16.7+1000 total=600000, sorted 100000 records to file big-sorter2064358954108583653 in 0.989s
2019-05-25 09:13:19.1+1000 total=700000, sorted 100000 records to file big-sorter6934618230625335397 in 0.964s
2019-05-25 09:13:21.5+1000 total=800000, sorted 100000 records to file big-sorter5759615033643361667 in 0.975s
2019-05-25 09:13:24.1+1000 total=900000, sorted 100000 records to file big-sorter6808081723248409045 in 0.948s
2019-05-25 09:13:25.8+1000 total=1000000, sorted 100000 records to file big-sorter2456434677554311136 in 0.983s
2019-05-25 09:13:25.8+1000 completed inital split and sort, starting merge
2019-05-25 09:13:25.8+1000 merging 10 files
2019-05-25 09:13:36.8+1000 sort of 1000000 records completed in 37.456s
```

## Memory usage
Memory usage is directly linked to the value of the `maxItemsPerFile` parameter which you can set in the builder. Its default is 100000. If too much memory is being used reduce that number and test.

## Benchmarks

```
10^3 integers sorted in 0.004s
10^4 integers sorted in 0.013s
10^5 integers sorted in 0.064s
10^6 integers sorted in 0.605s
10^7 integers sorted in 3.166s
10^8 integers sorted in 35.978s
10^9 integers sorted in 444.549s
```