Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/davidmoten/big-sorter

Java library that sorts very large files of records by splitting into smaller sorted files and merging
https://github.com/davidmoten/big-sorter
big-data java sorting
Last synced: 17 days ago
JSON representation
Java library that sorts very large files of records by splitting into smaller sorted files and merging
Host: GitHub
URL: https://github.com/davidmoten/big-sorter
Owner: davidmoten
License: apache-2.0
Created: 2019-05-14T07:32:57.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-10-23T02:00:06.000Z (21 days ago)
Last Synced: 2024-10-24T14:30:02.685Z (20 days ago)
Topics: big-data, java, sorting
Language: Java
Homepage:
Size: 669 KB
Stars: 84
Watchers: 2
Forks: 18
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # big-sorter




[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.github.davidmoten/big-sorter/badge.svg?style=flat)](https://maven-badges.herokuapp.com/maven-central/com.github.davidmoten/big-sorter)


[![codecov](https://codecov.io/gh/davidmoten/big-sorter/branch/master/graph/badge.svg)](https://codecov.io/gh/davidmoten/big-sorter)


Sorts very large files (or `InputStream`s) by splitting to many intermediate small sorted files and merging.

Status: *deployed to Maven Central*



Note that the merge step in the diagram above will happen repeatedly till one file remains.

## Features

* Easy to use builder

* Single threaded

* Sorts one billion integers from a file to a new file in 444s 

* Serialization helpers for 

  * lines of strings

  * Java IO Serialization

  * DataInputStream base

  * fixed length binary records 

  * CSV

  * JSON arrays

* Serialization is customizable

* Functional style transforms of input data (`filter`, `map`, `flatMap`, `transform`, `transformStream`), includes java.util.Stream support

* Compare sorted files (`findSame`, `findDifferent`, `findComplement`)

* Runtime complexity is O(n log(n))

* 100% test coverage

## Algorithm

One or more large files or `InputStream`s of records are sorted to one output file by:

* splitting the whole files into smaller segments according to `maxItemsPerFile`

* each segment is sorted in memory and then written to a file

* the segment files are then merged in groups according to `maxFilesPerMerge`

* the merged files are repeatedly merged in groups until only one file remains (with all of the sorted entries)

* Note that, where possible, files are merged with similarly sized files to ensure that we don't start approaching insertion sort computational complexity (O(n²).

* the merge step uses a Min Heap (`PriorityQueue`) for efficiency

## Getting started

Add this dependency to your maven pom.xml:

```xml

    com.github.davidmoten

    big-sorter

    VERSION_HERE

```

If you want to sort csv add this extra dependency:

```xml

    org.apache.commons

    commons-csv

    1.9_OR_LATER

```

If you want to sort JSON arrays add this extra dependency:

```xml

    com.fasterxml.jackson.core

    jackson-databind

    LATEST_VERSION_HERE

```

If you are new to Java or Maven, go to [big-sorter-example](https://github.com/davidmoten/big-sorter-example).

## Serialization

To read records from files or InputStreams and to write records to files we need to specify the *serialization* method to use.

### Example for sorting text lines

Make special note of the ability to do functional style transforms of the input data (`filter`, `map`).

```java

File in1 = ...

File in2 = ...

File out = ...

Sorter

  // set both serializer and natural comparator

  .linesUtf8()

  .input(in1, in2)

  .filter(line -> !line.isEmpty())

  .filter(line -> !line.startsWith("#"))

  .map(line -> line.toLowerCase())

  .output(out)

  .maxFilesPerMerge(100) // default is 100

  .maxItemsPerFile(100000) // default is 100,000

  .initialSortInParallel() // may want to use a large maxItemsPerFile for this to be effective

  .bufferSize(8192) // default is 8192

  .sort();

```

or for a different character set with "\r\n" line delimiters and in reverse order:

```java

Sorter

  // set both serializer and natural comparator

  .serializerLines(charset, LineDelimiter.CARRIAGE_RETURN_LINE_FEED)

  .comparator(Comparator.reverseOrder())

  .input(in)

  .output(out)

  .sort();

```

### Example for sorting integers from a text file

This approach will work but there is a lot of overhead from `Integer.parseInt` and writing int values as strings:

```java

Sorter

  .serializerLinesUtf8()

  .comparator((a, b) -> Integer.compare(Integer.parseInt(a), Integer.parseInt(b)))

  .input(new File("src/test/resources/numbers.txt"))

  .filter(line -> !line.isEmpty())

  .output(new File("target/numbers-sorted.txt"))

  .sort();

```

A more efficient approach (if you need better runtime) is to use an `inputMapper` (you can also use an `outputMapper` at the end):

```java

File textInts = new File("src/test/resources/numbers.txt");

// It's much more efficient to be dealing in 4 bytes of integer

// than strings

Serializer intSerializer = Serializer.dataSerializer( //

                dis -> (Integer) dis.readInt(), //

                (dos, v) -> dos.writeInt(v));

        

Sorter 

  .serializer(intSerializer)

  .inputMapper(Serializer.linesUtf8(), line -> Integer.parseInt(line))

  .naturalOrder()

  .input(textInts)

  .outputAsStream()

  .sort()        

  .peek(System.out::println)

  .count();

```

A test was made sorting 100m random integers in a text file (one per line). 

* Using the first method the runtime was 456s 

* With the second more efficient method the runtime was 81s

 

### Example for sorting CSV

Note that for sorting CSV you need to add the *commons-csv* dependency (see [Gettting started](#getting-started)).

Given the CSV file below, we will sort on the second column (the "number" column):

```

name,number,cost

WIPER BLADE,35,12.55

ALLEN KEY 5MM,27,3.80

```

```java

Serializer serializer = 

  Serializer.csv(

    CSVFormat

      .DEFAULT

      .withFirstRecordAsHeader()

      .withRecordSeparator("\n"),

    StandardCharsets.UTF_8);

Comparator comparator = (x, y) -> {

    int a = Integer.parseInt(x.get("number"));

    int b = Integer.parseInt(y.get("number"));

    return Integer.compare(a, b);

};

Sorter 

  .serializer(serializer) 

  .comparator(comparator) 

  .input(inputFile) 

  .output(outputFile)

  .sort();

```

The result is:

```

name,number,cost

ALLEN KEY 5MM,27,3.80

WIPER BLADE,35,12.55

```

### Example for sorting fixed length binary

This example uses a comparator based on byte arrays of length 32. You can also use [`DataSerializer`](#example-using-the-dataserializer-helper) to do more fine grained extraction from the byte arrays (or to handle non-fixed length records).

```java

Serializer serializer = Serializer.fixedSizeRecord(32);

Sorter //

  .serializer(serializer) 

  .comparator((x, y) -> compare(x, y)) 

  .input(new File("input.bin")) 

  .output(new File("sorted.bin")) 

  .sort();

```

You would of course have to implement the `compare(byte[], byte[])` function yourself ( returns -1 if x < y, 1 if x > y, 0 if x == y).

### Example for sorting a JSON array

Note that for sorting JSON you need to add the *jackson-databind* dependency (see [Gettting started](#getting-started)).

Given a JSON array like:

```json

[

{ "name": "fred", "age": 23 },

{ "name": "anne", "age": 31 }

]

```

We can sort the elements by the "name" field like this:

```java

Sorter //

  .serializer(Serializer.jsonArray()) 

  .comparator((x, y) -> x.get("name").asText().compareTo(y.get("name").asText())) 

  .input(new File("input.json")) 

  .output(new File("sorted.json")) 

  .sort();

```

and we get:

```json

[

{ "name": "anne", "age": 31 },

{ "name": "fred", "age": 23 }

]

```

If your structure is more complex than this (for example the array might not be top-level) then copy and customize the class [JsonArraySerializer.java](src/main/java/com/github/davidmoten/bigsorter/JsonArraySerializer.java).

### Example using Java IO Serialization

If each record has been written to the input file using `ObjectOutputStream` then we specify the *java()* Serializer:

```java

Sorter 

  .serializer(Serializer.java()) 

  .comparator(Comparator.naturalOrder()) 

  .input(in) 

  .output(out) 

  .sort();

```

### Example using the DataSerializer helper

If you would like to serializer/deserialize your objects using `DataOutputStream`/`DataInputStream` then extend the `DataSerializer` class as below. This is a good option for many binary formats. 

Let's use a binary format with a person's name and a height in cm. We'll keep it unrealistically simple with a short field for the length of the persons name, the bytes of the name, and an integer for the height in cm:

```java

public static final class Person {

    final String name;

    final int heightCm;

    ...

}

Serializer serializer = new DataSerializer() {

    @Override

    public Person read(DataInputStream dis) throws IOException {

        short length;

        // only check for EOF on first item. If it happens after then we have a corrupt file 

        // with incompletely written records

        try {

            length= dis.readShort();

        } catch (EOFException e) {

            return null;

        }

        byte[] bytes = new byte[length];

        dis.readFully(bytes);

        String name = new String(bytes, StandardCharsets.UTF_8);

        int heightCm = dis.readInt();

        return new Person(name, heightCm);

    }

    @Override

    public void write(DataOutputStream dos, Person p) throws IOException {

        dos.writeShort((short) p.name.length());

        dos.write(p.name.getBytes(StandardCharsets.UTF_8));

        dos.writeInt(p.heightCm);

    }

};

      

Sorter 

  .serializer(serializer) 

  .comparator((x, y) -> Integer.compare(x.heightCm, y.heightCm)) 

  .input(in) 

  .output(out) 

  .sort();

``` 

### But my binary file has a header record!

In that case make a type T that can be header or an item and have your serializer return that T object. In your comparator ensure that the header is always sorted to the top and you are done.

### Custom serialization

To fully do your own thing you need to implement the `Serializer` interface.

## Filtering, transforming input

Your large input file might have a lot of irrelevant stuff in it that you want to ignore or you might want to select only the columns from a csv line that you are interested in. You can use the java.util.Stream API to modify the input or use direct methods `filter`, `map`, `flatMap`:

```java

Sorter

  // set both serializer and natural comparator

  .linesUtf8()

  .input(in)

  .filter(line -> !line.isEmpty())

  .filter(line -> !line.startsWith("#"))

  .map(line -> line.toLowerCase())

  .output(out)

  .sort();

```

or

```java

Sorter

  // set both serializer and natural comparator

  .linesUtf8()

  .input(in)

  .transformStream(stream -> 

      stream.filter(line -> !line.isEmpty())

            .filter(line -> !line.startsWith("#"))

            .map(line -> line.toLowerCase()))

  .output(out)

  .sort();

```  

### Converting input files

Bear in mind that the filter, map and transform methods don't change the Serializer and you'll notice that the map method only maps to the same type (so you can modify a string for instance but it stays a string). To transform input into a different format that is more efficient for sorting (like text integer to binary integer) you can use this utility method:

```java

import com.github.davidmoten.bigsorter.Util;

static  void convert(File in, Serializer inSerializer, File out,  Serializer outSerializer, Function super S, ? extends T> mapper);

```

For example, let's convert a file of integers, one per text line to binary integers (4 bytes each):

```java

Serializer intSerializer = Serializer.dataSerializer(

   dis -> (Integer) dis.readInt(),

   (dos, v) -> dos.writeInt(v));

        

// convert input from text integers to 4 byte binary integers

File textInts = new File("src/test/resources/numbers.txt");

File ints = new File("target/numbers-integers");

Util.convert(textInts, Serializer.linesUtf8(), ints, intSerializer, line -> Integer.parseInt(line));

```

Converting the input to a more efficient format can make a big difference to the sort runtime. Sorting 100m integers was 6 times faster when the input was converted first to 4 byte binary ints (and that includes the conversion time).

## How to read the output file

Having sorted to a file `f`, you can read from that file like so (`Reader` is `Iterable`):

```java

Reader reader = serializer.createReader(f);

for (T t: reader) {

    System.out.println(t);

}

```

An example with String lines:

```java

Reader reader = Serializer.linesUtf8().createReader(f);

reader.forEach(System.out::println);

```

If you want to stream records from a file do this:

```java

// ensure reader is closed after handling stream

try (Reader reader = serializer.createReader(output)) {

    Stream stream = reader.stream();

    ...

}

```

## Returning the result as a Stream

You might want to deal with the results of the sort immediately and be prepared to throw away the output file once read by a stream:

```java

try (Stream stream = Sorter

  .linesUtf8()

  .input(in)

  .outputAsStream()

  .sort()) {

    stream.forEach(System.out::println);

}

``` 

The interaction is a little bit clumsy because you need the stream to be auto-closed by the try-catch-with-resources block.

Note especially that a terminal operation (like `.collect(...)` or `count()`) does **not** close the stream. When called,  

the close action of the stream deletes the file used as output. If you don't close the stream then you will accumulate final output files in the temp directory and possibly run out of disk.

The fact that java.util.Stream has poor support for closing resources tempts the author to switch to a more appropriate functional library like [kool](https://github.com/davidmoten/kool). We'll see.

See [here](#how-to-read-the-output-file) to stream records from a file.

## Comparing sorted files

Once you've got multiple sorted files you may want to perform some comparisons. Common comparisons include:

* find common records (use `Util.findSame`)

* find different records (use `Util.findDifferent`)

* find records that are not present in the other file (use `Util.findComplement`)

Here's an example of using `Util.findSame` (`findDifferent` and `findComplement` use the same approach):

```java

// already sorted

File a = ...

// already sorted

File b = ...

// result of the operation

File out = ...

Util.findSame(a, b, Serializer.linesUtf8(), Comparator.naturalOrder(), out);

```

## Logging

If you want some insight into the progress of the sort then set a logger in the builder:

```java

Sorter

  .linesUtf8()

  .input(in)

  .output(out)

  .logger(x -> log.info(x))

  .sort();

```

You can use the `.loggerStdOut()` method in the builder and you will get timestamped output written to the console:

```

2019-05-25 09:12:59.4+1000 starting sort

2019-05-25 09:13:03.2+1000 total=100000, sorted 100000 records to file big-sorter2118475291065234969 in 1.787s

2019-05-25 09:13:05.9+1000 total=200000, sorted 100000 records to file big-sorter2566930097970845433 in 2.240s

2019-05-25 09:13:08.9+1000 total=300000, sorted 100000 records to file big-sorter6018566301838556627 in 2.243s

2019-05-25 09:13:11.9+1000 total=400000, sorted 100000 records to file big-sorter4803313760593338955 in 0.975s

2019-05-25 09:13:14.3+1000 total=500000, sorted 100000 records to file big-sorter9199236260699264566 in 0.962s

2019-05-25 09:13:16.7+1000 total=600000, sorted 100000 records to file big-sorter2064358954108583653 in 0.989s

2019-05-25 09:13:19.1+1000 total=700000, sorted 100000 records to file big-sorter6934618230625335397 in 0.964s

2019-05-25 09:13:21.5+1000 total=800000, sorted 100000 records to file big-sorter5759615033643361667 in 0.975s

2019-05-25 09:13:24.1+1000 total=900000, sorted 100000 records to file big-sorter6808081723248409045 in 0.948s

2019-05-25 09:13:25.8+1000 total=1000000, sorted 100000 records to file big-sorter2456434677554311136 in 0.983s

2019-05-25 09:13:25.8+1000 completed inital split and sort, starting merge

2019-05-25 09:13:25.8+1000 merging 10 files

2019-05-25 09:13:36.8+1000 sort of 1000000 records completed in 37.456s

```

## Memory usage

Memory usage is directly linked to the value of the `maxItemsPerFile` parameter which you can set in the builder. Its default is 100000. If too much memory is being used reduce that number and test.

## Benchmarks

```

10^3 integers sorted in 0.004s

10^4 integers sorted in 0.013s

10^5 integers sorted in 0.064s

10^6 integers sorted in 0.605s

10^7 integers sorted in 3.166s

10^8 integers sorted in 35.978s

10^9 integers sorted in 444.549s

```