https://github.com/wlun001/file-reader

Try to read large file with limited memory
https://github.com/wlun001/file-reader
Last synced: 1 day ago
JSON representation
Try to read large file with limited memory
Host: GitHub
URL: https://github.com/wlun001/file-reader
Owner: WLun001
Created: 2020-10-02T13:14:44.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2020-10-15T12:06:57.000Z (almost 5 years ago)
Last Synced: 2025-04-04T02:14:13.370Z (6 months ago)
Language: Go
Homepage:
Size: 875 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Large file reader

- [Large file reader](#large-file-reader)

  * [Goal](#goal)

  * [Demo](#demo)

  * [Solution](#solution)

    + [Find example data](#find-example-data)

    + [First attempt](#first-attempt)

        * [concept](#concept)

        * [Result](#result)

    + [Second attempt](#second-attempt)

        * [Result](#result-1)

      - [Create larger text file](#create-larger-text-file)

      - [create cloud resources](#create-cloud-resources)

      - [Build container and deploy](#build-container-and-deploy)

      - [Get the IP address](#get-the-ip-address)

      - [Test the API](#test-the-api)

      - [Result](#result-2)

    + [Third attempt](#third-attempt)

      - [Result](#result-3)

  * [Conclusion](#conclusion)

  * [Clean up](#clean-up)

Table of contents generated with markdown-toc

## Goal

- read 100 GB text file

- memory cap at 16 GB

- only scan text file once

- reduce IO operations

- find first unique word

## Demo

- it will take about 1 minute

- IP address and bucket URL will be remove afterward

```bash

curl 'http://34.87.25.50:3000/word?file=https://storage.googleapis.com/temp-read-large-file-bucket/big10.txt' | jq

```

## Solution

### Find example data

`emoji.txt` were combined from [Kaggle dataset](https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp)

```bash

$ cat *.txt > emoji.txt # about 2.1 MB

``` 

### First attempt

With goroutines and `bufio.NewReader`

##### concept

1.  determine the filesize, and split into different goroutine, based on a constant `limitInBytes`

```go

goroutines := 1

limitInBytes := int64(*limit * kb)

if f.Size() > limitInBytes {

	goroutines = int(f.Size() / limitInBytes)

}

```

2. Define a channel to receive words

```go

channel := make(chan string)

dict := make(map[string]int)

done := make(chan bool, 1)

words := 0

go func() {

	for s := range channel {

		words++

		dict[s]++

	}

    done <- true

}()

```

3.  Read the file from specific destination

```go

// offset is the last reading position

file.Seek(offset, 0)

reader := bufio.NewReader(file)

// if current cummulativeSize is larger than the limit 

// that supposed to read by this goroutine, exit the function

if cummulativeSize > limit {

	break

}

// read word by word

b, err := reader.ReadBytes(' ')

// send to the word channel

channel <- s

```

4. Read the file on each goroutine

```go

for i := 0; i < goroutines; i++ {

	wg.Add(1)

	go func() {

      // file reading

	  read(current, limitInBytes, *file, channel)

	  wg.Done()

	}()

	current += limitInBytes + 1

}

```

5. Wait the goroutines to finish

```go

wg.Wait()

close(channel)

// to exit the word channel

<-done

close(done)

```

##### Result

faster, lower memory usage, but not accurate (could be some logic error)

```bash

$ go run cmd/reader.go

4 goroutine has been completed

4 goroutine has been completed

4 goroutine has been completed

4 goroutine has been completed

Alloc = 2 MiB	TotalAlloc = 2 MiB	Sys = 69 MiB	NumGC = 0

emoji.txt is 2069616 bytes

uniqueWords: 10929, wordCount: 101011

time taken: 52.895164ms

top 5 words

i, 4124

feel, 3859

and, 3292

to, 3097

the, 2848

a, 2163

```

### Second attempt

With `bufio.NewScanner` and  `Scan()`

```go

func readFile(f *os.File) (map[string]int, int) {

	dict := make(map[string]int)

	words := 0

	scanner := bufio.NewScanner(f)

	for scanner.Scan() {

		for _, w := range strings.Fields(scanner.Text()) {

			dict[w]++

			words++

		}

	}

    return dict, words

}

```

##### Result

- slower, higher memory usage, accurate

- Not sure if this can handle large files, without hitting ouf of memory error, will test it out in the below section. 

```bash

$ go run cmd/scanner.go

Alloc = 4 MiB	TotalAlloc = 10 MiB	Sys = 71 MiB	NumGC = 3

emoji.txt is 2069616 bytes

uniqueWords: 23929, wordCount: 382701

time taken: 35.120529ms

top 5 words

i, 32221

feel, 13938

and, 11983

to, 11151

the, 10454

a, 7732

```

Looks like second approach is better. Not sure if will hit out of memory error. Let's improve second approach further. To simulate low memory environment, we containerise it, wrap it on http server and run on Kubernetes.

```go

http.HandleFunc("/word", readFileHandler)

http.ListenAndServe(":3000", nil)

```

Run locally

```bash

go run cmd/server.go

```

The Web API will look like this

```text

http://IP_ADDRESS/word?file=file-url.txt

```

Assume we need to read 100 GB file, with max 16 GB memory. 

then we can simulate by reading 1 GB file, with max 0.16 GB (160 MB) memory.

| File size (GB) | Max Memory (GB)   

| :------------- | :----------: | 

|  100           | 16           | 

| 1              | 0.16         |

If the usage exceed 160 MB, the pod will be killed, by providing value to `spec.containers[].resources.limits.memory`

```yaml

 resources:

    requests:

      memory: "32Mi"

      cpu: "100m"

    limits:

      memory: "160Mi"

      cpu: "500m"

```

#### Create larger text file

```bash

$ cat /usr/share/dict/words | sort -R | head -100000 > file.txt

$ cat *.txt > big.txt # repeat for 10 times until get 1.6GB txt file

```

#### create cloud resources

- create public bucket

- upload `big10.txt` to bucket

- create GKE cluster

> make sure to download Service Account file with appropriate permissions from cloud console

```bash

$ cd terraform

$ terraform init

$ terraform apply

```

#### Build container and deploy

> make sure you enable cloud build access to GKE at [setting](https://console.cloud.google.com/cloud-build/settings/service-account)

```bash

$ gcloud builds submit . \

   --substitutions SHORT_SHA=$(git rev-parse --short HEAD)

```

#### Get the IP address

> can get from cloud console or CLI

```bash

$ gcloud container clusters get-credentials CLUSTER_NAME --zone ZONE --project PROJECT_ID

$ kubectl get service

NAME                  TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE

file-reader-service   LoadBalancer   10.3.254.7   34.87.25.50   3000:31465/TCP   24m

```

#### Test the API

```bash

$ curl 'http://IP_ADDRESS/word?file=https://big10.txt' | jq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

100   402  100   402    0     0      4      0  0:01:40  0:01:36  0:00:04   104

{

  "file": {

    "MiB": 1526,

    "name": "file-1601703943.txt",

    "size": 1600501760,

    "url": "https://big10.txt"

  },

  "mem": {

    "alloc": "22 MiB",

    "numGC": "506",

    "sys": "71 MiB",

    "totalAlloc": "5562 MiB"

  },

  "timeTaken": "1m15.587011379s",

  "words": {

    "firstUniqueWord": "yaguaza",

    "top5": [

      "i, 16497152",

      "feel, 7136768",

      "and, 6135296",

      "to, 5709312",

      "the, 5352960",

      "a, 3958784"

    ],

    "uniqueWords": 119960,

    "wordCount": 247142912

  }

}

```

Pod usage metrics

> After hit the API for 3 times, rest in between, to show cool down period

> Memory: Max usage:44.4 MiB during file processing

![usage](screenshot/pod-usage.png)

#### Result

Based on the result, it can read 1.5-1.6 GB of text file with ~ 40 MB memory. The pod did not get killed due to `OOMKilled` error. Thus, we can assume that it can handle 100 GB text file with cap memory at 16 GB.

Also, found the first unique word by reading the file only once and other additional information. 

### Third attempt

On this attempt, I will try to mimic Hadoop Map Reduce concept. Instead of using HDFS, we just write to local file storage.

1. Set a `lineLimit`, and read the files based on line limit, if current read line more than `lineLimit`, send a Mapper job

2. Mapper will process, count unique words, and  write result to file

3. Once all Mapper jobs are complete, Reducer will trigger.

4. Reducer will read all files written by Mapper, and combine the results

```go

// read line by line

for scanner.Scan() {

	accLines += fmt.Sprintf("\n%s", scanner.Text())

	counter++

	if counter > lineLimit {

		wg.Add(1)

        // trigger Mapper Job

		go mapper(&wg, accLines, dirPath)

		// reset

		accLines = ""

		counter = 0

	}

}

// wait for all Mapper jobs completes

wg.Wait()

// trigger Reducer

res, reducerFile := reducer(dirPath)

```

#### Result

```bash

$ go run cmd/fake-hadoop.go

Result written to tmp/reducer-1602762922444809000.json

First unique word: interdum

top 5 words

sed, 20

non, 12

sit, 12

et, 12

ipsum, 10

purus, 9

$ cat tmp/reducer-16027x62922444809000.json

{

 "Aenean": 6,

 "Aliquam": 2,

 "Cras": 3,

 "Curabitur": 1,

 "Donec": 6,

 "Duis": 3,

 "Etiam": 1,

 "Fusce": 2,

  .....

}

```

This approach does lower the usage of memory, but has high IO operation.

## Conclusion

I am not exact sure if [second approach](#second-attempt) will not get `out of memory`, thus I test it by simulating low memory environment, with the ratio below. 

| File size (GB) | Max Memory (GB)   

| :------------- | :----------: | 

|  100           | 16           | 

| 1              | 0.16         |

With the test result shown, I guess it will not get `out of memory`. If you have any suggestions and approaches, please let me know! I am keen to learn about it.

For the [third approach](#third-attempt), I personally is not a good choice. IO operation is slow. Which is why Spark RDD was created. 

> Spark has been found to run 100 times faster in-memory, and 10 times faster on disk. It’s also been used to sort 100 TB of data 3 times faster than Hadoop MapReduce on one-tenth of the machines. 

Let me know what you think. I am happy to know. 

Thanks!

## Clean up 

```

cd terraform

terraform destroy

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wlun001/file-reader

Awesome Lists containing this project

README