An open API service indexing awesome lists of open source software.

https://github.com/faradayio/pachyderm_large_file_test

Code to test Pachyderm's large file support
https://github.com/faradayio/pachyderm_large_file_test

Last synced: about 2 months ago
JSON representation

Code to test Pachyderm's large file support

Awesome Lists containing this project

README

          

# Testing Pachyderm with very large files in `FILE` mode

This is a test case for [Pachyderm][] focusing on large files and commits,
and avoiding the use of `BLOCK` mode. It uses data from the [OPUS][]
project, which is a collection of multi-lingual texts that have been
"aligned" at the sentence level. Total data size roughly 60 GB at moment,
including a single 33 GB input file.

For more information on where the data comes from and how to transfer it to
S3, see [data/README.md](./data/README.md). For the moment, we provide a
version of the data in `us-east-1`.

OPUS is a great data set for all sorts of linguistics and translation
tasks, though you usually need to massage into a more useful format for
your specific application first.

[Pachyderm]: https://www.pachyderm.io/
[OPUS]: http://opus.lingfil.uu.se/index.php

## Cluster configuration

Kubernetes master and two minions, each with the following specs:

- Docker: 1.12.3
- OS: RancherOS v0.7.1 (4.4.24)
- CPU: 2x2.49 GHz
- RAM: 7.3 GiB
- Disk: 469 GiB

This cluster was created using Rancher 1.3.2, with the fix for
[rancher/rancher#7370][] applied. The servers were created using the Rancher
REST API using the following options:

[rancher/rancher#7370]: https://github.com/rancher/rancher/issues/7370

```typescript
const config = {
amazonec2Config: {
accessKey: process.env['RANCHER_AWS_ACCESS_KEY_ID'],
ami: 'ami-dfdff3c8',
deviceName: '/dev/sda1',
iamInstanceProfile: 'kubernetes',
instanceType: 'm3.large',
// Allocate a public address so the servers can easily
// access outside resources (like Docker Hub).
privateAddressOnly: false,
region: 'us-east-1',
retries: '5',
rootSize: '500',
secretKey: process.env['RANCHER_AWS_SECRET_ACCESS_KEY'],
securityGroup: ['rancher-machine'],
spotPrice: '0.50',
sshUser: 'rancher',
subnetId: '...',
// Never use public addresses to communicate with servers,
// because the security group will block most of it.
usePrivateAddress: true,
volumeType: 'gp2',
vpcId: '...',
zone: 'a'
},
```

Hosts can also be added manually using the UI and these options. For
testing purposes, you can set up the instance profile `kubernetes` with EBS
attach/detach permissions:

```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:AttachVolume",
"ec2:DetachVolume"
],
"Resource": "arn:aws:ec2:us-east-1:YOUR-AWS-ID-HERE:instance/*"
},
{
"Effect": "Allow",
"Action": [
"ec2:AttachVolume",
"ec2:DetachVolume"
],
"Resource": "arn:aws:ec2:us-east-1:YOUR-AWS-ID-HERE:volume/*"
}
]
}
```

...and you can allow Rancher to create the `rancher-machine` security group
when creating a server through the UI.

## Running all tests automatically

Make sure `pachctl` can see your cluster and `~/pfs` is mounted. Then run:

```sh
./test.sh
```

## Test 1: Commiting a large file via S3 URL

This is a 33GB tarball stored on S3.

```sh
pachctl create-repo eubookshop_s3
pachctl put-file eubookshop_s3 master EUbookshop0.2.tar.gz -c \
-f s3://fdy-pachyderm-public-test-data/opus/EUbookshop0.2.tar.gz
```

**Result in local test:** The ingestion hung for a while and returned:

```
read tcp 10.42.131.61:55858->52.216.226.112:443: read: connection reset by peer
```

It looks like `pachd` and `rethinkdb` may have broken:

```
$ kubectl get all
NAME READY STATUS RESTARTS AGE
po/etcd-4h30v 1/1 Running 0 1d
po/pachd-dn1b6 0/1 Error 4 2m
po/pachd-k8v2f 1/1 Unknown 7 1d
po/rethink-11szv 0/1 ContainerCreating 0 2m
po/rethink-mfjvn 1/1 Unknown 0 1d

NAME DESIRED CURRENT READY AGE
rc/etcd 1 1 1 1d
rc/pachd 1 1 0 1d
rc/rethink 1 1 0 1d

NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/etcd 10.43.8.144 2379/TCP,2380/TCP 1d
svc/kubernetes 10.43.0.1 443/TCP 1d
svc/pachd 10.43.37.253 650:30650/TCP,651:30651/TCP 1d
svc/rethink 10.43.23.240 8080:32080/TCP,28015:32081/TCP,29015:30438/TCP 1d

NAME DESIRED SUCCESSFUL AGE
jobs/pachd-init 1 1 1d
```

Neither of the minions appears to be particularly low on disk space:

```
root@5a7c7b3ebed6:/# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 469G 65G 384G 15% /
tmpfs 3.7G 0 3.7G 0% /dev
tmpfs 3.7G 0 3.7G 0% /sys/fs/cgroup
/dev/xvda1 469G 65G 384G 15% /.r
shm 64M 0 64M 0% /dev/shm
```

```
root@b8363e1a9fc6:/# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 469G 4.9G 444G 2% /
tmpfs 3.7G 0 3.7G 0% /dev
tmpfs 3.7G 0 3.7G 0% /sys/fs/cgroup
/dev/xvda1 469G 4.9G 444G 2% /.r
shm 64M 0 64M 0% /dev/shm
```

## Test 2: Commiting a large file via HTTP URL

This is similar to the above, except we use a explicit URL. In practice,
this URL might be signed using `aws s3 presign`.

## Test 3: Adding all files

This is intended to approximate a test of adding 50GB to 200GB of CSV data
from a single source, spread out across varying numbers of multi-GB files.
We'll just try it with 60 GB for now.

```sh
pachctl create-repo opus_tars
pachctl put-file opus_tars master -c -i URLS.txt
```

**Result in local test:** Hung for about 5 minutes, then `pachd version`
started erroring, suggesting that the backend fell over again.

## Test 4: Copy-through of all files

(Tested successfully with a smaller input.)

We create a pipeline that copies everyting in `/pfs/opus_tars` to
`/pfs/out` unchanged, using `FILE` mode and multiple worker containers.
This has been tested using a single, smaller `tar.gz` input file.

## Test 5: Unpack tarballs, sort by language, repack

(Tested successfully with a smaller input.)

We use two pipelines:

1. `opus_unpack`: This unpacks all the tarballs in `opus_tars` and
organizes the raw data by language in `/pfs/out/en/`, `/pfs/out/es/`, etc.
2. `opus_repack`: This takes `/pfs/opus_unpack/$LANG/` and repacks it as
`/pfs/out/$LANG.tar`. We set `"constant": 2` but we only want to
produce a single, valid `en.tar` file! This is a test of how `FILE`
reduction works with multiple parallel workers and files in
subdirectories, and it assumes the semantics I would natually imagine,
which may not be correct.