Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hbase-rdd/hbase-rdd-examples

HBase RDD example project
https://github.com/hbase-rdd/hbase-rdd-examples

Last synced: about 2 months ago
JSON representation

HBase RDD example project

Host: GitHub
URL: https://github.com/hbase-rdd/hbase-rdd-examples
Owner: hbase-rdd
License: apache-2.0
Created: 2015-03-13T14:21:27.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2021-01-22T20:24:12.000Z (almost 4 years ago)
Last Synced: 2024-03-15T23:54:00.313Z (10 months ago)
Language: Scala
Size: 29.3 KB
Stars: 19
Watchers: 17
Forks: 11
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        HBase RDD examples

==================

![logo](https://raw.githubusercontent.com/hbase-rdd/hbase-rdd/master/docs/logo.png)

This is an example project for [HBase RDD](https://github.com/hbase-rdd/hbase-rdd). It currently runs on CDH 5.5, although it will run on other versions of CDH with minor modifications.

Running

-------

First, build the project with

    sbt assembly

This will generate `target/scala-2.12/hbase-rdd-examples-assembly-0.9.1.jar`.

You can then copy this file, together with the files in the `scripts` directory, on a gateway machine of the cluster, and then run the scripts to launch the jobs. Of course, you may have to adapt some parameters in the scripts.

Jobs

----

We assume the existence of a file `test-input` on the user directory on HDFS of the user running the job. Each line contains four fields, each one being a random printable string of length 10. Let us call `k`, `col1`, `col2`, `col3` the four fields.

**WriteSingleCf** copies the contents of this file inside `test-table`, using `k` as rowkey, putting `col1` and `col2` under the column family `cf1` and discarding `col3`.

**WriteMultiCf** copies the contents of this file inside `test-table`, using `k` as rowkey, putting `col1` and `col2` under the column family `cf1` and `col3` under `cf2`.

**WriteBulk** does the same as **WriteSingleCf** (on table `test-table-bulk`), by writing to HFiles on HDFS and then submitting these HFiles to the HBase servers.

All the write jobs create the table if it does not exist already - **WriteBulk** also takes care of computing splits appropriate for the file contents and a desired region size (128M in the example).

**Read** reads the contents of `test-tables` and reassambles a TSV output on HDFS under `test-output`, in the same format as the original. It does this by specifying both the column families and the columns to read, as a `Map[String, Set[String]]`.

**ReadTS** is the same as above but also reads timestamps.

**ReadCf** does the same as **Read** but only specifies the column families, as a `Set[String]`. The whole column families are read - this is useful in the cases where the set of columns in a family is not known a priori, e.g. when the column families are used as a set (using a dummy marker value for all columns).

**ReadTSCf** is the same as above but also reads timestamps.

 

**DeleteSingleCf** deletes contents from `test-table`, using `k` as rowkey, deleting `col1` and `col2` under the column family `cf1`.

**DeleteMultiCf** deletes contents from `test-table`, using `k` as rowkey, deleting `col1` and `col2` under the column family `cf1` and `col3` under `cf2`.

**DeleteRows** deletes rows from `test-table`, using `k` as rowkey.

In all jobs we are using `String` values for the cells, but HBaseRDD is not limited to this. Any other type `A` is supported, provided there is an implicit `Reads[A]` or `Writes[A]` in scope. These are traits defined by HBaseRDD that essentially wrap conversions from `Array[Byte]` to `A` and viceversa.

By default, HBaseRDD ships converters for `Array[Byte]` (duh!), `String` and `JValue` from [Json4s](http://json4s.org/), but you can write your own implicit conversions as necessary.

Test file

---------

You can generate the test file as you prefer. A quick way would be to open a Scala console and write

    import java.io._

    import scala.util.Random

    def printToFile(path: String)(op: PrintWriter => Unit) {

      val p = new PrintWriter(new File(path))

      try { op(p) } finally { p.close() }

    }

    def nextString = (1 to 10) map (_ => Random.nextPrintableChar) mkString

    def nextLine = (1 to 4) map (_ => nextString) mkString "\t"

    printToFile(args(0)) { p =>

      for (_ <- 1 to 100000000) {

        p.println(nextLine)

      }

    }