{"id":20798834,"url":"https://github.com/51zero/eel-sdk","last_synced_at":"2025-04-13T08:21:16.876Z","repository":{"id":56488809,"uuid":"50208220","full_name":"51zero/eel-sdk","owner":"51zero","description":"Big Data Toolkit for the JVM","archived":false,"fork":false,"pushed_at":"2020-11-04T13:40:24.000Z","size":3533,"stargazers_count":146,"open_issues_count":25,"forks_count":35,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-03-27T00:09:52.126Z","etag":null,"topics":["big-data","etl","hadoop","hive","kafka","kudu","orc","parquet","scala"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/51zero.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-01-22T21:21:32.000Z","updated_at":"2025-02-03T08:54:48.000Z","dependencies_parsed_at":"2022-08-15T19:40:39.623Z","dependency_job_id":null,"html_url":"https://github.com/51zero/eel-sdk","commit_stats":null,"previous_names":["51zero/eel","sksamuel/eel","sksamuel/hadoop-streams","eel-sdk/eel"],"tags_count":122,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/51zero%2Feel-sdk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/51zero%2Feel-sdk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/51zero%2Feel-sdk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/51zero%2Feel-sdk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/51zero","download_url":"https://codeload.github.com/51zero/eel-sdk/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248681598,"owners_count":21144715,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","etl","hadoop","hive","kafka","kudu","orc","parquet","scala"],"created_at":"2024-11-17T17:05:51.514Z","updated_at":"2025-04-13T08:21:16.849Z","avatar_url":"https://github.com/51zero.png","language":"Scala","readme":"# Eel\n\n[![Join the chat at https://gitter.im/eel-sdk/Lobby](https://badges.gitter.im/eel-sdk/Lobby.svg)](https://gitter.im/eel-sdk/Lobby?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n[![Build Status](https://travis-ci.org/51zero/eel-sdk.svg?branch=master)](https://travis-ci.org/51zero/eel-sdk)\n[![Issues](https://img.shields.io/github/issues/51zero/eel-sdk/bug.svg)](https://github.com/51zero/eel-sdk/issues?q=is%3Aissue+is%3Aopen+label%3A\"bug\")\n[\u003cimg src=\"https://img.shields.io/maven-central/v/io.eels/eel-core_2.11.svg?label=latest%20release%20for%202.11\"/\u003e](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22eel-core_2.11%22)\n[\u003cimg src=\"https://img.shields.io/maven-central/v/io.eels/eel-core_2.12.svg?label=latest%20release%20for%202.12\"/\u003e](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22eel-core_2.12%22)\n\nEel is a toolkit for manipulating data in the hadoop ecosystem. By hadoop ecosystem we mean file formats common to the big-data world, such as parquet, orc, csv in locations such as HDFS or Hive tables. In contrast to distributed batch or streaming engines such as [Spark](http://spark.apache.org/) or [Flink](https://flink.apache.org/), Eel is an SDK intended to be used directly in process. Eel is a lower level API than higher level engines like Spark and is aimed for those use cases when you want something like a file API. \n![eel logo](https://raw.githubusercontent.com/eel-sdk/eel/master/eel-core/src/main/graphics/eel_small.png)\n\n### Example Use Cases\n\n* Importing from one source such as JDBC into another source such as Hive/HDFS\n* Coalescing multiple files, such as the output from spark, into a single file\n* Querying, streaming or reading into memory (relatively) small datasets directly from your process without reaching out to YARN or similar.\n* Moving or altering partitions in hive\n* Retrieving statistics on existing tables or datasets\n* Reading or generating schemas for existing datasets\n\n## Comparisions\n\nHere are some of our notes comparing eel to other tools that offer functionality similar to eel.\n\n## Comparison with Sqoop\n\n*Sqoop* is a popular Hadoop ETL tool and API used for loading foreign data (e.g. JDBC) into Hive/Hadoop \n\nSqoop executes N configurable Hadoop mappers jobs which are executed in parallel. Each mapper job makes a separate JDBC connection and adapts their queries to retrieve parts of the data.  \n\nTo support the parallelism of mapper jobs you must specify a **split by** column key and Hive partitioning key columns if applicable.\n\n- With this approach you can end up with several small part files (one for each mapper task) in HDFS which is not the most optimal way of storing data in Hadoop.\n- To reduce the number of part files you must reduce the number of mappers hence reducing the parallelism \n- At the time of this writing Oracle **Number** and **Timestamp** types aren't properly supported from Oracle to Hive with a Parquet dialect\n- **Sqoop** depends on **YARN** to allocate resources for each mapper task\n- Both the **Sqoop** CLI and API has a steep learning curve \n\n## Comparison with Flume\n\n*Flume* supports streaming data from a plethora of out-of-the-box Sources and Sinks.  \n\nFlume supports the notion of a channel which is like a persistent queue and glues together sources and sinks.  \n\nThe channel is an attractive feature as it can buffer up transactions/events under heavy load conditions – channel types can be File, Kafka or JDBC.\n\n- The Flume Hive sink is *limited* to streaming events containing delimited text or JSON data directly into a Hive table or partition - it’s possible to write a Custom EEL source and sink and therefore supporting all source/sink types such as Parquet, Orc, Hive, etc...\n- Flume requires an additional maintenance of a Flume Agent topology - separate processes.\n\n## Comparison with Kite\n\nThe Kite API and CLI are very similar in functionality to EEL but there are some subtle differences:\n\n- Datasets in *Kite* require AVRO schemas\n- A dataset is essentially a Hive table - the upcoming *EEL 1.2* release you will be able to create Hive tables from the CLI - at the moment it’s possible generate the Hive DDL with EEL API using *io.eels.component.hive.HiveDDL$#showDDL*.\n- For writing directly to **AVRO** or **Parquet** storage formats you must provide an **AVRO** schema – EEL dynamically infers a schema from the underlying source, for example a JDBC Query or CSV headers.\n- Support for ingesting from storage formats (other than **AVRO** and **Parquet**) is be achieved by *transforming* each record/row with another module named **Kite Morphlines** - it uses another intermediate record format and is another **API** to learn.\n- EEL supports transformations using regular Scala functions by invoking the *map* method on the Source’s underlying *DataStream*, e.g. *source.toDataStream.map(f: (Row) =\u003e Row)* – the *map* function returns a new row object.\n- Kite has direct support for *HBase* but EEL doesn’t – will do with the upcoming *EEL 1.2* release\n- Kite currently **doesn’t** support Kudo – EEL does.\n- Kite stores additional metadata on disk (**HDFS**) to be deemed a valid Kite dataset – if you externally change the Schema outside of Kite, i.e. through *DDL* then it can cause a dataset to be *out-of-synch* and potentially *malfunction* - EEL functions normally in this scenario as there is no additional metadata required.\n- Kite handles Hive partitioning by specifying a partition strategies – there are a few *out-of-the-box* strategies derived from the current payload – with **EEL** this works auto-magically by virture of providing the same column on the source row, alternatively you can add a partition key column with  **addField** on the fly on the source’s DataStream or use **map** transformation function.\n\n## Introduction to the API\n\nThe core data structure in Eel is the `DataStream`. A DataStream consists of a `Schema`, and zero or more `Row`s which contain values for each field in the schema. \nA DataStream is conceptually similar to a table in a relational database, or a dataframe in Spark, or a dataset in Flink. \n\nDataStreams can be read from a `Source` such as hive tables, jdbc databases, or even programatically from Scala or Java collections.\nDataStreams can be written out to a `Sink` such as a hive table or parquet file.\n\nThe current set of sources and sinks include: *Apache Avro*, *Apache Parquet*, *Apache Orc*, *CSV*, *Kafka* (sink only), *HDFS*, *Kudu*, *JDBC*, *Hive*, *Json Files*.\n\nOnce you have a reference to a DataStream, the DataStream can be manipulated in a similar way to regular Scala collections - many of the methods\nshare the same name, such as `map`, `filter`, `take`, `drop`, etc. All operations on a DataStream are lazy - they will only be executed\nonce an _action_ takes place such as `collect`, `count`, or `save`.\n\nFor example, you could load data from a CSV file, drop rows that don't match a predicate, and then save the data back out to a Parquet file\nall in a couple of lines of code.\n\n```scala\nval source = CsvSource(new Path(\"input.csv\"))\nval sink = ParquetSink(new Path(\"output.pq\"))\nsource.toDataStream().filter(_.get(\"location\") == \"London\").to(sink)\n```\n\n### Types Supported\n\n|Eel Datatype|JVM Types|\n|-----|-------|\n|BigInteger|BigInt|\n|Binary|Array of Bytes|\n|Byte|Byte|\n|DateTime|java.sql.Date|\n|Decimal(precision,scale)|BigDecimal|\n|Double|Double|\n|Float|Float|\n|Int|Int|\n|Long|Long|\n|Short|Short|\n|String|String|\n|TimestampMillis|java.sql.Timestamp|\n|Array|Array, Java collection or Scala Seq|\n|Map|Java or Scala Map|\n\n# Sources and Sinks Usage Patterns\n\nThe  following examples describe going from a **JDBCSource** to a specific **Sink** and therefore we first need to set up some test **JDBC** data using a **H2** in-memory database with the following code snippet:\n\n```scala\n  def executeBatchSql(dataSource: DataSource, sqlCmds: Seq[String]): Unit = {\n    val connection = dataSource.getConnection()\n    connection.clearWarnings()\n    sqlCmds.foreach { ddl =\u003e\n      val statement = connection.createStatement()\n      statement.execute(ddl)\n      statement.close()\n    }\n    connection.close()\n  }\n  // Setup JDBC data in H2 in memory database\n  val dataSource = new BasicDataSource()\n  dataSource.setDriverClassName(\"org.h2.Driver\")\n  dataSource.setUrl(\"jdbc:h2:mem:eel_test_data\")\n  dataSource.setPoolPreparedStatements(false)\n  dataSource.setInitialSize(5)\n  val sql = Seq(\n    \"CREATE TABLE IF NOT EXISTS PERSON(NAME VARCHAR(30), AGE INT, SALARY NUMBER(38,5), CREATION_TIME TIMESTAMP)\",\n    \"INSERT INTO PERSON VALUES ('Fred', 50, 50000.99, CURRENT_TIMESTAMP())\",\n    \"INSERT INTO PERSON VALUES ('Gary', 50, 20000.34, CURRENT_TIMESTAMP())\",\n    \"INSERT INTO PERSON VALUES ('Alice', 50, 99999.98, CURRENT_TIMESTAMP())\"\n  )\n  executeBatchSql(dataSource, sql)\n```\n\n## JdbcSource To HiveSink with Parquet Dialect\n\nFirst let's create a Hive table named **person** in the database **eel_test** which is partitioned by *Title*\n\n_Note the following Hive DDL creates the table for *Parquet* format_\n\n```sql\nCREATE EXTERNAL TABLE IF NOT EXISTS `eel_test.person` (\n   `NAME` string,\n   `AGE` int,\n   `SALARY` decimal(38,5),\n   `CREATION_TIME` timestamp)\nPARTITIONED BY (`title` string)\nROW FORMAT SERDE\n   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'\nSTORED AS INPUTFORMAT\n   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'\nOUTPUTFORMAT\n   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'\nLOCATION '/client/eel_test/persons';\n```\n**Example Create Table**\n```sql\nhive\u003e CREATE EXTERNAL TABLE IF NOT EXISTS `eel_test.person` (\n    \u003e    `NAME` string,\n    \u003e    `AGE` int,\n    \u003e    `SALARY` decimal(38,5),\n    \u003e    `CREATION_TIME` timestamp)\n    \u003e PARTITIONED BY (`title` string)\n    \u003e ROW FORMAT SERDE\n    \u003e    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'\n    \u003e STORED AS INPUTFORMAT\n    \u003e    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'\n    \u003e OUTPUTFORMAT\n    \u003e    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'\n    \u003e LOCATION '/client/eel_test/persons';\nOK\nTime taken: 1.474 seconds\n```\n\n### Using the HiveSink\n\n```scala\n    // Write to a HiveSink from a JDBCSource\n    val query = \"SELECT NAME, AGE, SALARY, CREATION_TIME FROM PERSON\"\n    implicit val hadoopFileSystem = FileSystem.get(new Configuration())\n    implicit val hiveMetaStoreClient = new HiveMetaStoreClient(new HiveConf())\n    JdbcSource(() =\u003e dataSource.getConnection, query)\n      .withFetchSize(10)\n      .toDataStream\n      .withLowerCaseSchema\n      // Transformation - add title to row\n      .map { row =\u003e \n         if (row.get(\"name\").toString == \"Alice\") row.add(\"title\", \"Mrs\") else row.add(\"title\", \"Mr\") \n      }\n      .to(HiveSink(\"eel_test\", \"person\").withIOThreads(1).withInheritPermission(true))\n```\n\n1. The JDBCSource takes a connection function and a SQL query - it will execute the SQL and derive the EEL schema from it - also notice the withFetchSize which caches the number of rows per fetch reducing the number RPC calls to the database server.\n2. *hadoopFileSystem* is a *Hadoop File System* object scala implicit required by the HiveSink\n3. *hiveMetaStoreClient* is a *Hive metastore client* object scala implicit required by the HiveSink \n4. *withLowerCaseSchema* lowercases all the field names over the *JdbcSource* schema - internally Hive lowercases table objects and columns and therefore the source schema should also match\n5. The *map* function performs some *transformation* -  it simply adds a new column called **title** which figures out whether the value should be **Mr** or **Mrs** - *title* is defined as a partition column key on the Hive table.\n6. *HiveSink* on the *to* method specifies the target Hive *database* and *table* respectively.\n7. *withIOThreads* on the *HiveSink* specifies the number of worker threads where each thread writes to its own file - the default is 4.  This is set to 1 because we don't want to end up with too many files given that the source only has 3 rows.\n8. *withInheritPermission* on the *HiveSink* means that when the sink creates new files it should inherit the HDFS permissions from the parent folder - typically this is negated by the default **UMASK** policy set in the hadoop site files.\n\n- Note the **HiveSink** takes care of automatically updating the *HiveMetaStore* when new partitions are added.\n\n### Results shown in Hive\n```sql\nhive\u003e select * from eel_test.person;\nOK\nFred    50      50000.99000     2017-01-24 14:40:50.664 Mr\nGary    50      20000.34000     2017-01-24 14:40:50.664 Mr\nAlice   50      99999.98000     2017-01-24 14:40:50.664 Mrs\nTime taken: 2.59 seconds, Fetched: 3 row(s)\nhive\u003e\n```\n### Partition layout on HDFS\n\nThere should be 2 files created by the *HiveSink* one in the partiton for title called **Mr** and one in **Mrs**.\n\nHere are the partitions using the **hadoop fs -ls** shell command:\n\n```shell\n$ hadoop fs -ls /client/eel_test/persons\nFound 2 items\ndrwxrwxrwx   - eeluser supergroup          0 2017-01-24 14:40 /client/eel_test/persons/title=Mr\ndrwxrwxrwx   - eeluser supergroup          0 2017-01-24 14:40 /client/eel_test/persons/title=Mrs\n```\n\nNow let's see if a file was created for the **Mr** partition:\n```shell\n$ hadoop fs -ls /client/eel_test/persons/title=Mr\nFound 1 items\n-rw-r--r--   3 eeluser supergroup        752 2017-01-24 14:40 /client/eel_test/persons/title=Mr/eel_2985827854647169_0\n```\n\nNow let's see if a file was created for the **Mrs** partition:\n```shell\n$ hadoop fs -ls /client/eel_test/persons/title=Mrs\nFound 1 items\n-rw-r--r--   3 eeluser supergroup        723 2017-01-24 14:40 /client/eel_test/persons/title=Mrs/eel_2985828912259519_0\n```\n### HiveSource Optmizations\n\nThe 1.2 release for the **HiveSource** using **Parquet** and **Orc** storage **formats** exploits the following optimizations supported by these formats:\n\n1. **column pruning** or **schema projection** which means providing a read schema - the reader is interested only in certain fields but not all fields written by the writer. The *Parquet* and *Orc* columnar formats does this efficiently without reading the entire row, i.e. only reading the bytes required for those fields. \n2. **predicate push-down** means that filter expressions can applied to the read without reading the entire row - only reading the bytes required for the filter expressions  \n3. In addition partition pruning is supported - if a table is organised by partitions then full table scans can be avoided by providing the partition key values \n\n#### Reading back the data via HiveSource and printing to the console\n\n```scala\n    implicit val hadoopFileSystem = FileSystem.get(new Configuration())\n    implicit val hiveMetaStoreClient = new HiveMetaStoreClient(new HiveConf())\n    HiveSource(\"eel_test\", \"person\")\n      .toDataStream()\n      .collect\n      .foreach(row =\u003e println(row))\n```\n1. *hadoopFileSystem* is a *Hadoop File System* object scala implicit required by the HiveSource\n2. *hiveMetaStoreClient* is a *Hive metastore client* object scala implicit required by the HiveSource\n3. *HiveSource* specifies arguments for the Hive *database* and *table* respectively.\n4. To get the collection of rows you need to perform the action **collect** on the source's underlying **DataStream**:  *toDataStream().collect()*, then iterate over each row and print it out using *foreach(row =\u003e println(row))*\n\nHere are the results of the read:\n```\n[name = Fred,age = 50,salary = 50000.99000,creation_time = 2017-01-24 13:40:50.664,title = Mr]\n[name = Gary,age = 50,salary = 20000.34000,creation_time = 2017-01-24 13:40:50.664,title = Mr]\n[name = Alice,age = 50,salary = 99999.98000,creation_time = 2017-01-24 13:40:50.664,title = Mrs]\n```\n\n### Using a predicate with the HiveSource\n\nYou can query data via the **HiveSource** using simple **and**/**or** predicates with relational operators such as **equals**, **gt**, **ge**, **lt**, **le**, etc...\n\n```scala\n    implicit val hadoopFileSystem = FileSystem.get(new Configuration())\n    implicit val hiveMetaStoreClient = new HiveMetaStoreClient(new HiveConf())\n    HiveSource(\"eel_test\", \"person\")\n      .withPredicate(Predicate.or(Predicate.equals(\"name\", \"Alice\"), Predicate.equals(\"name\", \"Gary\")))\n      .toDataStream()\n      .collect()\n      .foreach(row =\u003e println(row))\n```\nThe above **HiveSource** predicate is equivalent to the SQL:\n```sql \nselect * from eel_test.person \nwhere name = 'Alice' or name = 'Gary'\n```\nThe result is as follows:\n```\n[name = Gary,age = 50,salary = 20000.34000,creation_time = 2017-01-24 13:40:50.664,title = Mr]\n[name = Alice,age = 50,salary = 99999.98000,creation_time = 2017-01-24 13:40:50.664,title = Mrs]\n```\n#### Using a partition key and predicate with the HiveSource\n\nSpecifying a partition key on the **HiveSource** using the method **withPartitionConstraint** restricts the *predicate* being performed on a specific *partition*.  This significantly speeds up the query, i.e. avoids an expensive table scan.\n\nIf you have simple filtering requirements on relatively small datasets then this approach may be considerably faster than using *Hive*, *Spark*, *Impala* query engines.  Here's an example:\n\n```scala\n    implicit val hadoopFileSystem = FileSystem.get(new Configuration())\n    implicit val hiveMetaStoreClient = new HiveMetaStoreClient(new HiveConf())\n    HiveSource(\"eel_test\", \"person\")\n      .withPredicate(Predicate.or(Predicate.equals(\"name\", \"Alice\"), Predicate.equals(\"name\", \"Gary\")))\n      .withPartitionConstraint(PartitionConstraint.equals(\"title\", \"Mr\"))\n      .toDataStream()\n      .collect()\n      .foreach(row =\u003e println(row))\n```\nThe **withPartitionConstraint** method homes in on the **title** partition whose value is **Mr** and peforms filtering on it using the **withPredicate**.   \n\nThe equivalent SQL would be:\n```sql \nselect * from eel_test.person \nwhere title = 'Mr'\nand (name = 'Alice' or name = 'Gary')\n```\n\nThe result is as follows:\n```\n[name = Gary,age = 50,salary = 20000.34000,creation_time = 2017-01-24 13:40:50.664,title = Mr]\n```\n\n## JdbcSource To ParquetSink\n\n```scala\n  val query = \"SELECT NAME, AGE, SALARY, CREATION_TIME FROM PERSON\"\n  val parquetFilePath = new Path(\"hdfs://nameservice1/client/eel/person.parquet\")\n  implicit val hadoopFileSystem = FileSystem.get(new Configuration()) // This is required\n  JdbcSource(() =\u003e dataSource.getConnection, query).withFetchSize(10)\n    .toDataStream.to(ParquetSink(parquetFilePath))\n```\n1. The **JDBCSource** takes a connection function and a SQL query - it will execute the SQL and derive the EEL schema from it - also notice the **withFetchSize** which caches the number of rows per fetch reducing the number RPC calls to the database server.\n2. **parquetFilePath** is the **ParquetSink** file path pointing to a **HDFS** path - alternatively this could be a local file path if you qualify it with the *file:* scheme \n3. **hadoopFileSystem** is a scala implicit required by the **ParquetSink**\n4. If you have the **parquet-tools** installed on your system you can look at its native schema like so:\n```shell\n$ parquet-tools schema person.parquet\nmessage row {\n  optional binary NAME (UTF8);\n  optional int32 AGE;\n  optional fixed_len_byte_array(16) SALARY (DECIMAL(38,5));\n  optional int96 CREATION_TIME;\n}\n```\n- For **Decimal** Parquet encodes it as a *fixed byte array* and for *Timestamp* it's an *int96*\n5. Reading back the data via **ParquetSource** and printing to the console:\n```scala\n   val parquetFilePath = new Path(\"hdfs://nameservice1/client/eel/person.parquet\")\n    implicit val hadoopConfiguration = new Configuration()\n    implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // This is required\n    ParquetSource(parquetFilePath)\n      .toDataStream()\n      .collect()\n      .foreach(row =\u003e println(row))\n```\n\n1. **parquetFilePath** is the **ParquetSource** file path pointing to a **HDFS** path - alternatively this could be a local file path if you qualify it with the *file:* scheme \n2. **hadoopConfiguration** and **hadoopFileSystem** are scala implicits required by the **ParquetSource**\n3. To get the collection of rows you need to perform the action **collect** on the source's underlying **DataStream**:  *toDataStream().collect()*, then iterate over each row and print it out using *foreach(row =\u003e println(row))*\n4. Here are the results of the read:\n```\n[NAME = Fred,AGE = 50,SALARY = 50000.99000,CREATION_TIME = 2017-01-23 14:53:51.862]\n[NAME = Gary,AGE = 50,SALARY = 20000.34000,CREATION_TIME = 2017-01-23 14:53:51.876]\n[NAME = Alice,AGE = 50,SALARY = 99999.98000,CREATION_TIME = 2017-01-23 14:53:51.876]\n```\n\n### predicate push-down\n\nYou can query data via the **ParquetSource** using simple and/or predicates with relational operators such as **equals**, **gt**, **ge**, **lt**, **le**, etc...\n\n*predicate push-down* means that filter expressions can applied to the read without reading the entire row (features of  **Parquet** and **Orc** columnar formats), i.e. it only reads the bytes required for the filter expressions, e.g.:\n```scala\n    val parquetFilePath = new Path(\"hdfs://nameservice1/client/eel/person.parquet\")\n    implicit val hadoopConfiguration = new Configuration()\n    implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // This is required\n    ParquetSource(parquetFilePath)\n      .withPredicate(Predicate.or(Predicate.equals(\"NAME\", \"Alice\"), Predicate.equals(\"NAME\", \"Gary\")))\n      .toDataStream()\n      .collect()\n      .foreach(row =\u003e println(row))\n```\nThe above **ParquetSource** predicate (**withPredicate**) is equivalent to the SQL predicate:\n```sql \nwhere name = 'Alice' or name = 'Gary'\n```\nThe result is as follows:\n```\n[NAME = Gary,AGE = 50,SALARY = 20000.34000,CREATION_TIME = 2017-01-23 14:53:51.876]\n[NAME = Alice,AGE = 50,SALARY = 99999.98000,CREATION_TIME = 2017-01-23 14:53:51.876]\n```\n### schema projection \n\n*column pruning* or *schema projection* which means providing a read schema - the reader is interested only in certain fields but not all fields written by the writer. The *Parquet* and *Orc* columnar formats does this efficiently without reading the entire row, i.e. only reading the bytes required for those fields, e.g.:\n```scala\n    val parquetFilePath = new Path(\"hdfs://nameservice1/client/eel/person.parquet\")\n    implicit val hadoopConfiguration = new Configuration()\n    implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // This is required\n    ParquetSource(parquetFilePath)\n      .withProjection(\"NAME\", \"SALARY\")\n      .withPredicate(Predicate.or(Predicate.equals(\"NAME\", \"Alice\"), Predicate.equals(\"NAME\", \"Gary\")))\n      .toDataStream()\n      .collect\n      .foreach(row =\u003e println(row))\n```\nThe above **ParquetSource** projection (**withProjection**) is equivalent to the SQL select:\n```sql \nselect NAME, SALARY\n```\nThe result is as follows:\n```\n[NAME = Gary,SALARY = 20000.34000]\n[NAME = Alice,SALARY = 99999.98000]\n```\n\n## JdbcSource To OrcSink\n\n1. The OrcSink is almost identical to the way the parquet sink works (see above)\n```scala\n    // Write to a OrcSink from a JDBCSource\n    val query = \"SELECT NAME, AGE, SALARY, CREATION_TIME FROM PERSON\"\n    val orcFilePath = new Path(\"hdfs://nameservice1/client/eel/person.orc\")\n    implicit val hadoopConfiguration = new Configuration()\n    JdbcSource(() =\u003e dataSource.getConnection, query).withFetchSize(10)\n      .toDataStream\n      .to(OrcSink(orcFilePath))\n```\n2. Reading back the data via **OrcSource** and printing to the console:\n```scala\n    val orcFilePath = new Path(\"hdfs://nameservice1/client/eel/person.orc\")\n    implicit val hadoopConfiguration = new Configuration()\n    OrcSource(orcFilePath)\n      .toDataStream.collect().foreach(row =\u003e println(row))\n```\n## JdbcSource To KudoSink\n\n**TBD** \n\n## JdbcSource To AvroSink\n\n```scala\n    // Write to a AvroSink from a JDBCSource\n    val query = \"SELECT NAME, AGE, SALARY, CREATION_TIME FROM PERSON\"\n    val avroFilePath = Paths.get(s\"${sys.props(\"user.home\")}/person.avro\")\n    JdbcSource(() =\u003e dataSource.getConnection, query)\n      .withFetchSize(10)\n      .toDataStream\n      .replaceFieldType(DecimalType.Wildcard, DoubleType)\n      .replaceFieldType(TimestampMillisType, StringType)\n      .to(AvroSink(avroFilePath))\n```\n1. The **JDBCSource** takes a connection function and a SQL query - it will execute the SQL and derive the EEL schema from it - also notice the **withFetchSize** which caches the number of rows per fetch reducing the number RPC calls to the database server.\n2. **avroFilePath** is the **AvroSource** file path pointing to a path on the local file system \n3. The 2 **replaceFieldType** method calls map **DecimalType** to **DoubleType** and **TimestampMillisType** to **StringType** as **Decimals** and **Timestamps** are not supported in *Avro Schema*\n4. If you have the **avro-tools** installed on your system you can look at its native schema like so - alternatively use the **AvroSource** to read it back in - see below.\n```shell\n$ avro-tools getschema person.avro\n{\n  \"type\" : \"record\",\n  \"name\" : \"row\",\n  \"namespace\" : \"namespace\",\n  \"fields\" : [ {\n    \"name\" : \"NAME\",\n    \"type\" : [ \"null\", \"string\" ],\n    \"default\" : null\n  }, {\n    \"name\" : \"AGE\",\n    \"type\" : [ \"null\", \"int\" ],\n    \"default\" : null\n  }, {\n    \"name\" : \"SALARY\",\n    \"type\" : [ \"null\", \"double\" ],\n    \"default\" : null\n  }, {\n    \"name\" : \"CREATION_TIME\",\n    \"type\" : [ \"null\", \"string\" ],\n    \"default\" : null\n  } ]\n}\n```\n5. Reading back the data via **AvroSource** and printing to the console:\n```scala\n    val avroFilePath = Paths.get(s\"${sys.props(\"user.home\")}/person.avro\")\n    AvroSource(avroFilePath)\n      .toDataStream()\n      .collect\n      .foreach(row =\u003e println(row))\n```\n\n1. **avroFilePath** is the **AvroSource** file path pointing to a path on the local file system \n2. To get the collection of rows you need to perform the action **collect** on the source's underlying **DataStream**:  *toDataStream().collect*, then iterate over each row and print it out using *foreach(row =\u003e println(row))*\n4. Here are the results of the read:\n```\n[NAME = Fred,AGE = 50,SALARY = 50000.99,CREATION_TIME = 2017-01-24 16:13:07.524]\n[NAME = Gary,AGE = 50,SALARY = 20000.34,CREATION_TIME = 2017-01-24 16:13:07.532]\n[NAME = Alice,AGE = 50,SALARY = 99999.98,CREATION_TIME = 2017-01-24 16:13:07.532]\n```\n\n## JdbcSource To CsvSink\n\n1. The CsvSink is almost identical to the way the parquet sink works (see above)\n```scala\n    // Write to a CsvSink from a JDBCSource\n    val query = \"SELECT NAME, AGE, SALARY, CREATION_TIME FROM PERSON\"\n    val csvFilePath = new Path(\"hdfs://nameservice1/client/eel/person.csv\")\n    implicit val hadoopConfiguration = new Configuration()\n    implicit val hadoopFileSystem = FileSystem.get(new Configuration()) // This is required\n    JdbcSource(() =\u003e dataSource.getConnection, query).withFetchSize(10)\n      .toDataStream\n      .to(CsvSink(csvFilePath))\n```\n2. Reading back the data via **CsvSource** and printing to the console:\n```scala\n    val csvFilePath = new Path(\"hdfs://nameservice1/client/eel/person.csv\")\n    implicit val hadoopConfiguration = new Configuration()\n    implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // This is required\n    CsvSource(csvFilePath).toDataStream().schema.fields.foreach(f =\u003e println(f))\n    CsvSource(csvFilePath)\n      .toDataStream()\n      .collect()\n      .foreach(row =\u003e println(row))\n```\nNote by default the **CsvSource** converts all types to a string - the following code prints out the fields in the schema:\n```scala\n    CsvSource(csvFilePath).toDataStream().schema.fields.foreach(f =\u003e println(f))\n```\nYou can enforce the types on the **CSVSource** by supplying *SchemaInferrer*:\n```scala\n    val csvFilePath = new Path(\"hdfs://nameservice1/client/eel/person.csv\")\n    implicit val hadoopConfiguration = new Configuration()\n    implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // This is required\n    val schemaInferrer = SchemaInferrer(StringType,\n      DataTypeRule(\"AGE\", IntType.Signed),\n      DataTypeRule(\"SALARY\", DecimalType.Wildcard),\n      DataTypeRule(\".*\\\\_TIME\", TimeMillisType))\n    CsvSource(csvFilePath).withSchemaInferrer(schemaInferrer)\n      .toDataStream()\n      .collect()\n      .foreach(row =\u003e println(row))\n```\nThe above **schemaInferrer** object sets up some rules for mapping field name **AGE** to an **int**, **Salary** to a **Decimal** and a field name ending in **TIME** using **REGEX** to a **Timestamp**. \n\nNote the first parameter on **SchemaInferrer** is *StringType* which means that this is the default type for all fields.\n\n## Working with Nested Type in Sources and Sinks\n\nStorage formats *Parquet* and *Orc* support nested types such as *struct*, *map* and *list*.\n\n### Structs in Parquet\nThe following example describes how to write rows containing a single struct column named *PERSON_DETAILS*:\n\n```sql\nstruct PERSON_DETAILS {\n    NAME String,\n    AGE Int,\n    SALARY DECIMAL(38,5),\n    CREATION_TIME TIMESTAMP\n}\n```\n#### Step 1:  Set up the hdfs path and scala implicit objects\n```scala\n    val parquetFilePath = new Path(\"hdfs://nameservice1/client/eel_struct/person.parquet\")\n    implicit val hadoopConfiguration = new Configuration()\n    implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) \n```\n#### Step 2:  Create the schema containing a single column named *PERSON_DETAILS* which is a *struct* type:\n```scala\n    val personDetailsStruct = Field.createStructField(\"PERSON_DETAILS\",\n      Seq(\n        Field(\"NAME\", StringType),\n        Field(\"AGE\", IntType.Signed),\n        Field(\"SALARY\", DecimalType(Precision(38), Scale(5))),\n        Field(\"CREATION_TIME\", TimestampMillisType)\n      )\n    )\n    val schema = StructType(personDetailsStruct)\n```\n- A *struct* is encoded as a list of *Fields* with their corresponding *type* definitions.\n\n#### Step 3:  Create 3 rows of *structs*\n```scala\n    val rows = Vector(\n      Vector(Vector(\"Fred\", 50, BigDecimal(\"50000.99000\"), new Timestamp(System.currentTimeMillis()))),\n      Vector(Vector(\"Gary\", 50, BigDecimal(\"20000.34000\"), new Timestamp(System.currentTimeMillis()))),\n      Vector(Vector(\"Alice\", 50, BigDecimal(\"99999.98000\"), new Timestamp(System.currentTimeMillis())))\n    )\n```\n- The first *Vector*, e.g.  **val rows = Vector(...)** is a list of rows - 3 in this case.\n- Each inner *Vector*, e.g. **Vector(...)** is a single row of column values\n- The column values in this case is another **Vector** representing the the **struct**, e.g. **Vector(\"Alice\", 50, BigDecimal(\"99999.98000\"), new Timestamp(System.currentTimeMillis()))**\n\n#### Step 4:  Write the rows using the ParquetSink\n```scala\n    DataStream.fromValues(schema, rows)\n      .to(ParquetSink(parquetFilePath))\n```\n\nIf you have the **parquet-tools** installed on your system you can look at its native schema like so:\n```shell\n$ parquet-tools schema person.parquet\nmessage row {\n  optional group PERSON_DETAILS {\n    optional binary NAME (UTF8);\n    optional int32 AGE;\n    optional fixed_len_byte_array(16) SALARY (DECIMAL(38,5));\n    optional int96 CREATION_TIME;\n  }\n}\n```\n- Notice that parquet encodes the *struct* as *group* of columns.\n#### Step 5:  Read back the rows using the ParquetSource\n```scala\n    ParquetSource(parquetFilePath)\n      .toDataStream()\n      .collect()\n      .foreach(row =\u003e println(row))\n```\n#### The results of Step 5\n```\n[PERSON_DETAILS = WrappedArray(Fred, 50, 50000.99000, 2017-01-25 15:56:06.212)]\n[PERSON_DETAILS = WrappedArray(Gary, 50, 20000.34000, 2017-01-25 15:56:06.212)]\n[PERSON_DETAILS = WrappedArray(Alice, 50, 99999.98000, 2017-01-25 15:56:06.212)]\n```\n#### Applying a predicate (filter) on the read - give me person details for names Alice and Gary\n```scala\n    ParquetSource(parquetFilePath)\n      .withPredicate(Predicate.or(Predicate.equals(\"PERSON_DETAILS.NAME\", \"Alice\"), Predicate.equals(\"PERSON_DETAILS.NAME\", \"Gary\")))\n      .toDataStream()\n      .collect()\n      .foreach(row =\u003e println(row))\n```\nThe above is equivalent to the following in SQL:\n```sql\nselect PERSON_DETAILS\nwhere PERSON_DETAILS.NAME = 'Alice' or PERSON_DETAILS.NAME = 'Gary'\n``` \n#### The results with the predicate filter\n```\n[PERSON_DETAILS = WrappedArray(Gary, 50, 20000.34000, 2017-01-25 16:03:37.678)]\n[PERSON_DETAILS = WrappedArray(Alice, 50, 99999.98000, 2017-01-25 16:03:37.678)]\n```\n\n### Looking at the **Parquet** file through **Hive**\n\nOn the *Parquet* file just written we can create a **Hive External** table pointing at the *HDFS* location of the file.\n```sql\nCREATE EXTERNAL TABLE IF NOT EXISTS `eel_test.struct_person`(\n   PERSON_DETAILS STRUCT\u003cNAME:String, AGE:Int, SALARY:decimal(38,5), CREATION_TIME:TIMESTAMP\u003e\n)\nROW FORMAT SERDE\n   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'\nSTORED AS INPUTFORMAT\n   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'\nOUTPUTFORMAT\n   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'\nLOCATION '/client/eel_struct';\n```\n- The location **/client/eel_struct** is the root directory of where all the files live - in this case its the root of folder of the *Parquet* write in *step 4*.\n\n#### Here's a Hive session show the select:\n```sql\nhive\u003e select * from eel_test.struct_person;\nOK\n{\"NAME\":\"Fred\",\"AGE\":50,\"SALARY\":50000.99,\"CREATION_TIME\":\"2017-01-25 17:03:37.678\"}\n{\"NAME\":\"Gary\",\"AGE\":50,\"SALARY\":20000.34,\"CREATION_TIME\":\"2017-01-25 17:03:37.678\"}\n{\"NAME\":\"Alice\",\"AGE\":50,\"SALARY\":99999.98,\"CREATION_TIME\":\"2017-01-25 17:03:37.678\"}\nTime taken: 1.092 seconds, Fetched: 3 row(s)\nhive\u003e\n```\n#### Here's another Hive query asking for Alice and Gary's age:\n```sql\nhive\u003e select person_details.name, person_details.age\n    \u003e from eel_test.struct_person\n    \u003e where person_details.name in ('Alice', 'Gary' );\nOK\nGary    50\nAlice   50\nTime taken: 0.067 seconds, Fetched: 2 row(s)\nhive\u003e\n```\n-  *HiveQL* has some nice features for cracking nested types - the query returns scalar values for *name* and *age* in the *person_details* structure.\n-  The same query is supported in *Spark* via *HiveContext* or *SparkSession* in version *\u003e= 2.x*\n\n### Arrays in Parquet\n\nEEL supports *Parquet* **ARRAYS** of any *primitive* type including *structs*.  The following example extends the previous example by adding another column called **PHONE_NUMBERS** defined as an **ARRAY** of **Strings**.   \n\n#### Writing with an ARRAY of strings - PHONE_NUMBERS\n```scala\n    val parquetFilePath = new Path(\"hdfs://nameservice1/client/eel_array/person.parquet\")\n    implicit val hadoopConfiguration = new Configuration()\n    implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) \n   // Create the schema with a STRUCT and an ARRAY\n    val personDetailsStruct = Field.createStructField(\"PERSON_DETAILS\",\n      Seq(\n        Field(\"NAME\", StringType),\n        Field(\"AGE\", IntType.Signed),\n        Field(\"SALARY\", DecimalType(Precision(38), Scale(5))),\n        Field(\"CREATION_TIME\", TimestampMillisType)\n      )\n    )\n    val schema = StructType(personDetailsStruct, Field(\"PHONE_NUMBERS\", ArrayType.Strings))\n\n    // Create 3 rows\n    val rows = Vector(\n      Vector(Vector(\"Fred\", 50, BigDecimal(\"50000.99000\"), new Timestamp(System.currentTimeMillis())), Vector(\"322\", \"987\")),\n      Vector(Vector(\"Gary\", 50, BigDecimal(\"20000.34000\"), new Timestamp(System.currentTimeMillis())), Vector(\"145\", \"082\")),\n      Vector(Vector(\"Alice\", 50, BigDecimal(\"99999.98000\"), new Timestamp(System.currentTimeMillis())), Vector(\"534\", \"129\"))\n    )\n   // Write the rows\n    DataStream.fromValues(schema, rows)\n      .to(ParquetSink(parquetFilePath))\n```\nIf you have the **parquet-tools** installed on your system you can look at its native schema like so:\n```shell\n$ parquet-tools schema person.parquet\nmessage row {\n  optional group PERSON_DETAILS {\n    optional binary NAME (UTF8);\n    optional int32 AGE;\n    optional fixed_len_byte_array(16) SALARY (DECIMAL(38,5));\n    optional int96 CREATION_TIME;\n  }\n  repeated binary PHONE_NUMBERS (UTF8);\n}\n```\n- Notice **PHONE_NUMBERS** is represented as a repeated UTF8 (String) in Parquet, i.e. an unbounded array.\n#### Read back the rows via ParquetSource\n```scala\n    ParquetSource(parquetFilePath)\n      .toDataStream()\n      .collect()\n      .foreach(row =\u003e println(row))\n```\n- The results\n```\n[PERSON_DETAILS = WrappedArray(Fred, 50, 50000.99000, 2017-01-25 20:33:48.302),PHONE_NUMBERS = Vector(322, 987)]\n[PERSON_DETAILS = WrappedArray(Gary, 50, 20000.34000, 2017-01-25 20:33:48.302),PHONE_NUMBERS = Vector(145, 082)]\n[PERSON_DETAILS = WrappedArray(Alice, 50, 99999.98000, 2017-01-25 20:33:48.302),PHONE_NUMBERS = Vector(534, 129)]\n```\n### Looking at the **Parquet** file through **Hive**\n\nOn the *Parquet* file just written we can create a **Hive External** table pointing at the *HDFS* location of the file.\n```sql\nCREATE EXTERNAL TABLE IF NOT EXISTS `eel_test.struct_person_phone`(\n   PERSON_DETAILS STRUCT\u003cNAME:String, AGE:Int, SALARY:decimal(38,5), CREATION_TIME:TIMESTAMP\u003e,\n   PHONE_NUMBERS Array\u003cString\u003e\n)\nROW FORMAT SERDE\n   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'\nSTORED AS INPUTFORMAT\n   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'\nOUTPUTFORMAT\n   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'\nLOCATION '/client/eel_array';\n```\n- The location **/client/eel_array** is the root directory of where all the files live - in this case its the root of folder of the *Parquet* write \n\n#### Here's a Hive session show the select:\n```sql\nhive\u003e select * from eel_test.struct_person_phone;\nOK\n{\"NAME\":\"Fred\",\"AGE\":50,\"SALARY\":50000.99,\"CREATION_TIME\":\"2017-01-26 10:50:57.192\"}    [\"322\",\"987\"]\n{\"NAME\":\"Gary\",\"AGE\":50,\"SALARY\":20000.34,\"CREATION_TIME\":\"2017-01-26 10:50:57.192\"}    [\"145\",\"082\"]\n{\"NAME\":\"Alice\",\"AGE\":50,\"SALARY\":99999.98,\"CREATION_TIME\":\"2017-01-26 10:50:57.192\"}   [\"534\",\"129\"]\nTime taken: 1.248 seconds, Fetched: 3 row(s)\nhive\u003e\n```\n#### Here's another Hive query asking for Alice and Gary's age and phone numbers:\n```sql\nhive\u003e select person_details.name, person_details.age, phone_numbers\n    \u003e from eel_test.struct_person_phone\n    \u003e where person_details.name in ('Alice', 'Gary' );\nOK\nGary    50      [\"145\",\"082\"]\nAlice   50      [\"534\",\"129\"]\nTime taken: 0.181 seconds, Fetched: 2 row(s)\nhive\u003e\n```\n-  *HiveQL* has some nice features for cracking nested types - the query returns scalar values for *name* and *age* in the *person_details* structure and phone numbers from the phone_numbers array.\n-  The same query is supported in *Spark* via *HiveContext* or *SparkSession* in version *\u003e= 2.x*\n\n#### What if I want to look at the first phone number:\n```sql\nhive\u003e select person_details.name, person_details.age, phone_numbers[0]\n    \u003e from eel_test.struct_person_phone;\nOK\nFred    50      322\nGary    50      145\nAlice   50      534\nTime taken: 0.08 seconds, Fetched: 3 row(s)\nhive\u003e\n```\n- To retrieve a specific array element, **HiveQL** requires the column index which is zero based, e.g. **phone_numbers[0]**\n\n#### Query to show *name*, *age* and *phone_number* with repeated rows for each phone number from the phone_numbers array\n```sql\nhive\u003e select person_details.name, person_details.age, phone_number\n    \u003e from eel_test.struct_person_phone\n    \u003e lateral view explode(phone_numbers) pns as phone_number;\nOK\nFred    50      322\nFred    50      987\nGary    50      145\nGary    50      082\nAlice   50      534\nAlice   50      129\nTime taken: 0.062 seconds, Fetched: 6 row(s)\nhive\u003e\n```\n- The above **lateral view** statement is used in conjunction with the **explode UDTF(user-defined-table-function)** to generate a row per array element \n\n\n## Parquet Source\nThe parquet source will read from one or more parquet files. To use the source, create an instance of `ParquetSource` specifying a file pattern or `Path` object. The Parquet source implementation is optimized to use native parquet reading directly to an eel row object without creating intermediate formats such as Avro.\n\nExample reading from a single file `ParquetSource(new Path(\"hdfs:///myfile\"))`\nExample reading from a wildcard pattern `ParquetSource(\"hdfs:///user/warehouse/*\"))`\n\n#### Predicates\n\nParquet as a file format supports predicates, which are row level filter operations. Because parquet is a columnar store,\nrow level filters can be extremely efficient. Whenever you are reading from parquet files - either directly or through hive - \na row level filter will nearly always be faster than reading the data and filtering afterwards. This is\nbecause parquet is able to skip whole chunks of the file that do not match the predicate.\n                                              \nTo use a predicate, simply add an instance of `Predicate` to the Parquet source class.\n\n```scala\nval ds = ParquetSource(path).withPredicate(Predicate.equals(\"location\", \"westeros\")).toDataStream()\n```\n\nMultiple predicates can be grouped together using `Predicate.or` and `Predicate.and`.\n\n#### Projections\n\nThe parquet source also allows you to specify a projection which is a subset of the columns to return.\nAgain, since parquet is columnar, if a column is not needed at all then the entire column can be skipped directly \nin the file making parquet extremely fast at this kind of operation.\n\nTo use a projection, simply use `withProjection` on the Parquet source with the fields to keep.\n\n```scala\nval ds = ParquetSource(path).withProjection(\"amount\", \"type\").toDataStream()\n```\n\n\n\nHive Source\n---\nThe [Hive](https://hive.apache.org/) source will read from a hive table. To use this source, create an instance of `HiveSource` specifying the database name, the table name, any partitions to limit the read. The source also requires instances of the Hadoop [Filesystem](https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html) object, and a [HiveConf](https://hive.apache.org/javadocs/r0.13.1/api/common/org/apache/hadoop/hive/conf/HiveConf.html) object.\n\nReading all rows from a table is the simplest use case: `HiveSource(\"mydb\", \"mytable\")`. We can also read rows from a table for a particular partition. For example, to read all rows which have the value '1975' for the partition column 'year': `HiveSource(\"mydb\", \"mytable\").withPartition(\"year\", \"1975\")`\n\nThe partition clause accepts an operator to perform more complicated querying, such as less than, greater than etc. For example to read all rows which have a *year* less than *1975* we can do: `HiveSource(\"mydb\", \"mytable\").withPartition(\"year\", \"\u003c\", \"1975\")`.\n\n\nHive Sink\n----\nThe [Hive](https://hive.apache.org/) sink writes data to Hive tables stored in any of the following formats: ORC (Optimized Row Columnar), Parquet, Avro, or Text delimited.\n\nTo configure a Hive Sink, you specify the Hive database, the table to write to, and the format to write in. The sink also requires instances of the Hadoop [Filesystem](https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html) object, and a [HiveConf](https://hive.apache.org/javadocs/r0.13.1/api/common/org/apache/hadoop/hive/conf/HiveConf.html) object.\n\n**Properties**\n\n|Parameter|Description|\n|----------|------------------|\n|IO Threads|The number of concurrent writes to the sink|\n|Dynamic Partitioning|If set to true then any values on partitioned fields that are new, will automatically be created as partitions in the metastore. If set to false, then a new value will throw an error.\n\n**Example**\n\nSimple example of writing to a Hive database `ds.to(HiveSink(\"mydb\", \"mytable\"))`\n\nWe can specify the number of concurrent writes, by using the ioThreads parameter `ds.to(HiveSink(\"mydb\", \"mytable\").withIOThreads(4))`\n \nCsv Source\n----\n\nIf the schema you need is in the form of the CSV headers, then we can easily parse those to create the schema. But obviously CSV won't encode any type information. Therefore, we can specify an instance of a `SchemaInferrer` which can be customized with rules to determine the correct schema type for each header. So for example, you might say that \"name\" is a SchemaType.String, or that anything matching \"*_id\" is a SchemaType.Long. You can also specify the nullability, scale, precision and unsigned. A quick example:\n\n```scala\nval inferrer = SchemaInferrer(SchemaType.String, SchemaRule(\"qty\", SchemaType.Int, false), SchemaRule(\".*_id\", SchemaType.Int))\nCsvSource(\"myfile\").withSchemaInferrer(inferrer)\n```\n\n### How to use\n\nEel is released to maven central, so is very easy to include in your project. Just find the latest version on [maven central](http://search.maven.org/#search|ga|1|io.eels) and copy the includes.\n","funding_links":[],"categories":["大数据"],"sub_categories":["Spring Cloud框架"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F51zero%2Feel-sdk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F51zero%2Feel-sdk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F51zero%2Feel-sdk/lists"}