https://github.com/fluent/fluent-plugin-webhdfs

Hadoop WebHDFS output plugin for Fluentd
https://github.com/fluent/fluent-plugin-webhdfs

fluentd fluentd-plugin hadoop hdfs

Last synced: 8 months ago
JSON representation

Hadoop WebHDFS output plugin for Fluentd

Host: GitHub
URL: https://github.com/fluent/fluent-plugin-webhdfs
Owner: fluent
License: other
Created: 2012-07-13T18:47:31.000Z (over 13 years ago)
Default Branch: master
Last Pushed: 2025-02-12T04:26:00.000Z (about 1 year ago)
Last Synced: 2025-06-08T16:08:25.118Z (9 months ago)
Topics: fluentd, fluentd-plugin, hadoop, hdfs
Language: Ruby
Homepage: http://docs.fluentd.org/articles/out_webhdfs
Size: 189 KB
Stars: 60
Watchers: 19
Forks: 37
Open Issues: 12
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# fluent-plugin-webhdfs

[![Build Status](https://travis-ci.org/fluent/fluent-plugin-webhdfs.svg?branch=master)](https://travis-ci.org/fluent/fluent-plugin-webhdfs)

[Fluentd](http://fluentd.org/) output plugin to write data into Hadoop HDFS over WebHDFS/HttpFs.

"webhdfs" output plugin formats data into plain text, and store it as files on HDFS. This plugin supports:

* inject tag and time into record (and output plain text data) using `` section
* format events into plain text by format plugins using `` section
* control flushing using `` section

Paths on HDFS can be generated from event timestamp, tag or any other fields in records.

## Requirements

| fluent-plugin-webhdfs | fluentd | ruby |
|-----------------------|------------|--------|
| >= 1.0.0 | >= v0.14.4 | >= 2.1 |
| < 1.0.0 | < v0.14.0 | >= 1.9 |

### Older versions

The versions of `0.x.x` of this plugin are for older version of Fluentd (v0.12.x). Old style configuration parameters (using `output_data_type`, `output_include_*` or others) are still supported, but are deprecated.
Users should use `` section to control how to format events into plain text.

## Configuration

### WebHDFSOutput

To store data by time,tag,json (same with '@type file') over WebHDFS:

@type webhdfs
host namenode.your.cluster.local
port 50070
path /path/on/hdfs/access.log.%Y%m%d_%H.log

If you want JSON object only (without time or tag or both on header of lines), use `` section to specify `json` formatter:

@type webhdfs
host namenode.your.cluster.local
port 50070
path /path/on/hdfs/access.log.%Y%m%d_%H.log

@type json

To specify namenode, `namenode` is also available:

@type webhdfs
namenode master.your.cluster.local:50070
path /path/on/hdfs/access.log.%Y%m%d_%H.log

To store data as JSON, including time and tag (using ``), over WebHDFS:

@type webhdfs
host namenode.your.cluster.local
port 50070
path /path/on/hdfs/access.log.%Y%m%d_%H.log

timekey_zone -0700 # to specify timezone used for "path" time placeholder formatting

tag_key tag
time_key time
time_type string
timezone -0700

@type json

To store data as JSON, including time as unix time, using path including tag as directory:

@type webhdfs
host namenode.your.cluster.local
port 50070
path /path/on/hdfs/${tag}/access.log.%Y%m%d_%H.log

@type file # using file buffer
path /var/log/fluentd/buffer # buffer directory path
timekey 3h # create a file per 3h
timekey_use_utc true # time in path are formatted in UTC (default false means localtime)

time_key time
time_type unixtime

@type json

With username of pseudo authentication:

@type webhdfs
host namenode.your.cluster.local
port 50070
path /path/on/hdfs/access.log.%Y%m%d_%H.log
username hdfsuser

Store data over HttpFs (instead of WebHDFS):

@type webhdfs
host httpfs.node.your.cluster.local
port 14000
path /path/on/hdfs/access.log.%Y%m%d_%H.log
httpfs true

With ssl:

@type webhdfs
host namenode.your.cluster.local
port 50070
path /path/on/hdfs/access.log.%Y%m%d_%H.log
ssl true
ssl_ca_file /path/to/ca_file.pem # if needed
ssl_verify_mode peer # if needed (peer or none)

Here `ssl_verify_mode peer` means to verify the server's certificate.
You can turn off it by setting `ssl_verify_mode none`. The default is `peer`.
See [net/http](http://www.ruby-doc.org/stdlib-2.1.3/libdoc/net/http/rdoc/Net/HTTP.html)
and [openssl](http://www.ruby-doc.org/stdlib-2.1.3/libdoc/openssl/rdoc/OpenSSL.html) documentation for further details.

With kerberos authentication:

@type webhdfs
host namenode.your.cluster.local
port 50070
path /path/on/hdfs/access.log.%Y%m%d_%H.log
kerberos true
kerberos_keytab /path/to/keytab # if needed
renew_kerberos_delegation_token true # if needed

NOTE: You need to install `gssapi` gem for kerberos. See https://github.com/kzk/webhdfs#for-kerberos-authentication

If you want to compress data before storing it:

@type webhdfs
host namenode.your.cluster.local
port 50070
path /path/on/hdfs/access.log.%Y%m%d_%H
compress gzip # or 'bzip2', 'snappy', 'hadoop_snappy', 'lzo_command', 'zstd'

Note that if you set `compress gzip`, then the suffix `.gz` will be added to path (or `.bz2`, `.sz`, `.snappy`, `.lzo`, `.zst`).
Note that you have to install additional gem for several compress algorithms:

- snappy: install snappy gem
- hadoop_snappy: install snappy gem
- bzip2: install bzip2-ffi gem
- zstd: install zstandard gem

Note that zstd will require installation of the libzstd native library. See the [zstandard-ruby](https://github.com/msievers/zstandard-ruby#examples-for-installing-libzstd) repo for infomration on the required packages for your operating system.

You can also specify compression block size (currently supported only for Snappy codecs):

@type webhdfs
host namenode.your.cluster.local
port 50070
path /path/on/hdfs/access.log.%Y%m%d_%H
compress hadoop_snappy
block_size 32768

If you want to explicitly specify file extensions in HDFS (override default compressor extensions):

@type webhdfs
host namenode.your.cluster.local
port 50070
path /path/on/hdfs/access.log.%Y%m%d_%H
compress snappy
extension ".snappy"

With this configuration paths in HDFS will be like `/path/on/hdfs/access.log.20201003_12.snappy`.
This one may be useful when (for example) you need to use snappy codec but `.sz` files are not recognized as snappy files in HDFS.

### Namenode HA / Auto retry for WebHDFS known errors

`fluent-plugin-webhdfs` (v0.2.0 or later) accepts 2 namenodes for Namenode HA (active/standby). Use `standby_namenode` like this:

@type webhdfs
namenode master1.your.cluster.local:50070
standby_namenode master2.your.cluster.local:50070
path /path/on/hdfs/access.log.%Y%m%d_%H.log

And you can also specify to retry known hdfs errors (such like `LeaseExpiredException`) automatically. With this configuration, fluentd doesn't write logs for this errors if retry successed.

@type webhdfs
namenode master1.your.cluster.local:50070
path /path/on/hdfs/access.log.%Y%m%d_%H.log
retry_known_errors yes
retry_times 1 # default 1
retry_interval 1 # [sec] default 1

### Performance notifications

Writing data on HDFS single file from 2 or more fluentd nodes, makes many bad blocks of HDFS. If you want to run 2 or more fluentd nodes with fluent-plugin-webhdfs, you should configure 'path' for each node.
To include hostname, `#{Socket.gethostname}` is available in Fluentd configuration string literals by ruby expression (in `"..."` strings). This plugin also supports `${uuid}` placeholder to include random uuid in paths.

For hostname:

@type webhdfs
host namenode.your.cluster.local
port 50070
path "/log/access/%Y%m%d/#{Socket.gethostname}.log" # double quotes needed to expand ruby expression in string

Or with random filename (to avoid duplicated file name only):

@type webhdfs
host namenode.your.cluster.local
port 50070
path /log/access/%Y%m%d/${uuid}.log

With configurations above, you can handle all of files of `/log/access/20120820/*` as specified timeslice access logs.

For high load cluster nodes, you can specify timeouts for HTTP requests.

@type webhdfs
namenode master.your.cluster.local:50070
path /log/access/%Y%m%d/${hostname}.log
open_timeout 180 # [sec] default: 30
read_timeout 180 # [sec] default: 60

### For unstable Namenodes

With default configuration, fluent-plugin-webhdfs checks HDFS filesystem status and raise error for inactive NameNodes.

If you were using unstable NameNodes and have wanted to ignore NameNode errors on startup of fluentd, enable `ignore_start_check_error` option like below:

@type webhdfs
host namenode.your.cluster.local
port 50070
path /log/access/%Y%m%d/${hostname}.log
ignore_start_check_error true

### For unstable Datanodes

With unstable datanodes that frequently downs, appending over WebHDFS may produce broken files. In such cases, specify `append no` and `${chunk_id}` parameter.

@type webhdfs
host namenode.your.cluster.local
port 50070

append no
path "/log/access/%Y%m%d/#{Socket.gethostname}.${chunk_id}.log"

`out_webhdfs` creates new files on hdfs per flush of fluentd, with chunk id. You shouldn't care broken files from append operations.

## TODO

* patches welcome!

## Copyright

* Copyright (c) 2012- TAGOMORI Satoshi (tagomoris)
* License
* Apache License, Version 2.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fluent/fluent-plugin-webhdfs

Awesome Lists containing this project

README