{"id":28501938,"url":"https://github.com/fluent/fluent-plugin-webhdfs","last_synced_at":"2025-07-05T02:31:25.267Z","repository":{"id":3932980,"uuid":"5023560","full_name":"fluent/fluent-plugin-webhdfs","owner":"fluent","description":"Hadoop WebHDFS output plugin for Fluentd","archived":false,"fork":false,"pushed_at":"2025-02-12T04:26:00.000Z","size":194,"stargazers_count":60,"open_issues_count":12,"forks_count":37,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-06-08T16:08:25.118Z","etag":null,"topics":["fluentd","fluentd-plugin","hadoop","hdfs"],"latest_commit_sha":null,"homepage":"http://docs.fluentd.org/articles/out_webhdfs","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fluent.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2012-07-13T18:47:31.000Z","updated_at":"2025-03-09T00:25:25.000Z","dependencies_parsed_at":"2024-03-19T09:49:32.194Z","dependency_job_id":"58ef43bf-8e1b-46a6-af37-6cd965bd280d","html_url":"https://github.com/fluent/fluent-plugin-webhdfs","commit_stats":{"total_commits":194,"total_committers":24,"mean_commits":8.083333333333334,"dds":0.7164948453608248,"last_synced_commit":"c9a68b5860890633a46978a9f3057afe69fc95d0"},"previous_names":[],"tags_count":40,"template":false,"template_full_name":null,"purl":"pkg:github/fluent/fluent-plugin-webhdfs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fluent%2Ffluent-plugin-webhdfs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fluent%2Ffluent-plugin-webhdfs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fluent%2Ffluent-plugin-webhdfs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fluent%2Ffluent-plugin-webhdfs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fluent","download_url":"https://codeload.github.com/fluent/fluent-plugin-webhdfs/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fluent%2Ffluent-plugin-webhdfs/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263671743,"owners_count":23494025,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fluentd","fluentd-plugin","hadoop","hdfs"],"created_at":"2025-06-08T16:08:30.343Z","updated_at":"2025-07-05T02:31:25.261Z","avatar_url":"https://github.com/fluent.png","language":"Ruby","readme":"# fluent-plugin-webhdfs\n\n[![Build Status](https://travis-ci.org/fluent/fluent-plugin-webhdfs.svg?branch=master)](https://travis-ci.org/fluent/fluent-plugin-webhdfs)\n\n[Fluentd](http://fluentd.org/) output plugin to write data into Hadoop HDFS over WebHDFS/HttpFs.\n\n\"webhdfs\" output plugin formats data into plain text, and store it as files on HDFS. This plugin supports:\n\n* inject tag and time into record (and output plain text data) using `\u003cinject\u003e` section\n* format events into plain text by format plugins using `\u003cformat\u003e` section\n* control flushing using `\u003cbuffer\u003e` section\n\nPaths on HDFS can be generated from event timestamp, tag or any other fields in records.\n\n## Requirements\n\n| fluent-plugin-webhdfs | fluentd    | ruby   |\n|-----------------------|------------|--------|\n| \u003e= 1.0.0              | \u003e= v0.14.4 | \u003e= 2.1 |\n| \u003c  1.0.0              | \u003c  v0.14.0 | \u003e= 1.9 |\n\n### Older versions\n\nThe versions of `0.x.x` of this plugin are for older version of Fluentd (v0.12.x). Old style configuration parameters (using `output_data_type`, `output_include_*` or others) are still supported, but are deprecated.\nUsers should use `\u003cformat\u003e` section to control how to format events into plain text.\n\n## Configuration\n\n### WebHDFSOutput\n\nTo store data by time,tag,json (same with '@type file') over WebHDFS:\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /path/on/hdfs/access.log.%Y%m%d_%H.log\n    \u003c/match\u003e\n\nIf you want JSON object only (without time or tag or both on header of lines), use `\u003cformat\u003e` section to specify `json` formatter:\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /path/on/hdfs/access.log.%Y%m%d_%H.log\n      \u003cformat\u003e\n        @type json\n      \u003c/format\u003e\n    \u003c/match\u003e\n\nTo specify namenode, `namenode` is also available:\n\n    \u003cmatch access.**\u003e\n      @type     webhdfs\n      namenode master.your.cluster.local:50070\n      path     /path/on/hdfs/access.log.%Y%m%d_%H.log\n    \u003c/match\u003e\n\nTo store data as JSON, including time and tag (using `\u003cinject\u003e`), over WebHDFS:\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /path/on/hdfs/access.log.%Y%m%d_%H.log\n      \u003cbuffer\u003e\n        timekey_zone -0700 # to specify timezone used for \"path\" time placeholder formatting\n      \u003c/buffer\u003e\n      \u003cinject\u003e\n        tag_key   tag\n        time_key  time\n        time_type string\n        timezone  -0700\n      \u003c/inject\u003e\n      \u003cformat\u003e\n        @type json\n      \u003c/format\u003e\n    \u003c/match\u003e\n\nTo store data as JSON, including time as unix time, using path including tag as directory:\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /path/on/hdfs/${tag}/access.log.%Y%m%d_%H.log\n      \u003cbuffer time,tag\u003e\n        @type   file                    # using file buffer\n        path    /var/log/fluentd/buffer # buffer directory path\n        timekey 3h           # create a file per 3h\n        timekey_use_utc true # time in path are formatted in UTC (default false means localtime)\n      \u003c/buffer\u003e\n      \u003cinject\u003e\n        time_key  time\n        time_type unixtime\n      \u003c/inject\u003e\n      \u003cformat\u003e\n        @type json\n      \u003c/format\u003e\n    \u003c/match\u003e\n\nWith username of pseudo authentication:\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /path/on/hdfs/access.log.%Y%m%d_%H.log\n      username hdfsuser\n    \u003c/match\u003e\n      \nStore data over HttpFs (instead of WebHDFS):\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host httpfs.node.your.cluster.local\n      port 14000\n      path /path/on/hdfs/access.log.%Y%m%d_%H.log\n      httpfs true\n    \u003c/match\u003e\n\nWith ssl:\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /path/on/hdfs/access.log.%Y%m%d_%H.log\n      ssl true\n      ssl_ca_file /path/to/ca_file.pem   # if needed\n      ssl_verify_mode peer               # if needed (peer or none)\n    \u003c/match\u003e\n\nHere `ssl_verify_mode peer` means to verify the server's certificate.\nYou can turn off it by setting `ssl_verify_mode none`. The default is `peer`.\nSee [net/http](http://www.ruby-doc.org/stdlib-2.1.3/libdoc/net/http/rdoc/Net/HTTP.html)\nand [openssl](http://www.ruby-doc.org/stdlib-2.1.3/libdoc/openssl/rdoc/OpenSSL.html) documentation for further details.\n\nWith kerberos authentication:\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /path/on/hdfs/access.log.%Y%m%d_%H.log\n      kerberos true\n      kerberos_keytab /path/to/keytab # if needed\n      renew_kerberos_delegation_token true # if needed\n    \u003c/match\u003e\n\nNOTE: You need to install `gssapi` gem for kerberos. See https://github.com/kzk/webhdfs#for-kerberos-authentication\n\nIf you want to compress data before storing it:\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /path/on/hdfs/access.log.%Y%m%d_%H\n      compress gzip  # or 'bzip2', 'snappy', 'hadoop_snappy', 'lzo_command', 'zstd'\n    \u003c/match\u003e\n\nNote that if you set `compress gzip`, then the suffix `.gz` will be added to path (or `.bz2`, `.sz`, `.snappy`, `.lzo`, `.zst`).\nNote that you have to install additional gem for several compress algorithms:\n\n- snappy: install snappy gem\n- hadoop_snappy: install snappy gem\n- bzip2: install bzip2-ffi gem\n- zstd: install zstandard gem\n\nNote that zstd will require installation of the libzstd native library. See the [zstandard-ruby](https://github.com/msievers/zstandard-ruby#examples-for-installing-libzstd) repo for infomration on the required packages for your operating system.\n\nYou can also specify compression block size (currently supported only for Snappy codecs):\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /path/on/hdfs/access.log.%Y%m%d_%H\n      compress hadoop_snappy\n      block_size 32768\n    \u003c/match\u003e\n\nIf you want to explicitly specify file extensions in HDFS (override default compressor extensions):\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /path/on/hdfs/access.log.%Y%m%d_%H\n      compress snappy\n      extension \".snappy\"\n    \u003c/match\u003e\n\nWith this configuration paths in HDFS will be like `/path/on/hdfs/access.log.20201003_12.snappy`.\nThis one may be useful when (for example) you need to use snappy codec but `.sz` files are not recognized as snappy files in HDFS.\n\n### Namenode HA / Auto retry for WebHDFS known errors\n\n`fluent-plugin-webhdfs` (v0.2.0 or later) accepts 2 namenodes for Namenode HA (active/standby). Use `standby_namenode` like this:\n\n    \u003cmatch access.**\u003e\n      @type            webhdfs\n      namenode         master1.your.cluster.local:50070\n\t  standby_namenode master2.your.cluster.local:50070\n      path             /path/on/hdfs/access.log.%Y%m%d_%H.log\n    \u003c/match\u003e\n\nAnd you can also specify to retry known hdfs errors (such like `LeaseExpiredException`) automatically. With this configuration, fluentd doesn't write logs for this errors if retry successed.\n\n    \u003cmatch access.**\u003e\n      @type              webhdfs\n      namenode           master1.your.cluster.local:50070\n      path               /path/on/hdfs/access.log.%Y%m%d_%H.log\n\t  retry_known_errors yes\n\t  retry_times        1 # default 1\n\t  retry_interval     1 # [sec] default 1\n    \u003c/match\u003e\n\n### Performance notifications\n\nWriting data on HDFS single file from 2 or more fluentd nodes, makes many bad blocks of HDFS. If you want to run 2 or more fluentd nodes with fluent-plugin-webhdfs, you should configure 'path' for each node.\nTo include hostname, `#{Socket.gethostname}` is available in Fluentd configuration string literals by ruby expression (in `\"...\"` strings). This plugin also supports `${uuid}` placeholder to include random uuid in paths.\n\nFor hostname:\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path \"/log/access/%Y%m%d/#{Socket.gethostname}.log\" # double quotes needed to expand ruby expression in string\n    \u003c/match\u003e\n\nOr with random filename (to avoid duplicated file name only):\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /log/access/%Y%m%d/${uuid}.log\n    \u003c/match\u003e\n\nWith configurations above, you can handle all of files of `/log/access/20120820/*` as specified timeslice access logs.\n\nFor high load cluster nodes, you can specify timeouts for HTTP requests.\n\n    \u003cmatch access.**\u003e\n\t  @type webhdfs\n\t  namenode master.your.cluster.local:50070\n      path /log/access/%Y%m%d/${hostname}.log\n\t  open_timeout 180 # [sec] default: 30\n\t  read_timeout 180 # [sec] default: 60\n    \u003c/match\u003e\n\n### For unstable Namenodes\n\nWith default configuration, fluent-plugin-webhdfs checks HDFS filesystem status and raise error for inactive NameNodes.\n\nIf you were using unstable NameNodes and have wanted to ignore NameNode errors on startup of fluentd, enable `ignore_start_check_error` option like below:\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      path /log/access/%Y%m%d/${hostname}.log\n      ignore_start_check_error true\n    \u003c/match\u003e\n\n### For unstable Datanodes\n\nWith unstable datanodes that frequently downs, appending over WebHDFS may produce broken files. In such cases, specify `append no` and `${chunk_id}` parameter.\n\n    \u003cmatch access.**\u003e\n      @type webhdfs\n      host namenode.your.cluster.local\n      port 50070\n      \n      append no\n      path   \"/log/access/%Y%m%d/#{Socket.gethostname}.${chunk_id}.log\"\n    \u003c/match\u003e\n\n`out_webhdfs` creates new files on hdfs per flush of fluentd, with chunk id. You shouldn't care broken files from append operations.\n\n## TODO\n\n* patches welcome!\n\n## Copyright\n\n* Copyright (c) 2012- TAGOMORI Satoshi (tagomoris)\n* License\n  * Apache License, Version 2.0\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffluent%2Ffluent-plugin-webhdfs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffluent%2Ffluent-plugin-webhdfs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffluent%2Ffluent-plugin-webhdfs/lists"}