https://github.com/prateek/ssh-spool-source
Prototype SshSpoolSource for Flume - think Spooling Directory Source over SSH
https://github.com/prateek/ssh-spool-source
Last synced: 10 months ago
JSON representation
Prototype SshSpoolSource for Flume - think Spooling Directory Source over SSH
- Host: GitHub
- URL: https://github.com/prateek/ssh-spool-source
- Owner: prateek
- License: apache-2.0
- Created: 2013-10-28T01:58:38.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2015-09-02T06:38:03.000Z (almost 11 years ago)
- Last Synced: 2025-04-05T02:21:56.077Z (about 1 year ago)
- Language: Java
- Size: 129 KB
- Stars: 3
- Watchers: 3
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
=======
ssh-spool-source
================
Prototype SshSpoolSource for Flume - think Spooling Directory Source over SSH. Caveat Emptor: It is very much **pre-alpha**.
Semantics
---------
The SshSpoolSource mirrors many semantics from the SpoolingDirectorySource. Here's what it supports at the moment:
- SSH authorization using username/password (ssh-keys not supported yet)
- Can specify a remote directory to monitor for new files added
- Any new files added in there are considered complete and ingested
- Once a file is processed, it is considered complete, changes to it will not be picked up
- The source persists the state of processed files to disk, so it will not reprocess any files; and it can pickup where it left off in the event of a restart.
Configuring Flume
------------------
1. **Build or Download the custom Flume Source**
The `flume-sources` directory contains a Maven project with a custom Flume source designed to connect to the specified SSH remote path and ingest the contents of the files there into HDFS.
To build the flume-sources JAR, from the root of the git repository:
$ cd flume-sources
$ mvn package
$ cd ..
This will generate a file called `flume-sources-1.0-SNAPSHOT.jar` in the `target` directory.
2. **Add the JAR to the Flume classpath**
$ sudo cp /etc/flume-ng/conf/flume-env.sh.template /etc/flume-ng/conf/flume-env.sh
Edit the `flume-env.sh` file and uncomment the `FLUME_CLASSPATH` line, and enter the path to the JAR. If adding multiple paths, separate them with a colon.
3. **Set the Flume agent name to SshAgent in /etc/default/flume-ng-agent**
If you don't see the `/etc/default/flume-ng-agent` file, it likely means that you didn't install the `flume-ng-agent` package. In the file, you should have the following:
FLUME_AGENT_NAME=SshAgent
4. **Modify the provided Flume configuration and copy it to /etc/flume-ng/conf**
There is a file called `flume.conf` in the `flume-sources` directory, which needs some minor editing. There are five fields which need to be filled in with values.
$ sudo cp flume.conf /etc/flume-ng/conf
Starting the data pipeline
------------------------
1. **Start the Flume agent**
Create the HDFS directory hierarchy for the Flume sink.
$ hadoop fs -mkdir /user/flume/ssh
$ hadoop fs -chown -R flume:flume /user/flume/ssh
$ hadoop fs -chmod -R 770 /user/flume
$ sudo /etc/init.d/flume-ng-agent start