https://github.com/prateek/ssh-spool-source

Prototype SshSpoolSource for Flume - think Spooling Directory Source over SSH
https://github.com/prateek/ssh-spool-source

Last synced: 10 months ago
JSON representation

Prototype SshSpoolSource for Flume - think Spooling Directory Source over SSH

Host: GitHub
URL: https://github.com/prateek/ssh-spool-source
Owner: prateek
License: apache-2.0
Created: 2013-10-28T01:58:38.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2015-09-02T06:38:03.000Z (almost 11 years ago)
Last Synced: 2025-04-05T02:21:56.077Z (about 1 year ago)
Language: Java
Size: 129 KB
Stars: 3
Watchers: 3
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

=======
ssh-spool-source
================

Prototype SshSpoolSource for Flume - think Spooling Directory Source over SSH. Caveat Emptor: It is very much **pre-alpha**.

Semantics
---------
The SshSpoolSource mirrors many semantics from the SpoolingDirectorySource. Here's what it supports at the moment:
- SSH authorization using username/password (ssh-keys not supported yet)
- Can specify a remote directory to monitor for new files added
- Any new files added in there are considered complete and ingested
- Once a file is processed, it is considered complete, changes to it will not be picked up
- The source persists the state of processed files to disk, so it will not reprocess any files; and it can pickup where it left off in the event of a restart.

Configuring Flume
------------------

1. **Build or Download the custom Flume Source**

The `flume-sources` directory contains a Maven project with a custom Flume source designed to connect to the specified SSH remote path and ingest the contents of the files there into HDFS.

To build the flume-sources JAR, from the root of the git repository:

$ cd flume-sources
$ mvn package
$ cd ..

This will generate a file called `flume-sources-1.0-SNAPSHOT.jar` in the `target` directory.

2. **Add the JAR to the Flume classpath**

$ sudo cp /etc/flume-ng/conf/flume-env.sh.template /etc/flume-ng/conf/flume-env.sh

Edit the `flume-env.sh` file and uncomment the `FLUME_CLASSPATH` line, and enter the path to the JAR. If adding multiple paths, separate them with a colon.

3. **Set the Flume agent name to SshAgent in /etc/default/flume-ng-agent**

If you don't see the `/etc/default/flume-ng-agent` file, it likely means that you didn't install the `flume-ng-agent` package. In the file, you should have the following:

FLUME_AGENT_NAME=SshAgent

4. **Modify the provided Flume configuration and copy it to /etc/flume-ng/conf**

There is a file called `flume.conf` in the `flume-sources` directory, which needs some minor editing. There are five fields which need to be filled in with values.

$ sudo cp flume.conf /etc/flume-ng/conf

Starting the data pipeline
------------------------

1. **Start the Flume agent**

Create the HDFS directory hierarchy for the Flume sink.


    $ hadoop fs -mkdir /user/flume/ssh

    $ hadoop fs -chown -R flume:flume /user/flume/ssh

    $ hadoop fs -chmod -R 770 /user/flume

    $ sudo /etc/init.d/flume-ng-agent start

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/prateek/ssh-spool-source

Awesome Lists containing this project

README