{"id":25000618,"url":"https://github.com/gluster/glusterfs-hadoop","last_synced_at":"2025-04-12T08:52:29.618Z","repository":{"id":1050599,"uuid":"2232802","full_name":"gluster/glusterfs-hadoop","owner":"gluster","description":"GlusterFS plugin for Hadoop HCFS","archived":false,"fork":false,"pushed_at":"2022-04-12T21:53:57.000Z","size":410,"stargazers_count":69,"open_issues_count":35,"forks_count":38,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-04-12T08:52:22.804Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gluster.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2011-08-19T08:12:26.000Z","updated_at":"2024-03-11T09:12:14.000Z","dependencies_parsed_at":"2022-08-06T10:15:08.571Z","dependency_job_id":null,"html_url":"https://github.com/gluster/glusterfs-hadoop","commit_stats":null,"previous_names":[],"tags_count":33,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gluster%2Fglusterfs-hadoop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gluster%2Fglusterfs-hadoop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gluster%2Fglusterfs-hadoop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gluster%2Fglusterfs-hadoop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gluster","download_url":"https://codeload.github.com/gluster/glusterfs-hadoop/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248543883,"owners_count":21121838,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-04T19:36:19.809Z","updated_at":"2025-04-12T08:52:29.598Z","avatar_url":"https://github.com/gluster.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"GlusterFS Hadoop Plugin\n=======================\n\nINTRODUCTION\n------------\n\nThis document describes how to use GlusterFS (http://www.gluster.org/) as a backing store with Hadoop.\n\nThis plugin replaces the hadoop file system (typically, the Hadoop Distributed File System) with the \nGlusterFileSystem, which writes to a local directory which FUSE mounts a proxy to a gluster system.\n\nREQUIREMENTS\n------------\n\n* Supported OS is GNU/Linux\n* GlusterFS installed on all machines in the cluster\n* Java Runtime Environment (JRE)\n* Maven 3x (needed if you are building the plugin from source)\n* JDK 6+ (needed if you are building the plugin from source)\n\nNOTE: Plugin relies on two *nix command line utilities to function properly. They are:\n\n* mount: Used to mount GlusterFS volumes.\n* getfattr: Used to fetch Extended Attributes of a file\n\nMake sure they are installed on all hosts in the cluster and their locations are in $PATH\nenvironment variable.\n\n\nINSTALLATION\n------------\n\n** NOTE: Example below is for Hadoop version 0.20.2 ($GLUSTER_HOME/hdfs/0.20.2) **\n\n* Building the plugin from source [Maven (http://maven.apache.org/) and JDK is required to build the plugin]\n\n  Change to glusterfs-hadoop directory in the GlusterFS source tree and build the plugin.\n\n  # cd $GLUSTER_HOME/hdfs/0.20.2\n  # mvn package\n\n  On a successful build the plugin will be present in the `target` directory.\n  (NOTE: version number will be a part of the plugin)\n\n  # ls target/\n  classes  glusterfs-0.20.2-0.1.jar  maven-archiver  surefire-reports  test-classes\n\n  Copy the plugin to lib/ directory in your $HADOOP_HOME dir.\n\n  # cp target/glusterfs-0.20.2-0.1.jar $HADOOP_HOME/lib\n\n  Copy the sample configuration file that ships with this source (conf/core-site.xml) to conf\n  directory in your $HADOOP_HOME dir.\n\n  # cp conf/core-site.xml $HADOOP_HOME/conf\n\n* Installing the plugin from RPM\n\n  See the plugin documentation for installing from RPM.\n\n\nCLUSTER INSTALLATION\n--------------------\n\n  In case it is tedious to do the above steps(s) on all hosts in the cluster; use the build-and-deploy.py script to\n  build the plugin in one place and deploy it (along with the configuration file on all other hosts).\n\n  This should be run on the host which is that hadoop master [Job Tracker].\n\n* STEPS (You would have done Step 1 and 2 anyway while deploying Hadoop)\n\n  1. Edit conf/slaves file in your hadoop distribution; one line for each slave.\n  2. Setup password-less ssh b/w hadoop master and slave(s).\n  3. Edit conf/core-site.xml with all glusterfs related configurations (see CONFIGURATION)\n  4. Run the following\n     # cd $GLUSTER_HOME/hdfs/0.20.2/tools\n     # python ./build-and-deploy.py -b -d /path/to/hadoop/home -c\n\n     This will build the plugin and copy it (and the config file) to all slaves (mentioned in $HADOOP_HOME/conf/slaves).\n\n   Script options:\n     -b : build the plugin\n     -d : location of hadoop directory\n     -c : deploy core-site.xml\n     -m : deploy mapred-site.xml\n     -h : deploy hadoop-env.sh\n\n\nCONFIGURATION\n-------------\n\n  All plugin configuration is done in a single XML file (core-site.xml) with \u003cname\u003e\u003cvalue\u003e tags in each \u003cproperty\u003e\n  block.\n\n  Brief explanation of the tunables and the values they accept (change them where-ever needed) are mentioned below\n\n  name:  fs.glusterfs.impl\n  value: org.apache.hadoop.fs.glusterfs.GlusterFileSystem\n\n         The default FileSystem API to use (there is little reason to modify this).\n\n  name:  fs.default.name\n  value: glusterfs:///\n\n         The default name that hadoop uses to represent file as a URI (typically a server:port tuple). Use any host\n         in the cluster as the server and any port number. This option has to be in server:port format for hadoop\n         to create file URI; but is not used by plugin.\n\n  name:  fs.glusterfs.volname\n  value: volume-dist-rep\n\n         The volume to mount.\n\n\n  name:  fs.glusterfs.mount\n  value: /mnt/glusterfs\n\n         This is the directory where the gluster volume is mounted\n\n  name:  fs.glusterfs.server\n  value: localhost\n\n         To mount a volume the plugin needs to know the hostname or the IP of a GlusterFS server in the cluster.\n         Mention it here.\n\nUSAGE\n-----\n\n  Once configured, start Hadoop Map/Reduce daemons\n\n  # cd $HADOOP_HOME\n  # ./bin/start-mapred.sh\n\n  If the map/reduce job/task trackers are up, all I/O will be done to GlusterFS.\n\n\nFOR HACKERS\n-----------\n\n* Source Layout (./src/)\n\nFor the overall architecture, see.  Currently, we use the hadoop RawLocalFileSystem as \nthe basis - and wrap it with the GlusterVolume class.  That class is then used by the \nHadoop 1x (GlusterFileSystem) and Hadoop 2x (GlusterFs) adapters.\n\n https://forge.gluster.org/hadoop/pages/Architecture\n\n./tools/build-deploy-jar.py                                                  \u003c--- Build and Deployment Script\n./conf/core-site.xml                                                         \u003c--- Sample configuration file\n./pom.xml                                                                    \u003c--- build XML file (used by maven)\n\n./COPYING                                                                    \u003c--- License\n./README                                                                     \u003c--- This file\n\n\n\nJENKINS\n-------\n\n  #Method 1) Modify JENKINS_USER in /etc/sysconfig/jenkins\n  JENKINS_USER=root\n\n  #Method 2) Directly modify /etc/init.d/jenkins \n  #daemon --user \"$JENKINS_USER\" --pidfile \"$JENKINS_PID_FILE\" $JAVA_CMD $PARAMS \u003e /dev/null\n  echo \"WARNING: RUNNING AS ROOT\" \n  daemon --user root --pidfile \"$JENKINS_PID_FILE\" $JAVA_CMD $PARAMS \u003e /dev/null\n\n\nBUILDING \n--------\n\nBuilding requires a working gluster mount for unit tests. \nThe unit tests read test resources from glusterconfig.properties - a file which should be present \n\n1) edit your .bashrc, or else at your terminal run : \n\nexport GLUSTER_MOUNT=/mnt/glusterfs\nexport HCFS_FILE_SYSTEM_CONNECTOR=org.apache.hadoop.fs.test.connector.glusterfs.GlusterFileSystemTestConnector \nexport HCFS_CLASSNAME=org.apache.hadoop.fs.glusterfs.GlusterFileSystem\n\n(in eclipse - see below , you will add these at the \"Run Configurations\" menu,\nin VM arguments, prefixed with -D, for example, \"-DGLUSTER_MOUNT=x -DHCFS_FILE_SYSTEM_CONNECTOR=y ...\")\n\n2) run: \n   mvn clean package \n   \n3) The jar artifact will be in target/\n\nDEVELOPING\n----------\n\n0) Create a mock gluster mount: \n \n #Create raw disk and format it...\n truncate -s 1G /export/debugging_fun.brick\n sudo mkfs.xfs  /export/debugging_fun.brick\n\n #Mount it as loopback fs\n mount -o loop /export/debugging_fun.brick /mnt/mybrick ;\n\n #Now make a mount point for it, and also, for gluster itself\n mkdir /mnt/mybrick/glusterbrick\n mkdir /mnt/glusterfs\n MNT=\"/mnt/glusterfs\"\n BRICK=\"/mnt/mybrick/glusterbrick\"\n \n #Create a gluster volume that writes to the brick\n sudo gluster volume create HadoopVol 10.10.61.230:$BRICK \n\n #Mount the volume on top of the newly created brick\n mount -t glusterfs mount -t glusterfs $(hostname):HadoopVol $MNT\n\n1) Run \"mvn eclipse:eclipse\", and import into eclipse.\n\n2) Add the exported env variables above via Run Configurations as described in the above section.\n\n3) Develop and run unit tests as you would any other java app. ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgluster%2Fglusterfs-hadoop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgluster%2Fglusterfs-hadoop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgluster%2Fglusterfs-hadoop/lists"}