{"id":15096952,"url":"https://github.com/apache/giraph","last_synced_at":"2025-10-08T02:30:51.732Z","repository":{"id":830189,"uuid":"2282376","full_name":"apache/giraph","owner":"apache","description":"Mirror of Apache Giraph","archived":true,"fork":false,"pushed_at":"2023-04-14T18:34:58.000Z","size":26962,"stargazers_count":615,"open_issues_count":32,"forks_count":300,"subscribers_count":65,"default_branch":"trunk","last_synced_at":"2024-10-02T05:41:27.230Z","etag":null,"topics":["big-data","giraph","java"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apache.png","metadata":{"files":{"readme":"README","changelog":"CHANGELOG","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2011-08-28T07:00:33.000Z","updated_at":"2024-08-17T13:49:17.000Z","dependencies_parsed_at":"2023-01-11T15:47:56.879Z","dependency_job_id":"b6a8a1f8-7e68-4a39-bb79-dc47c0a4bf45","html_url":"https://github.com/apache/giraph","commit_stats":{"total_commits":1132,"total_committers":58,"mean_commits":"19.517241379310345","dds":0.7879858657243817,"last_synced_commit":"14a74297378dc1584efbb698054f0e8bff4f90bc"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fgiraph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fgiraph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fgiraph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fgiraph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apache","download_url":"https://codeload.github.com/apache/giraph/tar.gz/refs/heads/trunk","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235674061,"owners_count":19027515,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","giraph","java"],"created_at":"2024-09-25T16:02:50.528Z","updated_at":"2025-10-08T02:30:45.903Z","avatar_url":"https://github.com/apache.png","language":"Java","readme":"Giraph : Large-scale graph processing on Hadoop\n\nWeb and online social graphs have been rapidly growing in size and\nscale during the past decade.  In 2008, Google estimated that the\nnumber of web pages reached over a trillion.  Online social networking\nand email sites, including Yahoo!, Google, Microsoft, Facebook,\nLinkedIn, and Twitter, have hundreds of millions of users and are\nexpected to grow much more in the future.  Processing these graphs\nplays a big role in relevant and personalized information for users,\nsuch as results from a search engine or news in an online social\nnetworking site.\n\nGraph processing platforms to run large-scale algorithms (such as page\nrank, shared connections, personalization-based popularity, etc.) have\nbecome quite popular.  Some recent examples include Pregel and HaLoop.\nFor general-purpose big data computation, the map-reduce computing\nmodel has been well adopted and the most deployed map-reduce\ninfrastructure is Apache Hadoop.  We have implemented a\ngraph-processing framework that is launched as a typical Hadoop job to\nleverage existing Hadoop infrastructure, such as Amazon’s EC2.  Giraph\nbuilds upon the graph-oriented nature of Pregel but additionally adds\nfault-tolerance to the coordinator process with the use of ZooKeeper\nas its centralized coordination service.\n\nGiraph follows the bulk-synchronous parallel model relative to graphs\nwhere vertices can send messages to other vertices during a given\nsuperstep.  Checkpoints are initiated by the Giraph infrastructure at\nuser-defined intervals and are used for automatic application restarts\nwhen any worker in the application fails.  Any worker in the\napplication can act as the application coordinator and one will\nautomatically take over if the current application coordinator fails.\n\n-------------------------------\n\nHadoop versions for use with Giraph:\n\nSecure Hadoop versions:\n\n- Apache Hadoop 1 (latest version: 1.2.1)\n\n  This is the default version used by Giraph: if you do not specify a\n  profile with the -P flag, maven will use this version. You may also\n  explicitly specify it with \"mvn -Phadoop_1 \u003cgoals\u003e\".\n\n- Apache Hadoop 2 (latest version: 2.5.1)\n\n  This is the latest version of Hadoop 2 (supporting YARN in addition\n  to MapReduce) Giraph could use. You may tell maven to use this version\n  with \"mvn -Phadoop_2 \u003cgoals\u003e\".\n\n- Apache Hadoop Yarn with 2.2.0\n\n  You may tell maven to use this version with \"mvn -Phadoop_yarn -Dhadoop.version=2.2.0 \u003cgoals\u003e\".\n\n- Apache Hadoop 3.0.0-SNAPSHOT\n\n  You may tell maven to use this version with \"mvn -Phadoop_snapshot \u003cgoals\u003e\".\n\nUnsecure Hadoop versions:\n\n- Facebook Hadoop releases: https://github.com/facebook/hadoop-20, Master branch\n\n  You may tell maven to use this version with \"mvn -Phadoop_facebook \u003cgoals\u003e\"\n\n-- Other versions reported working include:\n---  Cloudera CDH3u0, CDH3u1\n\nWhile we provide support for unsecure and Facebook versions of Hadoop\nwith the maven profiles 'hadoop_non_secure' and 'hadoop_facebook',\nrespectively, we have been primarily focusing on secure Hadoop releases\nat this time.\n\n-------------------------------\n\nBuilding and testing:\n\nYou will need the following:\n- Java 1.8\n- Maven 3 or higher. Giraph uses the munge plugin\n  (http://sonatype.github.com/munge-maven-plugin/),\n  which requires Maven 3, to support multiple versions of Hadoop. Also, the\n  web site plugin requires Maven 3.\n\nUse the maven commands with secure Hadoop to:\n- compile (i.e mvn compile)\n- package (i.e. mvn package)\n- test (i.e. mvn test)\n\nFor the non-secure versions of Hadoop, run the maven commands with the\nadditional argument '-Phadoop_non_secure'.\nExample compilation commands is 'mvn -Phadoop_non_secure compile'.\n\nFor the Facebook Hadoop release, run the maven commands with the\nadditional arguments '-Phadoop_facebook'.\nExample compilation commands is 'mvn -Phadoop_facebook compile'.\n\n-------------------------------\n\nDeveloping:\n\nGiraph is a multi-module maven project. The top level generates a POM that\ncarries information common to all the modules. Each module creates a jar with\nthe code contained in it.\n\nThe giraph/ module contains the main giraph code. If you just want to work on\nthe main code only you can do all your work inside this subdirectory.\nSpecifically you would do something like:\n\n  giraph-root/giraph/ $ mvn verify            # build from current state\n  giraph-root/giraph/ $ mvn clean             # wipe out build files\n  giraph-root/giraph/ $ mvn clean verify      # build from fresh state\n  giraph-root/giraph/ $ mvn install           # install jar to local repository\n\nThe giraph-formats/ module contains hooks to read/write from various\nformats (e.g. Accumulo, HBase, Hive). It depends on the giraph module. This\nmeans if you make local changes to the giraph codebase you will first need to\ninstall the giraph/ jar locally so that giraph-formats/ will pick it up.\nIn other words something like this:\n\n  giraph-root/giraph/ $ mvn install\n  giraph-root/giraph-formats $ mvn verify\n\nTo build everything at once you can issue the maven commands at the top level.\nNote that we use the \"install\" target so that if you have any local changes to\ngiraph/ which formats needs it will get picked up because it will install\nlocally first.\n\n  giraph-root/ $ mvn clean install\n\n-------------------------------\n\nScripting:\n\nGiraph has support for writing user logic in languages other than Java. A Giraph\njob involves at the very least a Computation and Input/Output Formats. There are\nother optional pieces as well like Aggregators and Combiners.\n\nAs of this writing we support writing the Computation logic in Jython. The\nComputation class is at the core of the algorithm so it was a natural starting\npoint. Eventually it is our goal to allow users to write any / all components of\ntheir algorithms in any language they desire.\n\nTo use Jython with our job launcher, GiraphRunner, pass the path to the script\nas the Computation class argument. Additionally, you should set the -jythonClass\noption to let Giraph know the name of your Jython Computation class. Lastly, you\nwill need to set -typesHolder to a class that extends Giraph's TypesHolder so\nthat Giraph can infer the types you use. Look at page-rank.py as an example.\n\n-------------------------------\n\nHow to run the unittests on a local pseudo-distributed Hadoop instance:\n\nAs mentioned earlier, Giraph supports several versions of Hadoop.  In\nthis section, we describe how to run the Giraph unittests against a single\nnode instance of Apache Hadoop 0.20.203.\n\nDownload Apache Hadoop 0.20.203 (hadoop-0.20.203.0/hadoop-0.20.203.0rc1.tar.gz)\nfrom a mirror picked at http://www.apache.org/dyn/closer.cgi/hadoop/common/\nand unpack it into a local directory\n\nFollow the guide at\nhttp://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed\nto setup a pseudo-distributed single node Hadoop cluster.\n\nGiraph’s code assumes that you can run at least 4 mappers at once,\nunfortunately the default configuration allows only 2. Therefore you need\nto update conf/mapred-site.xml:\n\n\u003cproperty\u003e\n  \u003cname\u003emapred.tasktracker.map.tasks.maximum\u003c/name\u003e\n  \u003cvalue\u003e4\u003c/value\u003e\n\u003c/property\u003e\n\n\u003cproperty\u003e\n  \u003cname\u003emapred.map.tasks\u003c/name\u003e\n  \u003cvalue\u003e4\u003c/value\u003e\n\u003c/property\u003e\n\nAfter preparing the local filesystem with:\n\nrm -rf /tmp/hadoop-\u003cusername\u003e\n/path/to/hadoop/bin/hadoop namenode -format\n\nyou can start the local hadoop instance:\n\n/path/to/hadoop/bin/start-all.sh\n\nand finally run Giraph’s unittests:\n\nmvn clean test -Dprop.mapred.job.tracker=localhost:9001\n\nNow you can open a browser, point it to http://localhost:50030 and watch the\nGiraph jobs from the unittests running on your local Hadoop instance!\n\n\nNotes:\nCounter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of\ncounters one can use, which is set to 120 by default. This limit restricts the\nnumber of iterations/supersteps possible in Giraph. This limit can be increased\nby setting a parameter \"mapreduce.job.counters.limit\" in job tracker's config\nfile mapred-site.xml.\n\n","funding_links":[],"categories":["数据库"],"sub_categories":["Spring Cloud框架"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fgiraph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapache%2Fgiraph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fgiraph/lists"}