https://github.com/zhaytam/pagerank

An implementation of the PageRank algorithm in Hadoop MapReduce
https://github.com/zhaytam/pagerank

hadoop java pagerank-algorithm

Last synced: 3 months ago
JSON representation

An implementation of the PageRank algorithm in Hadoop MapReduce

Host: GitHub
URL: https://github.com/zhaytam/pagerank
Owner: zHaytam
License: mit
Created: 2018-03-10T23:17:17.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-03-11T23:21:10.000Z (over 7 years ago)
Last Synced: 2025-02-12T15:53:13.531Z (4 months ago)
Topics: hadoop, java, pagerank-algorithm
Language: Java
Homepage:
Size: 141 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # PageRank

An implementation of the Page Rank algorithm using Hadoop (Java).

This was tested using the inputs (available above):

- pagerank_data.txt - A testing example

- hollins.data - A dataset that can be found [here](https://web2.qatar.cmu.edu/~gdicaro/15381/hw/hw4-files/hollins.dat).

## Input Format

The input used in this implementation (inputs) is as follows:

*Note: Each line is 2 values seperated by a space.*

 - First line: NodesCount EdgesCount *(e.g. "5 9")*

 - NC next lines: NodeID NodeURL *(e.g. "1 http://example.com")*

 - EC next lines: NodeID OutlinkToNodeID *(e.g. "1 2")*

## Step 1

- **Mapper:** Reads the initial input file, ignores the first line and the NC next lines and outputs the EC edges `` (e.g. `<1, 2>`).

- **Reducer:** Receives the EC edges (Each node and an Iterable of its outlinks) and outputs for each node a concatenated list of its outlinks prefixed with an initial page rank (separated by a tab) `` (e.g. `<1, "1.0 3,2">`). 

## Step 2

This is the most important step in the process. This is ran multiple times since the output of this step will be the same as in the first step.

- **Mapper:** Reads the last output generated (wheither from the first step or from a previous iteration) and outputs for each `[node, rank, "outlink1,outlink2,..."]`:

  - For each outlink it outputs the outlink and `rank / len(outlinks)`: `` (e.g. `<"3", "0.5">`.

  - The node and its outlinks prefixed with an open bracket:  (e.g. `<"1",  "[3,2">`).

-  **Reducer:** Receives for each node a list of values (the values can be either page ranks or the list of the outlinks), calculates the pagerank by using the formula `(1 - d) + (d * sum(pageRank))` and outputs the same structure as the first step:  `(e.g. <"1", "0.7875 3,2">)`.

This can be hard to understand using sentences, so here is a pseudo-code:

```

Map(Offset, Text):

	node, rank, outlinks = Parse(Text)

	

	if outlinks == None:

		return

	for (outlink in outlinks):

		Write(outlink, rank / len(outlinks))

	Write(node, '[' + outlinks)

Reduce(Node, Text[]):

	outlinks = []

	totalRank = 0

	for (text in Text):

		if text.startswith('['):

			outlinks = text[1:]

		else

			totalRank += text

	totalRank = (1 - 0.85) + (0.85 * totalRank)

	Write(Node, totalRank + '\t' + outlinks)

```

This step is ran X times (The X must be given in the arguments) or until a minimum score is met.

The score is calculated as follows: `∑i(|lastRanks[i] - newRanks[i]|)`.

## Step 3

This step basically only needs a mapper and the use of the shuffle&sort phase to print the rankings but can use a Reducer to, for example, print the top N ranked pages.

- **Setup:** Before mapping, we read the input data (the NC nodes lines) and fill a `HashMap` with each node and its corresponding url, this is simply to be able to print the url of the pages in the ranking instead of just the node ids.

- **Mapper:** Receives the last ranks output (node, rank, outlinks) and outputs for each: `` (e.g. `<1.6375, "http://www.hollins.edu/">`). We write the rank as the key so that the shuffle&sort phase (using a custom SortComparator) sorts our entries in a descending order (thus not needing a Reducer).

## Testing

 1. Download the sources.

 2. Compile to a .jar file.

 3. Create an `input` folder and put `pagerank_data.txt` in there.

 4. Use the following command `hadoop jar File.jar PageRank /input /output /input/pagerank_data.txt 0.85 5 true true`: 

    - **/input:** The input folder

    - **/output:** The output folder

    - **/input/pagerank_data.txt:** The data file (used to get the urls in step 3)

    - **0.85:** The damping factor.

    - **5:** The maximum number of iterations.

    - **0.01:** The criterion to break from the iterations (minimum difference).

    - **true:** Delete the output folder before starting (if found).

    - **true:** Show the results at the end (/ranking/part-r-00000 file).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zhaytam/pagerank

Awesome Lists containing this project

README