Big Data 15 min read

How PageRank Works: From Random Surfer Theory to MapReduce Implementation

This article explains the fundamental principles of Google's PageRank algorithm, modeling web pages as a directed graph and a random surfer, discusses matrix formulation, convergence issues like dangling nodes and traps, and demonstrates a practical MapReduce implementation with Python code for large‑scale rank computation.

21CTO

Feb 14, 2016

How PageRank Works: From Random Surfer Theory to MapReduce Implementation

1. What is PageRank

PageRank treats each web page as a node in a directed graph and assigns a rank based on the probability that a "random surfer" lands on that page. The surfer starts at a random page, follows outgoing links with equal probability, and occasionally jumps to a random page, modeling user navigation and estimating page importance.

2. Simplest PageRank Model

Consider a small graph with four pages (A, B, C, D). If the surfer is on page A, it moves to B, C, or D with probability 1/3 because A has three outgoing links. The transition matrix M is built such that M[i][j] = 1/k if page j has k outgoing links to page i, otherwise 0.

Starting with a uniform probability vector V0 (1/n for each of the n pages), successive iterations compute V1 = M·V0, V2 = M·V1, and so on until convergence.

3. Convergence (Termination) Issue

For convergence, the graph must be strongly connected; otherwise dangling pages (pages with no outgoing links) cause probability mass to disappear, leading to a vector of near‑zero values.

4. Trap Issue

Pages that only link to themselves (or form a closed sub‑graph) act as traps, pulling all probability mass into the trap and leaving other pages with zero rank.

5. Solving Termination and Trap Problems

Introduce a damping factor a (typically 0.8). At each step, with probability a the surfer follows a link, and with probability (1‑a) it jumps to a random page (1/n). The updated iteration becomes:

6. PageRank with MapReduce

Because the web graph is massive (billions of pages), direct matrix multiplication is infeasible. MapReduce distributes the computation. The process consists of:

Map phase: emit contributions of each page’s rank to its outgoing links.

Reduce phase: sum contributions for each target page and apply the damping factor.

Below are the essential Python scripts (kept unchanged for Hadoop streaming).

# SortMapper.py
#!/bin/python
'''Mapper for sort'''
import sys
for line in sys.stdin:
    print line.strip()

# SortReducer.py
#!/bin/python
'''Reducer for sort'''
import sys
for line in sys.stdin:
    print line.strip()

# PageRankMapper.py
'''mapper of pagerank algorithm'''
import sys
id1 = id2 = None
heros = value = None
count1 = count2 = 0
for line in sys.stdin:
    data = line.strip().split('\t')
    if len(data) == 3 and data[1] == 'a':  # pagerank value
        count1 += 1
        if count1 >= 2:
            print '%s\t%s' % (id1,0.0)
        id1 = data[0]
        value = float(data[2])
    else:  # link relation
        id2 = data[0]
        heros = data[1:]
    if id1 == id2 and id1:
        v = value / len(heros)
        for hero in heros:
            print '%s\t%s' % (hero,v)
        print '%s\t%s' % (id1,0.0)
        id1 = id2 = None
        count1 = 0

# PageRankReducer.py
'''reducer of pagerank algorithm'''
import sys
last = None
values = 0.0
alpha = 0.8
N = 4  # number of pages
for line in sys.stdin:
    data = line.strip().split('\t')
    hero, value = data[0], float(data[1])
    if data[0] != last:
        if last:
            values = alpha * values + (1 - alpha) / N
            print '%s\ta\t%s' % (last, values)
        last = data[0]
        values = value
    else:
        values += value
if last:
    values = alpha * values + (1 - alpha) / N
    print '%s\ta\t%s' % (last, values)

In a Linux environment, the MapReduce workflow can be simulated with shell scripts that repeatedly sort, map, and reduce the intermediate files, updating the rank vector each iteration.

Running the algorithm for 40 iterations with a damping factor of 0.8 yields a stable rank vector, e.g., after convergence the ranks are approximately:

A    0.101351351393
B    0.128378378439
C    0.641891891728
D    0.128378378439

The values demonstrate that the PageRank scores have stabilized and match the expected distribution for the example graph.

Conclusion

The article introduced PageRank theory, highlighted challenges such as dangling nodes and traps, and provided a concrete MapReduce implementation using Python, suitable for large‑scale web graph ranking.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

algorithm Big Data MapReduce PageRank graph theory

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.