How PageRank Works: From Random Surfer Theory to MapReduce Implementation
This article explains the fundamental principles of Google's PageRank algorithm, modeling web pages as a directed graph and a random surfer, discusses matrix formulation, convergence issues like dangling nodes and traps, and demonstrates a practical MapReduce implementation with Python code for large‑scale rank computation.
1. What is PageRank
PageRank treats each web page as a node in a directed graph and assigns a rank based on the probability that a "random surfer" lands on that page. The surfer starts at a random page, follows outgoing links with equal probability, and occasionally jumps to a random page, modeling user navigation and estimating page importance.
2. Simplest PageRank Model
Consider a small graph with four pages (A, B, C, D). If the surfer is on page A, it moves to B, C, or D with probability 1/3 because A has three outgoing links. The transition matrix M is built such that M[i][j] = 1/k if page j has k outgoing links to page i, otherwise 0.
Starting with a uniform probability vector V0 (1/n for each of the n pages), successive iterations compute V1 = M·V0, V2 = M·V1, and so on until convergence.
3. Convergence (Termination) Issue
For convergence, the graph must be strongly connected; otherwise dangling pages (pages with no outgoing links) cause probability mass to disappear, leading to a vector of near‑zero values.
4. Trap Issue
Pages that only link to themselves (or form a closed sub‑graph) act as traps, pulling all probability mass into the trap and leaving other pages with zero rank.
5. Solving Termination and Trap Problems
Introduce a damping factor a (typically 0.8). At each step, with probability a the surfer follows a link, and with probability (1‑a) it jumps to a random page (1/n). The updated iteration becomes:
6. PageRank with MapReduce
Because the web graph is massive (billions of pages), direct matrix multiplication is infeasible. MapReduce distributes the computation. The process consists of:
Map phase: emit contributions of each page’s rank to its outgoing links.
Reduce phase: sum contributions for each target page and apply the damping factor.
Below are the essential Python scripts (kept unchanged for Hadoop streaming).
# SortMapper.py
#!/bin/python
'''Mapper for sort'''
import sys
for line in sys.stdin:
print line.strip() # SortReducer.py
#!/bin/python
'''Reducer for sort'''
import sys
for line in sys.stdin:
print line.strip() # PageRankMapper.py
'''mapper of pagerank algorithm'''
import sys
id1 = id2 = None
heros = value = None
count1 = count2 = 0
for line in sys.stdin:
data = line.strip().split('\t')
if len(data) == 3 and data[1] == 'a': # pagerank value
count1 += 1
if count1 >= 2:
print '%s\t%s' % (id1,0.0)
id1 = data[0]
value = float(data[2])
else: # link relation
id2 = data[0]
heros = data[1:]
if id1 == id2 and id1:
v = value / len(heros)
for hero in heros:
print '%s\t%s' % (hero,v)
print '%s\t%s' % (id1,0.0)
id1 = id2 = None
count1 = 0 # PageRankReducer.py
'''reducer of pagerank algorithm'''
import sys
last = None
values = 0.0
alpha = 0.8
N = 4 # number of pages
for line in sys.stdin:
data = line.strip().split('\t')
hero, value = data[0], float(data[1])
if data[0] != last:
if last:
values = alpha * values + (1 - alpha) / N
print '%s\ta\t%s' % (last, values)
last = data[0]
values = value
else:
values += value
if last:
values = alpha * values + (1 - alpha) / N
print '%s\ta\t%s' % (last, values)In a Linux environment, the MapReduce workflow can be simulated with shell scripts that repeatedly sort, map, and reduce the intermediate files, updating the rank vector each iteration.
Running the algorithm for 40 iterations with a damping factor of 0.8 yields a stable rank vector, e.g., after convergence the ranks are approximately:
A 0.101351351393
B 0.128378378439
C 0.641891891728
D 0.128378378439The values demonstrate that the PageRank scores have stabilized and match the expected distribution for the example graph.
Conclusion
The article introduced PageRank theory, highlighted challenges such as dangling nodes and traps, and provided a concrete MapReduce implementation using Python, suitable for large‑scale web graph ranking.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
