GROUP 03 · CSCS2543 · Section H3

Network Graph
PageRank Cluster

Setup & Portability Manual — May 11, 2026

Apache Spark 3.5.3 Hadoop HDFS 3.3.6 Stanford Web-Google 875K nodes · 5M edges macOS · Linux · Windows

Part 0

Prerequisites

Install these tools on every machine — both master and all workers. Do this once before running any setup script.

0.1Install Python 3

The first prerequisite for all cluster nodes. Python runs the setup scripts and the PageRank job.

brew install python3

sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Using winget (comes with Windows 10/11)
winget install Python.Python.3.12
# Or download from: https://python.org

0.2Install Java

Spark and Hadoop run on the Java Virtual Machine. Install OpenJDK 11 or 17 — Java 8 and 21+ cause compatibility issues.

brew install openjdk@17
# Add Java to your PATH permanently:
echo 'export PATH="$(brew --prefix openjdk@17)/bin:$PATH"' >> ~/.zprofile
export PATH="$(brew --prefix openjdk@17)/bin:$PATH

sudo apt-get update
sudo apt-get install -y openjdk-17-jdk
# Set JAVA_HOME:
echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' >> ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

winget install EclipseAdoptium.Temurin.17.JDK --silent
# After install
set JAVA_HOME in PowerShell:
$env:JAVA_HOME = "C:\Program Files\Eclipse Adoptium\jdk-17.0.11.9-hotspot
# Or download from: https://adoptium.net

0.3Install Git

Git is used to clone the cluster repository onto each machine.

brew install git

sudo apt-get update
sudo apt-get install -y git

winget install Git.Git --silent
# Or download from: https://git-scm.com

0.4Enable SSH

Hadoop daemons communicate between nodes using SSH. Enable Remote Login on your machine:

# Option 1: GUI — System Settings → General → Remote Login → ON
# Option 2: Command line (requires admin password):
sudo systemsetup -setremotelogin on
# Verify:
sudo systemsetup -getremotelogin

# Most Linux distros have SSH pre-installed. Verify it's running:
which ssh
sudo systemctl enable ssh
sudo systemctl start ssh

# Windows 10/11 has OpenSSH client built-in. Verify:
Get-WindowsCapability -Online | Where-Object Name -like 'OpenSSH*'
# If needed
install via Settings → Apps → Optional Features → Add OpenSSH Client

Part 1

Master Setup

Run these steps on the laptop that will act as the cluster master. It coordinates all computation and hosts the API.

1.1Find Your LAN IP

Your LAN IP is the address other machines on the same Wi-Fi will use to reach yours. It usually starts with 192.168. or 10..

ipconfig getifaddr en0
# Expected output example:
# 192.168.1.14

hostname -I | awk '{print $1}'
# Expected output example:
# 192.168.1.14

(Get-NetIPAddress -AddressFamily IPv4 | Where-Object { $_.IPAddress -like '192.168.*' }).IPAddress
# Expected output example:
# 192.168.1.14

1.2Configure

Open setup/config.py. There is exactly one line you need to change — MASTER_IP. Everything else is auto-detected.

Before (the default placeholder):

Master_IP = "192.168.1.14"   # set to master laptop's LAN IP

After (your actual LAN IP from Section 1.1):

MASTER_IP = "192.168.1.42"   # your actual LAN IP here

⚠

Set this before running anythingEvery setup script imports config.py. If MASTER_IP is wrong, Hadoop and Spark will start but workers will be unable to connect — and the error messages won't point here.

Verify the config loads cleanly:

python3 setup/config.py

✓

Expected output

OS          : Darwin
Local IP    : 192.168.1.42
MASTER_IP   : 192.168.1.42
JAVA_HOME   : /opt/homebrew/opt/openjdk@17
HADOOP_HOME : /opt/hadoop-3.3.6
SPARK_HOME  : /opt/spark-3.5.3-bin-hadoop3
  [ok] Java: openjdk version "17.0.11" 2024-01-16

1.3Run Master Setup

This single command installs Hadoop (a distributed filesystem) and Spark (a distributed computation engine), formats the filesystem, and starts all required services.

python3 setup/setup_node.py --role master

The script prints phases as it runs. Here is what each one means:

Phase printed	What is happening
`-- Java --`	Checks Java is present; installs if missing.
`-- Python dependencies --`	Installs `pyspark`, `flask`, `requests` via pip.
`-- Hadoop 3.3.6 (master) --`	Downloads ~650 MB Hadoop archive and extracts it. Writes XML config files pointing at your `MASTER_IP`.
`-- Spark 3.5.3 (master) --`	Downloads ~300 MB Spark archive and extracts it. Writes `spark-defaults.conf`.
`-- Formatting NameNode --`	Initialises the HDFS metadata directory. Only happens once; skipped on re-runs.
`-- Starting services (master) --`	Launches NameNode, DataNode, SecondaryNameNode, and Spark Master as background daemons.
`-- Verification --`	Runs `jps` to list running Java processes and confirms all four are present.

}

💡

This takes a while the first timeDownloading Hadoop + Spark is about 1 GB total. On a fast connection expect 3-5 minutes; on a slow one, up to 20. Subsequent runs skip the download.

After it finishes: check with `jps`

jps — the Java Virtual Machine Process Status tool — lists every running Java process. After master setup you should see exactly four:

jps

✓

Expected output — all four must appear

12345 NameNode
12346 DataNode
12347 SecondaryNameNode
12348 Master
12349 Jps

If a process is missing, here is where to look:

Missing process	What to check
`NameNode`	Check Hadoop logs in `$HADOOP_HOME/logs/`. Common cause: a NameNode is already running (stop it with `stop-dfs.sh`), or wrong `JAVA_HOME` in `hadoop-env.sh`.
`DataNode`	Usually starts after NameNode. If NameNode is up but DataNode isn't, check for port 9000 conflicts: `lsof -i :9000`.
`SecondaryNameNode`	Non-critical for this lab. Its absence won't block PageRank.
`Master` (Spark)	Check Spark logs in `$SPARK_HOME/logs/`.

}

1.4Verify in Browser

Two web UIs start automatically. Open them to confirm the cluster is healthy.

Spark UI — `http://192.168.1.14:8080`

You should see the Spark Master dashboard. Look for: Status: ALIVE and at least one worker listed under "Workers" (the master itself counts as a worker at this stage).

HDFS NameNode UI — `http://192.168.1.14:9870`

You should see the Hadoop NameNode overview. Look for: Live Nodes: 1 (or more once workers join) and Safe Mode: OFF.

⚠

If the browser just times outThe ports are blocked by a firewall. Open them:

# macOS — allow Java through the firewall (or disable it temporarily):
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
# Linux — open the required ports:
sudo ufw allow 8080/tcp
sudo ufw allow 9870/tcp
sudo ufw allow 9000/tcp
sudo ufw allow 7077/tcp
sudo ufw allow 5000/tcp
# Windows — open via PowerShell (as Administrator):
netsh advfirewall firewall add rule name="Hadoop Spark Cluster" dir=in action=allow protocol=TCP localport=8080
9870
9000
7077
5000

# macOS — allow Java through the firewall (or disable it temporarily):
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
# Linux — open the required ports:
sudo ufw allow 8080/tcp
sudo ufw allow 9870/tcp
sudo ufw allow 9000/tcp
sudo ufw allow 7077/tcp
sudo ufw allow 5000/tcp
# Windows — open via PowerShell (as Administrator):
netsh advfirewall firewall add rule name="Hadoop Spark Cluster" dir=in action=allow protocol=TCP localport=8080
9870
9000
7077
5000

# macOS — allow Java through the firewall (or disable it temporarily):
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
# Linux — open the required ports:
sudo ufw allow 8080/tcp
sudo ufw allow 9870/tcp
sudo ufw allow 9000/tcp
sudo ufw allow 7077/tcp
sudo ufw allow 5000/tcp
# Windows — open via PowerShell (as Administrator):
netsh advfirewall firewall add rule name="Hadoop Spark Cluster" dir=in action=allow protocol=TCP localport=8080
9870
9000
7077
5000

1.5Load the Dataset

The dataset is the Stanford Web-Google graph — a snapshot of 875,713 web pages and 5,105,039 hyperlinks between them, collected by Google in 2002. HDFS (the Hadoop Distributed File System) stores data across all nodes so Spark can read it in parallel.

Step 1 — Download

python3 src/download_dataset.py

✓

Expected output

Downloading from https://snap.stanford.edu/data/web-Google.txt.gz ...
  [####################] 100%
Extracting...

[ok] Ready: data/web-Google.txt
  Nodes: 875,713
  Edges: 5,105,039

Next: hdfs dfs -put data/web-Google.txt /pagerank/input/

Step 2 — Push to HDFS

hdfs dfs -put copies a local file into the distributed filesystem so all workers can access it.

hdfs dfs -put data/web-Google.txt /pagerank/input/

hdfs dfs -put data/web-Google.txt /pagerank/input/

hdfs dfs -put data/web-Google.txt /pagerank/input/

1.6Run PageRank

This submits the PageRank job to Spark. spark-submit packages your Python script and sends it to the Spark Master, which distributes the computation across all connected workers.

spark-submit \
  --master spark://192.168.1.14:7077 \
  --executor-memory 2g \
  --driver-memory 1g \
  src/pagerank.py 10

The number 10 is the iteration count. During the run you will see lines like:

✓

Iteration output

  Iteration  1/10  12.3s
  Iteration  2/10  11.8s
  Iteration  3/10  11.5s
  ...
  Iteration 10/10  11.2s

Collecting results...

=======================================================
  TOP 5 MOST INFLUENTIAL NODES
=======================================================
  #1  node      41909  ->  445.71778597
  #2  node     597621  ->  406.62836675
  #3  node     504140  ->  399.08930875
  #4  node     384666  ->  392.82584373
  #5  node     537039  ->  383.90912550
=======================================================

[ok] Done in 118.4s

Each iteration line shows the wall-clock time for one pass over the graph. Times should be roughly stable; a sudden spike usually means a worker disconnected.

Where results are saved

File	Contents
`data/top5.json`	Top 5 nodes — this is what the API serves.
`data/results.json`	Top 1000 nodes with scores, for `/node/<id>` and `/top/<n>` queries.
`data/graph.json`	Full adjacency list — every node and its outgoing neighbors. Used by `/neighbors` and `/influencedby`.
`data/meta.json`	Job metadata — dataset name, node/edge counts, iterations, damping factor, top node, completion timestamp. Used by `/stats`.
`data/pagerank_output/part-00000`	All nodes, tab-separated: `nodeId\tscreen`.

}

1.7Start the API

The REST API is a small Flask server that reads data/top5.json and serves it to other groups over HTTP.

Verify locally first

python3 src/api.py

✓

Expected startup output

==================================================
  Group 03 - PageRank Portability API
  http://192.168.1.14:5000

  GET  /top5              -> top 5 influencers
  GET  /top/<n>          -> top N ranked nodes
  GET  /node/<id>         -> score for node id
  GET  /neighbors/<id>    -> outgoing edges
  GET  /influencedby/<id> -> incoming edges
  GET  /stats             -> job metadata + service info
  POST /rerun             -> trigger background rerun
  GET  /rerun/status      -> rerun job status
  GET  /health            -> status check

  [ok] Results loaded -- top node: 41909 (445.71778597)
==================================================

In a second terminal, confirm it responds:

curl http://localhost:5000/health

Run it in the background (so you can close the terminal)

nohup python3 src/api.py > /tmp/api.log 2>&1 &
echo "API PID: $!
# To stop it later:
# kill $(lsof -ti :5000)

nohup python3 src/api.py > /tmp/api.log 2>&1 &
echo "API PID: $!
# To stop it later:
# kill $(lsof -ti :5000)

# PowerShell -- start as a background job
Start-Job -ScriptBlock { python3 src/api.py } | Out-Null
Write-Host 'API started in background'
# To stop: Get-Job | Stop-Job

Part 2

Worker Setup

Run these steps on every laptop that will join the cluster as a worker. The master must already be running (Part 1 complete) before you start here.

2.1Before You Start

The master node must be fully running before you set up any worker. Specifically:

Section 1.3 must be complete — jps on the master shows all four processes.
You need the master's LAN IP address from Section 1.1.
Both machines must be on the same Wi-Fi or wired LAN.

ping -c 3 192.168.1.14
# All 3 packets should receive a reply.

ping -c 3 192.168.1.14
# All 3 packets should receive a reply.

ping -n 3 192.168.1.14
# All 3 packets should receive a reply.

2.2Clone and Configure

Clone the repository on the worker machine

git clone https://github.com/munimx/pagerank-cluster
cd pagerank-cluster

git clone https://github.com/munimx/pagerank-cluster
cd pagerank-cluster

git clone https://github.com/munimx/pagerank-cluster
cd pagerank-cluster

2.3Run Worker Setup

python3 setup/setup_node.py --role worker

This installs Java, Python dependencies, Hadoop, and Spark — same as the master, but starts only the DataNode and Spark Worker daemons (not the NameNode or Spark Master).

After it finishes: check with `jps`

jps

✓

Expected output on the worker — exactly these two

23456 DataNode
23457 Worker
23458 Jps

You should not see NameNode or Master on a worker — those only run on the master machine.

2.4Register with Master

The master needs to know this worker exists. This is a two-step handshake: the worker reports its IP, then the master adds it to the cluster roster.

Step 1 — Find your own IP (on the worker machine)

hdfs dfsadmin -report | grep 'Live datanodes'

hdfs dfsadmin -report | grep 'Live datanodes'

hdfs dfsadmin -report | findstr "Live datanodes

2.5Most Common Failures

Failure 1 — `MASTER_IP` mismatch

The worker starts but never appears in the Spark UI or HDFS report.

✓

What the error looks like in Spark Worker logs

ERROR Worker: All masters are unresponsive! Giving up.
ERROR Worker: Connection to spark://192.168.1.99:7077 failed

Fix: Open setup/config.py on the worker machine and correct MASTER_IP to the master's actual LAN IP. Then re-run worker setup.

Failure 2 — Worker can't reach master (firewall)

Ping succeeds but Spark/HDFS connections time out.

# Option 1: Disable the firewall (simplest for a lab cluster):
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
# Option 2: Allow Java through the firewall:
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /usr/bin/java

sudo ufw allow 7077/tcp
sudo ufw allow 9000/tcp
sudo ufw allow 9866/tcp
sudo ufw allow 9867/tcp
sudo ufw allow 8080/tcp
sudo ufw allow 9870/tcp

# Run as Administrator:
netsh advfirewall firewall add rule name="Hadoop-Spark Cluster" dir=in action=allow protocol=TCP localport=7077
9000
9866
9867
8080
9870

Part 3

Portability Test

These instructions are for the group that is testing Group 03's output. Follow them from any machine on the same LAN as the Group 03 master.

3.1Check the Service Is Up

Before querying results, confirm the API process is alive.

Three ways to call `GET /health`

# curl
curl http://192.168.1.14:5000/health

# Python
python3 -c "import urllib.request, json; print(json.dumps(json.loads(urllib.request.urlopen('http://192.168.1.14:5000/health').read()), indent=2))"

# Browser
# http://192.168.1.14:5000/health

✓

Healthy response

{
  "dataset": "Stanford Web-Google",
  "framework": "Apache Spark",
  "group": "03",
  "results_ready": true,
  "status": "ok",
  "task": "Network Graph PageRank"

}

⚠

Failed / unexpected response

If the connection is refused: the API process isn't running. Contact the Group 03 master operator to restart it (python3 src/api.py &).

If you get a timeout: check network connectivity with ping first, then check firewall rules (Section 2.5).

3.2Query the Top 5 Results

GET /top5 returns the five highest-ranked nodes in the Web-Google graph.

# curl
curl http://192.168.1.14:5000/top5

# Python urllib (no extra libraries needed)
python3 -c "
import urllib.request, json
data = json.loads(urllib.request.urlopen('http://192.168.1.14:5000/top5').read())
for n in data:
    print(n['rank'], n['nodeId'], n['pagerank'])
"

# Browser
# http://192.168.1.14:5000/top5

✓

Expected JSON response

[
  {
    "rank": 1,
    "nodeId": "41909",
    "pagerank": 445.71778597

, { "rank": 2, "nodeId": "597621", "pagerank": 406.62836675 }, { "rank": 3, "nodeId": "504140", "pagerank": 399.08930875 }, { "rank": 4, "nodeId": "384666", "pagerank": 392.82584373 }, { "rank": 5, "nodeId": "537039", "pagerank": 383.9091255 } ]}

💡

Scores may differ slightlyPageRank scores depend on the number of iterations and the damping factor. Our run used 10 iterations and damping 0.85. Node IDs will match; the exact floating-point values may vary by ±0.001 if you re-run with different settings.

3.3Query a Specific Node

GET /node/<id> returns the score for one node. Use the top-ranked node's ID as your test case.

# curl -- query the top node
curl http://192.168.1.14:5000/node/41909

# Python
python3 -c "
import urllib.request, json
data = json.loads(urllib.request.urlopen('http://192.168.1.14:5000/node/41909').read())
print(json.dumps(data, indent=2))
"

✓

Expected response

{
  "nodeId": "41909",
  "pagerank": 445.71778597

}

3.4Top N Nodes

GET /top/<n> returns the top n ranked nodes (capped at 1000, the size of the stored result set).

# Get top 10 nodes
curl http://192.168.1.14:5000/top/10

# Get top 50 nodes
curl http://192.168.1.14:5000/top/50

✓

Response format

{
  "requested": 10,
  "returned": 10,
  "nodes": [
    {"nodeId": "41909", "pagerank": 445.71778597

, {"nodeId": "597621", "pagerank": 406.62836675}, ... ] }}

Edge cases

n < 1 → returns 400 with error message.
n > 1000 → returns all 1000 available nodes with a note field explaining the cap.

3.5Outgoing Edges (Neighbors)

GET /neighbors/<node_id> returns every node that the given node links to (outgoing edges, i.e. the node's adjacency list).

# Get all outgoing neighbors of the top-ranked node
curl http://192.168.1.14:5000/neighbors/41909

# Python
python3 -c "
import urllib.request, json
data = json.loads(urllib.request.urlopen('http://192.168.1.14:5000/neighbors/41909').read())
print(json.dumps(data, indent=2))
"

✓

Response format

{
  "nodeId": "41909",
  "direction": "outgoing",
  "count": 42,
  "neighbors": ["123", "456", "789", ...]

}

Sink nodes

Nodes with no outgoing edges (sink nodes) will return 404 with the message: This may be a sink node (has no outgoing edges — it receives links but doesn't link to anything). This is expected behaviour — sink nodes exist in the graph but have no entries in the adjacency list.

3.6Incoming Edges (Influencers)

GET /influencedby/<node_id> returns every node that links to the given node (incoming edges). This is computed on first request by building a reverse-index from data/graph.json, then cached in memory for all subsequent calls — it does not rebuild on every request.

# Get all nodes that link to the top-ranked node
curl http://192.168.1.14:5000/influencedby/41909

# Python
python3 -c "
import urllib.request, json
data = json.loads(urllib.request.urlopen('http://192.168.1.14:5000/influencedby/41909').read())
print(f"Node {data['nodeId']

is influenced by {data['count']} nodes") "}

✓

Response format

{
  "nodeId": "41909",
  "direction": "incoming",
  "count": 18,
  "sources": ["789", "101", "202", ...]

}

3.7Job Statistics

GET /stats returns the job metadata recorded when pagerank.py finished, plus live service information.

curl http://192.168.1.14:5000/stats

# Python
python3 -c "
import urllib.request, json
print(json.dumps(json.loads(urllib.request.urlopen('http://192.168.1.14:5000/stats').read()), indent=2))
"

✓

Response format

{
  "dataset": "web-Google.txt",
  "total_nodes": 875713,
  "total_edges": 5105039,
  "iterations": 10,
  "damping_factor": 0.85,
  "top_node": "41909",
  "completed_at": "2026-05-10T12:34:56+00:00",
  "api_status": "ok",
  "endpoints": ["/top5", "/top/", "/node/", "/neighbors/", "/influencedby/", "/stats", "/rerun", "/health"]

}

3.8Background Rerun

POST /rerun triggers a PageRank rerun in the background. It accepts optional parameters to change the iteration count, damping factor, or swap to an entirely different dataset from SNAP.

# Rerun with more iterations
curl -X POST http://192.168.1.14:5000/rerun \
  -H 'Content-Type: application/json' \
  -d '{"iterations": 15

' # Rerun with different damping factor curl -X POST http://192.168.1.14:5000/rerun \ -H 'Content-Type: application/json' \ -d '{"damping_factor": 0.90}' # Swap dataset entirely — Twitter graph from SNAP curl -X POST http://192.168.1.14:5000/rerun \ -H 'Content-Type: application/json' \ -d '{"dataset_url": "https://snap.stanford.edu/data/twitter_combined.txt.gz"}' # Combine all parameters curl -X POST http://192.168.1.14:5000/rerun \ -H 'Content-Type: application/json' \ -d '{"iterations": 20, "damping_factor": 0.88, "dataset_url": "https://snap.stanford.edu/data/web-Google.txt.gz"}'}

✓

Immediate response (202 Accepted)

{
  "job_id": "rerun_1746861234",
  "status": "queued",
  "params": {"iterations": 15

}}

Check job status

curl http://192.168.1.14:5000/rerun/status

✓

Status responses

// idle (no rerun ever triggered)
{"status": "idle"

// running {"status": "running", "job_id": "rerun_1746861234", "started_at": "2026-05-10T12:00:00+00:00", ...} // completed {"status": "completed", "job_id": "rerun_1746861234", "completed_at": "2026-05-10T12:02:45+00:00", ...} // failed {"status": "failed", "job_id": "rerun_1746861234", "error": "..."}}

⚠

Dataset swap requires a SNAP URLWhen dataset_url is provided, the API downloads the file, places it in HDFS, updates the metadata, then runs pagerank.py. The URL must point to a gzipped SNAP graph file (e.g. from https://snap.stanford.edu/data/). The file is downloaded locally, extracted, then pushed to HDFS so it is available to all cluster nodes.

💡

The API stays responsive during a rerunThe rerun happens in a background thread. The API continues to serve requests using the previous results until the new job completes. When the job finishes, the API updates all result files and future queries return the new results. The reverse-index cache is also invalidated so /influencedby reflects the new graph.

3.9If Something Fails

API unreachable (connection refused / timeout)

Step 1: confirm basic network connectivity first.

ping -c 4 192.168.1.14

ping -c 4 192.168.1.14

ping -c 4 192.168.1.14

ping -n 4 192.168.1.14

Network GraphPageRank Cluster

0.1Install Python 3

0.2Install Java

0.3Install Git

0.4Enable SSH

1.1Find Your LAN IP

1.2Configure

1.3Run Master Setup

After it finishes: check with jps

1.4Verify in Browser

Spark UI — http://192.168.1.14:8080

HDFS NameNode UI — http://192.168.1.14:9870

1.5Load the Dataset

Step 1 — Download

Step 2 — Push to HDFS

1.6Run PageRank

Where results are saved

1.7Start the API

Verify locally first

Run it in the background (so you can close the terminal)

2.1Before You Start

2.2Clone and Configure

Clone the repository on the worker machine

2.3Run Worker Setup

After it finishes: check with jps

2.4Register with Master

Step 1 — Find your own IP (on the worker machine)

2.5Most Common Failures

Failure 1 — MASTER_IP mismatch

Failure 2 — Worker can't reach master (firewall)

3.1Check the Service Is Up

Three ways to call GET /health

3.2Query the Top 5 Results

3.3Query a Specific Node

3.4Top N Nodes

Edge cases

3.5Outgoing Edges (Neighbors)

Sink nodes

3.6Incoming Edges (Influencers)

3.7Job Statistics

3.8Background Rerun

Check job status

3.9If Something Fails

API unreachable (connection refused / timeout)

Network Graph
PageRank Cluster

After it finishes: check with `jps`

Spark UI — `http://192.168.1.14:8080`

HDFS NameNode UI — `http://192.168.1.14:9870`

After it finishes: check with `jps`

Failure 1 — `MASTER_IP` mismatch

Three ways to call `GET /health`