Install these tools on every machine — both master and all workers. Do this once before running any setup script.
0.1Install Python 3
The first prerequisite for all cluster nodes. Python runs the setup scripts and the PageRank job.
brew install python3sudo apt-get update
sudo apt-get install -y python3 python3-pip# Using winget (comes with Windows 10/11)
winget install Python.Python.3.12
# Or download from: https://python.org0.2Install Java
Spark and Hadoop run on the Java Virtual Machine. Install OpenJDK 11 or 17 — Java 8 and 21+ cause compatibility issues.
brew install openjdk@17
# Add Java to your PATH permanently:
echo 'export PATH="$(brew --prefix openjdk@17)/bin:$PATH"' >> ~/.zprofile
export PATH="$(brew --prefix openjdk@17)/bin:$PATHsudo apt-get update
sudo apt-get install -y openjdk-17-jdk
# Set JAVA_HOME:
echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' >> ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64winget install EclipseAdoptium.Temurin.17.JDK --silent
# After install
set JAVA_HOME in PowerShell:
$env:JAVA_HOME = "C:\Program Files\Eclipse Adoptium\jdk-17.0.11.9-hotspot
# Or download from: https://adoptium.net0.3Install Git
Git is used to clone the cluster repository onto each machine.
brew install gitsudo apt-get update
sudo apt-get install -y gitwinget install Git.Git --silent
# Or download from: https://git-scm.com0.4Enable SSH
Hadoop daemons communicate between nodes using SSH. Enable Remote Login on your machine:
# Option 1: GUI — System Settings → General → Remote Login → ON
# Option 2: Command line (requires admin password):
sudo systemsetup -setremotelogin on
# Verify:
sudo systemsetup -getremotelogin# Most Linux distros have SSH pre-installed. Verify it's running:
which ssh
sudo systemctl enable ssh
sudo systemctl start ssh# Windows 10/11 has OpenSSH client built-in. Verify:
Get-WindowsCapability -Online | Where-Object Name -like 'OpenSSH*'
# If needed
install via Settings → Apps → Optional Features → Add OpenSSH ClientRun these steps on the laptop that will act as the cluster master. It coordinates all computation and hosts the API.
1.1Find Your LAN IP
Your LAN IP is the address other machines on the same Wi-Fi will use to reach yours. It usually starts with 192.168. or 10..
ipconfig getifaddr en0
# Expected output example:
# 192.168.1.14hostname -I | awk '{print $1}'
# Expected output example:
# 192.168.1.14(Get-NetIPAddress -AddressFamily IPv4 | Where-Object { $_.IPAddress -like '192.168.*' }).IPAddress
# Expected output example:
# 192.168.1.141.2Configure
Open setup/config.py. There is exactly one line you need to change — MASTER_IP. Everything else is auto-detected.
Before (the default placeholder):
Master_IP = "192.168.1.14" # set to master laptop's LAN IPAfter (your actual LAN IP from Section 1.1):
MASTER_IP = "192.168.1.42" # your actual LAN IP hereconfig.py. If MASTER_IP is wrong, Hadoop and Spark will start but workers will be unable to connect — and the error messages won't point here.Verify the config loads cleanly:
python3 setup/config.pyOS : Darwin Local IP : 192.168.1.42 MASTER_IP : 192.168.1.42 JAVA_HOME : /opt/homebrew/opt/openjdk@17 HADOOP_HOME : /opt/hadoop-3.3.6 SPARK_HOME : /opt/spark-3.5.3-bin-hadoop3 [ok] Java: openjdk version "17.0.11" 2024-01-16
1.3Run Master Setup
This single command installs Hadoop (a distributed filesystem) and Spark (a distributed computation engine), formats the filesystem, and starts all required services.
python3 setup/setup_node.py --role masterThe script prints phases as it runs. Here is what each one means:
| Phase printed | What is happening |
|---|---|
-- Java -- | Checks Java is present; installs if missing. |
-- Python dependencies -- | Installs pyspark, flask, requests via pip. |
-- Hadoop 3.3.6 (master) -- | Downloads ~650 MB Hadoop archive and extracts it. Writes XML config files pointing at your MASTER_IP. |
-- Spark 3.5.3 (master) -- | Downloads ~300 MB Spark archive and extracts it. Writes spark-defaults.conf. |
-- Formatting NameNode -- | Initialises the HDFS metadata directory. Only happens once; skipped on re-runs. |
-- Starting services (master) -- | Launches NameNode, DataNode, SecondaryNameNode, and Spark Master as background daemons. |
-- Verification -- | Runs jps to list running Java processes and confirms all four are present. |
After it finishes: check with jps
jps — the Java Virtual Machine Process Status tool — lists every running Java process. After master setup you should see exactly four:
jps12345 NameNode 12346 DataNode 12347 SecondaryNameNode 12348 Master 12349 Jps
If a process is missing, here is where to look:
| Missing process | What to check |
|---|---|
NameNode | Check Hadoop logs in $HADOOP_HOME/logs/. Common cause: a NameNode is already running (stop it with stop-dfs.sh), or wrong JAVA_HOME in hadoop-env.sh. |
DataNode | Usually starts after NameNode. If NameNode is up but DataNode isn't, check for port 9000 conflicts: lsof -i :9000. |
SecondaryNameNode | Non-critical for this lab. Its absence won't block PageRank. |
Master (Spark) | Check Spark logs in $SPARK_HOME/logs/. |
1.4Verify in Browser
Two web UIs start automatically. Open them to confirm the cluster is healthy.
Spark UI — http://192.168.1.14:8080
You should see the Spark Master dashboard. Look for: Status: ALIVE and at least one worker listed under "Workers" (the master itself counts as a worker at this stage).
HDFS NameNode UI — http://192.168.1.14:9870
You should see the Hadoop NameNode overview. Look for: Live Nodes: 1 (or more once workers join) and Safe Mode: OFF.
# macOS — allow Java through the firewall (or disable it temporarily):
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
# Linux — open the required ports:
sudo ufw allow 8080/tcp
sudo ufw allow 9870/tcp
sudo ufw allow 9000/tcp
sudo ufw allow 7077/tcp
sudo ufw allow 5000/tcp
# Windows — open via PowerShell (as Administrator):
netsh advfirewall firewall add rule name="Hadoop Spark Cluster" dir=in action=allow protocol=TCP localport=8080
9870
9000
7077
5000# macOS — allow Java through the firewall (or disable it temporarily):
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
# Linux — open the required ports:
sudo ufw allow 8080/tcp
sudo ufw allow 9870/tcp
sudo ufw allow 9000/tcp
sudo ufw allow 7077/tcp
sudo ufw allow 5000/tcp
# Windows — open via PowerShell (as Administrator):
netsh advfirewall firewall add rule name="Hadoop Spark Cluster" dir=in action=allow protocol=TCP localport=8080
9870
9000
7077
5000# macOS — allow Java through the firewall (or disable it temporarily):
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
# Linux — open the required ports:
sudo ufw allow 8080/tcp
sudo ufw allow 9870/tcp
sudo ufw allow 9000/tcp
sudo ufw allow 7077/tcp
sudo ufw allow 5000/tcp
# Windows — open via PowerShell (as Administrator):
netsh advfirewall firewall add rule name="Hadoop Spark Cluster" dir=in action=allow protocol=TCP localport=8080
9870
9000
7077
50001.5Load the Dataset
The dataset is the Stanford Web-Google graph — a snapshot of 875,713 web pages and 5,105,039 hyperlinks between them, collected by Google in 2002. HDFS (the Hadoop Distributed File System) stores data across all nodes so Spark can read it in parallel.
Step 1 — Download
python3 src/download_dataset.pyDownloading from https://snap.stanford.edu/data/web-Google.txt.gz ... [####################] 100% Extracting... [ok] Ready: data/web-Google.txt Nodes: 875,713 Edges: 5,105,039 Next: hdfs dfs -put data/web-Google.txt /pagerank/input/
Step 2 — Push to HDFS
hdfs dfs -put copies a local file into the distributed filesystem so all workers can access it.
hdfs dfs -put data/web-Google.txt /pagerank/input/hdfs dfs -put data/web-Google.txt /pagerank/input/hdfs dfs -put data/web-Google.txt /pagerank/input/1.6Run PageRank
This submits the PageRank job to Spark. spark-submit packages your Python script and sends it to the Spark Master, which distributes the computation across all connected workers.
spark-submit \
--master spark://192.168.1.14:7077 \
--executor-memory 2g \
--driver-memory 1g \
src/pagerank.py 10The number 10 is the iteration count. During the run you will see lines like:
Iteration 1/10 12.3s Iteration 2/10 11.8s Iteration 3/10 11.5s ... Iteration 10/10 11.2s Collecting results... ======================================================= TOP 5 MOST INFLUENTIAL NODES ======================================================= #1 node 41909 -> 445.71778597 #2 node 597621 -> 406.62836675 #3 node 504140 -> 399.08930875 #4 node 384666 -> 392.82584373 #5 node 537039 -> 383.90912550 ======================================================= [ok] Done in 118.4s
Each iteration line shows the wall-clock time for one pass over the graph. Times should be roughly stable; a sudden spike usually means a worker disconnected.
Where results are saved
| File | Contents |
|---|---|
data/top5.json | Top 5 nodes — this is what the API serves. |
data/results.json | Top 1000 nodes with scores, for /node/<id> and /top/<n> queries. |
data/graph.json | Full adjacency list — every node and its outgoing neighbors. Used by /neighbors and /influencedby. |
data/meta.json | Job metadata — dataset name, node/edge counts, iterations, damping factor, top node, completion timestamp. Used by /stats. |
data/pagerank_output/part-00000 | All nodes, tab-separated: nodeId\tscreen. |
1.7Start the API
The REST API is a small Flask server that reads data/top5.json and serves it to other groups over HTTP.
Verify locally first
python3 src/api.py================================================== Group 03 - PageRank Portability API http://192.168.1.14:5000 GET /top5 -> top 5 influencers GET /top/<n> -> top N ranked nodes GET /node/<id> -> score for node id GET /neighbors/<id> -> outgoing edges GET /influencedby/<id> -> incoming edges GET /stats -> job metadata + service info POST /rerun -> trigger background rerun GET /rerun/status -> rerun job status GET /health -> status check [ok] Results loaded -- top node: 41909 (445.71778597) ==================================================
In a second terminal, confirm it responds:
curl http://localhost:5000/healthRun it in the background (so you can close the terminal)
nohup python3 src/api.py > /tmp/api.log 2>&1 &
echo "API PID: $!
# To stop it later:
# kill $(lsof -ti :5000)nohup python3 src/api.py > /tmp/api.log 2>&1 &
echo "API PID: $!
# To stop it later:
# kill $(lsof -ti :5000)# PowerShell -- start as a background job
Start-Job -ScriptBlock { python3 src/api.py } | Out-Null
Write-Host 'API started in background'
# To stop: Get-Job | Stop-JobRun these steps on every laptop that will join the cluster as a worker. The master must already be running (Part 1 complete) before you start here.
2.1Before You Start
The master node must be fully running before you set up any worker. Specifically:
- Section 1.3 must be complete —
jpson the master shows all four processes. - You need the master's LAN IP address from Section 1.1.
- Both machines must be on the same Wi-Fi or wired LAN.
ping -c 3 192.168.1.14
# All 3 packets should receive a reply.ping -c 3 192.168.1.14
# All 3 packets should receive a reply.ping -n 3 192.168.1.14
# All 3 packets should receive a reply.2.2Clone and Configure
Clone the repository on the worker machine
git clone https://github.com/munimx/pagerank-cluster
cd pagerank-clustergit clone https://github.com/munimx/pagerank-cluster
cd pagerank-clustergit clone https://github.com/munimx/pagerank-cluster
cd pagerank-cluster2.3Run Worker Setup
python3 setup/setup_node.py --role workerThis installs Java, Python dependencies, Hadoop, and Spark — same as the master, but starts only the DataNode and Spark Worker daemons (not the NameNode or Spark Master).
After it finishes: check with jps
jps23456 DataNode 23457 Worker 23458 Jps
You should not see NameNode or Master on a worker — those only run on the master machine.
2.4Register with Master
The master needs to know this worker exists. This is a two-step handshake: the worker reports its IP, then the master adds it to the cluster roster.
Step 1 — Find your own IP (on the worker machine)
hdfs dfsadmin -report | grep 'Live datanodes'hdfs dfsadmin -report | grep 'Live datanodes'hdfs dfsadmin -report | findstr "Live datanodes2.5Most Common Failures
Failure 1 — MASTER_IP mismatch
The worker starts but never appears in the Spark UI or HDFS report.
ERROR Worker: All masters are unresponsive! Giving up. ERROR Worker: Connection to spark://192.168.1.99:7077 failed
Fix: Open setup/config.py on the worker machine and correct MASTER_IP to the master's actual LAN IP. Then re-run worker setup.
Failure 2 — Worker can't reach master (firewall)
Ping succeeds but Spark/HDFS connections time out.
# Option 1: Disable the firewall (simplest for a lab cluster):
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate off
# Option 2: Allow Java through the firewall:
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --add /usr/bin/javasudo ufw allow 7077/tcp
sudo ufw allow 9000/tcp
sudo ufw allow 9866/tcp
sudo ufw allow 9867/tcp
sudo ufw allow 8080/tcp
sudo ufw allow 9870/tcp# Run as Administrator:
netsh advfirewall firewall add rule name="Hadoop-Spark Cluster" dir=in action=allow protocol=TCP localport=7077
9000
9866
9867
8080
9870These instructions are for the group that is testing Group 03's output. Follow them from any machine on the same LAN as the Group 03 master.
3.1Check the Service Is Up
Before querying results, confirm the API process is alive.
Three ways to call GET /health
# curl
curl http://192.168.1.14:5000/health
# Python
python3 -c "import urllib.request, json; print(json.dumps(json.loads(urllib.request.urlopen('http://192.168.1.14:5000/health').read()), indent=2))"
# Browser
# http://192.168.1.14:5000/health{
"dataset": "Stanford Web-Google",
"framework": "Apache Spark",
"group": "03",
"results_ready": true,
"status": "ok",
"task": "Network Graph PageRank"
If the connection is refused: the API process isn't running. Contact the Group 03 master operator to restart it (python3 src/api.py &).
If you get a timeout: check network connectivity with ping first, then check firewall rules (Section 2.5).
3.2Query the Top 5 Results
GET /top5 returns the five highest-ranked nodes in the Web-Google graph.
# curl
curl http://192.168.1.14:5000/top5
# Python urllib (no extra libraries needed)
python3 -c "
import urllib.request, json
data = json.loads(urllib.request.urlopen('http://192.168.1.14:5000/top5').read())
for n in data:
print(n['rank'], n['nodeId'], n['pagerank'])
"
# Browser
# http://192.168.1.14:5000/top5[
{
"rank": 1,
"nodeId": "41909",
"pagerank": 445.71778597
3.3Query a Specific Node
GET /node/<id> returns the score for one node. Use the top-ranked node's ID as your test case.
# curl -- query the top node
curl http://192.168.1.14:5000/node/41909
# Python
python3 -c "
import urllib.request, json
data = json.loads(urllib.request.urlopen('http://192.168.1.14:5000/node/41909').read())
print(json.dumps(data, indent=2))
"{
"nodeId": "41909",
"pagerank": 445.71778597
3.4Top N Nodes
GET /top/<n> returns the top n ranked nodes (capped at 1000, the size of the stored result set).
# Get top 10 nodes
curl http://192.168.1.14:5000/top/10
# Get top 50 nodes
curl http://192.168.1.14:5000/top/50{
"requested": 10,
"returned": 10,
"nodes": [
{"nodeId": "41909", "pagerank": 445.71778597Edge cases
n < 1→ returns400with error message.n > 1000→ returns all 1000 available nodes with anotefield explaining the cap.
3.5Outgoing Edges (Neighbors)
GET /neighbors/<node_id> returns every node that the given node links to (outgoing edges, i.e. the node's adjacency list).
# Get all outgoing neighbors of the top-ranked node
curl http://192.168.1.14:5000/neighbors/41909
# Python
python3 -c "
import urllib.request, json
data = json.loads(urllib.request.urlopen('http://192.168.1.14:5000/neighbors/41909').read())
print(json.dumps(data, indent=2))
"{
"nodeId": "41909",
"direction": "outgoing",
"count": 42,
"neighbors": ["123", "456", "789", ...]
Sink nodes
Nodes with no outgoing edges (sink nodes) will return 404 with the message: This may be a sink node (has no outgoing edges — it receives links but doesn't link to anything). This is expected behaviour — sink nodes exist in the graph but have no entries in the adjacency list.
3.6Incoming Edges (Influencers)
GET /influencedby/<node_id> returns every node that links to the given node (incoming edges). This is computed on first request by building a reverse-index from data/graph.json, then cached in memory for all subsequent calls — it does not rebuild on every request.
# Get all nodes that link to the top-ranked node
curl http://192.168.1.14:5000/influencedby/41909
# Python
python3 -c "
import urllib.request, json
data = json.loads(urllib.request.urlopen('http://192.168.1.14:5000/influencedby/41909').read())
print(f"Node {data['nodeId']{
"nodeId": "41909",
"direction": "incoming",
"count": 18,
"sources": ["789", "101", "202", ...]
3.7Job Statistics
GET /stats returns the job metadata recorded when pagerank.py finished, plus live service information.
curl http://192.168.1.14:5000/stats
# Python
python3 -c "
import urllib.request, json
print(json.dumps(json.loads(urllib.request.urlopen('http://192.168.1.14:5000/stats').read()), indent=2))
"{
"dataset": "web-Google.txt",
"total_nodes": 875713,
"total_edges": 5105039,
"iterations": 10,
"damping_factor": 0.85,
"top_node": "41909",
"completed_at": "2026-05-10T12:34:56+00:00",
"api_status": "ok",
"endpoints": ["/top5", "/top/", "/node/", "/neighbors/", "/influencedby/", "/stats", "/rerun", "/health"]
3.8Background Rerun
POST /rerun triggers a PageRank rerun in the background. It accepts optional parameters to change the iteration count, damping factor, or swap to an entirely different dataset from SNAP.
# Rerun with more iterations
curl -X POST http://192.168.1.14:5000/rerun \
-H 'Content-Type: application/json' \
-d '{"iterations": 15{
"job_id": "rerun_1746861234",
"status": "queued",
"params": {"iterations": 15Check job status
curl http://192.168.1.14:5000/rerun/status// idle (no rerun ever triggered)
{"status": "idle"dataset_url is provided, the API downloads the file, places it in HDFS, updates the metadata, then runs pagerank.py. The URL must point to a gzipped SNAP graph file (e.g. from https://snap.stanford.edu/data/). The file is downloaded locally, extracted, then pushed to HDFS so it is available to all cluster nodes./influencedby reflects the new graph.3.9If Something Fails
API unreachable (connection refused / timeout)
Step 1: confirm basic network connectivity first.
ping -c 4 192.168.1.14ping -c 4 192.168.1.14ping -c 4 192.168.1.14ping -n 4 192.168.1.14