Skip to the content.

Spark with Hadoop on Docker

A production-ready Spark + Hadoop + Hive stack deployed on Docker, aligned with the open-source Spark versions used by Databricks Runtime. Install on any machine in just a few minutes with a single command.

Docker
Docker
Apache Spark
Apache Spark
Hadoop
Hadoop HDFS
Apache Hive
Apache Hive
Shell Script
Shell Script

Introduction

This project Spark with Hadoop Anywhere provides a production‑like Spark + Hadoop + Hive stack deployed in Docker containers that closely mirror real-world Spark environments with one command so that you can develop, test, and demo locally.

Each branch in this project corresponds to a specific Spark / Scala /Java combination (aligned with spark versions used in DBR runtimes), giving you a portable environment for:


Table of Contents


User Challenges

Below are some real-world challanges users face:

Third-Party Library Compatibility

Some libraries work well with open-source Apache Spark but fail on Databricks. How can developers reproduce and debug these issues locally?

Behavioral Differences

How can we reproduce and compare behavioral differences between Databricks and on-premises Spark environments, especially when using Delta Lake or other extensions?

Feature Development and Testing

Developers often need to build new features or integrations (e.g., Delta, MongoDB, Redshift) on open-source Spark and run minimal tests before production deployment.

End-to-End ETL Pipeline Development

How can I implement a complete ETL pipeline—including Airflow, Kafka, and Spark with minimal data locally for experimentation and testing?

Regression Testing Across Spark Versions

If an issue occurs in Spark 4.0.0 on-premises, how can I easily test it locally and verify whether it also exists in Spark 3.5.x?

Learning and Enablement

How can I get a cluster-like local environment to learn Spark concepts and experiment safely?


Motivation

Modern data platforms (Databricks, EMR, on-prem Hadoop) are:

Typical local setups (e.g., just spark-shell on a laptop or a single generic Spark image) lack:

Spark with Hadoop Anywhere bridges that gap by providing low-friction, version-accurate, reproducible stacks you can run anywhere Docker is installed.

Design Goals

The project is built around these core principles:

1. Version Fidelity

2. Minimal but Realistic

3. Quick Setup

4. Isolation

5. Extensibility


Architecture

Each branch provides two deployment modes with version-specific artifacts:

Spark with Hadoop Anywhere Architecture

Deployment Modes

Single-Node Mode (Default)

A single all-in-one container with Spark + HDFS + Hive for quick development and testing.

Containers:

Use Cases:

Multi-Node Mode

A distributed Spark Standalone cluster with master and worker nodes for realistic production-like scenarios.

Containers:

Use Cases:

Component Details

Spark

Hadoop (HDFS)

Hive


DBR underlying Spark OSS Compatible Branches

Branches are curated to align with Databricks Runtime (DBR) and underlying OSS Spark versions. Click on the branch name you want and i will take you to that specific branch in the reprository

DBR Version Spark OSS Version Scala Version Java Version Compatible Branch in the Repository
13.3 3.4.1 2.12 8 spark-3.4.1
14.3 3.5.0 2.12 8 spark-3.5.0
15.4 3.5.0 2.12 8 spark-3.5.0
16.4 3.5.2 2.12 17 spark-3.5.2-scala-2.12
16.4 3.5.2 2.13 17 spark-3.5.2-scala-2.13
17.x 4.0.0 2.13 17 spark-4.0.0

Tip: Use the closest match to your target DBR. For binary compatibility (especially for UDFs, UDAFs, and custom libs), ensure the Scala version also matches.


What makes this different from other Repos

There are many Spark Docker images, but this project specifically targets data platform engineers, SREs,support engineers,developers working on real world pipelines involiving data analytcis.

1. Complete Stack, Not Just Spark

You get a full analytics node: Spark + HDFS + Hive Metastore + CLI tooling, crucial for debugging:

2. Version-Driven Branches

3. Reproducibility-First Design

4. Pre-built Docker Images

All Spark/Hadoop/Hive combinations are pre-built and available on DockerHub: docker4ops/spark-with-hadoop


Some Use Cases

1. Reproducing OSS VS DBR Behavior Locally

Problem

You hit a bug on DBR 16.4 and need a deterministic environment to:

Solution

  1. Check out the branch mapped to Spark 3.5.2 / Scala 2.12 or 2.13 / Java 17
  2. Spin up the stack with setup-spark.sh
  3. Load synthetic/anonymized data into HDFS/Hive
  4. Run the same job logic and compare behavior

2. Validating Cross-Version Behavior

Problem

Upgrading DBR (or straight OSS Spark) and need to understand:

Solution

  1. Run the same workload against multiple branches (e.g., spark-3.4.1 vs spark-3.5.2-scala-2.13)
  2. Compare:
    • Query plans
    • Logs and metrics
    • Output correctness

3. Minimal Reproducible Examples (MREs)

Problem

You want to open a GitHub issue or vendor ticket and must provide:

Solution

  1. Use this repo + branch as the environment contract
  2. Share:
    • Branch name
    • setup-spark.sh invocation
    • A small dataset and job script
  3. Others can clone the same branch and reproduce the behavior exactly

4. Training and Onboarding

Problem

New team members need a safe environment to:

Solution

  1. One command brings up a single-node analytics stack
  2. Nothing is shared; you can destroy and recreate at will
  3. Ideal for internal training or "Spark archeology" on older versions

Getting Started

Prerequisites

Make sure you have these tools installed by following the installation steps in the README file and then verify the istallation:

docker --version
docker-compose version
git --version

Step-1: Clone and Choose a Branch

Choose a branch based on the spark version you want to install , you can refer to the table DBR underlying Spark OSS Compatible Branches and pick the branch based on your Spark version.

git clone -b spark-2.4.7 https://github.com/AnudeepKonaboina/spark-with-hadoop-anywhere.git && cd spark-with-hadoop-anywhere/

Step-2: Configure Secrets (hive metastore password)

mkdir -p secrets
echo "<your_strong_password_here>" > secrets/postgres_password.txt

Step-3: Run the Setup Script

There are two ways of running the setup script, and you can optionally choose between single-node and multi-node deployment.

Cluster Mode Options:

Specify the mode with --node-type {single|multi} (defaults to single if omitted)

All images are pre-built and available on DockerHub: docker4ops/spark-with-hadoop

# Single-node (default)
sh setup-spark.sh --run

# Multi-node cluster
sh setup-spark.sh --run --node-type multi

This pulls pre-built images and starts the stack in seconds. Perfect for quick testing and development.

Option B: Build images locally

# Single-node (default)
sh setup-spark.sh --build --run

# Multi-node cluster
sh setup-spark.sh --build --run --node-type multi

Build from source if you want to customize the Dockerfile or add additional packages.

This will:

Step-4: Verify Running Containers

Once the setup is completed, verify the running containers:

docker ps

Single-node deployment:

You should see 2 containers:

Example output:

CONTAINER ID   IMAGE                     COMMAND                  PORTS                              NAMES
1af5afd31789   spark-with-hadoop:local   "/usr/local/bin/star…"   0.0.0.0:4040-4041->4040-4041/tcp   spark
c8c3e725a73c   hive-metastore:local      "docker-entrypoint.s…"   5432/tcp                           hive_metastore

Multi-node deployment:

You should see 4 containers:

Example output:

CONTAINER ID   IMAGE                        COMMAND                  PORTS                              NAMES
18bd26ade9ac   spark-with-hadoop:local      "bash -lc..."            0.0.0.0:7077->7077/tcp             spark-master
973ee17a76e8   spark-with-hadoop:local      "bash -lc..."            8081/tcp                           spark-worker-1
60e52fdc6bc5   spark-with-hadoop:local      "bash -lc..."            8081/tcp                           spark-worker-2
12fcb76b3af2   hive-metastore:local         "docker-entrypoint.s…"   5432/tcp                           hive_metastore

How to use

Connect to the Spark container

Single-node:

docker exec -it spark bash

Multi-node (connect to master):

docker exec -it spark-master bash

Spark

Single-node Mode

Start a Spark shell in local mode:

# Scala shell
spark-shell

# Python shell
pyspark

# Submit a spark job
spark-submit --class com.example.MyApp my-app.jar

Multi-node Mode

Connect to the Spark cluster master:

# Scala shell connected to cluster
spark-shell --master spark://hadoop.spark:7077

# Python shell connected to cluster
pyspark --master spark://hadoop.spark:7077

# Submit a job to the cluster
spark-submit --master spark://hadoop.spark:7077 \
  --class com.example.MyApp \
  my-app.jar

# Control parallelism
spark-shell --master spark://hadoop.spark:7077 \
  --executor-cores 1 \
  --total-executor-cores 2

Access Web UIs


HDFS

Use the HDFS CLI inside the container:

hdfs dfs -ls /
hdfs dfs -mkdir -p /user/$(whoami)
hdfs dfs -put /opt/data/sample.parquet /user/$(whoami)/
hdfs dfs -cat /user/$(whoami)/sample.parquet | head

Hive

Use Hive CLI or Beeline:

# Hive CLI
hive

# Beeline (JDBC)
beeline -u jdbc:hive2://localhost:10000/default

Example queries you can run:

-- Create a table
CREATE TABLE IF NOT EXISTS employees (
  id INT,
  name STRING,
  department STRING
) STORED AS PARQUET;

-- Query from Spark
SELECT * FROM employees LIMIT 10;

Extending the Stack

Common extension patterns:

Add Custom Jars

# Extend the Dockerfile
FROM anudeepkonaboina/spark-hadoop-standalone:spark-3.5.2

COPY custom-jars/*.jar ${SPARK_HOME}/jars/

Mount Host Directories

# docker-compose.yml
services:
  spark:
    volumes:
      - ./data:/opt/data
      - ./jars:/opt/jars

Add more services to extend the stack

Yonu can add more data egineering services to the docker-compsoe file and build a complete End-to-end data eng tech stack on Docker


services:
  spark:
    volumes:
      - ./data:/opt/data
      - ./jars:/opt/jars
  kafka:
       --
  hbase:
       --
  airflow:


Project Layout

spark-with-hadoop-anywhere/
├── docker-compose.yml          # Multi-node orchestration
├── docker-compose.single.yml   # Single-node orchestration
├── setup-spark.sh              # Entry script (supports --node-type)
├── spark-hadoop-standalone/
│   ├── Dockerfile              # Spark/Hadoop/Hive image
├── hive-metastore/
│   └── Dockerfile              # Hive Metastore image
├── configs/                    # Shared configuration files
├── scripts/
│   └── start-services.sh       # Service initialization script
├── secrets/                    # (Git-ignored) secret files

Limitations


Cleanup

When you’re done testing, you can stop and remove all containers with a single command:

sh setup-spark.sh --stop

This will:


Author

Anudeep Konaboina


If this project helps you debug a tricky Spark/Hive/HDFS issue or reproduce a DBR bug, please star the repository!


© 2025 Anudeep Konaboina | Licensed under Apache 2.0