Databricks Spark Application Starter Kit is a starter codebase for building and deploying Apache Spark applications on Databricks with production-ready structure and best practices.
Note
In short, this starter kit will help you to develop your Spark application locally and deploy it to Databricks Jobs with a single command, and schedule it to run periodically, using Databricks Connect, Databricks Unity Catalog, and Databricks Jobs (Python Wheel Task).
- Databricks Spark Application Starter Kit
Run a sample job locally: (make sure you have followed the Development Setup steps and setup .env file)
spark_app --job_name sample_simple_jobDeploy the sample job to Databricks Jobs:
databricks_deploy --job_name sample_simple_jobNote
You can find the sample job code in src/databricks_spark_app/jobs/sample_simple_job.py.
This only deploys the job to Databricks Jobs. You still need to schedule it in the Databricks UI under Schedules & Triggers.
databricks-spark-app-starter/
βββ src/databricks_spark_app/
β βββ jobs/ # the only folder you mostly work on and care about the most
β β βββ sample_simple_job.py # example spark job, the file name will be used as job name
β β βββ ...
β βββ io
β β βββ dataframe.py # ManagedDataFrame class for schema and comments management
β β βββ writer.py # insert_overwrite function for writing data to tables
β βββ pipeline.py # entry point for running spark jobs
β βββ deploy.py # script to deploy jobs to Databricks
β βββ utils.py # utility functions
β βββ config.py # configuration management
βββ ... # other project files- Python
3.12+ - Databricks account with access to a workspace
- You can sign up for a free edition: https://www.databricks.com/learn/free-edition
- Generate a personal access token (PAT): https://docs.databricks.com/aws/en/dev-tools/auth#generate-a-token
uvfor python package management: https://docs.astral.sh/uv/installation
tl;dr:
Clone the repo, install dependencies, set up environment variables, and start coding in src/databricks_spark_app/jobs/.
Important
If you are planning to change the folder name databricks_spark_app in the src directory, please make sure everywhere in the codebase is updated accordingly by searching for databricks_spark_app. (e.g ctrl + shift + f in VSCode)
-
Install dependencies:
uv sync --all-groups --all-extras
After that, a virtual environment will be created in the
.venvdirectory. -
Prepare environment variables:
- Copy
.env.exampleto.envand fill in your Databricks host and token.cp .env.example .env
- Fill in your Databricks host and token in the
.envfile.DATABRICKS_HOST=your-databricks-host DATABRICKS_TOKEN=your-databricks-token
- Source the
.envfile if needed:source .env
- Copy
-
Develop your Spark application in the
src/databricks_spark_app/jobs/*.pyfiles withpipelinefunction as the entry point. Example:# src/databricks_spark_app/jobs/sample_simple_job.py import logging from pyspark.sql import SparkSession, Window from pyspark.sql import functions as f logger = logging.getLogger(__name__) def pipeline(): # ... your Spark job logic here ...
-
Test your Spark application locally by adding job name as a command-line argument. Example:
spark_app --job_name sample_simple_job
2025-09-15 20:35:28,415 - databricks_spark_app.utils - INFO - Logging is set up. 2025-09-15 20:35:32,612 - databricks_spark_app.utils - INFO - Initialized Spark 4.0.0 session. 2025-09-15 20:35:32,612 - databricks_spark_app.utils - INFO - Running job: sample_simple_job +------------------+ | message| +------------------+ |Hello, Databricks!| +------------------+ 2025-09-15 20:35:33,505 - databricks_spark_app.jobs.sample_simple_job - INFO - Sample job completed successfully.
Note: The job name should match the filename in
src/databricks_spark_app/jobs/. -
Deploy your Spark application to Databricks:
databricks_deploy --job_name sample_simple_job
This will package your application and deploy it to Databricks Jobs. You can verify the deployment in the Databricks UI under
Jobs & Pipelines.Job will be deployed as Python Wheel Task and will run on the latest Databricks Runtime with Spark
4.0+, which supports Python3.12+and treats the job like a DAG in Airflow. -
Schedule your job under
Schedules & Triggersin the Databricks Jobs UI.You can also trigger the job manually from the UI or scheduled it via cron expression.
This starter kit provides two ways to manage writing data into Unity Catalog or Hive Metastore tables:
- Direct functional API with
insert_overwrite - Class-based abstraction with
ManagedDataFrame
Choose whichever fits your workflow best.
When to use: if your jobs are simple and you just need to persist results.
If you already have a Spark DataFrame and want to write it to a managed table with schema enforcement and comments:
import logging
from pyspark.sql import SparkSession
from pyspark.sql import types as t
from databricks_spark_app.io.writer import insert_overwrite
logger = logging.getLogger(__name__)
def pipeline():
spark = SparkSession.getActiveSession()
df = spark.sql("""
SELECT
'Hello, Databricks!' AS message,
CAST(CURRENT_DATE() AS STRING) AS part_date
""")
df.show(truncate=False)
force_schema = t.StructType([
t.StructField("message", t.StringType(), nullable=False),
t.StructField("part_date", t.StringType(), nullable=False),
])
spark.sql("CREATE DATABASE IF NOT EXISTS temp_db")
insert_overwrite(
fqtn="temp_db.hello_table",
spark_df=df,
force_schema=force_schema,
table_comment="Sample table for insert overwrite demonstration",
column_comments={
"message": "A greeting message",
"part_date": "Partition date",
},
partition_by=["part_date"],
)
logger.info("Sample job completed successfully.")When to use: if your jobs are complex and you want to encapsulate logic, schema, and comments in one place.
- You want strict schema management (
table_schemarequired). - You want column-level documentation via
column_comments. - You want to follow a standardized job template.
You can extend ManagedDataFrame to define schema, comments, and transformation logic in one place.
This makes jobs self-describing, reusable, and testable.
import logging
from pyspark.sql import SparkSession
from pyspark.sql import types as t
from databricks_spark_app.io.dataframe import ManagedDataFrame
logger = logging.getLogger(__name__)
class SampleHelloTable(ManagedDataFrame):
table_comment = "Sample table for insert overwrite demonstration"
column_comments = {
"message": "A greeting message",
"part_date": "Partition date",
}
table_schema = t.StructType([
t.StructField("message", t.StringType(), nullable=False),
t.StructField("part_date", t.StringType(), nullable=False),
])
def process(self):
spark = SparkSession.getActiveSession()
return spark.sql("""
SELECT
'Hello, Databricks! From ManagedDataFrame' AS message,
CAST(CURRENT_DATE() AS STRING) AS part_date
""")
def pipeline():
job = SampleHelloTable()
job.insert_overwrite(fqtn="temp_db.hello_table", partition_by=["part_date"])
logger.info("Sample job completed successfully.")| Feature | insert_overwrite (function) |
ManagedDataFrame (class) |
|---|---|---|
| Minimal boilerplate | Yes | No, requires class define |
| Schema enforcement | Manual | Automatic (table_schema) |
| Column-level documentation | Manual | Automatic (column_comments) |
| Best for | Simple jobs | Complex/production jobs |
You can define additional job parameters in DatabricksAdditionalParams in config.py. These parameters will be automatically added as command-line arguments when running the job locally or in Databricks Jobs. You can then access these parameters in your job code via Spark SQL variables with the syntax `params.<param_name>`. Note that the sign is the backtick (`), not a single quote (').
Example:
spark.sql("""
SELECT *
FROM some_table
WHERE part_date = `params.run_date`
""")When running locally:
spark_app --job_name sample_run_date_job.py --run_date "2025-09-01"Because the Spark Connect session doesn't support setting or changing Spark configurations, the only way to pass parameters is Spark variables (https://spark.apache.org/docs/4.0.0/sql-ref-syntax-ddl-declare-variable.html).
In the src/databricks_spark_app/jobs folder, you will find several example jobs demonstrating different features:
βββ jobs
β βββ sample_insert_overwrite.py
β βββ sample_managed_table.py
β βββ sample_run_date_job.py
β βββ sample_simple_job.py
Just replace/delete these example jobs with your own job files as needed.
- Databricks Jobs: https://docs.databricks.com/aws/en/jobs
- Databricks Connect For Python: https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/
- PySpark Documentation: https://spark.apache.org/docs/4.0.0/api/python/index.html


