Databricks Jobs#

Recently I successfully deploy my python wheel to Databricks Cluster. Here are some tips if you plan to deploy pyspark.

pyspark project
pytest

`pyspark` project#

My previous spark project is scala based and I use IDEA to compile and test conveniently.

Databricks Job nice UI save your time to create JAR job.

This is official guide: Databricks Wheel Job

What I did:

Initialize a python project

# create python virtual environment
python -m venv pyspark_venv

# active your venv
source pyspark_venv/bin/activate

# check your current python
which python

# install python lib
pip install uv ruff pyspark pytest wheel

## if pip failed at proxy error
## adding your proxy 
## --proxy http://proxy:port

# create your project
uv init --package <your package name>

After uv command complete, a nice python project is created.

pyspark-app
├── README.md
├── pyproject.toml
└── src
    └── pyspark_app
        └── __init__.py

pyspark entry point

add one file __main__.py in pyspark_app
modify [project.scripts] in pyproject.toml and this is entry point of Databricks job

Now the project is

pyspark-app
├── README.md
├── pyproject.toml
└── src
    └── pyspark_app
        ├── __init__.py
        └── __main__.py

`pytest`#

Please check your pytest installed. Let create a new package test

pyspark-app
├── README.md
├── pyproject.toml
└── src
    └── pyspark_app
        ├── __init__.py
        ├── __main__.py
        └── test
            ├── __init__.py
            ├── conftest.py
            └── test_spark.py

test_spark
def test_spark(init_spark):
    spark = init_spark
    df = spark.range(10)
    df.show()

""" output
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/01 20:59:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
PASSED                                         [100%]+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+
"""

Now you can work on your spark application with test

wheel file#

Final step is building wheel file

# 1. change your work directory to pyproject.toml
# 2. run below command
python -m build --wheel

# project is now changing to

pyspark-app
├── README.md
├── build
│   ├── bdist.macosx-12.0-x86_64
│   └── lib
│       └── pyspark_app
│           ├── __init__.py
│           ├── __main__.py
│           └── test
│               ├── __init__.py
│               ├── conftest.py
│               └── test_spark.py
├── dist
│   └── pyspark_app-0.1.0-py3-none-any.whl
├── pyproject.toml
└── src
    ├── pyspark_app
    │   ├── __init__.py
    │   ├── __main__.py
    │   └── test
    │       ├── __init__.py
    │       ├── conftest.py
    │       └── test_spark.py
    └── pyspark_app.egg-info
        ├── PKG-INFO
        ├── SOURCES.txt
        ├── dependency_links.txt
        ├── entry_points.txt
        └── top_level.txt

Your wheel file is at line 20

Go to view all at Project template

Databricks Jobs#

pyspark project#

pytest#

wheel file#

`pyspark` project#

`pytest`#