Prerequisites

You need the following:

Git
Java 8.0 or 11 You can download the Java SE Development Kit software from http://www.oracle.com/technetwork/java/javase/downloads/ or use the OpenJDK version from https://openjdk.java.net/install/index.html.

Project structure

The basic project structure is as follows:

root/
|-- tools/
|   |-- spark.py
|   |-- utils.py
|   |-- processing.py
|   Makefile
|   package.sh
|   requirements.txt
|   spark_submit.py

The main Python module running the job (which will be sent to the Spark cluster), is spark_submit.py, which look like:

"""
spark_submit.py
"""
import logging

from tools.processing import run

logging.basicConfig(level=logging.DEBUG)


def main():
    """Main script definition.
    :return: None
    """
    logging.debug('Run Spark job!')
    run()


# entry point for PySpark
if __name__ == "__main__":
    main()

Packaging dependencies

Create an archive containing the dependencies (excluding Spark and Py4J) of your project and your modules. Here is an example of script:

#!/usr/bin/env bash

export PACKAGE_NAME=pyspark-examples
export MODULE=tools

echo "Packing the dependencies"
rm -rf ./$PACKAGE_NAME $PACKAGE_NAME.zip
grep -v '^ *#\|^pyspark\|^py4j' requirements.txt > .requirements-filtered.txt
pip install -t ./$PACKAGE_NAME -r .requirements-filtered.txt
rm .requirements-filtered.txt

# check to see if there are any external dependencies
# if not then create an empty file to seed zip with
if [ -z "$(ls -A $PACKAGE_NAME)" ]
then
    touch $PACKAGE_NAME/empty.txt
fi

cd $PACKAGE_NAME
zip -9mrv $PACKAGE_NAME.zip .
mv $PACKAGE_NAME.zip ..
cd ..

echo "Add all modules from local"
zip -ru9 $PACKAGE_NAME.zip $MODULE -x $MODULE/__pycache__/\*

Graal Platform Documentation

Get started with Apache PySpark

Prerequisites

Project structure

Packaging dependencies