Get started with Apache PySpark
Prerequisites
You need the following:
- Git
- Java 8.0 or 11 You can download the Java SE Development Kit software from http://www.oracle.com/technetwork/java/javase/downloads/ or use the OpenJDK version from https://openjdk.java.net/install/index.html.
Project structure
The basic project structure is as follows:
root/
|-- tools/
| |-- spark.py
| |-- utils.py
| |-- processing.py
| Makefile
| package.sh
| requirements.txt
| spark_submit.py
The main Python module running the job (which will be sent to the Spark cluster), is spark_submit.py, which look like:
"""
spark_submit.py
"""
import logging
from tools.processing import run
logging.basicConfig(level=logging.DEBUG)
def main():
"""Main script definition.
:return: None
"""
logging.debug('Run Spark job!')
run()
# entry point for PySpark
if __name__ == "__main__":
main()
Packaging dependencies
Create an archive containing the dependencies (excluding Spark and Py4J) of your project and your modules. Here is an example of script:
#!/usr/bin/env bash
export PACKAGE_NAME=pyspark-examples
export MODULE=tools
echo "Packing the dependencies"
rm -rf ./$PACKAGE_NAME $PACKAGE_NAME.zip
grep -v '^ *#\|^pyspark\|^py4j' requirements.txt > .requirements-filtered.txt
pip install -t ./$PACKAGE_NAME -r .requirements-filtered.txt
rm .requirements-filtered.txt
# check to see if there are any external dependencies
# if not then create an empty file to seed zip with
if [ -z "$(ls -A $PACKAGE_NAME)" ]
then
touch $PACKAGE_NAME/empty.txt
fi
cd $PACKAGE_NAME
zip -9mrv $PACKAGE_NAME.zip .
mv $PACKAGE_NAME.zip ..
cd ..
echo "Add all modules from local"
zip -ru9 $PACKAGE_NAME.zip $MODULE -x $MODULE/__pycache__/\*