Graal Platform Documentation

Graal Platform Documentation

  • Docs
  • Help

›Tutorials

Overview

  • What is Graal Platform?
  • Why use our platform?
  • How Graal Platform works?
  • Concepts
  • Jobs & workflows
  • Security

Quickstart

  • Quickstart

Tutorials

  • Get started with Python
  • Get started with Dask
  • Get started with XGBoost
  • Get started with Apache Spark and Maven
  • Get started with Apache PySpark
  • Get started with Apache Beam and Gradle
  • Use the API
  • Using the command line tool (graalctl)
  • Using secrets
  • Migration from Databricks
  • Get started with Tensorflow
  • Get started with Pytorch
  • Get started with Mxnet
  • Setting up the Hadoop bridge
  • Get started with Apache Flink and Maven
  • Get started with Dbt
  • Get started with Pulsar
  • Get started with Apache Spark Streaming Pulsar
  • Get started with Debezium
  • Get started with the SDK

How-to guides

  • Using Graal Platform with Azure Data Factory
  • Publishing your artefacts with Azure DevOps
  • Using Graal Platform with Apache Airflow
  • Publishing your artefacts with Jenkins
  • Spark
  • Network, VPN, gateway and firewall
  • Logs
  • Pricing

Security

  • Overview
  • Comply with requirements
  • Infrastructures under Graal Systems
  • Responsibilities

Troubleshoot & debug

  • Troubleshooting
  • Common issues
  • Debug jobs

Get started with Apache PySpark

Prerequisites

You need the following:

  • Git
  • Java 8.0 or 11 You can download the Java SE Development Kit software from http://www.oracle.com/technetwork/java/javase/downloads/ or use the OpenJDK version from https://openjdk.java.net/install/index.html.

Project structure

The basic project structure is as follows:

root/
|-- tools/
|   |-- spark.py
|   |-- utils.py
|   |-- processing.py
|   Makefile
|   package.sh
|   requirements.txt
|   spark_submit.py

The main Python module running the job (which will be sent to the Spark cluster), is spark_submit.py, which look like:

"""
spark_submit.py
"""
import logging

from tools.processing import run

logging.basicConfig(level=logging.DEBUG)


def main():
    """Main script definition.
    :return: None
    """
    logging.debug('Run Spark job!')
    run()


# entry point for PySpark
if __name__ == "__main__":
    main()

Packaging dependencies

Create an archive containing the dependencies (excluding Spark and Py4J) of your project and your modules. Here is an example of script:

#!/usr/bin/env bash

export PACKAGE_NAME=pyspark-examples
export MODULE=tools

echo "Packing the dependencies"
rm -rf ./$PACKAGE_NAME $PACKAGE_NAME.zip
grep -v '^ *#\|^pyspark\|^py4j' requirements.txt > .requirements-filtered.txt
pip install -t ./$PACKAGE_NAME -r .requirements-filtered.txt
rm .requirements-filtered.txt

# check to see if there are any external dependencies
# if not then create an empty file to seed zip with
if [ -z "$(ls -A $PACKAGE_NAME)" ]
then
    touch $PACKAGE_NAME/empty.txt
fi

cd $PACKAGE_NAME
zip -9mrv $PACKAGE_NAME.zip .
mv $PACKAGE_NAME.zip ..
cd ..

echo "Add all modules from local"
zip -ru9 $PACKAGE_NAME.zip $MODULE -x $MODULE/__pycache__/\*
← Get started with Apache Spark and MavenGet started with Apache Beam and Gradle →
  • Packaging dependencies
Graal Platform Documentation
Overview
What is Graal Platform?
Quickstart
Apache SparkApache FlinkApache BeamPythonTensorflowDaskDistributed XGBoost
Links
HomeConsoleCopyrights
Copyright © 2023 Graal Systems