PySpark – SparkContext Example

Python PySpark – SparkContext

SparkContext provides an entry point of any Spark Application.

In this tutorial, we shall start with a basic example of how to get started with SparkContext, and then learn more about the details of it in-depth, using syntax and example programs.

Example – PySpark SparkContext

A simple example to create SparkContext with PySpark is:

#import SparkContext
from pyspark import SparkContext

#create SparkContext
sc = SparkContext("local", "My First Spark Application")

print("SparkContext :",sc)

Firstly, we have imported SparkContext class from pyspark package. Then, we have created spark context with local master and My First Spark Application as application name. If you have installed spark in your computer and are trying out this example, you can keep the master as local. Otherwise, if the spark demon is running on some other computer in the cluster, you can provide the URL of the spark driver.

To run the above application, you can save the file as pyspark_example.py and run the following command in command prompt.

C:\workspace\python> spark-submit pyspark_example.py

You should not see any errors that potentially stop the Spark Driver, and between those clumsy logs, you should see the following line, which we are printing out to console in our Spark Application.

SparkContext : <SparkContext master=local appName=My First Spark Application>

Syntax – Python SparkContext

The syntax of SparkContext Class is:

class pyspark.SparkContext (
	master = None,
	appName = None, 
	sparkHome = None, 
	pyFiles = None, 
	environment = None, 
	batchSize = 0, 
	serializer = PickleSerializer(), 
	conf = None, 
	gateway = None, 
	jsc = None, 
	profiler_cls = <class 'pyspark.profiler.BasicProfiler'>
)

where

  • master is the URL of the cluster it connects to.
  • appName is the Application Name by which you can identify in the Job List of Spark UI.
  • sparkHome is the path to Spark installation directory.
  • pyFiles is the (.zip or .py) files to send to the cluster and add to the PYTHONPATH.
  • environment is the Worker nodes environment variables.
  • batchSize is the number of Python objects represented as a single Java object. Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size.
  • serializer is the RDD serializer which should be used for this Job.
  • conf is an object of L{SparkConf} to set all the Spark properties.
  • gateway lets to use an existing gateway and JVM, otherwise initializing a new JVM.
  • jsc is the JavaSparkContext instance.
  • profiler_cls is a class of custom Profiler used to do profiling (the default is pyspark.profiler.BasicProfiler).

Based on the requirement, and your environment settings, you can set any of the parameters allowed by pyspark.SparkContext().

Summary

In this tutorial of Python Examples, we learned how to get started with SparkContext in Python using PySpark library.