Do I need to run always findspark or once?


My method of using pyspark is to always run the code below in jupyter. Is this method always necessary ?

import findspark
import pyspark
sc = pyspark.SparkContext()


If you want to reduce the findspark dependency, you can just make sure you have these variables in your .bashrc

export SPARK_HOME='/opt/spark2.4'
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

Change the directories according to your enviroment, and the spark version as well. Apart from that, findspark will have to be in your code for your python interpreter to find the spark directory

If you get it working, you can run pip uninstall findspark


Pure python solution, add this code on top of your jupyter notebook (maybe in the first cell):

import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["SPARK_HOME"] = "/opt/spark2.4"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/")
sys.path.insert(0, os.environ["PYLIB"] +"/")

Source : Anaconda docs

