Do I need to run always findspark or once?

Issue

My method of using pyspark is to always run the code below in jupyter. Is this method always necessary ?

import findspark
findspark.init('/opt/spark2.4')
import pyspark
sc = pyspark.SparkContext()

Solution

If you want to reduce the findspark dependency, you can just make sure you have these variables in your .bashrc

export SPARK_HOME='/opt/spark2.4'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

Change the directories according to your enviroment, and the spark version as well. Apart from that, findspark will have to be in your code for your python interpreter to find the spark directory

If you get it working, you can run pip uninstall findspark

EDIT:

Pure python solution, add this code on top of your jupyter notebook (maybe in the first cell):

import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["SPARK_HOME"] = "/opt/spark2.4"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

Source : Anaconda docs

Answered By – pissall

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published