Issue
My method of using pyspark is to always run the code below in jupyter. Is this method always necessary ?
import findspark
findspark.init('/opt/spark2.4')
import pyspark
sc = pyspark.SparkContext()
Solution
If you want to reduce the findspark
dependency, you can just make sure you have these variables in your .bashrc
export SPARK_HOME='/opt/spark2.4'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
Change the directories according to your enviroment, and the spark version as well. Apart from that, findspark
will have to be in your code for your python interpreter to find the spark directory
If you get it working, you can run pip uninstall findspark
EDIT:
Pure python solution, add this code on top of your jupyter notebook (maybe in the first cell):
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["SPARK_HOME"] = "/opt/spark2.4"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Source : Anaconda docs
Answered By – pissall
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0