Setting up pyspark standalone on mac
You all might have had heard about spark
- A large scale distributed data processing framework. Often the bottlenecks faced when learning spark are due to incorrect installations or due to not being aware of the dependencies and the environment spark is working on. In this guide I will present quick and easy steps to have your spark applications running in standalone mode on mac.
With python2
heading toward maintenance mode in 2021, python3
is the way to go ahead. So we will breakdown our guide into two steps:
- Python setup
- Spark setup
Python setup
Your mac might have come preinstalled with python2 by default, you can check it by $python --version
. However, to install python 3, we will use homebrew
, a package installer for mac, as npm
is for nodejs
.
Install Homebrew
shell$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
Add the homebrew path in your profile
shell$ vi ~/.bash_profile
export PATH="/usr/local/opt/python/libexec/bin:$PATH"
shell$ source ~/.bash_profile
Install python
brew install python
Yay, so python3
neatly installed. Onto spark now.
Spark setup
Download the latest tarball from here
Install pyspark
shell$ pip install pyspark
Notes:
- You should have pip installed if you have installed python from homebrew as above.
- You need Java SE runtime also, generally you should have it installed, if not find it here.
All good you are up and running with all the installations neatly. Now just move to bin folder in your spark binaries and enter $pyspark
as shown.
Note that you would want to use python3
, if you observe pyspark uses python2
then you can change the PYSPARK_PYTHON
variable to point to python3
. You can do that as follows by adding it to your profile.
shell$ vi ~/.bash_profile
export PYSPARK_PYTHON=python3
shell$ source ~/.bash_profile