How to Set up Spark Environment on Mac

DataCouch
8 min readAug 19, 2022

--

Overview

Setting up and configuring a Spark environment is not an easy task. The complexity is multi-fold when Spark has to be integrated with Jupyter Notebooks, Hive and HDFS.

This guide simplifies the journey to setup and configure the environment on Mac OS.

System Prerequisites

  • Laptop or desktop with i3 quad-core processor or higher
  • Browser: Google Chrome (Preferred) or Safari
  • 8GB or higher RAM
  • Internet connection (minimum 10 MB/s)

Installing Anaconda

  1. Go to the Anaconda Website and select a Python 3.x graphical installer. Choose Python 3.

2. Locate your download and double click on it to execute/run it.

3. Click Next when the following splash screen appears.

Using JupyterLab

Using shortcuts in both command and edit modes:

  • <Shift + Enter> to run the command in the current cell and move to the next cell below it
  • <Ctrl + Enter> run selected cells
  • <Alt + Enter> run the current cell and insert a new cell below it
  • <Ctrl + S> save and create a checkpoint

Installing Brew

Open a terminal window and execute the following command:

/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

Check Java Version

java -version

Note: If you already have Java installed, please don’t execute the below commands.

In case, Java is not installed then

Installing OpenJDK 8

Let’s install OpenJDK using brew:

brew cask install adoptopenjdk8

Or on newer Mac OS version(s):

brew install — cask adoptopenjdk8

Now we have installed Java 8, we can check the installation with the following command.

java -version

Installing Apache Spark

Now that you have brew installed, this installs the latest version of Spark by default. If it is not installed, then to install it, open a terminal and execute the below command -

brew install apache-spark

To locate the installation path and other details of Spark, enter the following command:

brew info apache-spark

You would see something like this:

To add PySpark to use Jupyter, add the following two entries in your ~/.bash_profile file::

export PYSPARK_DRIVER_PYTHON=jupyterexport PYSPARK_DRIVER_PYTHON_OPTS=’lab’

Now, run the following command after adding the above entries:

source ~/.bash_profile

Verify Spark With Python

Launch Jupyter lab using PySpark in terminal using the following command:

pyspark

Create a PySpark DataFrame from an RDD (Resilient Distributed Dataset) consisting of a list of tuples.

rdd = spark.sparkContext.parallelize([
(1, 2., ‘string1’, date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
(2, 3., ‘string2’, date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
(3, 4., ‘string3’, date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
])
df = spark.createDataFrame(rdd, schema=[‘a’, ‘b’, ‘c’, ‘d’, ‘e’])
df

Check the data and the schema of the DataFrame created above with following python commands:

df.show()
df.printSchema()

(Optional) Installing Scala Spark on Jupyter

Step1: Install the package

conda install -c conda-forge spylon-kernel

Step2: Create a kernel spec

This will allow us to select the Scala kernel in the notebook.

python -m spylon_kernel install

Step3: Testing the notebook

Let’s write some Scala code:

val x = 2
val y = 3
x+y

The output should be something similar to the result shown in an image on the left hand side. As you can see, it also initiates the Spark components. For this, please make sure that you have SPARK_HOME set up.

Now we can use Spark. Let’s test it by creating a dataset:

val data = Seq((1,2,3), (4,5,6), (6,7,8), (9,19,10))
val ds = spark.createDataset(data)
ds.show()

This should output a simple dataframe:

Installing MySQL

Open a terminal and execute the following command:

brew install mysql

You can now start MySQL server by executing the following command:

brew services start mysql

Now, we need to secure the MySQL server. By default the server comes without a root password, so we need to make sure it’s protected.

mysql_secure_installation

Since we used the command brew services start mysql to start MySQL, your macOS will start it automatically, after each reboot.

Now, you can connect to the MySQL server using the command:

mysql -u root -p

You will need to type the root user password after you run this command, and once you are done you should see a mysql prompt.

Setting up MySQL database

$ mysql -u root -p
mysql> CREATE DATABASE metastore DEFAULT CHARACTER SET utf8;
mysql> CREATE USER ‘hiveuser’@’localhost’ IDENTIFIED BY ‘password’;
mysql> GRANT ALL on metastore.* TO ‘hiveuser’@’localhost’;
mysql> FLUSH PRIVILEGES;
mysql> quit;

Installing Hive

Update Brew and Install Hive -

brew update
brew install hive

Note: You may get a “No bottle available!” error if you are running a Macbook pro having an M1 chip. This is a known issue, Please refer to this article to resolve the problem.

If installation is successful then please follow the below steps -

Edit the bash_profile file and add the following environment variables -

vi ~/.bash_profile

Use the <A> key to change vi editor’s mode to “Insert” mode.

export HADOOP_HOME=/usr/local/Cellar/hadoop/3.3.3/libexecexport HIVE_HOME=/usr/local/Cellar/hive/3.1.3/libexec

Key in <ESC> followed by <:wq!> to save the bash_profile file.

After saving the above file, execute the following command:

source ~/.bash_profile

Download JDBC connector

cd ~/Downloadstar xzf mysql-connector-java-8.0.29.tarsudo cp mysql-connector-java-8.0.29/mysql-connector-java-8.0.29.jar /usr/local/Cellar/hive/3.1.3/libexec/lib/cd /usr/local/Cellar/hive/3.1.3/libexec/conf

Note: Download hive-site.xml file from here, and then run the following commands:

cd ~/Downloadssudo cp hive-site.xml /usr/local/Cellar/hive/3.1.3/libexec/confchmod -R 777 /tmp/hive

Create a directory hive/warehouse

cd ~
mkdir -p hive/warehouse

Go inside hive/warehouse using below command and copy the path

cd hive/warehouse
pwd

Edit the hive-site.xml file and update the path of hive.metastore.warehouse.dir

<property><name>hive.metastore.warehouse.dir</name><value>/Users/<user>/hive/warehouse</value><description>location of default database for the warehouse</description></property>

Once you have updated the path in hive-site.xml, then save the file using :wq.

You need to create tables in MySQL which are required for the metastore. Manually create the tables using the script available at the following location.

cd $HIVE_HOME/scripts/metastore/upgrade/mysql/

Logging in into MySQL

mysql> use metastore; 
mysql> source hive-schema-3.1.0.mysql.sql;
mysql> quit;

Exit MySQL and run below command in terminal

schematool -dbType mysql -initSchema -dryRun

Running Hive

$ hive
hive > show tables;
hive > exit;

Integrating Hive with Spark

Copy Spark default configuration to Spark conf directory

cd /usr/local/Cellar/apache-spark/3.2.0/libexec/confsudo cp spark-defaults.conf.template spark-defaults.conf

Now, specify Hive configuration in Spark default conf file.

To edit spark-defaults.conf, give the following command:

vi spark-defaults.conf

and add the below line:

spark.sql.catalogImplementation=hive

To save the file and quit vi enter :wq.

Now, copy hive-site.xml file to the Spark conf directory.

sudo cp /usr/local/Cellar/hive/3.1.3/libexec/conf/hive-site.xml .

Then, copy the jar file with mysql connector to the Spark Jars directory.

sudo cp /usr/local/Cellar/hive/3.1.3/libexec/lib/mysql-connector-java-8.0.29.jar /usr/local/Cellar/apache-spark/3.2.0/libexec/jars

Set the permissions

sudo chmod -R 777 /tmp/hive
sudo chmod -R 775 /usr/local/Cellar/hive
sudo chmod -R 775 /usr/local/Cellar/apache-spark

Configuring Hadoop

Configuring Hadoop with minimal settings will take a few steps. A more detailed version can be found in the Apache Hadoop documentation for setting up a single node cluster (be sure to follow along with the correct version installed on your machine).

The steps to configure Hadoop are as follows:

  • Updating the environment variable settings
  • Make changes to core-, hdfs-, mapred- and yarn-site.xml files
  • Remove password requirement (if present)
  • Format NameNode

Open the file containing the environment variable settings, i.e., hadoop-env.sh:

cd /usr/local/Cellar/hadoop/3.3.3/libexec/etc/hadoop

vi hadoop-env.sh

Make the following changes to the document, save and close.

1) Add the location for export JAVA_HOME:

export JAVA_HOME= “/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home”

2) Replace information for export HADOOP_OPTS:

change export HADOOP_OPTS=”-Djava.net.preferIPv4Stack=true”to export HADOOP_OPTS=”-Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc=”

Make the following changes to core-site.xml file

vi core-site.xml

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

Make the following changes to hdfs-site.xml file

vi hdfs-site.xml

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

Make the following changes to mapred-site.xml file

vi mapred-site.xml

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

<property>

<name>mapreduce.application.classpath</name>

<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>

</property>

</configuration>

Make the following changes to yarn-site.xml file

vi yarn-site.xml

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>

</property>

<! — Site specific YARN configuration properties- ->

</configuration>

Remove password requirement (if necessary)

Check if you’re able to ssh without a password. This step is necessary before moving to the next step to prevent unexpected results when formatting the NameNode.

ssh localhost

Make sure to turn on Remote Login and File Sharing under System Preferences. This worked on my machine.

If this does not return a last login time, use the following commands to remove the need to insert a password.

ssh-keygen -t rsa -P ‘’ -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Format NameNode

To format the NameNode, use the following commands:

cd /usr/local/Cellar/hadoop/3.3.3/libexec/binhdfs namenode -format

A warning will tell you that a directory for logs is being created. You will be prompted to re-format the filesystem in Storage Directory for root. Say Y and press RETURN.

Run Hadoop

Now, you are ready to run Hadoop

cd /usr/local/Cellar/hadoop/3.3.3/libexec/sbin./start-all.shjps

After running jps, you should have confirmation that all components of Hadoop have been installed and the system is running. You should see something like this:

Open a web browser to see your configurations for the current session.

http://localhost:9870

Verify Hadoop

hdfs dfs -mkdir -p /user/$USER/hdfs dfs -ls

Stop Hadoop

Stop Hadoop service when you are all done.

cd /usr/local/Cellar/hadoop/3.3.3/libexec/sbin./stop-all.sh

I hope this step-by-step guide has helped you get over the hurdle of installing Hadoop on your macOS machine and you are now all set!

For virtual instructor-led Class, please reach out to us at operations@datacouch.io

--

--

DataCouch
DataCouch

Written by DataCouch

We are a team of Data Scientists who provide training and consultancy services to professionals worldwide. Linkedin- https://in.linkedin.com/company/datacouch

No responses yet