Overview
Setting up and configuring a Spark environment is not an easy task. The complexity is multi-fold when Spark has to be integrated with Jupyter Notebooks, Hive and HDFS.
This guide simplifies the journey to setup and configure the environment on Mac OS.
System Prerequisites
- Laptop or desktop with i3 quad-core processor or higher
- Browser: Google Chrome (Preferred) or Safari
- 8GB or higher RAM
- Internet connection (minimum 10 MB/s)
Installing Anaconda
- Go to the Anaconda Website and select a Python 3.x graphical installer. Choose Python 3.
2. Locate your download and double click on it to execute/run it.
3. Click Next when the following splash screen appears.
Using JupyterLab
Using shortcuts in both command and edit modes:
- <Shift + Enter> to run the command in the current cell and move to the next cell below it
- <Ctrl + Enter> run selected cells
- <Alt + Enter> run the current cell and insert a new cell below it
- <Ctrl + S> save and create a checkpoint
Installing Brew
Open a terminal window and execute the following command:
/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
Check Java Version
java -version
Note: If you already have Java installed, please don’t execute the below commands.
In case, Java is not installed then
Installing OpenJDK 8
Let’s install OpenJDK using brew:
brew cask install adoptopenjdk8
Or on newer Mac OS version(s):
brew install — cask adoptopenjdk8
Now we have installed Java 8, we can check the installation with the following command.
java -version
Installing Apache Spark
Now that you have brew installed, this installs the latest version of Spark by default. If it is not installed, then to install it, open a terminal and execute the below command -
brew install apache-spark
To locate the installation path and other details of Spark, enter the following command:
brew info apache-spark
You would see something like this:
To add PySpark to use Jupyter, add the following two entries in your ~/.bash_profile file::
export PYSPARK_DRIVER_PYTHON=jupyterexport PYSPARK_DRIVER_PYTHON_OPTS=’lab’
Now, run the following command after adding the above entries:
source ~/.bash_profile
Verify Spark With Python
Launch Jupyter lab using PySpark in terminal using the following command:
pyspark
Create a PySpark DataFrame from an RDD (Resilient Distributed Dataset) consisting of a list of tuples.
rdd = spark.sparkContext.parallelize([
(1, 2., ‘string1’, date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
(2, 3., ‘string2’, date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
(3, 4., ‘string3’, date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
])
df = spark.createDataFrame(rdd, schema=[‘a’, ‘b’, ‘c’, ‘d’, ‘e’])
df
Check the data and the schema of the DataFrame created above with following python commands:
df.show()
df.printSchema()
(Optional) Installing Scala Spark on Jupyter
Step1: Install the package
conda install -c conda-forge spylon-kernel
Step2: Create a kernel spec
This will allow us to select the Scala kernel in the notebook.
python -m spylon_kernel install
Step3: Testing the notebook
Let’s write some Scala code:
val x = 2
val y = 3
x+y
The output should be something similar to the result shown in an image on the left hand side. As you can see, it also initiates the Spark components. For this, please make sure that you have SPARK_HOME set up.
Now we can use Spark. Let’s test it by creating a dataset:
val data = Seq((1,2,3), (4,5,6), (6,7,8), (9,19,10))
val ds = spark.createDataset(data)
ds.show()
This should output a simple dataframe:
Installing MySQL
Open a terminal and execute the following command:
brew install mysql
You can now start MySQL server by executing the following command:
brew services start mysql
Now, we need to secure the MySQL server. By default the server comes without a root password, so we need to make sure it’s protected.
mysql_secure_installation
Since we used the command brew services start mysql to start MySQL, your macOS will start it automatically, after each reboot.
Now, you can connect to the MySQL server using the command:
mysql -u root -p
You will need to type the root user password after you run this command, and once you are done you should see a mysql prompt.
Setting up MySQL database
$ mysql -u root -p
mysql> CREATE DATABASE metastore DEFAULT CHARACTER SET utf8;
mysql> CREATE USER ‘hiveuser’@’localhost’ IDENTIFIED BY ‘password’;
mysql> GRANT ALL on metastore.* TO ‘hiveuser’@’localhost’;
mysql> FLUSH PRIVILEGES;mysql> quit;
Installing Hive
Update Brew and Install Hive -
brew update
brew install hive
Note: You may get a “No bottle available!” error if you are running a Macbook pro having an M1 chip. This is a known issue, Please refer to this article to resolve the problem.
If installation is successful then please follow the below steps -
Edit the bash_profile file and add the following environment variables -
vi ~/.bash_profile
Use the <A> key to change vi editor’s mode to “Insert” mode.
export HADOOP_HOME=/usr/local/Cellar/hadoop/3.3.3/libexecexport HIVE_HOME=/usr/local/Cellar/hive/3.1.3/libexec
Key in <ESC> followed by <:wq!> to save the bash_profile file.
After saving the above file, execute the following command:
source ~/.bash_profile
Download JDBC connector
cd ~/Downloadstar xzf mysql-connector-java-8.0.29.tarsudo cp mysql-connector-java-8.0.29/mysql-connector-java-8.0.29.jar /usr/local/Cellar/hive/3.1.3/libexec/lib/cd /usr/local/Cellar/hive/3.1.3/libexec/conf
Note: Download hive-site.xml file from here, and then run the following commands:
cd ~/Downloadssudo cp hive-site.xml /usr/local/Cellar/hive/3.1.3/libexec/confchmod -R 777 /tmp/hive
Create a directory hive/warehouse
cd ~
mkdir -p hive/warehouse
Go inside hive/warehouse using below command and copy the path
cd hive/warehouse
pwd
Edit the hive-site.xml file and update the path of hive.metastore.warehouse.dir
<property><name>hive.metastore.warehouse.dir</name><value>/Users/<user>/hive/warehouse</value><description>location of default database for the warehouse</description></property>
Once you have updated the path in hive-site.xml, then save the file using :wq.
You need to create tables in MySQL which are required for the metastore. Manually create the tables using the script available at the following location.
cd $HIVE_HOME/scripts/metastore/upgrade/mysql/
Logging in into MySQL
mysql> use metastore;
mysql> source hive-schema-3.1.0.mysql.sql;
mysql> quit;
Exit MySQL and run below command in terminal
schematool -dbType mysql -initSchema -dryRun
Running Hive
$ hive
hive > show tables;hive > exit;
Integrating Hive with Spark
Copy Spark default configuration to Spark conf directory
cd /usr/local/Cellar/apache-spark/3.2.0/libexec/confsudo cp spark-defaults.conf.template spark-defaults.conf
Now, specify Hive configuration in Spark default conf file.
To edit spark-defaults.conf, give the following command:
vi spark-defaults.conf
and add the below line:
spark.sql.catalogImplementation=hive
To save the file and quit vi enter :wq.
Now, copy hive-site.xml file to the Spark conf directory.
sudo cp /usr/local/Cellar/hive/3.1.3/libexec/conf/hive-site.xml .
Then, copy the jar file with mysql connector to the Spark Jars directory.
sudo cp /usr/local/Cellar/hive/3.1.3/libexec/lib/mysql-connector-java-8.0.29.jar /usr/local/Cellar/apache-spark/3.2.0/libexec/jars
Set the permissions
sudo chmod -R 777 /tmp/hive
sudo chmod -R 775 /usr/local/Cellar/hive
sudo chmod -R 775 /usr/local/Cellar/apache-spark
Configuring Hadoop
Configuring Hadoop with minimal settings will take a few steps. A more detailed version can be found in the Apache Hadoop documentation for setting up a single node cluster (be sure to follow along with the correct version installed on your machine).
The steps to configure Hadoop are as follows:
- Updating the environment variable settings
- Make changes to core-, hdfs-, mapred- and yarn-site.xml files
- Remove password requirement (if present)
- Format NameNode
Open the file containing the environment variable settings, i.e., hadoop-env.sh:
cd /usr/local/Cellar/hadoop/3.3.3/libexec/etc/hadoop
vi hadoop-env.sh
Make the following changes to the document, save and close.
1) Add the location for export JAVA_HOME:
export JAVA_HOME= “/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home”
2) Replace information for export HADOOP_OPTS:
change export HADOOP_OPTS=”-Djava.net.preferIPv4Stack=true”to export HADOOP_OPTS=”-Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc=”
Make the following changes to core-site.xml file
vi core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Make the following changes to hdfs-site.xml file
vi hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Make the following changes to mapred-site.xml file
vi mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
Make the following changes to yarn-site.xml file
vi yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<! — Site specific YARN configuration properties- ->
</configuration>
Remove password requirement (if necessary)
Check if you’re able to ssh without a password. This step is necessary before moving to the next step to prevent unexpected results when formatting the NameNode.
ssh localhost
Make sure to turn on Remote Login and File Sharing under System Preferences. This worked on my machine.
If this does not return a last login time, use the following commands to remove the need to insert a password.
ssh-keygen -t rsa -P ‘’ -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Format NameNode
To format the NameNode, use the following commands:
cd /usr/local/Cellar/hadoop/3.3.3/libexec/binhdfs namenode -format
A warning will tell you that a directory for logs is being created. You will be prompted to re-format the filesystem in Storage Directory for root. Say Y and press RETURN.
Run Hadoop
Now, you are ready to run Hadoop
cd /usr/local/Cellar/hadoop/3.3.3/libexec/sbin./start-all.shjps
After running jps, you should have confirmation that all components of Hadoop have been installed and the system is running. You should see something like this:
Open a web browser to see your configurations for the current session.
Verify Hadoop
hdfs dfs -mkdir -p /user/$USER/hdfs dfs -ls
Stop Hadoop
Stop Hadoop service when you are all done.
cd /usr/local/Cellar/hadoop/3.3.3/libexec/sbin./stop-all.sh
I hope this step-by-step guide has helped you get over the hurdle of installing Hadoop on your macOS machine and you are now all set!
For virtual instructor-led Class, please reach out to us at operations@datacouch.io