Getting started with Sqoop on Google Cloud Platform

DataCouch
7 min readDec 22, 2021

--

Learning Objectives:

  1. What is Sqoop?
  2. History of Sqoop
  3. Sqoop working and its Architecture
  4. Why to use Sqoop?
  5. Installation of Sqoop on Google Cloud Platform(GCP)
  6. Basic working example of Sqoop

What is Sqoop?

It is a utility which is built to transfer the bulk data between HDFS and databases such as RDBMS(Relational DataBase Management System) etc. Sqoop is derived from “SQL-to-Hadoop” i.e. SQ from SQL and OOP from HADOOP. It can import the data from RDBMS, DB2, Oracle, etc to HDFS and can export from HDFS directory to Relational DataBase Management System. It supports other sources/sinks as well e.g. HBase etc.

History of Sqoop

Initially Sqoop was developed and maintained by Cloudera and later incubated as the Apache project in 2011. In 2012, Sqoop was promoted as Apache’s top-level project. Since then all the releases are managed by Apache.

Sqoop working and its Architecture

Sqoop basically work in two modes i.e. Sqoop Import and Sqoop Export.

Talking about Sqoop Import, it is internally used to fetch the data from various other databases. First things first, Sqoop first of all analyses databases and fetch metadata in order to retrieve the data.

After that it uses Map Only Map Reduce job (not reduce job, as no aggregation is occurring) for transferring data to HDFS (Hadoop Distributed File System) under a file which is related to the retrieved table. Sqoop let’s you specify the name of a directory in HDFS.

Now talking about Sqoop Export, first of all tables must exist in RDBMS before export can start. Sqoop executes this transfer by splitting the I/P data set into multiple map tasks and transferring it to the database. Map task executes this data transfer over several transactions. Sometimes in Sqoop it facilitates a staging table concept. In this concept first of all data sets will be transferred to a staging table and then to a target/destination table. This prevents data loss in case of a failure.

Why to use Sqoop?

  1. Sqoop can execute the data transfer in parallel
  2. Convenient Command Line Interface allows even non-developer to transfer data
  3. It allows the data transfer to multiple Data stores such as MySQL, Oracle, DB2, etc.
  4. Sqoop allows us to offload processing in the EL (Extract and Load) process into less expensive, fast, and effective Hadoop/Spark processes
  5. One can use incremental loads, so it is efficient to transfer delta data
  6. Various data compression strategies make it efficient in terms of network IO and disk IO
  7. Kerberos security integration makes it secure to use

Installation of Sqoop on Google Cloud Platform(GCP)

Before Sqoop Installation, Create a Dataproc Cluster on GCP

Sqoop 1.4.7 can be directly installed from https://sqoop.apache.org/ or you can follow our video tutorial to download it or you can manually install it in the system by following these steps.

For this Sqoop Installation tutorial, we are using version 1.4.7, that is,

sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

For other versions of Sqoop you can follow this http://archive.apache.org/dist/sqoop/1.4.5/

In the terminal, perform wget to download the sqoop package.

$ wget http://apachemirror.wuchna.com/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

Extract the Sqoop tarball using tar -xvf and move it to the /usr/lib/ directory.

$ tar -xvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz$ sudo mv sqoop-1.4.7.bin__hadoop-2.6.0 sqoop$ sudo mv sqoop /usr/lib/

Append the following lines to the ~/.profile file using sudo vi ~/.profile command

export SQOOP_HOME=/usr/lib/sqoop

export PATH=$PATH:$SQOOP_HOME/bin

Execute following command to reload environment variables.

Edit the sqoop-env.sh file, that is placed in the $SQOOP_HOME/conf i.e. /usr/lib/conf directory, in order to configure Sqoop with Hadoop.

Now, use the following command to redirect to the Sqoop config directory and copy the template file.

$ source ~/.profile

Now, open sqoop-env.sh and edit the following lines in it.

export HADOOP_COMMON_HOME=/usr/lib/hadoop

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Find the hive-site.xml in /usr/lib/hive and mysql-connector file in /usr/share/java/. Set the soft link of hive-site.xml and MySQL-connector in /usr/lib/sqoop/conf and /usr/lib/sqoop/lib respectively.

$ sudo cp /usr/lib/hive/conf/hive-site.xml /usr/lib/sqoop/conf$ sudo ln -s /usr/share/java/mysql-connector-java-5.1.42.jar /usr/lib/sqoop/lib

Verify the Sqoop version we use the following command.

$ cd /usr/lib/sqoop/ /usr/lib/sqoop:~$ .bin/sqoop-version

Now, Sqoop is installed and ready for use in GCP in Dataproc.

Basic working example of Sqoop

Let’s take an example to import the student data from MySQL to Sqoop and from MySQL to Hive. Suppose there is a table named as “student_details” in a database named as “Demo”.

For creating table in MySQL, first of all go to your MySQL by typing mysql -u root -p

Note: Type the password as password

If you don’t know password and want to reset it. To change the root password, you have to shut down the database server beforehand.

You can do that for MySQL with:

$ sudo systemctl stop mysql

After the database server is stopped, you’ll access it manually to reset the root password. Restarting the Database Server Without Permission Checking

But before that we need to create directory named as mysqld inside /var/run and provide them necessary permissions:

$ sudo mkdir -p /var/run/mysqld$ sudo chown mysql:mysql /var/run/mysqld

Now Start the database without loading the grant tables or enabling networking:

$ sudo mysqld_safe — skip-grant-tables &

Now open new tab and open mysql by typing mysql -u root . Here no password is required.

Changing the Root Password

Let’s tell the database server to reload the grant tables by issuing the FLUSH PRIVILEGES command.

mysql> FLUSH PRIVILEGES;

Now we can actually change the root password.

mysql> ALTER USER ‘root’@’localhost’ IDENTIFIED BY ‘password’;

Reload the grant tables after this.

mysql> FLUSH PRIVILEGES;

Now exit the mysql by typing exit;

Once your are done with this, there is need to stop and restart the mysql service. For stopping the mysql run following command:

$ sudo /etc/init.d/mysql stop$ killall -KILL mysql mysqld_safe mysqld

Now restart the mysql service:

$ sudo service mysql restart

Now open MySQL by typing mysql -u root -p

Note: Type the password as password

Create database “demo” in mysql:

create database demo;

mysql> use demo;

Create table named as “test” in mysql:

mysql> create table student_details(rollno varchar(3) primary key, name varchar(20), age varchar(2));

Insert Data Into MySQL:

mysql> INSERT INTO student_details values (‘1’,’Harshit’,’22');mysql> INSERT INTO student_details values (‘2’,’Akshat’,’23');mysql> INSERT INTO student_details values (‘3’,’Akash’,’21');

Exit mysql using command exit;

For importing data from MySQL to Sqoop execute following command:

/usr/lib/sqoop/ :~$ ./bin/sqoop import — connect “jdbc:mysql://localhost/demo” — username root — password password — table student_details — target-dir=”data” -m 3

If it shows error to you then run the following command:

/usr/lib/sqoop/ :~$ ./bin/sqoop import “-Dorg.apache.sqoop.splitter.allow_text_splitter=true” — connect “jdbc:mysql://localhost/demo” — username root — password password — table student_details — target-dir=”data” -m 3

Now check the data in hdfs directory by using series of commands:

$ hdfs dfs -ls data$ hdfs dfs -cat data/part*

Now, let’s use the Hive import command and import the data from MySQL to Hive directly. For doing so execute the following command in your shell:

/usr/lib/sqoop $ ./bin/sqoop import “-Dorg.apache.sqoop.splitter.allow_text_splitter=true” — connect jdbc:mysql://localhost:3306/demo — username root — password password — table student_details — target-dir student_details — hive-import — create-hive-table — hive-table default.test -m 3

Above command will create a table in Hive named as “student_details”, which would be having exactly the same data that was present in the “student_details” table in MySQL.

Note: If you face this error

Add permission javax.management.MBeanTrustPermissionregister”; in /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/lib/security/java.policy inside grant{ };

Now Change hive.execution.engine to “mr” inside /usr/lib/sqoop/conf/hive-site.xml Now run the command again:

/usr/lib/sqoop $ ./bin/sqoop import “-Dorg.apache.sqoop.splitter.allow_text_splitter=true” — connect jdbc:mysql://localhost:3306/demo — username root — password password — table student_details — target-dir student_details — hive-import — create-hive-table — hive-table default.test -m 3

Now open the hive shell by typing hive, then finally run select * from test;

Conclusion

Sqoop is the data transfer tool, which works in two modes i.e. Import and Export. Incremental imports enable us to save effort of transferring entire data. Because of parallel transfer of data, Sqoop is fast and cost effective.

𝐒𝐭𝐚𝐲 𝐜𝐨𝐧𝐧𝐞𝐜𝐭𝐞𝐝 𝐰𝐢𝐭𝐡 𝐮𝐬!

Subscribe to our YouTube channel for more such informative Content!

Follow us on: 𝐅𝐚𝐜𝐞𝐛𝐨𝐨𝐤, 𝐓𝐰𝐢𝐭𝐭𝐞𝐫, 𝐋𝐢𝐧𝐤𝐞𝐝𝐈𝐧, 𝐈𝐧𝐬𝐭𝐚𝐠𝐫𝐚𝐦

Join our strong 3700+ meetup members community where we impart our knowledge regularly on Data, ML, AI, and many more technologies.

For virtual instructor-led Class, please reach out to us at operations@datacouch.io

Enjoy Learning!

--

--

DataCouch

We are a team of Data Scientists who provide training and consultancy services to professionals worldwide. Linkedin- https://in.linkedin.com/company/datacouch