Exercise 1: Ingest and query relational data
In this scenario, DataCo's business question is: What
products do our customers like to buy? To answer this question, the first
thought might be to look at the transaction data, which should indicate what
customers actually do buy and like to buy, right?
This is probably something you can do in your regular
RDBMS environment, but a benefit of Apache Hadoop is that you can do it at
greater scale at lower cost, on the same system that you may also use for many
other types of analysis.
What this exercise demonstrates is how to do exactly the
same thing you already know how to do, but in CDH. Seamless integration is
important when evaluating any new infrastructure. Hence, it's important to be
able to do what you normally do, and not break any regular BI reports or
workloads over the dataset you plan to migrate.
To analyze the transaction data in the new platform, we
need to ingest it into the Hadoop Distributed File System (HDFS). We need to
find a tool that easily transfers structured data from a RDBMS to HDFS, while
preserving structure. That enables us to query the data, but not interfere with
or break any regular workload on it.
Apache Sqoop, which is part of CDH, is that tool. The nice
thing about Sqoop is that we can automatically load our relational data from
MySQL into HDFS, while preserving the structure. With a few additional
configuration parameters, we can take this one step further and load this
relational data directly into a form ready to be queried by Apache Impala
(incubating), the MPP analytic database included with CDH, and other workloads.
You should first log in to the Master Node of your cluster
via a terminal. Then, launch the Sqoop job:
> sqoop import-all-tables
\
-m
{{cluster_data.worker_node_hostname.length}} \
--connect
jdbc:mysql://{{cluster_data.manager_node_hostname}}:3306/retail_db \
--username=retail_dba \
--password=cloudera \
--compression-codec=snappy \
--as-parquetfile \
--warehouse-dir=/user/hive/warehouse \
--hive-import
This command may take a while to complete, but it is doing
a lot. It is launching MapReduce jobs to pull the data from our MySQL database
and write the data to HDFS in parallel, distributed across the cluster in Apache Parquet format. It is also
creating tables to represent the HDFS files in Impala/Apache Hive with matching
schema.
Parquet is a format designed for analytical applications
on Hadoop. Instead of grouping your data into rows like typical data formats,
it groups your data into columns. This is ideal for many analytical queries where
instead of retrieving data from specific records, you're analyzing
relationships between specific variables across many records. Parquet is
designed to optimize data storage and retrieval in these scenarios.
Verification
When
this command is complete, confirm that your data files exist in HDFS.
> hadoop fs -ls /user/hive/warehouse/
> hadoop fs -ls /user/hive/warehouse/categories/
These
commands will show the directories and the files inside them that make up your
tables: