Exercise 2: Correlate structured data with unstructured
data
Since you are a pretty smart data person, you realize
another interesting business question would be: are the most viewed products
also the most sold? Since Hadoop can store unstructured and semi-structured
data alongside structured data without remodeling an entire database, you can
just as well ingest, store, and process web log events. Let's find out what
site visitors have actually viewed the most.
For this, you need the web clickstream data. The most
common way to ingest web clickstream is to use Apache Flume. Flume is a
scalable real-time ingest framework that allows you to route, filter,
aggregate, and do "mini-operations" on data on its way
in to the scalable processing platform.
In Exercise 4, later in this tutorial, you can explore a
Flume configuration example, to use for real-time ingest and transformation of
our sample web clickstream data. However, for the sake of tutorial-time, in
this step, we will not have the patience to wait for three days of data to be
ingested. Instead, we prepared a web clickstream data set (just pretend you
fast forwarded three days) that you can bulk upload into HDFS directly.
Bulk Upload Data
For your convenience, we have pre-loaded some sample
access log data into /opt/examples/log_data/access.log.2.
Let's move this data from the local filesystem, into HDFS.
> sudo -u hdfs hadoop fs
-mkdir /user/hive/warehouse/original_access_logs
> sudo -u hdfs hadoop fs
-copyFromLocal /opt/examples/log_files/access.log.2
/user/hive/warehouse/original_access_logs
The copy command may take several minutes to complete.
Verify that your data is in HDFS by executing the
following command:
> hadoop fs -ls
/user/hive/warehouse/original_access_logs
You should see a result similar to the following:
Now you can build a table in Hive and query the data via
Apache Impala (incubating) and Hue. You'll build this table in 2 steps. First,
you'll take advantage of Hive's flexible SerDes (serializers / deserializers)
to parse the logs into individual fields using a regular expression. Second,
you'll transfer the data from this intermediate table to one that does not
require any special SerDe. Once the data is in this table, you can query it
much faster and more interactively using Impala.
We'll use the Hive Query Editor app in Hue to execute the
following queries:
CREATE EXTERNAL TABLE
intermediate_access_logs (
ip STRING,
date STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex' = '([^ ]*) - -
\\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*)
"([^"]*)" "([^"]*)"',
'output.format.string' = "%1$$s %2$$s
%3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s")
LOCATION
'/user/hive/warehouse/original_access_logs';
CREATE EXTERNAL TABLE
tokenized_access_logs (
ip STRING,
date STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
LOCATION
'/user/hive/warehouse/tokenized_access_logs';
ADD JAR
{{lib_dir}}/hive/lib/hive-contrib.jar;
INSERT OVERWRITE TABLE
tokenized_access_logs SELECT * FROM intermediate_access_logs;
The final query will take a minute to run. It is using a
MapReduce job, just like our Sqoop import did, to transfer the data from one
table to the other in parallel. You can follow the progress in the log below,
and you should see the message 'The operation has no results.' when it's done.
Again, we need to tell Impala that some tables have been
created through a different tool. Switch back to the Impala Query Editor app,
and enter the following command:
invalidate metadata;
Now, if you enter the 'show tables;' query or refresh the
table list in the left-hand column, you should see the two new external tables
in the default database. Paste the following query into the Query Editor:
select count(*),url from
tokenized_access_logs
where url like
'%\/product\/%'
group by url order by
count(*) desc;
You should see a result similar to the following:
By introspecting the results you quickly realize that this
list contains many of the products on the most sold list from previous tutorial
steps, but there is one product that did not show up in the previous result.
There is one product that seems to be viewed a lot, but never purchased. Why?
Well, in our example with DataCo, once these odd findings
are presented to your manager, it is immediately escalated. Eventually, someone
figures out that on that view page, where most visitors stopped, the sales path
of the product had a typo in the price for the item. Once the typo was fixed,
and a correct price was displayed, the sales for that SKU started to rapidly
increase.
Conclusion:
If you had lacked an efficient and interactive tool
enabling analytics on high-volume semi-structured data, this loss of revenue
would have been missed for a long time. There is risk of loss if an
organization looks for answers within partial data. Correlating two data sets
for the same business question showed value, and being able to do so within the
same platform made life easier for you and for the organization.
If you'd like to dive deeper into Hive, Impala, and other
tools for data analysis in Cloudera's platform, you may be interested in Data Analyst
Training.
For now, we'll explore some different techniques.
Thank you for sharing this useful information. Azure data factory training in chennai training equips individuals with the skills needed to design, implement, and manage efficient data integration workflows in a cloud-based environment. As organizations increasingly adopt cloud solutions, proficiency in Azure Data Factory becomes a valuable asset for data professionals.
ReplyDelete