Database Reference
In-Depth Information
That's fine, but what if you want to find suppliers that meet specific criteria, such as those with more than 1,000
transactions? For this, you must use the HAVING clause:
SELECT
supplier, COUNT(*)
FROM
trade.rawtrans
GROUP BY
supplier
HAVING COUNT(*) > 1000
ORDER BY
supplier DESC ;
Notice that the HAVING clause operates on the COUNT(*) column and uses a greater than ( > ) operator.
So, with just nine lines of SQL, it is possible to generate the transaction volumes for suppliers by using COUNT(*)
and GROUP BY . The list is sorted in reverse order, with an ORDER BY clause. Finally, the HAVING clause is used to find
the highest volume suppliers. So, by combining these terms in one statement, it is possible to extract some very useful
information from the raw data. The rows with more than 1,000 transactions are shown as follows:
UK Trade & Investment - Trade Development 1158
UK Trade & Investment - Sectors Group 1503
UK Trade & Investment - Regional Directorate 2134
UK Trade & Investment - International Group 1038
UK Trade & Investment - Defence and Security Organisation 2970
UK Trade & Investment - Business Group 1229
When combined with Cloudera Impala and Apache Hive, simple SQL statements become even more powerful.
As you have seen, you can extend the functionality of Apache Hive by using UDFs, while Impala integrates with
Cloudera's enterprise data hub. For further information on Apache Hive QL, see the Apache language reference at
https://cwiki.apache.org/confluence/display/Hive/LanguageManual .
The power of these SQL-like languages comes partly from their functionality but also from the fact that they
are familiar to people who have had exposure to relational databases. Apache Hive and Impala Cloudera are both
Hadoop HDFS-based databases. In the next section, I briefly demonstrate Apache Spark data processing and use an
SQL statement in Spark to show that SQL can also be used to manipulate Spark-based data. (In Chapter 10, I present
Talend and Pentaho, which are also used to form ETL and manipulate data using a visual object-based approach.)
Apache Spark
Apache Spark is a cluster computing system that offers very fast in-memory distributed processing. You can develop
applications in Java, Python, and Scala or use the built-in scripting shell for ad hoc script development. With the
capability to scale to a very large degree (2000 nodes), Spark is also able to cache data for memory-based analytics.
Spark can run in local mode or use cluster managers, such as Mesos, YARN, or Spark. The cluster manager
manages the executor processes on the worker nodes. The executors run applications on worker nodes and process
application data. Spark uses a resilient distributed data set RDD) data model for data processing.
So while a Hadoop cluster provides a distributed batch processing system for handling very large volumes
of data, Spark works in real time and is much faster. Like Hadoop, it offers a robust distributed processing model;
however, Hadoop uses HDFS, while Spark is a memory-based system. Spark can also integrate with Hadoop, pulling
data from and saving data to HDFS.
 
Search WWH ::




Custom Search