Database Reference
In-Depth Information
Large data regression model: biglm(formula = formula, data = data, ...)
Sample size = 5481303
Coef (95% CI) SE p
(Intercept) -0.3857 -0.4000 -0.3714 0.0072 0
DepDelay 0.9690 0.9686 0.9695 0.0002 0
RHadoop: Accessing Apache Hadoop from R
Solving large data challenges often requires a combination of disparate technologies
working in concert. R is a well-supported tool for statistical analysis, and Hadoop is
popular for distributed data processing tasks. Hadoop provides a framework for defin-
ing MapReduce jobs: tasks that split large data challenges into smaller pieces that
can be processed on a number of machines. Both R and Hadoop have an established
user base and are great choices for specific use cases. For more on using the Hadoop
MapReduce framework for data processing, see Chapter 8, “Putting It Together:
MapReduce Data Pipelines,” and Chapter 9, “Building Data Transformation Work-
f lows with Pig and Cascading.”
A practical way to build a bridge between distributed computing systems, such
as Hadoop, and R is to provide interfaces from the programming language to the
MapReduce framework. The Hadoop Streaming API (see Chapter 8) is one example
of this type of interface, allowing MapReduce jobs to be defined using languages
other than Hadoop's native Java. A popular choice for connecting R with Hadoop is
the aptly named RHadoop project. RHadoop contains several packages, the key ones
being rmr, rhdfs, and rhbase, that enable R developers to interface with Hadoop using
an idiomatic syntax.
The rmr package is a bridge between R and Hadoop's MapReduce functional-
ity. It's not a direct interface to the Hadoop Streaming API but rather a way to define
MapReduce in a concise, R-friendly syntax. The rhdfs and rhbase packages provide
interfaces to HDFS (the Hadoop Distributed File System) and the HBase database
respectively. These packages provide simple functions to read, write, and copy data via
R. Like many other libraries used to access Hadoop, it's also possible to debug Hadoop
scripts written using RHadoop locally without touching a Hadoop backend. This can
be done by setting the backend parameter to local in the rmr.options function.
In order to build and install the RHadoop packages, several additional R packages,
including Rcpp and RJSONIO, are required. To run the rmr package over an existing
Hadoop installation, each node must have a copy of R as well as rmr installed. Finally,
R requires that the HADOOP_CMD and HADOOP_STREAMING environment variables
point to the Hadoop binary and the location of the Hadoop Streaming API JAR files
respectively.
Let's take a look at an example of using R and Hadoop together with the rmr and
rhdfs packages. Listing 11.6 demonstrates how to use rmr to run a MapReduce job on
an underlying Hadoop cluster.
 
Search WWH ::




Custom Search