Beyond MapReduce - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

head ( df )

# use R functions to summarize and visualize part of the data

df $ hire_age <- as.integer ( as.Date ( df $ HIRE_DATE ) - as.Date ( df $ BIRTH_DATE )) / 365.25

summary ( df $ hire_age )

# uncomment next line the first time

#install.packages("ggplot2")

library ( ggplot2 )

m <- ggplot ( df , aes ( x = hire_age ))

m <- m + ggtitle ( "Age at hire, people named Gina" )

m + geom_histogram ( binwidth = 1 , aes ( y = .. density.. , fill = .. count.. )) + geom_density ()

That R script first sets up a JDBC connection in Lingual. Then it runs the same query

we used in the SQL command shell to list records for employees named Gina. Next, the

script calculates age (in years) at time of hire for employees in the SQL result set. Then

it calculates summary statistics and visualizes the age distribution, shown in Figure 6-5 :

> summary ( df $ hire_age )

Min. 1 st Qu. Median Mean 3 rd Qu. Max.

20.86 27.89 31.70 31.61 35.01 43.92

Figure 6-5. R data visualization

This shows how a very large data set could be queried to produce a sample, then analyzed

—all based on R, JDBC, and SQL. Under the hood, Cascading and Apache Hadoop are

doing the heavy lifting to run those queries at scale. Meanwhile, the users, analysts, and

Search WWH ::

Custom Search

Home