Databases Reference
In-Depth Information
head ( df )
# use R functions to summarize and visualize part of the data
df $ hire_age <- as.integer ( as.Date ( df $ HIRE_DATE ) - as.Date ( df $ BIRTH_DATE )) / 365.25
summary ( df $ hire_age )
# uncomment next line the first time
#install.packages("ggplot2")
library ( ggplot2 )
m <- ggplot ( df , aes ( x = hire_age ))
m <- m + ggtitle ( "Age at hire, people named Gina" )
m + geom_histogram ( binwidth = 1 , aes ( y = .. density.. , fill = .. count.. )) + geom_density ()
That R script first sets up a JDBC connection in Lingual. Then it runs the same query
we used in the SQL command shell to list records for employees named Gina. Next, the
script calculates age (in years) at time of hire for employees in the SQL result set. Then
it calculates summary statistics and visualizes the age distribution, shown in Figure 6-5 :
> summary ( df $ hire_age )
Min. 1 st Qu. Median Mean 3 rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
Figure 6-5. R data visualization
This shows how a very large data set could be queried to produce a sample, then analyzed
—all based on R, JDBC, and SQL. Under the hood, Cascading and Apache Hadoop are
doing the heavy lifting to run those queries at scale. Meanwhile, the users, analysts, and
 
Search WWH ::




Custom Search