Database Reference
In-Depth Information
11
Using R with Large Datasets
The sexy job in the next 10 years will be statisticians.
—Hal Varian
T he role of traditional systems administrators is changing. Before the growth of
cloud-computing platforms and distributed data frameworks, sys admins were primar-
ily concerned with maintaining server hardware. There is still a need for this type
of work, but thanks to hardware virtualization, companies are beginning to build
products that are managed and hosted in the cloud. Companies are purchasing com-
pute time on clusters of virtual machines. The underlying hardware that supports
these clusters is abstracted away from the customer. As the tooling around scalable
data analysis becomes more mature, applications that process large amounts of data are
becoming more dependent on distributed-software expertise rather than hardware-
management skills. This trend has focused a lot of attention on the role of a new type
of admin, known as a DevOps engineer: in other words, a systems administrator who
focuses on complicated distributed-software systems rather than hardware.
Cloud and data technologies are disrupting many other traditional IT tasks as well.
One job role that doesn't look to be in jeopardy anytime soon is that of the statisti-
cian. In fact, there is growing need for knowledge of statistical skills across many job
functions related to data analysis. Because computer scientists are finding new ways of
processing ever-increasing amounts of data, the need for making sense out of this data
is in greater demand than ever before.
Solving statistical-analysis challenges can require an expressive language for defin-
ing numeric workf lows. The open-source software most commonly used by statisti-
cians is R. If you are new to the field of data analysis, you have likely heard about
R, as it has become a de facto tool for a wide variety of computational analyses.
Although programming languages such as Python and Julia are gaining in popularity
for numeric computations, R is the reigning champion of the open-source statistics
world. R has a huge user community and many available packages that cover the range
of numeric and visualization tasks. R has so many users that the vast number of users
alone is good reason to make it a compelling choice for many organizations. Certainly
very few statisticians have been fired for using R.
On the other hand, R was originally designed to work in single-machine and
single-threaded environments with limited memory. It can be a challenge to use R for
 
 
 
Search WWH ::




Custom Search