Database Reference
In-Depth Information
Note that numerous firms are developing and offering various commercial extensions
for Hadoop, making the platform even more resilient, strong, slick, and fast.
So there, in a nutshell is Hadoop. Data Science simply can't be practiced without it any
more than the Kentucky Derby can be practiced without a racetrack.
Of equal interest, and nearly equal importance, is the open-source programming lan-
guage R.
A product of the R Project for Statistical Computing, the R programming language is a
profoundly robust tool for statistical computing and graphics - one that is rapidly becoming
indispensable for Data Scientists everywhere. The language easily compiles on platforms
ranging from UNIX to MacOS to Windows.
A “child” of the S language originally developed by John Chambers and his team at
the old Bell Laboratories (now Lucent Technologies), R includes many of the characterist-
ics of S, but with substantial enhancements. Some might even consider this a “dialect” of
the S language.
R delivers a range of elegant tools for classical statistical tests, linear and nonlinear
modelling, clustering, and time-series analysis, as well as robust graphics tools. This latter
attribute is most important. R allows Data Scientists to produce well-designed plots (in-
cluding both mathematical symbols and formulae) with ease while still maintaining com-
plete control of the process.
R's sophisticated, integrated suite of software facilities includes strong tools for data
manipulation and handling, powerful operators for array calculations in particularized
matrices, integrated coherent tools for data analysis, and an expansive programming lan-
guage environment which includes loops, conditionals, input/output facilities, and user-
defined recursive functions. Advanced programmers engaged in tasks which might be de-
scribed as “computationally-intensive” can create and link Fortran, C, or C++ code callable
at run time, or even C code designed to directly manipulate R objects.
Although such languages as Matlab/Octave and Mathematica/Sage are the choice of
some Data Scientists, the second most popular language for practice of the art (after R) is
Python. Although a relic of the 1990s, Python remains a vital high-level open-source solu-
tion offering support for object-oriented, imperative, functional, and procedural styles of
programming.
Why is Python particularly valuable to Data Scientists?
As programmer and teacher Foster Provost explains: “The practice of Data Science
involves many interrelated but different activities, including accessing data, manipulating
data, computing statistics about data, plotting/graphing/visualizing data, building predict-
ive and explanatory models from data, evaluating those models on yet more data, integ-
Search WWH ::




Custom Search