Database Reference
In-Depth Information
added wrinkle that complicates the decision process greatly: the frequent need to deal
with hardware.
Data scientists must be proficient with a number of different technologies to gain
value from data. We are still in the early days of data-science technology. Open-source
software has helped large-scale data technology become more accessible, but as is
the nature of this type of software, there are currently many overlapping projects in
various states of maturity. In a more mature market, such as the traditional relational
database industry, there are a huge number of commercial products available to choose
from. However, the world of MapReduce frameworks and nonrelational databases
hasn't quite reached the same point yet.
Another characteristic of a more mature technology market is the presence of a
pool of customers who feel that there is little risk in choosing various technologies.
Currently in the data space, there are many early adopters who will try just about any-
thing and others who are simply trying to make sense of the hype. This leads to some
organizations taking the plunge and building solutions using whichever open-source
software is available while others wait on the sidelines as spectators.
The current state of data technologies mirrors the multifaceted skill sets that data
scientists are required to have. Some technologies require infrastructure skills, includ-
ing tasks such as hardware monitoring and log management. In other engineering
professions, proficiency with a particular type of software is an important skill. Some-
times, a strong theoretical background is the primary skill needed for success. In the
world of large-scale data analysis, often all three are expected.
Some of the technologies featured in this topic are developer tools; they are
designed to be used to build other software. Other technologies we cover are essen-
tially interfaces aimed at data analysts, not developers. And yet others are a combina-
tion of the two—requiring analysts to write scripts to define systems that process data.
Data technology is currently in a state of f lux, with different aspects of these three
pillars maturing at different rates. As a data scientist, it's perfectly reasonable to find
yourself in a situation in which there is no obvious solution, whether commercial or
not, for solving a data challenge.
Another consequence of the organic growth of data technology is that different
software projects can often address very similar use cases. A great example in this space
is the choice of R versus Python for scientific computing (technologies that we cover
in Chapter 11, “Using R with Large Datasets,” and Chapter 12, “Building Analyt-
ics Workf lows Using Python and Pandas”). R is an extremely popular programming
language for statistical and mathematical computing. Python is an extremely popular
language for general-purpose programming. Both R and Python can be used for sci-
entific and statistical computing, but currently R is more mature in this space and is
more likely to have a greater number of available modules and libraries for specific
tasks. Both can be used for general-purpose programming, but it's difficult to argue
that R would make a better choice than Python for this purpose. Choosing one over
the other depends on a range of factors, including your available personnel. In addi-
tion, within the statistical space, there are numerous commercial software packages
available, such as SAS or MATLAB, further complicating software decision making.
 
Search WWH ::




Custom Search