When to Build, When to Buy, When to Outsource - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

added wrinkle that complicates the decision process greatly: the frequent need to deal

with hardware.

Data scientists must be proficient with a number of different technologies to gain

value from data. We are still in the early days of data-science technology. Open-source

software has helped large-scale data technology become more accessible, but as is

the nature of this type of software, there are currently many overlapping projects in

various states of maturity. In a more mature market, such as the traditional relational

database industry, there are a huge number of commercial products available to choose

from. However, the world of MapReduce frameworks and nonrelational databases

hasn't quite reached the same point yet.

Another characteristic of a more mature technology market is the presence of a

pool of customers who feel that there is little risk in choosing various technologies.

Currently in the data space, there are many early adopters who will try just about any-

thing and others who are simply trying to make sense of the hype. This leads to some

organizations taking the plunge and building solutions using whichever open-source

software is available while others wait on the sidelines as spectators.

The current state of data technologies mirrors the multifaceted skill sets that data

scientists are required to have. Some technologies require infrastructure skills, includ-

ing tasks such as hardware monitoring and log management. In other engineering

professions, proficiency with a particular type of software is an important skill. Some-

times, a strong theoretical background is the primary skill needed for success. In the

world of large-scale data analysis, often all three are expected.

Some of the technologies featured in this topic are developer tools; they are

designed to be used to build other software. Other technologies we cover are essen-

tially interfaces aimed at data analysts, not developers. And yet others are a combina-

tion of the two—requiring analysts to write scripts to define systems that process data.

Data technology is currently in a state of f lux, with different aspects of these three

pillars maturing at different rates. As a data scientist, it's perfectly reasonable to find

yourself in a situation in which there is no obvious solution, whether commercial or

not, for solving a data challenge.

Another consequence of the organic growth of data technology is that different

software projects can often address very similar use cases. A great example in this space

is the choice of R versus Python for scientific computing (technologies that we cover

in Chapter 11, “Using R with Large Datasets,” and Chapter 12, “Building Analyt-

ics Workf lows Using Python and Pandas”). R is an extremely popular programming

language for statistical and mathematical computing. Python is an extremely popular

language for general-purpose programming. Both R and Python can be used for sci-

entific and statistical computing, but currently R is more mature in this space and is

more likely to have a greater number of available modules and libraries for specific

tasks. Both can be used for general-purpose programming, but it's difficult to argue

that R would make a better choice than Python for this purpose. Choosing one over

the other depends on a range of factors, including your available personnel. In addi-

tion, within the statistical space, there are numerous commercial software packages

available, such as SAS or MATLAB, further complicating software decision making.

Data Just Right: Introduction to Large-Scale Data and Analytics

Search WWH ::

Custom Search

Home