Data Engineering: MapReduce, Pregel, and Hadoop - Doing Data Science

Databases Reference

In-Depth Information

David claims we can't, because although the capacity of a given com‐

puter is growing exponentially, those same computers also make the

data. The rate of new data is also growing exponentially. So there are

actually two exponential curves, and they won't intersect any time

soon.

Let's work through an example to show how hard this gets.

Word Frequency Problem

Say you're told to find the most frequent words in the following list:

red, green, bird, blue, green, red, red.

The easiest approach for this problem is inspection, of course. But now

consider the problem for lists containing 10,000, or 100,000, or 10 9

words.

The simplest approach is to list the words and then count their prev‐

alence. Figure 14-1 shows an example code snippet from the language

Go , which David loves and is helping to build (do you build a lan‐

guage?) and design at Google.

Figure 14-1. Example code snippet from the language Go

Because counting and sorting are fast, this scales to ~100 million

words. The limit now is computer memory—if you think about it, you

need to get all the words into memory twice: once when you load in

the list of all words, and again when you build a way to associate a

count for each word.

You can modify it slightly so it doesn't have to have all words loaded

in memory—keep them on the disk and stream them in by using a

channel instead of a list. A channel is something like a stream: you

read in the first 100 items, then process them, then you read in the

next 100 items.

But there's still a potential problem, because if every word is unique,

and the list is super long, your program will still crash; it will still be

Search WWH ::

Custom Search

Home