Databases Reference
In-Depth Information
David claims we can't, because although the capacity of a given com‐
puter is growing exponentially, those same computers also make the
data. The rate of new data is also growing exponentially. So there are
actually two exponential curves, and they won't intersect any time
soon.
Let's work through an example to show how hard this gets.
Word Frequency Problem
Say you're told to find the most frequent words in the following list:
red, green, bird, blue, green, red, red.
The easiest approach for this problem is inspection, of course. But now
consider the problem for lists containing 10,000, or 100,000, or 10 9
words.
The simplest approach is to list the words and then count their prev‐
alence. Figure 14-1 shows an example code snippet from the language
Go , which David loves and is helping to build (do you build a lan‐
guage?) and design at Google.
Figure 14-1. Example code snippet from the language Go
Because counting and sorting are fast, this scales to ~100 million
words. The limit now is computer memory—if you think about it, you
need to get all the words into memory twice: once when you load in
the list of all words, and again when you build a way to associate a
count for each word.
You can modify it slightly so it doesn't have to have all words loaded
in memory—keep them on the disk and stream them in by using a
channel instead of a list. A channel is something like a stream: you
read in the first 100 items, then process them, then you read in the
next 100 items.
But there's still a potential problem, because if every word is unique,
and the list is super long, your program will still crash; it will still be
 
Search WWH ::




Custom Search