Database Reference
In-Depth Information
source and put into memory. Hence, the first time such an operation is called, the time it
takes to run the task is partly dependent on the time it takes to read the data from the input
source. However, when the data is accessed the next time (for example, in subsequent
queries in analytics or iterations in a machine learning model), the data can be read direc-
tly from memory, thus avoiding expensive I/O operations and speeding up the computa-
tion, in many cases, by a significant factor.
If we now call the count or sum function on our cached RDD, we will see that the RDD
is loaded into memory:
val aveLengthOfRecordChained = rddFromTextFile.map(line =>
line.size).sum / rddFromTextFile.count
Indeed, in the following output, we see that the dataset was cached in memory on the first
call, taking up approximately 62 KB and leaving us with around 270 MB of memory free:
...
14/01/30 06:59:27 INFO MemoryStore: ensureFreeSpace(63454)
called with curMem=32960, maxMem=311387750
14/01/30 06:59:27 INFO MemoryStore: Block rdd_2_0 stored as
values to memory (estimated size 62.0 KB, free 296.9 MB)
14/01/30 06:59:27 INFO
BlockManagerMasterActor$BlockManagerInfo: Added rdd_2_0 in
memory on 10.0.0.3:55089 (size: 62.0 KB, free: 296.9 MB)
...
Now, we will call the same function again:
val aveLengthOfRecordChainedFromCached =
rddFromTextFile.map(line => line.size).sum /
rddFromTextFile.count
We will see from the console output that the cached data is read directly from memory:
...
14/01/30 06:59:34 INFO BlockManager: Found block rdd_2_0
locally
...
Search WWH ::




Custom Search