Databases Reference
In-Depth Information
"traffic_count" , "traffic_index" , "traffic_class" ,
"paving_length" , "paving_width" , "paving_area" ,
"surface_type" )
t <- sort ( table ( d $ surface_type ), decreasing = TRUE )
roads <- head ( as.data.frame.table ( t ), n = 20 )
colnames ( roads ) <- c ( "surface_type" , "count" )
roads
summary ( d $ traffic_class )
t <- sort ( table ( d $ traffic_class ), decreasing = TRUE )
roads <- head ( as.data.frame.table ( t ), n = 20 )
colnames ( roads ) <- c ( "traffic_class" , "count" )
roads
summary ( d $ traffic_count )
plot ( ecdf ( d $ traffic_count ))
m <- ggplot ( d , aes ( x = traffic_count ))
m <- m + ggtitle ( "Traffic Count Density" )
m + geom_histogram ( aes ( y = .. density.. , fill = .. count.. )) + geom_density ()
Spatial Indexing
Because we are working with GIS data, the attributes that tie together tree data, road
data, and GPS track are obviously the geo coordinates: latitude, longitude, and altitude.
Much of Palo Alto is relatively flat and not far above sea level because it is close to San
Francisco Bay. To make this code a bit simpler, we can ignore altitude. However, we'll
need to do large-scale joins and queries based on latitude and longitude. Those are
problematic at scale: they are represented as decimal values, and range queries will be
required, both of which make parallelization difficult at scale. So we've used a geohash
as an approximate location, as a kind of bounding box: it combines the decimal values
for latitude and longitude into a string. That makes joins and queries much simpler and
makes the app more reasonable to parallelize. Effectively we cut the entire map of Palo
Alto into bounding boxes and then compute for each bounding box in parallel.
There can be problems with this approach. For instance, what if the center of a road
segment is right in between two geohash squares? We might end up with joins that
reference only half the trees near that road segment. There are a number of more in‐
teresting algorithms to use for spatial indexing. R-trees is one common approach. The
general idea would be to join a given road segment with trees in its bounding box plus
the neighboring bounding boxes. Then we apply a better algorithm within those col‐
lections of data. The problem is still reasonably constrained and can be parallelized.
In this sample app, we simply consider each geohash value as a kind of “bucket.” Imagine
that all the data points that fall into the same bucket get evaluated together. Figure 8-6
shows how each block of a road is divided into road segments.
Search WWH ::




Custom Search