Case Study: City of Palo Alto Open Data - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

"traffic_count" , "traffic_index" , "traffic_class" ,

"paving_length" , "paving_width" , "paving_area" ,

"surface_type" )

t <- sort ( table ( d $ surface_type ), decreasing = TRUE )

roads <- head ( as.data.frame.table ( t ), n = 20 )

colnames ( roads ) <- c ( "surface_type" , "count" )

roads

summary ( d $ traffic_class )

t <- sort ( table ( d $ traffic_class ), decreasing = TRUE )

roads <- head ( as.data.frame.table ( t ), n = 20 )

colnames ( roads ) <- c ( "traffic_class" , "count" )

roads

summary ( d $ traffic_count )

plot ( ecdf ( d $ traffic_count ))

m <- ggplot ( d , aes ( x = traffic_count ))

m <- m + ggtitle ( "Traffic Count Density" )

m + geom_histogram ( aes ( y = .. density.. , fill = .. count.. )) + geom_density ()

Spatial Indexing

Because we are working with GIS data, the attributes that tie together tree data, road

data, and GPS track are obviously the geo coordinates: latitude, longitude, and altitude.

Much of Palo Alto is relatively flat and not far above sea level because it is close to San

Francisco Bay. To make this code a bit simpler, we can ignore altitude. However, we'll

need to do large-scale joins and queries based on latitude and longitude. Those are

problematic at scale: they are represented as decimal values, and range queries will be

required, both of which make parallelization difficult at scale. So we've used a geohash

as an approximate location, as a kind of bounding box: it combines the decimal values

for latitude and longitude into a string. That makes joins and queries much simpler and

makes the app more reasonable to parallelize. Effectively we cut the entire map of Palo

Alto into bounding boxes and then compute for each bounding box in parallel.

There can be problems with this approach. For instance, what if the center of a road

segment is right in between two geohash squares? We might end up with joins that

reference only half the trees near that road segment. There are a number of more in‐

teresting algorithms to use for spatial indexing. R-trees is one common approach. The

general idea would be to join a given road segment with trees in its bounding box plus

the neighboring bounding boxes. Then we apply a better algorithm within those col‐

lections of data. The problem is still reasonably constrained and can be parallelized.

In this sample app, we simply consider each geohash value as a kind of “bucket.” Imagine

that all the data points that fall into the same bucket get evaluated together. Figure 8-6

shows how each block of a road is divided into road segments.

Search WWH ::

Custom Search

Home