Databases Reference
In-Depth Information
add_use <- add_use [ ! is.na ( add_use $ latitude ),]
# map coordinates
map_coords <- add_use [, c ( 2 , 4 , 26 , 27 )]
table ( map_coords $ neighborhood )
map_coords $ neighborhood <- as.factor ( map_coords $ neighborhood )
geoPlot ( map_coords , zoom = 12 , color = map_coords $ neighborhood )
## - knn function
## - there are more efficient ways of doing this,
## but oh well...
map_coords $ class <- as.numeric ( map_coords $ neighborhood )
n_cases <- dim ( map_coords )[ 1 ]
split <- 0.8
train_inds <- sample.int ( n_cases , floor ( split * n_cases ))
test_inds <- ( 1 : n_cases )[ - train_inds ]
k_max <- 10
knn_pred <- matrix ( NA , ncol = k_max , nrow = length ( test_inds ))
knn_test_error <- rep ( NA , times = k_max )
for ( i in 1 : k_max ) {
knn_pred [, i ] <- knn ( map_coords [ train_inds , 3 : 4 ],
map_coords [ test_inds , 3 : 4 ], cl = map_coords [ train_inds , 5 ], k = i )
knn_test_error [ i ] <- sum ( knn_pred [, i ] !=
map_coords [ test_inds , 5 ]) / length ( test_inds )
}
plot ( 1 : k_max , knn_test_error )
Modeling and Algorithms at Scale
The data you've been dealing with so far in this chapter has been pretty
small on the Big Data spectrum. What happens to these models and
algorithms when you have to scale up to massive datasets?
In some cases, it's entirely appropriate to sample and work with a
smaller dataset, or to run the same model across multiple sharded
datasets. (Sharding is where the data is broken up into pieces and
divided among diffrent machines, and then you look at the empirical
distribution of the estimators across models.) In other words, there
are statistical solutions to these engineering challenges.
However, in some cases we want to fit these models at scale, and the
challenge of scaling up models generally translates to the challenge of
 
Search WWH ::




Custom Search