Algorithms - Doing Data Science

Databases Reference

In-Depth Information

add_use <- add_use [ ! is.na ( add_use $ latitude ),]

# map coordinates

map_coords <- add_use [, c ( 2 , 4 , 26 , 27 )]

table ( map_coords $ neighborhood )

map_coords $ neighborhood <- as.factor ( map_coords $ neighborhood )

geoPlot ( map_coords , zoom = 12 , color = map_coords $ neighborhood )

## - knn function

## - there are more efficient ways of doing this,

## but oh well...

map_coords $ class <- as.numeric ( map_coords $ neighborhood )

n_cases <- dim ( map_coords )[ 1 ]

split <- 0.8

train_inds <- sample.int ( n_cases , floor ( split * n_cases ))

test_inds <- ( 1 : n_cases )[ - train_inds ]

k_max <- 10

knn_pred <- matrix ( NA , ncol = k_max , nrow = length ( test_inds ))

knn_test_error <- rep ( NA , times = k_max )

for ( i in 1 : k_max ) {

knn_pred [, i ] <- knn ( map_coords [ train_inds , 3 : 4 ],

map_coords [ test_inds , 3 : 4 ], cl = map_coords [ train_inds , 5 ], k = i )

knn_test_error [ i ] <- sum ( knn_pred [, i ] !=

map_coords [ test_inds , 5 ]) / length ( test_inds )

}

plot ( 1 : k_max , knn_test_error )

Modeling and Algorithms at Scale

The data you've been dealing with so far in this chapter has been pretty

small on the Big Data spectrum. What happens to these models and

algorithms when you have to scale up to massive datasets?

In some cases, it's entirely appropriate to sample and work with a

smaller dataset, or to run the same model across multiple sharded

datasets. (Sharding is where the data is broken up into pieces and

divided among diffrent machines, and then you look at the empirical

distribution of the estimators across models.) In other words, there

are statistical solutions to these engineering challenges.

However, in some cases we want to fit these models at scale, and the

challenge of scaling up models generally translates to the challenge of

Search WWH ::

Custom Search

Home