Recommendation Engines: Building a User-Facing Data Product at Scale - Doing Data Science

Databases Reference

In-Depth Information

Assume for now we only modify with determinant 1 matrices G ; i.e.,

we restrict ourselves to volume-preserving transformations. If we ig‐

nore for now the size of the entries of V and concentrate only on the

size of the entries of U , we are minimizing the surface area of a d -

dimensional parallelepiped in n space (specifically, the one generated

by the columns of U ) where the volume is fixed. This is achieved by

making the sides of the parallelepiped mutually orthogonal, which is

the same as saying the latent features will be uncorrelated.

But don't forget, we've ignored V ! However, it turns out that V 's rows

will also be mutually orthogonal when we force U 's columns to be.

This is not hard to see if you keep in mind X has its SVD as discussed

previously. In fact, the SVD and this form U · V have a lot in common,

and some people just call this an SVD algorithm, even though it's not

quite.

Now we allow modifications with nontrivial determinant—so, for ex‐

ample, let G be some scaled version of the identity matrix. Then if we

do a bit of calculus, it turns out that the best choice of scalar (i.e., to

minimize the sum of the squares of the entries of U and of V ) is in fact

the geometric mean of those two quantities, which is cool. In other

words, we're minimizing the arithmetic mean of them with a single

parameter (the scalar) and the answer is the geometric mean.

So that's the proof. Believe us?

Alternating Least Squares

But how do you do this? How do you actually find U and V ? In reality,

as you will see next, you're not first minimizing the squared error and

then minimizing the size of the entries of the matrices U and V . You're

actually doing both at the same time.

So your goal is to find U and V by solving the optimization problem

described earlier. This optimization doesn't have a nice closed formula

like ordinary least squares with one set of coefficients. Instead, you

need an iterative algorithm like gradient descent. As long as your

problem is convex you'll converge OK (i.e., you won't find yourself at

a local, but not global, maximum), and you can force your problem to

be convex using regularization.

Here's the algorithm:

• Pick a random V .

Search WWH ::

Custom Search

Home