Database Reference
In-Depth Information
By defining the two factor matrices
L ¼ US
1/2
and
R ¼ VS
1/2
, we easily observe
that trace(
W
1
) + trace(
W
2
)
¼kLk
F
+
kRk
F
and finally arrive at the optimization
problem
1
2
2
F
þ kk
2
F
minimize
kk
¼ P
Ω
A
:
ð
8
:
35
Þ
subject to
P
Ω
LR
T
(Please consult [CR08] for a proof of equivalence of (
8.34
)an(
8.35
).)
Since in most practical applications data is noisy, we will allow some approx-
imation error on the observed entries,
replacing (
8.35
) by the less rigid
formulation
1
2
2
F
þ kk
2
F
minimize
kk
ð
8
:
36
Þ
P
Ω
ðÞ
2
F
σ
P
Ω
LR
T
subject to
for a small positive
σ
. Thus, in Lagrangian form, we arrive at
the formulation
P
Ω
ðÞ
1
2
2
F
,
2
F
þ kk
2
F
P
Ω
LR
T
minimize
λ
kk
þ
ð
8
:
37
Þ
where
is the regularization parameter that controls the noise in the data.
Formulation (
8.37
) is also called
maximum-margin matrix factorization
(MMMF)
and goes back to Nathan Srebro in 2005 [RS05]. Related work was also done by the
group of Trevor Hastie [MHT10].
Problem (
8.37
) can be solved, e.g., by using a simple gradient descent method.
Interestingly, the first SVD solution submitted to Netflix (by Simon Funk
[Fun06], in 2006) used exactly this approach and achieved considerable progress.
The final price was a mixture of hundreds of models with the SVD playing a
crucial role.
After all, one may ask: what is the difference of the SVD (
8.37
) to the truncated
SVD (
8.14
) that we have considered before? This brings us back to the introductory
discussion in Sect.
8.1
about unknown values. The answer is that in (
8.37
), we
assume only the entries
A
ij
,(
i
,
j
)
∈ Ω
to be known. In contrast, in our previous
formulation, we consider all entries
A
ij
,(
i
,
j
)
λ
to be zero. The Netflix competi-
tion, where all entries of
A
on a test set (of actual ratings of the users) had to be
predicted, is obviously a classic matrix completion problem and hence (
8.37
) is the
certainly the right approach.
In case of matrix factorization for an actual recommendation task, like that of a
user or session matrix in Example 8.1, or even the probability matrix
P
, the
discussion is more complex. In fact, we may view all non-visited entries of the
matrix to be zero since the user didn't show any interest in them. This justifies our
2
Ω