Database Reference
In-Depth Information
Most Popular Website Items
A commerce website, such as Amazon.com or Etsy.com , usually has a
most popular items sidebar. These websites often have thousands of
products available in each department. To maintain a top-K list in real
time, the sites would have to maintain the current number of purchases
for each of those items in some accessible way. For Amazon.com , which
has so much hardware that it started a successful side business renting
it out, this may not be an issue, but it is very expensive to maintain for a
relatively small feature.
Using the top-K list approach in the last section, it is possible to cheaply
implement an approximate top-10 most popular item list using only a
few kilobytes of RAM. To improve the accuracy, top-K lists could be
maintained for every department on the website so that each page has a
unique top-K list.
Range and Quantile Queries
Many times, the data being stored has some sort of natural ordering. When
this happens it is natural to think of the data as having an empirical
distribution (or histogram depending on the discipline), for example, the
prices paid (in cents) for a particular advertising unit on a website or the
number of seconds a user spent on a given page before taking an action.
In these cases, the questions of interest are usually concerned one way or
another with a range query of some kind: What fraction of users spent
less than 10 seconds on this page? What is the median sale price of this
advertising unit? The naïve answer to the first question is to simply sum
the point estimates from 0 to 10 seconds and report the total frequency. To
answer the question of the median time, you would begin calculating the
frequencyofarangestartingfrom$0.01untilthesumofthepointestimates
was approximately 50 percent of the total number of elements seen so far.
This naïve approach has two problems. The first is obvious in the second
question: The number of operations required could be very large. If the
median price is $15 then 1,500 point estimates are required, all using k hash
calculationsandsoon.Lessobviousisthesecondproblem:Theseindividual
Search WWH ::




Custom Search