Erin Shellman - Data Scientists at Work

Database Reference

In-Depth Information

Gutierrez: What specific tools are you using?

Shellman: I'm writing a lot of Python these days, it's what all our recom-

mendation algorithms are written in. The Recommendo API is written in node

and hosted on AWS. We use a lot of open source libraries in Python, like

scikit-learn and pandas. As someone who used to work almost exclusively in

R, pandas is great because it's cheating in a way. It makes Python a lot like R,

so you get to code in Python but get a lot of the conveniences that we've all

come to expect from R. Of course, you'll also make yourself insane trying to

remember whether it's “len” or “length,” and 0 or 1 indexed.

Gutierrez: What is a specific project you have worked on recently?

Shellman: It's been a little while, but the beauty replenishment project was

a really fun one. The project started out as a tool for beauty stylists but

evolved into a personalized e-mail campaign. Initially we thought it would be

helpful for stylists to know when their clients were running low on product

so they could give them a call and remind them to come into the store. After

early feedback from the stylists that they likely wouldn't use a tool like that,

we found a home for the beauty replenishment work in a personalized email.

I started by analyzing active beauty customers, going through their beauty

transaction histories to understand what they purchased and then estimate

when they would be ready to replenish.

The biggest challenge was that beauty products have fast SKU [Stock Keeping

Unit] turnover. For example, say four months ago I bought lotion, and now

there's a new and improved formula. As a customer, when I replenish my

lotion, the new and improved formula is the same product that I bought four

months ago. However, from the manufacturer's perspective, it's a new SKU.

The issue is that if I didn't account for that SKU ancestry in my analysis I'd

miss a lot of replenishment purchases.

I used record linkage to solve the problem. Record linkage is a technique used to

find duplicates in things like census data and medical records. In survey data it's

typical to have typos and variations in name spellings and you want to link those

separate records into a single entry. I was doing the same thing—only instead of

names and addresses, I had brands, categories, and product descriptions. I forced

matching on things like product type and brand, and then used fuzzy string match-

ing to measure the similarity between product descriptions. My output was a

probability that two items were the same “record” for each candidate product.

Going in I didn't know that SKU turnover would be such a large part of the

project. I was green and not familiar with the product catalog and how SKUs

evolve. That made the project fun and challenging.

Search WWH ::

Custom Search

Home