Databases Reference
In-Depth Information
Each one represents one (simulated) day's worth of ads shown and
clicks recorded on the New York Times home page in May 2012. Each
row represents a single user. There are five columns: age, gender
(0=female, 1=male), number impressions, number clicks, and logged-
in.
You'll be using R to handle these data. It's a programming language
designed specifically for data analysis, and it's pretty intuitive to start
using. You can download it here . Once you have it installed, you can
load a single file into R with this command:
data1 <- read.csv ( url ( "http://stat.columbia.edu/~rachel/
datasets/nyt1.csv" ))
Once you have the data loaded, it's time for some EDA:
1. Create a new variable, age_group , that categorizes users as "<18" ,
"18-24" , "25-34" , "35-44" , "45-54" , "55-64" , and "65+" .
2. For a single day:
• Plot the distributions of number impressions and click-
through-rate (CTR=# clicks/# impressions) for these six age
categories.
• Define a new variable to segment or categorize users based on
their click behavior.
• Explore the data and make visual and quantitative comparisons
across user segments/demographics (<18-year-old males ver‐
sus < 18-year-old females or logged-in versus not, for example).
• Create metrics/measurements/statistics that summarize the da‐
ta. Examples of potential metrics include CTR, quantiles, mean,
median, variance, and max, and these can be calculated across
the various user segments. Be selective. Think about what will
be important to track over time—what will compress the data,
but still capture user behavior.
3. Now extend your analysis across days. Visualize some metrics and
distributions over time.
4. Describe and interpret any patterns you find.
Sample code
Here we'll give you the beginning of a sample solution for this exercise.
The reality is that we can't teach you about data science and teach you
Search WWH ::




Custom Search