Database Reference
In-Depth Information
You will see the following output:
Users: 943, genders: 2, occupations: 21, ZIP codes: 795
Next, we will create a histogram to analyze the distribution of user ages, using matplot-
lib's hist function:
ages = user_fields.map(lambda x: int(x[1])).collect()
hist(ages, bins=20, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)
We passed in the ages array, together with the number of bins for our histogram ( 20 in
this case), to the hist function. Using the normed=True argument, we also specified
that we want the histogram to be normalized so that each bucket represents the percentage
of the overall data that falls into that bucket.
You will see an image containing the histogram chart, which looks something like the one
shown here. As we can see, the ages of MovieLens users are somewhat skewed towards
younger viewers. A large number of users are between the ages of about 15 and 35.
Distribution of user ages
Search WWH ::




Custom Search