Database Reference
In-Depth Information
We might also want to explore the relative frequencies of the various occupations of our
users. We can do this using the following code snippet. First, we will use the MapReduce
approach introduced previously to count the occurrences of each occupation in the dataset.
Then, we will use
matplotlib
to display a bar chart of occupation counts, using the
bar
function.
Since part of our data is the descriptions of textual occupation, we will need to manipulate
it a little to get it to work with the
bar
function:
count_by_occupation = user_fields.map(lambda fields:
(fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()
x_axis1 = np.array([c[0] for c in count_by_occupation])
y_axis1 = np.array([c[1] for c in count_by_occupation])
Once we have collected the
RDD
of counts per occupation, we will convert it into two ar-
rays for the
x
axis (the occupations) and the
y
axis (the counts) of our chart. The
col-
lect
function returns the count data to us in no particular order. We need to sort the
count data so that our bar chart is ordered from the lowest to the highest count.
We will achieve this by first creating two
numpy
arrays and then using the
argsort
method of
numpy
to select the elements from each array, ordered by the count data in an
ascending fashion. Notice that here, we will sort both the
x
and
y
axis arrays by the
y
axis
(that is, by the counts):
x_axis = x_axis1[np.argsort(y_axis1)]
y_axis = y_axis1[np.argsort(y_axis1)]
Once we have the
x
and
y
axis data for our chart, we will create the bar chart with the oc-
cupations as labels on the
x
axis and the counts as the values on the
y
axis. We will also
add a few lines, such as the
plt.xticks(rotation=30)
code, to display a better-
looking chart:
pos = np.arange(len(x_axis))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)