Database Reference
In-Depth Information
We might also want to explore the relative frequencies of the various occupations of our
users. We can do this using the following code snippet. First, we will use the MapReduce
approach introduced previously to count the occurrences of each occupation in the dataset.
Then, we will use matplotlib to display a bar chart of occupation counts, using the
bar function.
Since part of our data is the descriptions of textual occupation, we will need to manipulate
it a little to get it to work with the bar function:
count_by_occupation = user_fields.map(lambda fields:
(fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()
x_axis1 = np.array([c[0] for c in count_by_occupation])
y_axis1 = np.array([c[1] for c in count_by_occupation])
Once we have collected the RDD of counts per occupation, we will convert it into two ar-
rays for the x axis (the occupations) and the y axis (the counts) of our chart. The col-
lect function returns the count data to us in no particular order. We need to sort the
count data so that our bar chart is ordered from the lowest to the highest count.
We will achieve this by first creating two numpy arrays and then using the argsort
method of numpy to select the elements from each array, ordered by the count data in an
ascending fashion. Notice that here, we will sort both the x and y axis arrays by the y axis
(that is, by the counts):
x_axis = x_axis1[np.argsort(y_axis1)]
y_axis = y_axis1[np.argsort(y_axis1)]
Once we have the x and y axis data for our chart, we will create the bar chart with the oc-
cupations as labels on the x axis and the counts as the values on the y axis. We will also
add a few lines, such as the plt.xticks(rotation=30) code, to display a better-
looking chart:
pos = np.arange(len(x_axis))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)
Search WWH ::




Custom Search