Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

We might also want to explore the relative frequencies of the various occupations of our

users. We can do this using the following code snippet. First, we will use the MapReduce

approach introduced previously to count the occurrences of each occupation in the dataset.

Then, we will use matplotlib to display a bar chart of occupation counts, using the

bar function.

Since part of our data is the descriptions of textual occupation, we will need to manipulate

it a little to get it to work with the bar function:

count_by_occupation = user_fields.map(lambda fields:

(fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()

x_axis1 = np.array([c[0] for c in count_by_occupation])

y_axis1 = np.array([c[1] for c in count_by_occupation])

Once we have collected the RDD of counts per occupation, we will convert it into two ar-

rays for the x axis (the occupations) and the y axis (the counts) of our chart. The col-

lect function returns the count data to us in no particular order. We need to sort the

count data so that our bar chart is ordered from the lowest to the highest count.

We will achieve this by first creating two numpy arrays and then using the argsort

method of numpy to select the elements from each array, ordered by the count data in an

ascending fashion. Notice that here, we will sort both the x and y axis arrays by the y axis

(that is, by the counts):

x_axis = x_axis1[np.argsort(y_axis1)]

y_axis = y_axis1[np.argsort(y_axis1)]

Once we have the x and y axis data for our chart, we will create the bar chart with the oc-

cupations as labels on the x axis and the counts as the values on the y axis. We will also

add a few lines, such as the plt.xticks(rotation=30) code, to display a better-

looking chart:

pos = np.arange(len(x_axis))

width = 1.0

ax = plt.axes()

ax.set_xticks(pos + (width / 2))

ax.set_xticklabels(x_axis)

Search WWH ::

Custom Search

Home