Managing Complexity with Concurrent Programming - Clojure Data Analysis

Database Reference

In-Depth Information

["|N00030973|" 9900]

["|N00005656|" 11598514])

This solution uses agents to handle the work, and it uses the STM to manage shared data

structures. The main function irst assigns each input ile to an agent. Each agent then reads

the input ile and totals the amount of contributions for each candidate. It takes those totals

and uses the STM to update the shared counts.

Maintaining consistency with ensure

When we use the STM, we are trying to coordinate and maintain consistency between several

values, all of which keep changing. However, we'll sometimes want to maintain consistency

with those references that won't change and therefore won't be included in the transaction.

We can signal that the STM should include these other references in the transaction by

using the ensure function.

This helps simplify the data processing system by ensuring that the data structures stay

synchronized and consistent. The ensure function allows us to have more control over

what gets managed by the STM.

For this recipe, we'll use a slightly contrived example. We'll process a set of text iles

and compute the frequency of a term as well as the total number of words. We'll do this

concurrently, and we'll be able to watch the results get updated as we progress.

For the set of text iles, we'll use the Brown corpus. Constructed in the 1960s, this was

one of the irst digital collections of texts (or corpora) assembled for linguists to use to

study language. At that time, its size (one million words) was huge. Today, similar corpora

contain 100 million words or more.

Getting ready

We'll need to include the clojure.string library and have easy access to the File class:

(require '[clojure.string :as string])

(import '[java.io File])

We'll also need to download the Brown corpus. We can download it at http://www.nltk.

org/nltk_data/ . Actually, you can use any large collection of texts, but the Brown corpus

has each word's part of speech listed in the ile, so we'll need to parse it specially. If you use

a different corpus, you can just change the tokenize-brown function, as explained in the

next section, to work with your texts.

Search WWH ::

Custom Search

Home