Cleaning and Validating Data - Clojure Data Analysis

Database Reference

In-Depth Information

How to do it…

To deine a parser, we just deine functions that parse the different parts of the input and then

combine them to parse larger structures:

1.

It would be useful to have a way to parse two things and throw away the results of

the second. This function will do that:

(defn <| [l r]

(let [l-output (l)]

(r)

l-output))

2. Also, we'll deine a parser for the end of a line. It matches either a carriage return or

a new line:

(defn nl []

(chr-in #{\newline \return}))

3.

Let's start putting the pieces together. The irst function parses the sequence deinition

line by accepting a > character, followed by anything up to the end of the line:

(defn defline []

(chr \>)

(<| #(read-to-re #"[\n\r]+") nl))

4. We parse a sequence of amino acid or nucleic acid codes by deining a parser for

a single code and then building on that to create a parser for a line of code:

(defn acid-code []

(chr-in #{\A \B \C \D \E \F \G \H \I \K \L \M

\N \P \Q \R \S \T \U \V \W \X \Y \Z

\- \*}))

(defn acid-code-line []

(<| #(multi+ acid-code) #(attempt nl)))

5.

Next, we combine these parsers into one that parses an entire FASTA record and

populates a map with our data. Moreover, we deine a combinator that parses

multiple FASTA records:

(defn fasta []

(ws?)

(let [dl (defline)

gls (apply str (flatten

(multi+ acid-code-line)))]

{:defline dl, :gene-seq gls}))

(defn multi-fasta []

(<| #(multi+ fasta)

ws?))

Search WWH ::

Custom Search

Home