Importing Data for Analysis - Clojure Data Analysis

Database Reference

In-Depth Information

load-xml-data implements this process. This takes three parameters:

F The input ilename

F A function that takes the root node of the parsed XML and returns the irst data node

F A function that takes a data node and returns the next data node or nil, if there are

no more nodes

First, the function parses the XML ile and wraps it in a zipper (we'll talk more about zippers in

the next section). Then, it uses the two functions that are passed in to extract all of the data

nodes as a sequence. For each data node, the function retrieves that node's child nodes and

converts them into a series of tag name / content pairs. The pairs for each data node are

converted into a map, and the sequence of maps is converted into an Incanter dataset.

There's more…

We used a couple of interesting data structures or constructs in this recipe. Both are common

in functional programming or Lisp, but neither have made their way into more mainstream

programming. We should spend a minute with them.

Navigating structures with zippers

The irst thing that happens to the parsed XML is that it gets passed to clojure.zip/

xml-zip . Zippers are standard data structures that encapsulate the data at a position in a

tree structure, as well as the information necessary to navigate back out. This takes Clojure's

native XML data structure and turns it into something that can be navigated quickly using

commands such as clojure.zip/down and clojure.zip/right . Being a functional

programming language, Clojure encourages you to use immutable data structures, and

zippers provide an eficient, natural way to navigate and modify a tree-like structure, such as

an XML document.

Zippers are very useful and interesting, and understanding them can help you understand

and work better with immutable data structures. For more information on zippers, the

Clojure-doc page is helpful ( http://clojure-doc.org/articles/tutorials/

parsing_xml_with_zippers.html ). However, if you would rather dive into the deep

end, see Gerard Huet's paper, The Zipper ( http://www.st.cs.uni-saarland.de/edu/

Processing in a pipeline

We used the ->> macro to express our process as a pipeline. For deeply nested function calls,

this macro lets you read it from the left-hand side to the right-hand side, and this makes the

process's data low and series of transformations much more clear.

Search WWH ::

Custom Search

Home