Importing Data for Analysis - Clojure Data Analysis

Database Reference

In-Depth Information

We can do this in Clojure because of its macro system. ->> simply rewrites the calls into

Clojure's native, nested format as the form is read. The irst parameter of the macro is

inserted into the next expression as the last parameter. This structure is inserted into the third

expression as the last parameter, and so on, until the end of the form. Let's trace this through

a few steps. Say, we start off with the expression (->> x first (map length) (apply

+)) . As Clojure builds the inal expression, here's each intermediate step (the elements to be

combined are highlighted at each stage):

1. (->> x first (map length) (apply +))

2. (->> (first x) (map length) (apply +))

3. (->> (map length (first x)) (apply +) )

4. (apply + (map length (first x)))

Comparing XML and JSON

XML and JSON (from the Reading JSON data into Incanter datasets recipe) are very similar.

Arguably, much of the popularity of JSON is driven by disillusionment with XML's verboseness.

When we're dealing with these formats in Clojure, the biggest difference is that JSON is

converted directly to native Clojure data structures that mirror the data, such as maps and

vectors Meanwhile, XML is read into record types that relect the structure of XML, not the

structure of the data.

In other words, the keys of the maps for JSON will come from the domains, first_name or

age , for instance. However, the keys of the maps for XML will come from the data format, such

as tag, attribute, or children, and the tag and attribute names will come from the domain.

This extra level of abstraction makes XML more unwieldy.

Scraping data from tables in web pages

There's data everywhere on the Internet. Unfortunately, a lot of it is dificult to reach. It's

buried in tables, articles, or deeply nested div tags. Web scraping (writing a program that

walks over a web page and extracts data from it) is brittle and laborious, but it's often the only

way to free this data so it can be used in our analyses. This recipe describes how to load a

web page and dig down into its contents so that you can pull the data out.

To do this, we're going to use the Enlive ( https://github.com/cgrand/enlive/wiki )

library. This uses a domain speciic language (DSL, a set of commands that make a small set

of tasks very easy and natural) based on CSS selectors to locate elements within a web page.

This library can also be used for templating. In this case, we'll just use it to get data back out

of a web page.

Search WWH ::

Custom Search

Home