Database Reference
In-Depth Information
Finally, convert everything to a dataset. incanter.core/dataset is a lower level
constructor than incanter.core/to-dataset . It requires you to pass in the column
names and data matrix as separate sequences:
(i/dataset headers rows)))
It's important to realize that the code, as presented here, is the result of a lot of trial and error.
Screen scraping usually is. Generally, I download the page and save it, so I don't have to keep
requesting it from the web server. Next, I start the REPL and parse the web page there. Then,
I can take a look at the web page and HTML with the browser's view source function, and I can
examine the data from the web page interactively in the REPL. While working, I copy and paste
the code back and forth between the REPL and my text editor, as it's convenient. This worklow
and environment (sometimes called REPL-driven-development) makes screen scraping
(a iddly, dificult task at the best of times) almost enjoyable.
See also
F The next recipe, Scraping textual data from web pages , has a more involved example
of data scraping on an HTML page
F The Aggregating data from different formats recipe has a practical, real-life example
of data scraping in a table
Scraping textual data from web pages
Not all of the data on the Web is in tables, as in our last recipe. In general, the process
to access this nontabular data might be more complicated, depending on how the page
is structured.
Getting ready
First, we'll use the same dependencies and the require statements as we did in the last
recipe, Scraping data from tables in web pages .
Next, we'll identify the ile to scrape the data from. I've put up a ile at http://www.
ericrochester.com/clj-data-analysis/data/small-sample-list.html .
This is a much more modern example of a web page. Instead of using tables, it marks up the
text with the section and article tags and other features from HTML5, which help convey
what the text means, not just how it should look.
 
Search WWH ::




Custom Search