Java Reference
In-Depth Information
Let's say you're visiting an interesting page on Wikipedia about coffee varieties at http://
en.wikipedia.org/wiki/List_of_coffee_varieties . You notice the list shown in Figure 10-27.
figure 10-27  
You'd like to extract this list for later use (perhaps to import into a database). You know you can copy
and paste tables from websites into Excel, but where's the fun in that? You've already seen how to make
HTTP requests to REST services interact with them, so why wouldn't it be possible to make HTTP
requests to normal websites and parse the HTML page they give back? Sure, HTML contains a lot of
formatting and is not a structured format such as XML or JSON, but it definitely seems possible.
This is exactly what screen scraping (also called data scraping or web scraping) does—it extracts
data programmatically from output that's meant to be consumed by humans.
Depending on the type of data you want to extract, screen scraping can be more or less difficult to
pull off. Some websites are structured in a complex way, make HTTP requests even after the main
page has loaded (using JavaScript), or require certain cookies to be set (e.g., indicating that a user is
logged in) before you can access pages. This section takes a look at screen scraping web pages both
with and without cookie‐based authentication. To screen scrape, you need to use yet another library
in order to help parse and search through the received HTML pages.
You certainly don't want to do this manually using Java's String
manipulation methods.
As always, start by setting up a new project in Eclipse, called
ScreenScrapingWithJava . The library you will be using to do the
HTML parsing is called jsoup and can be downloaded at http://
jsoup.org/download (in this topic version 1.7.3 is used). Simply
download the core library JAR file and drag and copy it into an
Eclipse folder within your project, named jsoup . Finally, right‐click
the JAR in Eclipse to add it to the build path. See Figure 10-28.
figure 10-28  
Search WWH ::




Custom Search