Java Reference
In-Depth Information
[attribute] : Finds elements with an attribute
[attribute=value] : Finds elements whose attribute equals a value
Selectors can be combined to build more complex queries, e.g., table.wikitable could select all
table elements with the class wikitable , or table a could find all link elements within a table.
You can read more about selectors on the jsoup site.
Running your code will produce the following result:
2 wikitables found
----- Arusha -----
Arabica: Arabica
Region(s): Mount Meru in Tanzania, and Papua New Guinea
Comments: either a Typica variety or a French Mission.
----- Bergendal, Sidikalang -----
Arabica: Arabica
Region(s): Indonesia
Comments: Both are Typica varieties which survived the Leaf Rust Outbreak
of the 1880s; most of the other Typica in Indonesia was destroyed.
** AND SO ON **
As you no doubt agree, screen scraping can oftentimes offer a very straightforward and simple
method to extract data from web pages. However, screen scraping is generally regarded as an inel-
egant technique to be used only when no other mechanism for structured data exchange is available
(such as REST or SOAP). The reasons for this are both technical and “ethical” in nature. First, web
pages may change without notice, causing your screen scraping programs to break and require a
great deal of maintenance. Second, it is not regarded good form to “hammer” websites with screen
scraping programs. It makes thousands of requests to extract all data from the websites, which was
originally meant for human consumption only. Therefore, always take care when screen scraping a
website not to anger the website owner.
Note People sometimes build wrappers around websites that do not offer an
official REST API based on screen scraping techniques, but rather a so‐called
“Evil API” that can be used by programmers to access a website in a program-
matic manner. The reason such third‐party tools are denoted as “evil” is due
to the fact that the website owner does not intend or want programmers to
access their resources in a programmatic manner.
Screen Scraping with Cookies
Before showing you their content, websites frequently require your browser to send a set of cookies
(small tokens of information) that were set by the website earlier to be able to identify you again.
Recall that this was HTTP's way to work around its stateless nature. For instance, when you visit
Twitter or Facebook in your browser and no cookie is sent with the request, the site will ask you to
log in before continuing. Once you provide your username and password, the site will set a cookie
to remember you in subsequent requests.
 
Search WWH ::




Custom Search