Scrubbing Data - Data Science at the Command Line

Database Reference

In-Depth Information

That seems to be in order. (Note that we're only showing the first 79 characters of

each line so that output fits on the page.)

Using the developer tools of our browser, we were able to determine that the root

HTML element that we're interested in is a <table> with the class wikitable . This

allows us to look at the part that we're interested in using grep (the -A option below

specifies the number of lines we want to see after the matching line):

$ < wiki.html grep wikitable -A 21

<tr>

<th>Country or territory</th>

<th>Total length of land borders (km)</th>

<th>Total surface area (km²)</th>

<th>Border/area ratio (km/km²)</th>

</tr>

<tr>

<td>Vatican City</td>

</tr>

<tr>

<td>Monaco</td>

</tr>

The next step is to extract the necessary elements from the HTML file. For this we use

the scrape tool:

$ < wiki.html scrape -b -e 'table.wikitable > tr:not(:first-child)' \

> > table.html

$ head -n 21 data/table.html

<!DOCTYPE html>

<html>

<body>

<td>Vatican City</td>

</tr>

Search WWH ::

Custom Search

Home