Scrubbing Data - Data Science at the Command Line

Database Reference

In-Depth Information

Filtering Lines

The first scrubbing operation is filtering lines. This means that from the input data,

each line will be evaluated to determine whether it may be passed on as output.

Based on location

The most straightforward way to filter lines is based on their location. This may be

useful when you want to inspect, say, the top 10 lines of a file, or when you extract a

specific row from the output of another command-line tool. To illustrate how to filter

based on location, let's create a dummy file that contains 10 lines:

$ cd ~/book/ch05/data

$ seq -f "Line %g" 10 | tee lines

Line 1

Line 2

Line 3

Line 4

Line 5

Line 6

Line 7

Line 8

Line 9

Line 10

We can print the first three lines using either head , sed , or awk :

$ < lines head -n 3

$ < lines sed -n '1,3p'

$ < lines awk 'NR<=3'

Line 1

Line 2

Line 3

Similarly, we can print the last three lines using tail (Rubin, MacKenzie, Taylor, &

Meyering, 2012):

$ < lines tail -n 3

Line 8

Line 9

Line 10

You can also you use sed and awk for this, but tail is much faster. Removing the first

three lines goes as follows:

$ < lines tail -n +4

$ < lines sed '1,3d'

$ < lines sed -n '1,3!p'

Line 4

Line 5

Line 6

Line 7

Search WWH ::

Custom Search

Home