Scrubbing Data - Data Science at the Command Line

Database Reference

In-Depth Information

It's worth noting that cut can also split on character positions. This is useful for when

you want to extract (or remove) the same set of characters per input line:

$ grep -i chapter alice.txt | cut -c 9-

I. Down the Rabbit-Hole

II. The Pool of Tears

III. A Caucus-Race and a Long Tale

IV. The Rabbit Sends in a Little Bill

V. Advice from a Caterpillar

VI. Pig and Pepper

VII. A Mad Tea-Party

VIII. The Queen's Croquet-Ground

IX. The Mock Turtle's Story

X. The Lobster Quadrille

XI. Who Stole the Tarts?

XII. Alice's Evidence

grep has a great feature that outputs every match onto a separate line:

$ < alice.txt grep -oE '\w{2,}' | head

Project

Gutenberg

Alice

Adventures

in

Wonderland

by

Lewis

Carroll

This

But what if we wanted to create a data set of all the words that start with an “a” and

end with an “e”. Well, of course there's a pipeline for that, too:

$ < alice.txt tr '[:upper:]' '[:lower:]' | grep -oE '\w{2,}' |

> grep -E '^a.*e$' | sort | uniq -c | sort -nr |

> awk '{print $2","$1}' | header -a word,count | head | csvlook

|-------------+--------|

| word | count |

|-------------+--------|

| alice | 403 |

| are | 73 |

| archive | 13 |

| agree | 11 |

| anyone | 5 |

| alone | 5 |

| age | 4 |

| applicable | 3 |

| anywhere | 3 |

| alive | 3 |

|-------------+--------|

Data Science at the Command Line

Search WWH ::

Custom Search

Home