Scrubbing Data - Data Science at the Command Line

Database Reference

In-Depth Information

$ grep -i chapter alice.txt

CHAPTER I. Down the Rabbit-Hole

CHAPTER II. The Pool of Tears

CHAPTER III. A Caucus-Race and a Long Tale

CHAPTER IV. The Rabbit Sends in a Little Bill

CHAPTER V. Advice from a Caterpillar

CHAPTER VI. Pig and Pepper

CHAPTER VII. A Mad Tea-Party

CHAPTER VIII. The Queen's Croquet-Ground

CHAPTER IX. The Mock Turtle's Story

CHAPTER X. The Lobster Quadrille

CHAPTER XI. Who Stole the Tarts?

CHAPTER XII. Alice's Evidence

Here, -i means case-insensitive. We can also specify a regular expression. For exam‐

ple, if we only wanted to print out the headings which start with “The”:

$ grep -E '^CHAPTER (.*)\. The' alice.txt

CHAPTER II. The Pool of Tears

CHAPTER IV. The Rabbit Sends in a Little Bill

CHAPTER VIII. The Queen's Croquet-Ground

CHAPTER IX. The Mock Turtle's Story

CHAPTER X. The Lobster Quadrille

Note that you have to specify the -E option in order to enable regular expressions.

Otherwise, grep interprets the pattern as a literal string.

Based on randomness

When you're in the process of formulating your data pipeline and you have a lot of

data, then debugging your pipeline can be cumbersome. In that case, sampling from

the data might be useful. The main purpose of the command-line tool sample (Jans‐

sens, 2014) is to get a subset of the data by outputting only a certain percentage of the

input on a line-by-line basis:

$ seq 1000 | sample -r 1% | jq -c '{line: .}'

{"line":53}

{"line":119}

{"line":141}

{"line":228}

{"line":464}

{"line":476}

{"line":523}

{"line":657}

{"line":675}

{"line":865}

{"line":948}

Here, every input line has a 1% chance of being forwarded to jq . This percentage

could also have been specified as a fraction ( 1/100 ) or as a probability ( 0.01 ).

Data Science at the Command Line

Search WWH ::

Custom Search

Home