Database Reference
In-Depth Information
$ echo 'foo\nbar\nfoo' | sort | uniq -c | sort -nr
2 foo
1 bar
$ echo 'foo\nbar\nfoo' | sort | uniq -c | sort -nr |
> awk '{print $2","$1}' | header -a value,count
value,count
foo,2
bar,1
If your data requires additional functionality than what is offered by (a combination
of ) these command-line tools, you can use csvsql . This command-line tool allows
you to perform SQL queries directly on CSV files. And remember, if after reading this
chapter you still need more flexibility, you're free to use R, Python, or whatever pro‐
gramming language you prefer.
The command-line tools will be introduced on a need-to-use basis. You'll notice that
sometimes we can use the same command-line tool to perform multiple operations,
or vice versa, multiple command-line tools to perform the same operation. This
chapter is more structured like a cookbook, where the focus is on the problems or
recipes, rather than on the command-line tools.
Overview
In this chapter, you'll learn how to:
• Convert data from one format to another
• Apply SQL queries to CSV
• Filter lines
• Extract and replace values
• Split, merge, and extract columns
Common Scrub Operations for Plain Text
In this section we describe common scrubbing operations for plain text. Formally,
plain text refers to a sequence of human-readable characters and optionally, some
specific types of control characters (e.g., a tab or a newline; for more information, see:
http://www.linfo.org/plain_text.html) . Examples include: ebooks, emails, logfiles, and
source code.
For the purpose of this topic, we assume that the plain text contains some data, and
that it has no clear tabular structure (like the CSV format) or nested structure (like
the JSON and HTML/XML formats). We discuss those formats later in this chapter.
Although these operations can also be applied to CSV, JSON, and HTML/XML for‐
mats, keep in mind that the tools treat the data as plain text.
Search WWH ::




Custom Search