Database Reference
In-Depth Information
Is the data in your source file delimited by an odd character, such the caret (^)?
Replace those pesky characters with commas!
sed 's/\^/,/g' original_file > new_file
When working with extremely large text files with millions of lines, it's possible to have
a bad record on a particular line; but how can you display it? With sed, it's easy to
print out a particular line number, such as line number 3,451,234:
sed '3451234q;d' your_large_file.csv
Another great utility for dealing with very large files is split, which, as the name
implies, splits large files into smaller ones. For example, if you have a very large file
that you need to split into chunks of 500 Mb at most while preserving the integrity of
line endings (avoiding broken lines), use this command:
split -C 500m your_large_file.csv
There's a lot more that can be done with Unix command-line text-processing utilities.
Sometimes the best (and quickest) solution is the simplest.
File Transformations
We've only brief ly touched upon the challenges of working with many files in a vari-
ety of data formats. In practice, converting many files from one format to another can
be a very involved process. What if you've got a lot of data in a particular format, and
you need to convert it to another format? This data transformation process can some-
times be daunting.
Transforming a large collection one document at time can take enormous amounts
of time. Fortunately, it is also a task well suited to distributed systems. When con-
fronted with hundreds and even thousands of separate documents, it's helpful to run
these tasks in parallel using a large number of distributed resources.
The most popular open-source distributed computing framework is Hadoop .
Hadoop is an open-source implementation of the MapReduce framework, which
allows data processing tasks to be split across a large number of separate machines.
Hadoop was originally inspired by Google's MapReduce research paper (but it is by no
means the only implementation of MapReduce). In Chapter 9, “Building Data Trans-
formation Workf lows with Pig and Cascading,” we'll take a look at how to build an
end-to-end data-transformation pipeline using the open-source Cascading workf low
framework.
Data in Motion: Data Serialization Formats
XML, JSON, and even plain old CSV files can all be considered members of a class of
objects used in the process of converting data into bits that a computer can understand
and move from one place to another. This process is known as data serialization .
 
 
Search WWH ::




Custom Search