Hosting and Sharing Terabytes of Raw Data - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Is the data in your source file delimited by an odd character, such the caret (^)?

Replace those pesky characters with commas!

sed 's/\^/,/g' original_file > new_file

When working with extremely large text files with millions of lines, it's possible to have

a bad record on a particular line; but how can you display it? With sed, it's easy to

print out a particular line number, such as line number 3,451,234:

sed '3451234q;d' your_large_file.csv

Another great utility for dealing with very large files is split, which, as the name

implies, splits large files into smaller ones. For example, if you have a very large file

that you need to split into chunks of 500 Mb at most while preserving the integrity of

line endings (avoiding broken lines), use this command:

split -C 500m your_large_file.csv

There's a lot more that can be done with Unix command-line text-processing utilities.

Sometimes the best (and quickest) solution is the simplest.

File Transformations

We've only brief ly touched upon the challenges of working with many files in a vari-

ety of data formats. In practice, converting many files from one format to another can

be a very involved process. What if you've got a lot of data in a particular format, and

you need to convert it to another format? This data transformation process can some-

times be daunting.

Transforming a large collection one document at time can take enormous amounts

of time. Fortunately, it is also a task well suited to distributed systems. When con-

fronted with hundreds and even thousands of separate documents, it's helpful to run

these tasks in parallel using a large number of distributed resources.

The most popular open-source distributed computing framework is Hadoop .

Hadoop is an open-source implementation of the MapReduce framework, which

allows data processing tasks to be split across a large number of separate machines.

Hadoop was originally inspired by Google's MapReduce research paper (but it is by no

means the only implementation of MapReduce). In Chapter 9, “Building Data Trans-

formation Workf lows with Pig and Cascading,” we'll take a look at how to build an

end-to-end data-transformation pipeline using the open-source Cascading workf low

framework.

XML, JSON, and even plain old CSV files can all be considered members of a class of

objects used in the process of converting data into bits that a computer can understand

and move from one place to another. This process is known as data serialization .

Search WWH ::

Custom Search

Home