Hosting and Sharing Terabytes of Raw Data - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Years ago, due to limitations in hardware (and probably a linguistic bias by Ameri-

can geeks working on early computing systems), many computers often represented

textual characters in a format known as the American Standard Code for Information

Interchange, or ASCII . ASCII worked great for representing characters from the Eng-

lish language. It worked so well that the Johnson administration even made ASCII a

federal standard in 1968. 4

What worked well in the United States didn't exactly translate to users who com-

municate in languages with characters that differ from those in the English alphabet.

In fact, many countries ended up creating their own encoding standards that, although

similar to ASCII, were incompatible with ASCII in various ways. This obviously

caused all kinds of frustrating problems in the software world.

Fortunately, by the late 1980s, a group of very smart computer scientists began to

come up with solutions to this alphabet soup (pardon the pun). Their solution was

the Unicode Standard , which aims to define a set of standard encodings for all the

characters in the world. Unicode is typically implemented using one of several stan-

dards, the most common ones being UTF-8 and UTF-16. The UTF-8 standard was

famously first developed on the back of a placemat and implemented in a matter of a

few days. 5 Because most of the technologies that encompass the Big Data movement

grew up after its creation, Unicode is almost universally supported by the software fea-

tured in this chapter and others. Many of the technologies and tools in this topic will

natively use one of these encodings.

Unfortunately, it's not uncommon to encounter situations in which enormous

amounts of data are tied up in files encoded with some version of ASCII or some other

non-Unicode scheme. Sometimes, non-Unicode data is created accidentally or unwit-

tingly. Another source of non-Unicode data may come from legacy software, such as

the decades-old reservations systems that are still being used by some airlines. Some

older desktop software might only be able to export data in obsolete encoding formats.

The message here is simple: If you've got lots of data lying around that is not

already encoded in UTF-8 or UTF-16, then convert it!

Working with text files? Try these classics.

When working with large amounts of text data, sometimes the quickest way to solve a

problem is to use some of the original Unix command-line utilities, such as split, grep,

and sed. After many decades, this software is still in widespread use today; thanks to

open-source implementations of Unix such as GNU/Linux, the utilities might even be

used more than ever before!

Do you have annoying header information in your CSV file? Remove the first two lines

of a file and place the result in a new file:

sed '1,2d' file > newfile

Search WWH ::

Custom Search

Home