Database Reference
In-Depth Information
Years ago, due to limitations in hardware (and probably a linguistic bias by Ameri-
can geeks working on early computing systems), many computers often represented
textual characters in a format known as the American Standard Code for Information
Interchange, or ASCII . ASCII worked great for representing characters from the Eng-
lish language. It worked so well that the Johnson administration even made ASCII a
federal standard in 1968. 4
What worked well in the United States didn't exactly translate to users who com-
municate in languages with characters that differ from those in the English alphabet.
In fact, many countries ended up creating their own encoding standards that, although
similar to ASCII, were incompatible with ASCII in various ways. This obviously
caused all kinds of frustrating problems in the software world.
Fortunately, by the late 1980s, a group of very smart computer scientists began to
come up with solutions to this alphabet soup (pardon the pun). Their solution was
the Unicode Standard , which aims to define a set of standard encodings for all the
characters in the world. Unicode is typically implemented using one of several stan-
dards, the most common ones being UTF-8 and UTF-16. The UTF-8 standard was
famously first developed on the back of a placemat and implemented in a matter of a
few days. 5 Because most of the technologies that encompass the Big Data movement
grew up after its creation, Unicode is almost universally supported by the software fea-
tured in this chapter and others. Many of the technologies and tools in this topic will
natively use one of these encodings.
Unfortunately, it's not uncommon to encounter situations in which enormous
amounts of data are tied up in files encoded with some version of ASCII or some other
non-Unicode scheme. Sometimes, non-Unicode data is created accidentally or unwit-
tingly. Another source of non-Unicode data may come from legacy software, such as
the decades-old reservations systems that are still being used by some airlines. Some
older desktop software might only be able to export data in obsolete encoding formats.
The message here is simple: If you've got lots of data lying around that is not
already encoded in UTF-8 or UTF-16, then convert it!
Working with text files? Try these classics.
When working with large amounts of text data, sometimes the quickest way to solve a
problem is to use some of the original Unix command-line utilities, such as split, grep,
and sed. After many decades, this software is still in widespread use today; thanks to
open-source implementations of Unix such as GNU/Linux, the utilities might even be
used more than ever before!
Do you have annoying header information in your CSV file? Remove the first two lines
of a file and place the result in a new file:
sed '1,2d' file > newfile
4. www.presidency.ucsb.edu/ws/index.php?pid=28724
5. www.cl.cam.ac.uk/~mgk25/ucs/utf-8-histor y.txt
 
Search WWH ::




Custom Search