Hosting and Sharing Terabytes of Raw Data - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

sample_row = ('Michael', 1234, 23.46, 'San Francisco, California')

# Write a CSV row

csv_writer.writerow(sample_row)

# Result: Michael,1234,23.46,"San Francisco, California"

CSV is definitely a great format for “f lat” data—that is, data that can be repre-

sented in a single line. Log data, such as that coming from Web servers and sensors, is

well represented in this format. CSV can be fairly compact as text-based data goes: It's

the data, pure and simple, with little markup or structure to get in the way. Also, it

is definitely easy for most people to use CSV, as it can be imported into spreadsheets,

ingested into databases, and easily parsed programmatically. For logs or records that

don't require data modeling beyond f lat rows, CSV can be extremely useful.

Most importantly, CSV is an excellent format for sequential access of data. In other

words, it's simple for a computer program to grab one, two, or 1,000 rows at a time

from the middle of a file and just start processing. In a distributed processing system,

this is helpful for breaking up large programming tasks into many smaller ones. Do

you have a huge CSV file that is overwhelming the memory of your single machine?

Just split it up and process the fragments.

Although CSV has many positives going for it, there are cases in which it can be

a pretty bad format for sharing a large amount data. First of all, it lacks much in the

way of standardization. Certainly there have been attempts at official CSV standards, 3

but in practice, there is little regularity in how developers create CSV output. Unfor-

tunately, this sometimes means that people will add a few header lines, use peculiar

delimiters between fields, or escape strings in eccentric ways. CSV also doesn't provide

a standard way of referring to information about the file itself; when working with

collections of CSVs, any information about the type or date that the data represents is

sometimes found in the filename itself. In fact, CSV files basically lack any metadata

at all, requiring those using them for sharing data to provide additional information

about the file somewhere else.

CSV is very bad at describing data that does not fit well into discreet rows. In prac-

tice, real world data often has many dimensions, and not all of these dimensions neces-

sarily fit into the rigid structure of CSV's rectangular regularity. Take, for example,

data about the number of people registered for political parties in the United States,

organized state by state. All states have representatives from the two major parties and

many of the much smaller ones. However, some states may have specific parties not

found in other states. Therefore, the list of parties will be different sizes for different

states. Expressing this data in CSV format provides a data-modeling challenge. Does

one create columns of data for every possible political party? Is the list of parties con-

catenated into a single string and stored in a field all by itself ? These representations

are not a natural fit for a fixed-size-row structure.

Search WWH ::

Custom Search

Home