Hosting and Sharing Terabytes of Raw Data - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

economically feasible way to share data at high availability without incurring exorbi-

tant storage or bandwidth costs.

Using an IAAS storage solution fulfills several of the guiding principles we dis-

cussed in the first chapter. First, this solution allows us to plan for scale: If massive

increases in storage or bandwidth occur, our solution will be able to handle it. Also,

this model helps us avoid building infrastructure by allowing us to worry about

our data rather than purchasing and maintaining our own hardware, hiring systems

administrators, or thinking about backups or electricity.

The Network Is Slow

The network is really slow. The average global Internet data transfer speed in 2012 was

2.3 megabits per second (Mbps), with the United States clocking in at around 5.3 Mbps. 2

Imagine having to transfer your 25 gigabytes of data from one place to another at

a consistent speed of 5.3 Mbps. At this rate, the transfer will take close to 11 hours.

Projects like Google Fiber that aim to increase the average Internet connection toward

the 1,000 Mbps range using optical fiber seem promising, but they may not be wide-

spread in the United States for many years. The solutions to many of the issues we've

raised inherently favor use of distributed, ubiquitous computing systems. However,

network latency will often rear its head when it comes to big data challenges.

Let's consider a practical use case. A local government has just installed devices that

track the position and speed of each bus in the transit system every minute. This data

is used to determine how well these busses stick to their planned schedules. However,

because this is a civic project, the city wants to make the raw data available to people

who would like to run their own analyses. How should the city structure the data so

that others are easily able to make use of it?

A common format for sharing data is comma-separated value (CSV) files. CSV files

feature a record of data with each field in the record separated by a comma. Separate

records are defined by a line break. While the “C” in CSV often stands for “comma,”

it's not uncommon to find formats that are delimited by other characters, such as tabs,

spaces, or other, more esoteric symbols. Listing 1.1 shows CSV creation in Python.

Listing 1.1 Creating a CSV file using Python

import csv

my_csv_file = open('/tmp/sample.csv', 'w')

csv_writer = csv.writer(my_csv_file)

Search WWH ::

Custom Search

Home