Database Reference
In-Depth Information
economically feasible way to share data at high availability without incurring exorbi-
tant storage or bandwidth costs.
Using an IAAS storage solution fulfills several of the guiding principles we dis-
cussed in the first chapter. First, this solution allows us to plan for scale: If massive
increases in storage or bandwidth occur, our solution will be able to handle it. Also,
this model helps us avoid building infrastructure by allowing us to worry about
our data rather than purchasing and maintaining our own hardware, hiring systems
administrators, or thinking about backups or electricity.
The Network Is Slow
The network is really slow. The average global Internet data transfer speed in 2012 was
2.3 megabits per second (Mbps), with the United States clocking in at around 5.3 Mbps. 2
Imagine having to transfer your 25 gigabytes of data from one place to another at
a consistent speed of 5.3 Mbps. At this rate, the transfer will take close to 11 hours.
Projects like Google Fiber that aim to increase the average Internet connection toward
the 1,000 Mbps range using optical fiber seem promising, but they may not be wide-
spread in the United States for many years. The solutions to many of the issues we've
raised inherently favor use of distributed, ubiquitous computing systems. However,
network latency will often rear its head when it comes to big data challenges.
Choosing the Right Data Format
Let's consider a practical use case. A local government has just installed devices that
track the position and speed of each bus in the transit system every minute. This data
is used to determine how well these busses stick to their planned schedules. However,
because this is a civic project, the city wants to make the raw data available to people
who would like to run their own analyses. How should the city structure the data so
that others are easily able to make use of it?
A common format for sharing data is comma-separated value (CSV) files. CSV files
feature a record of data with each field in the record separated by a comma. Separate
records are defined by a line break. While the “C” in CSV often stands for “comma,”
it's not uncommon to find formats that are delimited by other characters, such as tabs,
spaces, or other, more esoteric symbols. Listing 1.1 shows CSV creation in Python.
Listing 1.1 Creating a CSV file using Python
import csv
my_csv_file = open('/tmp/sample.csv', 'w')
csv_writer = csv.writer(my_csv_file)
2. www.theverge.com/2012/5/1/2990469/average-global-internet-speed-drop-us
 
 
 
Search WWH ::




Custom Search