Preparing Array Analytics for the Data Tsunami - Geographical Information Systems: Trends and Technologies

Global Positioning System Reference

In-Depth Information

and Simple Storage Service (S3), Azure Services Platform, DynDNS, Google

Compute Engine, and HP Cloud. Users can dynamically provision the

virtualized hardware resources and confi gure them. One advantage of

IaaS over the other models is that the cloud user has more control over

the distribution logic of tasks necessary to complete either a single or

multiple service requests. Servers can be dynamically provisioned at peak

request times, and de-provisioned thereafter. Similarly, for long running

processes, servers can be provisioned provided the task scheduling and

distribution logic for the task is known. In addition to EC2 and S3 cloud

platforms, Amazon also offers Amazon Elastic MapReduce (EMR) which

uses the Hadoop framework (Hadoop Distributed File System (HDFS), 23

MapReduce, 24 Pig, 25 Hive, 26 etc.) for running big data storage and analytics.

To test the utilization of IaaS cloud for Geosciences applications, Huang

et al. (2010) deployed the Global Earth Observation System of Systems

(GEOSS) Clearinghouse clearing metadata catalog service in the Amazon

EC2. Similarly, Baranski et al. (2010) proposed a pay-per-use revenue model

for geoprocessing services in the Cloud to support future business models

of geoprocessing in the cloud.

In situ processing

In situ data processing denotes the ability to access data directly “in-place”,

without having to import it to the database beforehand (Alagiannis et al.

2012). That means that complex, ad-hoc analytics can be easily performed

on external data sources, avoiding pre-loading into the database and all

overhead incurred by this. In certain situations, in situ processing would

be preferred and perhaps is the only way to work with data, even though

in some cases it might be slower than in-database processing due to a lack

of adaptability to I/O access patterns and internal optimization.

In situ processing is most useful when working with existing, legacy

data archives where data is already stored in a certain structure, and many

services are built to assume this structure. Modifying the data archive is

not an option and importing it into a database leads to unnecessary data

duplication. In situ processing is non-invasive, so a database with such

23 Similar to Google File System, HDFS is a scalable, fault tolerant, distributed fi le system

designed to run on commodity hardware

24 MapReduce is a software framework and programming model for easily writing applications

which process vast amounts of data in-parallel on large clusters of commodity hardware in

a reliable, fault-tolerant manner

25 Pig is a platform for analyzing large data sets that consist of a high-level language for

expressing analysis programs on data stored on Hadoop

26 Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-

hoc queries, and the analysis of large datasets stored in Hadoop compatible fi le systems

Search WWH ::

Custom Search

Home