Big Data Analytics Methodology - Big Data Imperatives

Databases Reference

In-Depth Information

privacy is effectively both a technical and a sociological problem, which must be

addressed jointly from both perspectives to realize the promise of big data.

Consider, for example, data gleaned from location-based services. These new

architectures require a user to share his/her location with the service provider, resulting

in obvious privacy concerns. Note that hiding the user's identity alone without hiding

person's location would not properly address these privacy concerns. An attacker or a

(potentially malicious) location-based server can infer the identity of the query source

from its (subsequent) location information. For example, a user's location information

can be tracked through several stationary connection points (e.g., cell towers). After a

while, the user leaves “a trail of packet crumbs” which could be associated to a certain

residence or office location and thereby used to determine the user's identity. Several

other types of surprisingly private information such as health issues (e.g., presence in a

cancer treatment center) or religious preferences (e.g., presence in a church) can also be

revealed by just observing anonymous users' movement and usage pattern over time. In

general, hiding a user location is much more challenging than hiding his/her identity.

This is because with location-based services, the location of the user is needed for a

successful data access or data collection, while the identity of the user is not necessary.

Human Collaboration: Despite the tremendous advances made in computational

analysis, there remain many patterns that humans can easily detect but that computer

algorithms have a hard time finding. Ideally, analytics for big data will not be all

computational; rather these will be designed explicitly to have a human in the loop.

The new sub-field of visual analytics is attempting to do this, at least with respect to the

modeling and analysis phase in the pipeline. There is similar value to human input at all

stages of the analysis pipeline.

In today's complex world, it often takes multiple experts from different domains to

really understand what is going on. A big data analysis system must support input from

multiple human experts, and shared exploration of results. These multiple experts may be

separated in space and time when it is too expensive to assemble an entire team together

in one room. The data system has to accept this distributed expert input and support their

collaboration.

System Architecture: Business data is analyzed for many purposes: a company may

perform system log analytics and social media analytics for risk assessment, customer

retention, brand management, and so on. Typically, such varied tasks have been

handled by separate systems, even if each system includes common steps of information

extraction, data cleaning, relational-like processing (joins, group-by, aggregation),

statistical and predictive modeling, and appropriate exploration and visualization tools

as discussed in the methodology in this chapter.

With big data, the use of separate systems in this fashion becomes prohibitively

expensive given the large size of the data sets. The expense is due not only to the cost

of the systems themselves but also to the time to load the data into multiple systems.

Consequently, big data has made it necessary to run heterogeneous workloads on a single

infrastructure that is sufficiently flexible to handle all these workloads. The challenge here

is not to build a system that is ideally suited for all processing tasks. Instead, the need is

for the underlying system architecture to be flexible enough that the components built on

top of it for expressing the various kinds of processing tasks can tune it to efficiently run

these different workloads.

Big Data Imperatives

Search WWH ::

Custom Search

Home