Databases Reference
In-Depth Information
privacy is effectively both a technical and a sociological problem, which must be
addressed jointly from both perspectives to realize the promise of big data.
Consider, for example, data gleaned from location-based services. These new
architectures require a user to share his/her location with the service provider, resulting
in obvious privacy concerns. Note that hiding the user's identity alone without hiding
person's location would not properly address these privacy concerns. An attacker or a
(potentially malicious) location-based server can infer the identity of the query source
from its (subsequent) location information. For example, a user's location information
can be tracked through several stationary connection points (e.g., cell towers). After a
while, the user leaves “a trail of packet crumbs” which could be associated to a certain
residence or office location and thereby used to determine the user's identity. Several
other types of surprisingly private information such as health issues (e.g., presence in a
cancer treatment center) or religious preferences (e.g., presence in a church) can also be
revealed by just observing anonymous users' movement and usage pattern over time. In
general, hiding a user location is much more challenging than hiding his/her identity.
This is because with location-based services, the location of the user is needed for a
successful data access or data collection, while the identity of the user is not necessary.
Human Collaboration: Despite the tremendous advances made in computational
analysis, there remain many patterns that humans can easily detect but that computer
algorithms have a hard time finding. Ideally, analytics for big data will not be all
computational; rather these will be designed explicitly to have a human in the loop.
The new sub-field of visual analytics is attempting to do this, at least with respect to the
modeling and analysis phase in the pipeline. There is similar value to human input at all
stages of the analysis pipeline.
In today's complex world, it often takes multiple experts from different domains to
really understand what is going on. A big data analysis system must support input from
multiple human experts, and shared exploration of results. These multiple experts may be
separated in space and time when it is too expensive to assemble an entire team together
in one room. The data system has to accept this distributed expert input and support their
collaboration.
System Architecture: Business data is analyzed for many purposes: a company may
perform system log analytics and social media analytics for risk assessment, customer
retention, brand management, and so on. Typically, such varied tasks have been
handled by separate systems, even if each system includes common steps of information
extraction, data cleaning, relational-like processing (joins, group-by, aggregation),
statistical and predictive modeling, and appropriate exploration and visualization tools
as discussed in the methodology in this chapter.
With big data, the use of separate systems in this fashion becomes prohibitively
expensive given the large size of the data sets. The expense is due not only to the cost
of the systems themselves but also to the time to load the data into multiple systems.
Consequently, big data has made it necessary to run heterogeneous workloads on a single
infrastructure that is sufficiently flexible to handle all these workloads. The challenge here
is not to build a system that is ideally suited for all processing tasks. Instead, the need is
for the underlying system architecture to be flexible enough that the components built on
top of it for expressing the various kinds of processing tasks can tune it to efficiently run
these different workloads.
 
Search WWH ::




Custom Search