Database Reference
In-Depth Information
https://education.emc.com/guest/campaign/data_science.aspx ,
that displays the page shown as (2) in Figure 1.6 . Arriving at this site, the user may
decide to click to learn more about the process of becoming certified in data
science. The user chooses a link toward the top of the page on Certifications,
bringing the user to a new URL: https://education.emc.com/guest/
certification/framework/stf/data_science.aspx , which is (3) in
Figure 1.6 .
Visiting these three websites adds three URLs to the log files monitoring the user's
computer or network use. These three URLs are:
https://www.google.com/#q=EMC+data+science
https://education.emc.com/guest/campaign/data_science.aspx
https://education.emc.com/guest/certification/framework/stf/
data_science.aspx
This set of three URLs reflects the websites and actions taken to find Data Science
information related to EMC. Together, this comprises a clickstream that can be
parsed and mined by data scientists to discover usage patterns and uncover
relationships among clicks and areas of interest on a website or group of sites.
The four data types described in this chapter are sometimes generalized into two
groups: structured and unstructured data. Big Data describes new kinds of data
with which most organizations may not be used to working. With this in mind, the
next section discusses common technology architectures from the standpoint of
someone wanting to analyze Big Data.
1.1.2 Analyst Perspective on Data Repositories
The introduction of spreadsheets enabled business users to create simple logic on
data structured in rows and columns and create their own analyses of business
problems. Database administrator training is not required to create spreadsheets:
They can be set up to do many things quickly and independently of information
technology (IT) groups. Spreadsheets are easy to share, and end users have control
over the logic involved. However, their proliferation can result in “many versions
of the truth.” In other words, it can be challenging to determine if a particular user
has the most relevant version of a spreadsheet, with the most current data and
logic in it. Moreover, if a laptop is lost or a file becomes corrupted, the data and
logic within the spreadsheet could be lost. This is an ongoing challenge because
spreadsheet programs such as Microsoft Excel still run on many computers
Search WWH ::




Custom Search