Database Reference
In-Depth Information
sort of gap analysis provides a framework for understanding which datasets the
team can take advantage of today and where the team needs to initiate projects for
data collection or access to new datasets currently unavailable. A component of this
subphase involves extracting data from the available sources and determining data
connections for raw data, online transaction processing (OLTP) databases, online
analytical processing (OLAP) cubes, or other data feeds.
Application programming interface (API) is an increasingly popular way to access
a data source [8]. Many websites and social network applications now provide APIs
that offer access to data to support a project or supplement the datasets with which
a team is working. For example, connecting to the Twitter API can enable a team
to download millions of tweets to perform a project for sentiment analysis on a
product, a company, or an idea. Much of the Twitter data is publicly available and
can augment other datasets used on the project.
2.3.3 Learning About the Data
A critical aspect of a data science project is to become familiar with the data
itself. Spending time to learn the nuances of the datasets provides context to
understand what constitutes a reasonable value and expected output versus what is
a surprising finding. In addition, it is important to catalog the data sources that the
team has access to and identify additional data sources that the team can leverage
but perhaps does not have access to today. Some of the activities in this step may
overlap with the initial investigation of the datasets that occur in the discovery
phase. Doing this activity accomplishes several goals.
• Clarifies the data that the data science team has access to at the start of the
project
• Highlights gaps by identifying datasets within an organization that the
team may find useful but may not be accessible to the team today. As a
consequence, this activity can trigger a project to begin building
relationships with the data owners and finding ways to share data in
appropriate ways. In addition, this activity may provide an impetus to
begin collecting new data that benefits the organization or a specific
long-term project.
• Identifies datasets outside the organization that may be useful to obtain,
through open APIs, data sharing, or purchasing data to supplement
already existing datasets
Table 2.1 demonstrates one way to organize this type of data inventory.
Search WWH ::




Custom Search