Database Reference
In-Depth Information
purpose. Often, the mindset of the IT group is to provide the minimum amount
of data required to allow the team to achieve its objectives. Conversely, the data
science team wants access to everything. From its perspective, more data is better,
as oftentimes data science projects are a mixture of purpose-driven analyses and
experimental approaches to test a variety of ideas. In this context, it can be
challenging for a data science team if it has to request access to each and every
dataset and attribute one at a time. Because of these differing views on data access
and use, it is critical for the data science team to collaborate with IT, make clear
what it is trying to accomplish, and align goals.
During these discussions, the data science team needs to give IT a justification to
develop an analytics sandbox, which is separate from the traditional IT-governed
data warehouses within an organization. Successfully and amicably balancing the
needs of both the data science team and IT requires a positive working relationship
between multiple groups and data owners. The payoff is great. The analytic
sandbox enables organizations to undertake more ambitious data science projects
and move beyond doing traditional data analysis and Business Intelligence to
perform more robust and advanced predictive analytics.
Expect the sandbox to be large. It may contain raw data, aggregated data, and
other data types that are less commonly used in organizations. Sandbox size can
vary greatly depending on the project. A good rule is to plan for the sandbox to be
at least 5-10 times the size of the original datasets, partly because copies of the
data may be created that serve as specific tables or data stores for specific kinds of
analysis in the project.
Although the concept of an analytics sandbox is relatively new, companies are
making progress in this area and are finding ways to offer sandboxes and
workspaces where teams can access datasets and work in a way that is acceptable
to both the data science teams and the IT groups.
2.3.2 Performing ETLT
As the team looks to begin data transformations, make sure the analytics sandbox
has ample bandwidth and reliable network connections to the underlying data
sources to enable uninterrupted read and write. In ETL, users perform extract,
transform, load processes to extract data from a datastore, perform data
transformations, and load the data back into the datastore. However, the analytic
sandbox approach differs slightly; it advocates extract, load, and then transform. In
this case, the data is extracted in its raw form and loaded into the datastore, where
analysts can choose to transform the data into a new state or leave it in its original,
Search WWH ::




Custom Search