Database Reference
In-Depth Information
These are typical considerations that should be part of the thought process as the
team evaluates the datasets that are obtained for the project. Becoming deeply
knowledgeable about the data will be critical when it comes time to construct and
run models later in the process.
2.3.6 Common Tools for the Data Preparation Phase
Several tools are commonly used for this phase:
Hadoop [10] can perform massively parallel ingest and custom analysis
for web traffic parsing, GPS location analytics, genomic analysis, and
combining of massive unstructured data feeds from multiple sources.
Alpine Miner [11] provides a graphical user interface (GUI) for creating
analytic workflows, including data manipulations and a series of analytic
events such as staged data-mining techniques (for example, first select the
top 100 customers, and then run descriptive statistics and clustering) on
Postgres SQL and other Big Data sources.
OpenRefine (formerly called Google Refine) [12] is “a free, open source,
powerful tool for working with messy data.” It is a popular GUI-based tool
for performing data transformations, and it's one of the most robust free
tools currently available.
• Similar to OpenRefine, Data Wrangler [13] is an interactive tool for data
cleaning and transformation. Wrangler was developed at Stanford
University and can be used to perform many transformations on a given
dataset. In addition, data transformation outputs can be put into Java or
Python. The advantage of this feature is that a subset of the data can be
manipulated in Wrangler via its GUI, and then the same operations can be
written out as Java or Python code to be executed against the full, larger
dataset offline in a local analytic sandbox.
For Phase 2, the team needs assistance from IT, DBAs, or whoever controls the
Enterprise Data Warehouse (EDW) for data sources the data science team would
like to use.
Search WWH ::




Custom Search