Databases Reference
In-Depth Information
Historical Perspective: Bell Labs
Bell Labs is a research lab going back to the 1920s that has made
innovations in physics, computer science, statistics, and math, pro‐
ducing languages like C++, and many Nobel Prize winners as well.
There was a very successful and productive statistics group there, and
among its many notable members was John Tukey, a mathematician
who worked on a lot of statistical problems. He is considered the
father of EDA and R (which started as the S language at Bell Labs; R
is the open source version), and he was interested in trying to visualize
high-dimensional data.
We think of Bell Labs as one of the places where data science was
“born” because of the collaboration between disciplines, and the mas‐
sive amounts of complex data available to people working there. It
was a virtual playground for statisticians and computer scientists,
much like Google is today.
In fact, in 2001, Bill Cleveland wrote “Data Science: An Action Plan
for expanding the technical areas of the field of statistics,” which de‐
scribed multidisciplinary investigation, models, and methods for data
(traditional applied stats), computing with data (hardware, software,
algorithms, coding), pedagogy, tool evaluation (staying on top of cur‐
rent trends in technology), and theory (the math behind the data).
You can read more about Bell Labs in the topic The Idea Factory by
Jon Gertner (Penguin Books).
The basic tools of EDA are plots, graphs and summary statistics. Gen‐
erally speaking, it's a method of systematically going through the data,
plotting distributions of all variables (using box plots), plotting time
series of data, transforming variables, looking at all pairwise relation‐
ships between variables using scatterplot matrices, and generating
summary statistics for all of them. At the very least that would mean
computing their mean, minimum, maximum, the upper and lower
quartiles, and identifying outliers.
But as much as EDA is a set of tools, it's also a mindset. And that
mindset is about your relationship with the data. You want to under‐
stand the data—gain intuition, understand the shape of it, and try to
connect your understanding of the process that generated the data to
 
Search WWH ::




Custom Search