Statistical Inference, Exploratory Data Analysis, and the Data Science Process - Doing Data Science

Databases Reference

In-Depth Information

New kinds of data

Gone are the days when data is just a bunch of numbers and cat‐

egorical variables. A strong data scientist needs to be versatile and

comfortable with dealing a variety of types of data, including:

• Traditional: numerical, categorical, or binary

• Text: emails, tweets, New York Times articles (see Chapter 4

or Chapter 7 )

• Records: user-level data, timestamped event data, json-

formatted log files (see Chapter 6 or Chapter 8 )

• Geo-based location data: briefly touched on in this chapter

with NYC housing data

• Network (see Chapter 10 )

• Sensor data (not covered in this topic)

• Images (not covered in this topic)

These new kinds of data require us to think more carefully about what

sampling means in these contexts.

For example, with the firehose of real-time streaming data, if you an‐

alyze a Facebook user-level dataset for a week of activity that you ag‐

gregated from timestamped event logs, will any conclusions you draw

from this dataset be relevant next week or next year?

How do you sample from a network and preserve the complex network

structure?

Many of these questions represent open research questions for the

statistical and computer science communities. This is the frontier!

Given that some of these are open research problems, in practice, data

scientists do the best they can, and often are inventing novel methods

as part of their jobs.

Search WWH ::

Custom Search

Home