Databases Reference
In-Depth Information
New kinds of data
Gone are the days when data is just a bunch of numbers and cat‐
egorical variables. A strong data scientist needs to be versatile and
comfortable with dealing a variety of types of data, including:
• Traditional: numerical, categorical, or binary
• Text: emails, tweets, New York Times articles (see Chapter 4
or Chapter 7 )
• Records: user-level data, timestamped event data, json-
formatted log files (see Chapter 6 or Chapter 8 )
• Geo-based location data: briefly touched on in this chapter
with NYC housing data
• Network (see Chapter 10 )
• Sensor data (not covered in this topic)
• Images (not covered in this topic)
These new kinds of data require us to think more carefully about what
sampling means in these contexts.
For example, with the firehose of real-time streaming data, if you an‐
alyze a Facebook user-level dataset for a week of activity that you ag‐
gregated from timestamped event logs, will any conclusions you draw
from this dataset be relevant next week or next year?
How do you sample from a network and preserve the complex network
structure?
Many of these questions represent open research questions for the
statistical and computer science communities. This is the frontier!
Given that some of these are open research problems, in practice, data
scientists do the best they can, and often are inventing novel methods
as part of their jobs.
Search WWH ::




Custom Search