Databases Reference
In-Depth Information
In-Sample, Out-of-Sample, and Causality
We need to establish a strict concept of in-sample and out-of-sample
data. Note the out-of-sample data is not meant as testing data—that
all happens inside in-sample data. Rather, out-of-sample data is meant
to be the data you use after finalizing your model so that you have some
idea how the model will perform in production.
We should even restrict the number of times one does out-of-sample
analysis on a given dataset because, like it or not, we learn stuff about
that data every time, and we will subconsciously overfit to it even in
different contexts, with different models.
Next, we need to be careful to always perform causal modeling (note
this differs from what statisticians mean by causality). Namely, never
use information in the future to predict something now . Or, put differ‐
ently, we only use information from the past up and to the present
moment to predict the future. This is incredibly important in financial
modeling. Note it's not enough to use data about the present if it isn't
actually available and accessible at the present moment. So this means
we have to be very careful with timestamps of availability as well as
timestamps of reference. This is huge when we're talking about lagged
government data.
Similarly, when we have a set of training data, we don't know the “best-
fit coefficients” for that training data until after the last timestamp on
all the data. As we move forward in time from the first timestamp to
the last, we expect to get different sets of coefficients as more events
happen.
One consequence of this is that, instead of getting one set of “best-fit”
coefficients, we actually get an evolution of each coefficient. This is
helpful because it gives us a sense of how stable those coefficients are.
In particular, if one coefficient has changed sign 10 times over the
training set, then we might well expect a good estimate for it is zero,
not the so-called “best fit” at the end of the data. Of course, depending
on the variable, we might think of a legitimate reason for it to actually
change sign over time.
The in-sample data should, generally speaking, come before the out-
of-sample data to avoid causality problems as shown in Figure 6-7 .
Search WWH ::




Custom Search