Time Stamps and Financial Modeling - Doing Data Science

Databases Reference

In-Depth Information

In-Sample, Out-of-Sample, and Causality

We need to establish a strict concept of in-sample and out-of-sample

data. Note the out-of-sample data is not meant as testing data—that

all happens inside in-sample data. Rather, out-of-sample data is meant

to be the data you use after finalizing your model so that you have some

idea how the model will perform in production.

We should even restrict the number of times one does out-of-sample

analysis on a given dataset because, like it or not, we learn stuff about

that data every time, and we will subconsciously overfit to it even in

different contexts, with different models.

Next, we need to be careful to always perform causal modeling (note

this differs from what statisticians mean by causality). Namely, never

use information in the future to predict something now . Or, put differ‐

ently, we only use information from the past up and to the present

moment to predict the future. This is incredibly important in financial

modeling. Note it's not enough to use data about the present if it isn't

actually available and accessible at the present moment. So this means

we have to be very careful with timestamps of availability as well as

timestamps of reference. This is huge when we're talking about lagged

government data.

Similarly, when we have a set of training data, we don't know the “best-

fit coefficients” for that training data until after the last timestamp on

all the data. As we move forward in time from the first timestamp to

the last, we expect to get different sets of coefficients as more events

happen.

One consequence of this is that, instead of getting one set of “best-fit”

coefficients, we actually get an evolution of each coefficient. This is

helpful because it gives us a sense of how stable those coefficients are.

In particular, if one coefficient has changed sign 10 times over the

training set, then we might well expect a good estimate for it is zero,

not the so-called “best fit” at the end of the data. Of course, depending

on the variable, we might think of a legitimate reason for it to actually

change sign over time.

The in-sample data should, generally speaking, come before the out-

of-sample data to avoid causality problems as shown in Figure 6-7 .

Search WWH ::

Custom Search

Home