Information Technology Reference
In-Depth Information
to 14); however, it generalizes very poorly during the testing set (period 15+). The optimal complexity
constant of 0.012154154, as identified by the complexity constant optimization based on the windowed
cross-validation procedure (as described in the previous paragraphs), provides a forecast that represents
the level of patterns learning that seems to generalize best.
Super Wide Model
As indicated earlier by the observations-to-weights ratio, since the available time series are very short,
there are not many examples for learning complex patterns. The separation of the data set into training,
cross-validation and testing sets and the loss of periods due to the windowing all combined to further
reduce the set of usable observations. Based on the assumption that several products of the same manu-
facturer probably have similar demand patterns, we introduced what we called a Super Wide Model.
This method takes a wide selection of time-series from the same problem domain and combines them
into one large model that effectively increases the number of training examples. This large number
of training examples permits an increase in input dimensionality (e.g. larger window size) and model
complexity.
For example, in this experiment, we consider 100 time series for each of the sources. With the Super
Wide Model, we use the data from all of the 100 time series simultaneously to train the model. This
provides a large number of training examples and permits us to greatly increase the window size so that
the models can look deep into the past data. Additionally, it could also be used to look across various
other information sources that may be correlated to the demand, such as category averages or comple-
ment or substitute product demand information.
For example, for the chocolate factory data set, there are 100 products and 47 periods of time series
data. Once the training and testing set are separated, we have 38 periods of data. For this type of model,
we choose a window size of 50%, which is a perfect balance between modeling the demand behavior as
a function of the past 50% of the data and using 50% of the data as examples. Using this large window
size of 50% with the traditional time series model would provide a training set of 19 examples for a
window size of 19 that would not represent very much data to identify patterns that may be present in
the future. However, with the Super Wide Model, we have 1900 examples for a window size of 19 that
represent sufficient data to find the best forecasting patterns for the problem domain.
All of the models that learn from past demand, such as the multiple linear regression, neural networks
and support vector machines will be tested also on the Super Wide Models. The only exception is the
recurrent neural networks because the necessary tools are not yet available. Although training a recurrent
neural network on a Super Wide Model is feasible in principle, it would require a reset of the recurrent
connections for every product because time lagged signals between products would not make sense.
The neural network models were enlarged to 10 hidden layer neurons, which in combination with
the very large window, results in large network sizes compared to the patterns to be detected. With a
window size of 50% of the training data, we have a ratio of 1 input to 1 observation. We then multiplied
the observations by 100 products (because of the Super Wide model format) to calculate the observa-
tions-to-weights ratio for the chocolate manufacturer dataset. We can calculate the total number of
weights as:
Total Weights = p w h b h h o b o
⋅ ⋅ + ⋅ + ⋅ + ⋅
Search WWH ::




Custom Search