Information Technology Reference
In-Depth Information
The input gate activation i.t/ at time t is computed by applying the (non-linear)
input gate activation function g ig ./ on its inputs as
i.t/ D g ig W ix x.t/C W ih h.t 1/ C W ic c.t 1/ C b i ;
(19.16)
where W ix , W ih , and W ic correspond to the weight matrices which project the input
x.t/ , all (hidden) memory block outputs h.t 1/ and the internal cell states c.t 1/
from the previous time step, respectively, to the input gate; b i denotes the input
gate bias. Usually, the input gate activation function g ig is chosen to be the sigmoid
function ( 19.4 ). The activation i.t/ of the input gate multiplies the input to all cells
in the memory block, and thus determines which activity patterns are stored (added)
into it. During training, the input gate learns to open ( i.t/ 1 )soastostorerelevant
inputs in the memory block, respectively close ( i.t/ 0 )soastoshielditfrom
irrelevant ones.
Similarly, the activations of the forget gates f.t/ can be calculated as
f.t/D g fg W fx x.t/C W fh h.t 1/ C W fc c.t 1/ C b f ;
(19.17)
where g fg is commonly chosen to be the tanh activation function.
To determine the current state of a cell c.t/ , we scale the previous state c.t 1/
by the activation of the forget gate f.t/ and the cell input activations g ci by the
activation of the input gate i.t/ :
c.t/ D f.t/c.t 1/ C i.t/g ci W cx x.t/C W ch h.t 1/ C b c ;
(19.18)
where g ci is a logistic sigmoid function with range [0;1]. At t D 0 , the cell state of a
memory cell is initialized to zero, i.e. c.0/ D 0 . Subsequently, the cell accumulates
a sum, discounted by the forget gate, over its input. Hence, activity circulates in the
cell c.t/ as long as the forget gate remains open ( f.t/ 1 ). Just as the input gate
learns what to store in the memory block, the forget gate learns for how long to retain
the information, and—once it is outdated—to erase it by resetting the cell state to
zero. This prevents the cell state from growing to infinity and enables the memory
block to store new data without undue interference from prior operations (Gers et al.
2002 ).
The computation of the output gate activations o.t/ follows the same principle
as the calculation of the input gate activation. However, in this case the current cell
states c.t/ are considered, rather than the states from the previous time step:
o.t/ D g og W ox x.t/C W oh h.t 1/ C W oc c.t/ C b o
(19.19)
Here, g og denotes the output gate activation function, which is typically chosen to
be the sigmoid function as for the input gate.
Search WWH ::




Custom Search