Modern RNN architectures

Recall that the key quantity to calculate RNN gradients is \(\frac{\partial \mathcal{L}}{\partial \boldsymbol{\mathsf{H}}_t}\) that contain terms \(\left(\boldsymbol{\mathsf{W}}^\top\right)^{\kappa}\) for path length \(\kappa\) from the current time step \(t,\) where \(\boldsymbol{\mathsf{W}}\) is the matrix responsible for transforming the latent state vector. This explodes or vanishes with increasing \(\kappa\) depending on whether the norm of the principal eigenvalue of \(\boldsymbol{\mathsf{W}}\) is greater than or equal to 1. Hence, RNNs are said to have problems with modeling long-term dependencies between tokens. The consequence of this in practice is that RNNs are limited in context size.

Exploding gradients can be solved in practice by gradient clipping or truncated BPTT. On the other hand, the problem of vanishing gradients requires nontrivial architectural changes. We will consider LSTM [HS97b], GRU [CvMG+14], as well as deep RNNs and Bidirectional RNNs [SP97] which increase model complexity. In the next chapter, we will apply these architectures to sequence-to-sequence tasks.

Code. In terms of code, we implement the following classes in order: RNN, LSTM, and GRU, as well as wrappers Deep and Bidirectional that augment units to the respective architecture, but with the same API. Note that we can also swap with PyTorch implementations, e.g.

Deep(Bidirectional(nn.LSTM))(EMBED_DIM, HIDDEN_DIM, num_layers=3, batch_first=True)

Finally, all implementations are checked for correctness by comparing with PyTorch. 🔥