Recurrent Neural Networks

In this chapter, we will deal with variable-length sequence data. This is fundamentally different from fixed shape data that we have previously encountered. But variable-length data is abundant in the real-world. Tasks such as translating passages of text from one language to another, engaging in dialogue, or controlling a robot, demand that models both ingest and output sequentially structured data. Here we focus on text data which is our primary interest. In particular, we sample character sequences as training data from a dataset of Spanish names. To increase complexity, we continue with The Time Machine (1895) by H. G. Wells.

For sequence modeling, we explore two approaches. In the previous chapter, we used fixed-length windows, or contexts, to predict the probability distribution of the next token. This allows us to use familiar models such as CNNs and MLPs. Here we consider processing sequences of arbitrary length. In particular, we introduce Recurrent Neural Networks (RNNs) which are neural networks that capture the dynamics of sequences via recurrent connections (Fig. 55), which can be thought of as cycles that iteratively updates a latent state vector. The resulting hidden representation depend on the specific input order. Hence, RNNs inherit causality from the structure of the text.

To understand the challenges of training RNNs, we derive the BPTT equations (Backpropagation Through Time). We will see that RNNs accumulate gradients with depth corresponding to time steps, instead of number of layers for MLPs[1]. In particular, we will see that RNNs struggle to model long-term dependencies (i.e., tokens that are spaced far apart but share a significant relationship) which manifest as vanishing gradient. This had motivated the development of more advanced RNN architectures (e.g., LSTM [HS97b] and GRU [CvMG+14]) that aim to minimize or address vanishing gradients.

References and readings