Getting sequence data

In this section, we outline the process of extracting sequence data from text. In particular, we will extract character sequences from H. G. Wells’ The Time Machine, a book containing just over 30,000 words. While real applications will typically involve significantly larger datasets, this is sufficient to demonstrate the preprocessing pipeline.

Processing the sample text

!mkdir ./data
!curl "https://www.gutenberg.org/cache/epub/35/pg35.txt" --output ./data/time_machine.txt
Hide code cell output
mkdir: ./data: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  199k  100  199k    0     0   104k      0  0:00:01  0:00:01 --:--:--  104k

This has some boilerplate text by Project Gutenberg[1] that we have to remove:

text = open("./data/time_machine.txt").read()
print(text[:891])
The Project Gutenberg eBook of The Time Machine
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The Time Machine

Author: H. G. Wells

Release date: October 2, 2004 [eBook #35]
                Most recently updated: March 30, 2021

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK THE TIME MACHINE ***




The Time Machine

An Invention

by H. G. Wells


CONTENTS

 I Introduction
 II The Machine
 III The Time Traveller Returns
 IV Time Travelling
 V In the Golden Age

Simply find the start and end of the Project Gutenberg markers:

start = "*** START OF THE PROJECT GUTENBERG EBOOK THE TIME MACHINE ***"
end = "*** END OF THE PROJECT GUTENBERG EBOOK THE TIME MACHINE ***"
text = text[text.find(start) + len(start): text.find(end)]
print(len(text.split()))
print(text[:129].strip())
print("...\n")
print(text[-95:].strip())
32453
The Time Machine

An Invention

by H. G. Wells


CONTENTS

 I Introduction
 II The Machine
 III The Time Traveller Returns
...

strength had gone, gratitude and a mutual tenderness
still lived on in the heart of man.

Tokenization

Tokens are the atomic units of text. Each time step corresponds to one token, but what it is that constitutes a token is a design choice. For example, we can represent the sentence “Deep learning is fun” as a sequence of 4 tokens, with one token for every English word. Then, the set of all words comprise a large vocabulary (typically ~10-100K words). Or we can represent the same sentence as a much longer sequence of 30 characters, using a much smaller vocabulary (256 ASCII characters). There is some tradeoff associated with the choice of vocabulary[2].

Our implementation of the tokenizer builds[3] the vocabulary from the text which is represented as a single large string. Tokenization implemented as tokenize refers to converting a string to a list of tokens. The vocabulary refers to the list of all tokens. Finally, the tokenizer implements an encode method which converts a string to a list of integer indices, and a decode method which converts a list of integers to a string.

import re
from collections import Counter
from typing import Union, Optional, TypeVar, List

T = TypeVar("T")
ScalarOrList = Union[T, List[T]]


class Vocab:
    def __init__(self, 
        text: str, 
        min_freq: int = 0, 
        reserved_tokens: Optional[List[str]] = None,
        preprocess: bool = True
    ):
        text = self.preprocess(text) if preprocess else text
        tokens = list(text)
        counter = Counter(tokens)
        reserved_tokens = reserved_tokens or []
        self.token_freqs = sorted(counter.items(), key=lambda x: x[1], reverse=True)
        self.itos = [self.unk_token] + reserved_tokens + [tok for tok, f in filter(lambda tokf: tokf[1] >= min_freq, self.token_freqs)]
        self.stoi = {tok: idx for idx, tok in enumerate(self.itos)}

    def __len__(self):
        return len(self.itos)
    
    def __getitem__(self, tokens: ScalarOrList[str]) -> ScalarOrList[int]:
        if isinstance(tokens, str):
            return self.stoi.get(tokens, self.unk)
        else:
            return [self.__getitem__(tok) for tok in tokens]

    def to_tokens(self, indices: ScalarOrList[int]) -> ScalarOrList[str]:
        if isinstance(indices, int):
            return self.itos[indices]
        else:
            return [self.itos[int(index)] for index in indices]
            
    def preprocess(self, text: str):
        return re.sub("[^A-Za-z]+", " ", text).lower().strip()

    @property
    def unk_token(self) -> str:
        return "▮"

    @property
    def unk(self) -> int:
        return self.stoi[self.unk_token]

    @property
    def tokens(self) -> List[int]:
        return self.itos

For simplicity (i.e. to get smaller models), the source text is preprocessed by removing punctuation and ignoring capitalization. This results in a significantly smaller vocabulary, trading off punctuation and capitalization which are important for text understanding and generating nuanced text.

import torch

class Tokenizer:
    def __init__(self, vocab: Vocab):
        self.vocab = vocab

    def tokenize(self, text: str) -> List[str]:
        UNK = self.vocab.unk_token
        tokens = self.vocab.stoi.keys()
        return [c if c in tokens else UNK for c in list(text)]

    def encode(self, text: str) -> torch.Tensor:
        x = self.vocab[self.tokenize(text)]
        return torch.tensor(x, dtype=torch.int64)

    def decode(self, indices: Union[ScalarOrList[int], torch.Tensor]) -> str:
        return "".join(self.vocab.to_tokens(indices))

    @property
    def vocab_size(self) -> int:
        return len(self.vocab)

Since our vocab includes only lowercase letters, the unknown token ▮ replaces missing characters when we encode the text:

text = open("./data/time_machine.txt").read()
vocab = Vocab(text)
tokenizer = Tokenizer(vocab)
print("vocab:", ", ".join(vocab.itos[:10]) + ", ...")
print("tokenization:", tokenizer.tokenize("Hello!"))
print("encoding-decoding:", "\nHello! ->", tokenizer.encode("Hello!"), "->", tokenizer.decode(tokenizer.encode("Hello!")))
vocab: ▮,  , e, t, a, i, o, n, s, r, ...
tokenization: ['▮', 'e', 'l', 'l', 'o', '▮']
encoding-decoding: 
Hello! -> tensor([ 0,  2, 12, 12,  6,  0]) -> ▮ello▮

Defining the class for processing the dataset:

import re
import os
import requests

from pathlib import Path

DATA_DIR = Path("./data")
DATA_DIR.mkdir(exist_ok=True)


class TimeMachine:
    def __init__(self, download=False, path=None):
        DEFAULT_PATH = str((DATA_DIR / "time_machine.txt").absolute())
        self.filepath = path or DEFAULT_PATH
        if download or not os.path.exists(self.filepath):
            self._download()
        
    def _download(self):
        url = "https://www.gutenberg.org/cache/epub/35/pg35.txt"
        print(f"Downloading text from {url} ...", end=" ")
        response = requests.get(url, stream=True)
        response.raise_for_status()
        print("OK!")
        with open(self.filepath, "wb") as output:
            output.write(response.content)
        
    def _load_text(self):
        with open(self.filepath, "r") as f:
            text = f.read()
        s = "*** START OF THE PROJECT GUTENBERG EBOOK THE TIME MACHINE ***"
        e = "*** END OF THE PROJECT GUTENBERG EBOOK THE TIME MACHINE ***"
        return text[text.find(s) + len(s): text.find(e)]
    
    def build(self, vocab: Optional[Vocab] = None):
        self.text = self._load_text()
        vocab = vocab or Vocab(self.text)
        tokenizer = Tokenizer(vocab)
        encoded_text = tokenizer.encode(vocab.preprocess(self.text))
        return encoded_text, tokenizer

Basic usage starts with download and then build:

tm = TimeMachine(download=True)
encoded_text, tokenizer = tm.build()
Downloading text from https://www.gutenberg.org/cache/epub/35/pg35.txt ... OK!

The encoded_text is the encoded clean text. This will be used later as training data.

print(len(encoded_text), tokenizer.vocab_size)
print(tokenizer.decode(encoded_text[:100]) + "...")
174215 28
the time machine an invention by h g wells contents i introduction ii the machine iii the time trave...

Appendix: Zipf’s law

Zipf’s law is the observation that token frequency follows an inverse power law. More precisely, it states that the frequency \(f_i\) of the \(i\)-th most frequent word for texts in natural language decays inversely proportional to its word rank \(i,\) after a few exceptions. Let \(a > 0\) characterize the rate of token frequency decay. Then,

\[f_i = {f_1} \cdot {i^{-a}}\]

or

\[\log f_i = -a \log i + \log f_1.\]

Note that the indexing drops a few words, i.e. \(i = 1\) corresponds to the rank of the first word that wasn’t dropped. The parameter \(a\) and the number of skipped words is particular to the text and, to a larger scale, the language used.

tm = TimeMachine(download=False)
data, tokenizer = tm.build()

freqs = lambda l: list(map(lambda z: z[1], sorted(Counter(l).items(), key=lambda x: x[1], reverse=True)))
words = tokenizer.vocab.preprocess(tm.text).split()
word_freqs = freqs(words)
bigram_freqs = freqs(["__".join(pair) for pair in zip(words[:-1], words[1:])])
trigram_freqs = freqs(["__".join(triple) for triple in zip(words[:-2], words[1:-1], words[2:])])
Hide code cell source
%matplotlib inline
%config InlineBackend.figure_format = "svg"
import matplotlib.pyplot as plt

plt.plot(word_freqs, label="words")
plt.plot(bigram_freqs, label="bigram")
plt.plot(trigram_freqs, label="trigram")
plt.yscale("log")
plt.xscale("log")
plt.ylabel("frequency $f_i$")
plt.xlabel("rank $i$")
plt.legend();
../../../_images/765c5a056a5e7549eb6a363fbda333e1b3f6d568448ba64bd9b6e1b05525038e.svg

Estimating \(a\):

import math

print("1:", (math.log(word_freqs[100]) - math.log(word_freqs[10])) / (math.log(100) - math.log(10)))
print("2:", (math.log(bigram_freqs[100]) - math.log(bigram_freqs[10])) / (math.log(100) - math.log(10)))
print("3:", (math.log(trigram_freqs[100]) - math.log(trigram_freqs[10])) / (math.log(100) - math.log(10)))
1: -1.007012981390835
2: -0.5772364076029304
3: -0.38021124171160603

Remark. Similar behavior and similar \(a\)’s have been observed for other novels. For example, The Time Machine and Frankenstein have similar values. Other types such as non-fiction or plays (Romeo and Juliet) have slightly but significantly different values.