Basic theory

Here we will consider binary classification, although the method can be easily extended to the multiclass setting. For each data point \(\boldsymbol{\mathsf{x}} \in \mathscr{X}\) the latent true label is denoted by \(y \in \mathscr{Y} = \{-1, +1\}\). We have \(n\) labeling functions \(\lambda_j\colon \mathscr{X} \to \mathscr{Y} \cup \{0\}\) for \(j = 1, \ldots, n.\) In case that the LF is not applicable, then the LF returns \(0\) for abstain. This explains why we augmented the target space with the zero label. So if we have data points \(\boldsymbol{\mathsf{x}}_i\) for \(i = 1, \ldots, m\), then we get an \(m \times n\) matrix of LF outputs \(\Lambda_{ij} = \lambda_j(\boldsymbol{\mathsf{x}}_i)\).

Our first step is to estimate the distribution \(p_{\delta, \gamma}(\Lambda=\lambda(\boldsymbol{\mathsf{x}}), Y=y)\) defined as the probability that LFs output \(\lambda(\boldsymbol{\mathsf{x}}) = (\lambda_1(\boldsymbol{\mathsf{x}}), \ldots, \lambda_n(\boldsymbol{\mathsf{x}}))\) for a test instance \(\boldsymbol{\mathsf{x}}\) with true label \(y\). The parameters are chosen such that the marginal probability \(p(\Lambda_{ij})\) of the observed LF outputs is maximized. This parameters of the model are coverage \(\delta = (\delta_1, \ldots, \delta_n)\) and accuracy \(\gamma = (\gamma_1, \ldots, \gamma_n)\) of each LF. Once the parameters \(\hat{\delta}\) and \(\hat{\gamma}\) have been learned, we can train a noise-aware discriminative model by minimizing the ff. loss function:

\[ \mathscr{L}(\Theta) = \frac{1}{m}\sum_{i=1}^m \sum_{y=-1,+1} \ell(f_\Theta(\boldsymbol{\mathsf{x}}), y) \cdot {p_{\hat{\delta}, \hat{\gamma}}(Y = y \mid \Lambda = \lambda(\boldsymbol{\mathsf{x}}))} \]

where \(\ell\) is the instance loss. This looks like the usual loss except that the contribution for each target is summed over weighted by the probability of the target given the labeling of the input.

Generative model

For each LF \(\lambda_j\) we assign two parameters \((\delta_j, \gamma_j)\) corresponding to its coverage and accuracy. Coverage \(\delta_j\) is defined as the probability of labeling an input, and accuracy as \(\gamma_j\) as the probability of labeling it correctly. This assumes that the LFs have the same distribution for each label (e.g. not more accurate when the label is positive). Moreover, we assume that LF outputs are independent of each other. Hence,

\[\begin{split} \begin{aligned} p_{\delta, \gamma}(\Lambda = \lambda(\boldsymbol{\mathsf{x}})) &= \sum_{y=-1,+1} p_{\delta, \gamma}(\Lambda = \lambda(\boldsymbol{\mathsf{x}}), Y = y) \\ &= \sum_{y=-1,+1} p(Y = y) \cdot p_{\delta, \gamma}(\Lambda = \lambda(\boldsymbol{\mathsf{x}}) \mid Y = y) \\ &= \sum_{y=-1,+1} p(Y = y) \cdot \prod_{j=1}^n \begin{cases} 1 - \delta_j \quad& \phantom{y}\lambda_j(\boldsymbol{\mathsf{x}}) &= \phantom{-}0 \\ \delta_j\gamma_j \quad& y \lambda_j(\boldsymbol{\mathsf{x}}) &= +1 \\ \delta_j(1 - \gamma_j) \quad& y \lambda_j(\boldsymbol{\mathsf{x}}) &= -1 \\ \end{cases}. \end{aligned} \end{split}\]

Our goal therefore is to find parameters that maximize the observed LF outputs for our dataset:

\[\begin{split} \begin{aligned} (\hat{\delta}, \hat{\gamma}) &= \underset{\delta,\gamma}{\text{arg min}}\sum_{i=1}^m -\log \; p_{\delta, \gamma}(\Lambda_{i}) \\ &= \underset{\delta,\gamma}{\text{arg min}}\sum_{i=1}^m -\log \left( \sum_{y=-1,+1} p_y \cdot \prod_{j=1}^n \begin{cases} 1 - \delta_j \quad& \phantom{y}\Lambda_{ij} &= \phantom{-}0 \\ \delta_j\gamma_j \quad& y \Lambda_{ij} &= +1 \\ \delta_j(1 - \gamma_j) \quad& y \Lambda_{ij} &= -1 \end{cases} \right). \end{aligned} \end{split}\]

Remark. The assumption that accuracy is independent of true label is strong. Recall that the distributions in the rows of a confusion matrix for a classifier are not generally the same for each true label. This is fixed by having a separate set of LF parameters for each true label. For the multi-class case with \(K\) classes, we have to learn parameters for \(K-1\) entries of each row of the confusion matrix of every LF. For the sake of simplicity, we stick with the idealized case for binary classification.

Code implementation

We implement the above equations using some clever indexing:

import torch
torch.manual_seed(1)

delta = torch.tensor([0.3, 0.3, 0.3, 0.3])  # coverage
gamma = torch.tensor([0.7, 0.6, 0.8, 0.9])  # accuracy

n = 4
L = torch.tensor([
    [-1,  0, +1, +1],
    [+1, -1, -1,  0],
])

params = torch.stack([
    1 - delta,              # abstained
    delta * gamma,          # accurate
    delta * (1 - gamma)     # inaccurate
], dim=0)

params  # Note sum along dim=0 is 1, i.e. sum of all p(λ | y) for λ = -1, 0, 1 is 1.

tensor([[0.7000, 0.7000, 0.7000, 0.7000],
        [0.2100, 0.1800, 0.2400, 0.2700],
        [0.0900, 0.1200, 0.0600, 0.0300]])

We will use the empirical LF matrix to pick out the appropriate weight given its value:

# p(L | y = +1)
params[L, torch.arange(n)]

tensor([[0.0900, 0.7000, 0.2400, 0.2700],
        [0.2100, 0.1200, 0.0600, 0.7000]])

Notice that non-abstained probabilities will flip:

# p(L | y = -1)
params[-L, torch.arange(n)]

tensor([[0.2100, 0.7000, 0.0600, 0.0300],
        [0.0900, 0.1800, 0.2400, 0.7000]])

Let \(p_{y=-1} = 0.7\) and \(p_{y=+1} = 0.3\). The marginal probability of the LF outputs for each instance is given by:

py = [0.0, 0.5, 0.5]    # zero-index = dummy
p_pos = py[+1] * (params[+L, torch.arange(n)]).prod(dim=1) 
p_neg = py[-1] * (params[-L, torch.arange(n)]).prod(dim=1) 
p = p_pos + p_neg
p

tensor([0.0022, 0.0019])

Note that we generally have \(m \gg 1\) terms with valuees in \([0, 1].\) So we use \(\log\) to convert the product to a sum:

print("     p(Λ):", p.prod().item())
print("-log p(Λ):", -torch.log(p).sum().item())

     p(Λ): 4.107915628992487e-06
-log p(Λ): 12.402594566345215