(dl/01-intro)=
# Introduction to Neural Networks

Deep learning models are machine learning models with multiple layers of *learned representations*. A **layer** refers to any transformation that maps an input to its feature representation. In principle, any function that can be described as a composition of layers with *differentiable operations* is called a **neural network**.  Representations are powerful. In machine translation, sentences are not translated word-for-word between two languages. Instead, a model learns an intermediate representation that captures the 'thought' shared by the translated sentences ({numref}`01-thought`).

<br>

```{figure} ../../../img/nn/01-thought.png
---
name: 01-thought
width: 80%
align: center
---
Thought is a representation.
```

The practical successes of deep learning can be attributed to three factors:

- **Scalability.** Deep neural networks ({numref}`01-imagenet-progress`) are created by stacking multiple layers (10-100+). Features at deep layers can be thought of as combinatorial, higher-level, and less sensitive to noise, thereby improving generalization. On the other hand, *wider layers*, i.e. layers with more parameters, memorizes features better at that level. There are tradeoffs with depth and width that has to be controlled in order to get effectively sized networks. But in general, the larger the network, the better it generalizes to test data ({numref}`01-gflops`).

+++

- **Massive datasets.** Deep large networks ({numref}`01-imagenet-progress`) are necessary for tasks involving massive, complex, structured datasets like images and text, where the complexity of the model aligns with that of the underlying distributions in the data. Moreover, complex models allow for *automatic feature engineering*, provided that prior knowledge about the structure (or [modality](https://en.wikipedia.org/wiki/Multimodal_learning)) of the data is encoded into the network architecture. Another interesting property of neural networks is that knowledge can be transferred between tasks through **transfer learning** where are a large model pre-trained on a large dataset can be adapted to specialized tasks using smaller datasets. 

+++

- **Compute.** Both of these factors require significant computational resources ({numref}`01-gflops`). GPUs are particularly well-suited for the large scale matrix operations required in deep learning with their parallel processing capabilities.

```{figure} ../../../img/nn/01-imagenet-progress.png
---
name: 01-imagenet-progress
width: 80%
align: center
---
Progress in top-5 error in the [ImageNet competition](https://en.wikipedia.org/wiki/ImageNet#ImageNet_Challenge).
```

```{figure} ../../../img/nn/01-gflops.png
---
name: 01-gflops
width: 80%
align: center
---
[Top-1 accuracy](https://stackoverflow.com/questions/37668902/evaluation-calculate-top-n-accuracy-top-1-and-top-5) vs [GFLOPs](https://en.wikipedia.org/wiki/FLOPS) on ImageNet. Performance generally improve with increasing compute. But some architectures such as ResNet have better tradeoff than others, e.g. VGG.
```

## References

- [Berkeley CS 182. Lecture 1: Introduction](https://cs182sp21.github.io/static/slides/lec-1.pdf)
- [Berkeley CS 182. Lecture 2: ML Basics 1](https://cs182sp21.github.io/static/slides/lec-2.pdf)
- [Berkeley CS 182. Lecture 3: ML Basics 2](https://cs182sp21.github.io/static/slides/lec-3.pdf)
- [Cornell CS 4780. Lecture 12: Bias-Variance Tradeoff](https://www.cs.cornell.edu/courses/cs4780/2018sp/lectures/lecturenote12.html)
- {cite}`data-programming`