# Basic mathematical notations in Machine Learning

Although I try to avoid touching too much maths in this site, since machine learning is originated from computer science, it is unavoidable to involve some maths and mathematical notations when explaining some of the concepts in machine learning.

In this blog post, I will summarize and explain some important mathematics concepts and notation in machine learning. Please note that for readability the definitions in this post are simplified. For rigorous definitions, you can refer to some nice reference books listed in the end of the article.

## Vector

In machine learning, it is very common to represent an object in numbers because this is how a computer algorithm sees the object.

Let’s say if we want to train a model to classify raccoon and red panda, we would need to describe them using numbers, for instance, the height, the weight or the color. In most of the cases, we don’t solely rely on single information, we want to describe the object as detail as possible. Vector is a mathematical tool to store this multidimensional information.

**Vector** is basically an array of numbers. It allows us to represent multidimensional information of a single entity. For example \(x = [height, weight, r, g, b]\).

The dimension of a vector is usually defined as follow:

$$x\in\mathbb{R}^N$$

This expression means \(x\) is a vector consist of \(N\) real numbers.

## Norm and Distance

Once objects are represented in terms of vectors, the next thing we want to know is how different the vectors are and eventually utilize this information to build a model to do prediction or classification.

To achieve this, what we need is a function to measure the length of a vector, **norm** is a mathematical tool designed for this.

Norm of vector \(x = [x_1, x_2, \cdots, x_N]\) is defined as

$$||x||_p = (\sum_{i=1}^{N} |x_i|^p)^{1/p}$$

Looks complicated? Don’t worry, most of the time we will only use 2-norm and 1-norm:

$$||x||_2 = \sqrt{(\sum_{i=1}^{N} x_i^2} $$

$$||x||_1= \sum_{i=1}^{N} |x_i| $$

With the help of norm, we can define distance as follow:

$$d(a,b)=||a-b||_p$$

However, if you dig deeper in machine learning you may see other ways to define distances. For instance **edit distance**. We will cover more on this topic when we need to use them.

## Optimization

Optimization is a large and important field in applied mathematics which directly related to may real-world problems.

To be more specific, optimization is all about solving problems in the form of** maximizing/minimize something under certain circumstances**. Problems in many domains can be formulated in such format and, of course, this includes machine learning.

In machine learning, most of the time we are interested in two types of optimization problems.

One of them is to minimize the error \(E\) by changing the model parameters \(\theta\) based on data \(x,y\) :

$$\min \limits _\theta E(x,y,\theta)$$

Another one is to maximize the likelihood \(L\) by changing the model parameters \(\theta\) based on data \(x,y\):

$$\max \limits _\theta L(x,y,\theta)$$

## Gradient

Gradient is basically the slope of a function. Most of the methods solving optimizing problems utilize the gradient of the objective function (the function you want to minimize/maximize) to do the search. Imagine you want to go to the bottom of a valley, the most intuitive way would be moving in the opposite direction of the slope. When the gradient is getting gentle, you know you are getting closer to the bottom of a valley.

We will talk more about gradient later.

## Bayes’ Rule

Bayes’ Rule is a theorem in probability theory describes the relationship of the probability of an event and its prior condition.

The equation itself is very simple:

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

In the above equation, **A** and **B** are events and **P(X)** is the probability of event **X** happens.

**P(X|Y)** is the conditional probability of event **X** being happened given event **Y** has already happened.

This relationship is very common in machine learning since, in many cases, we want to know the probability of a new event **x** happens given a history of past events **D**, i.e. **P(x|D)**. With the help of Bayes’ Rule, we can link this posterior probability to the likelihood probability **P(D|x)**, or with some assumptions, even the joint probability **P(x**, D**)**.

## References

- Linear Algebra and Its Applications by David C. Lay, Steven R. Lay, Judi J. McDonald
- Introduction to Linear Algebra by Gilbert Strang
- Convex Optimization by Stephen P. Boyd and Lieven Vandenberghe
- The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman