# Empirical Risk and the true Risk

We can try to learn \(f(x,\alpha)\) by choosing a function that performs well on training data:

$$
R_{emp}\left(\alpha\right)=\frac{1}{m}\sum_{i=1}^{m}{l\left(f\left(x_i,\alpha\right),y_i\right)}
$$

where \(l\) is the **zero-one** loss function.

Vapnik and Chervonenkis **showed** that **an upper bound on the truerisk** can be given by the empirical risk + and additional term:

$$ R\left(\alpha\right) \leq R_{emp}\left(\alpha\right)+\sqrt{\frac{h\left(\log\left(\frac{2m}{n}+1\right)\right)-\log \left(\frac{n}{4} \right)}{m}} $$

where \(h\) is the VC dimension of the set of functions parameterized by\(\alpha\)

The VC dimension of a set of functions is a measure of their capacity or complexity

If you can describe a lot of different phenomena with a set of functions then the value of h is large.

*VC dim = the maximum number of points that can be separated in allpossible ways by that set of fuctions.*

# VC dimension and capacity of functions

Simplification of bound:

Test Error \( \leq \) Training Error + Complexity of set of Models

# Capacity of hyperplans

Vapnik and Chevrvonenkis also showed the following:

Consider hyperplans \((wx=0)\)where \( w \) is normalized w.r.t. a set of points \( X^{\star} \) such that: \( min_i|wx_i|=1 \)

the set of decision functions \( f_w(x)=sign(wx) \) defined on \( X^{\star} \) such that \( ||w|| \leq A \) has a VC dimension satisfying

$$
h \leq R^2A^2
$$

where \( R \) is the radius of the smallest sphere around the origin containing \( X^{\star} \). So we would like to find the function which minimizes an objective like:

Training Error + Complexity term

We write that as:

$$
\frac{1}{n}\sum_{i=1}^{n}l\left(f\left(x_i,\alpha\right),y_i\right)+ \text{ Complexity term}
$$

For now we will choose the set of hyperplans, so \( f(x)=wx+b \):

$$\frac{1}{n}\sum_{i=1}^{n}l\left(f\left(x_i,\alpha\right),y_i\right)+{\|w\|}^2$$

subject to \( min_i|wx_i|=1 \). That function before was a little fifficult to minimize because of the step function in \( l(y,\hat{y}) \). Let’s assume we can seperate the data perfectly. Then we can optimize the following: minimize \( {\|w\|}^2 \), subject to:

$$
\begin{align}
(wx_i+b) & \geq 1, if \quad y_i=1 \\
(wx_i+b) & \leq -1, if \quad y_i=-1
\end{align}
$$

the last two constraints can be compacted to:

$$y_i(wx_i+b)\geq1$$

# Advantages and Disadvantages

## Advantages

It has a regularisation parameter, which makes the user think about avoiding over-fitting

It uses the kernel trick, so you can build in expert knowledge about the problem via engineering the kernel

An SVM is defined by a convex optimisation problem (no local minima) for which there are efficient methods (e.g. SMO)

It is an approximation to a bound on the test error rate, and there is a substantial body of theory behind it which suggests it should be a good idea.

## Disadvantages

The main disadvantage of the SVM algorithm is that it has several key parameters that need to be set correctly to achieve the best classification results for any given problem. Parameters that may result in an excellent classification accuracy for problem A, may result in a poor classification accuracy for problem B. The user may, therefore, have to experiment with a number of different parameter settings in order to achieve a satisfactory result. The main parameters that the user should experiment with are the SVM kernel type and the kernel-specific parameters