0%

A short introduction to SVM

# Empirical Risk and the true Risk

We can try to learn $$f(x,\alpha)$$ by choosing a function that performs well on training data:

$$R_{emp}\left(\alpha\right)=\frac{1}{m}\sum_{i=1}^{m}{l\left(f\left(x_i,\alpha\right),y_i\right)}$$

where $$l$$ is the zero-one loss function.
Vapnik and Chervonenkis showed that an upper bound on the true
risk
can be given by the empirical risk + and additional term:

$$R\left(\alpha\right) \leq R_{emp}\left(\alpha\right)+\sqrt{\frac{h\left(\log\left(\frac{2m}{n}+1\right)\right)-\log \left(\frac{n}{4} \right)}{m}}$$

where $$h$$ is the VC dimension of the set of functions parameterized by$$\alpha$$

1. The VC dimension of a set of functions is a measure of their capacity or complexity

2. If you can describe a lot of different phenomena with a set of functions then the value of h is large.

VC dim = the maximum number of points that can be separated in all
possible ways by that set of fuctions.

# VC dimension and capacity of functions

Simplification of bound:
Test Error $$\leq$$ Training Error + Complexity of set of Models

# Capacity of hyperplans

Vapnik and Chevrvonenkis also showed the following:
Consider hyperplans $$(wx=0)$$where $$w$$ is normalized w.r.t. a set of points $$X^{\star}$$ such that: $$min_i|wx_i|=1$$

the set of decision functions $$f_w(x)=sign(wx)$$ defined on $$X^{\star}$$ such that $$||w|| \leq A$$ has a VC dimension satisfying

$$h \leq R^2A^2$$

where $$R$$ is the radius of the smallest sphere around the origin containing $$X^{\star}$$. So we would like to find the function which minimizes an objective like:
Training Error + Complexity term
We write that as:

$$\frac{1}{n}\sum_{i=1}^{n}l\left(f\left(x_i,\alpha\right),y_i\right)+ \text{ Complexity term}$$

For now we will choose the set of hyperplans, so $$f(x)=wx+b$$:

$$\frac{1}{n}\sum_{i=1}^{n}l\left(f\left(x_i,\alpha\right),y_i\right)+{\|w\|}^2$$

subject to $$min_i|wx_i|=1$$. That function before was a little fifficult to minimize because of the step function in $$l(y,\hat{y})$$. Let’s assume we can seperate the data perfectly. Then we can optimize the following: minimize $${\|w\|}^2$$, subject to:

\begin{align} (wx_i+b) & \geq 1, if \quad y_i=1 \\ (wx_i+b) & \leq -1, if \quad y_i=-1 \end{align}

the last two constraints can be compacted to:

$$y_i(wx_i+b)\geq1$$

1. It has a regularisation parameter, which makes the user think about avoiding over-fitting

2. It uses the kernel trick, so you can build in expert knowledge about the problem via engineering the kernel

3. An SVM is defined by a convex optimisation problem (no local minima) for which there are efficient methods (e.g. SMO)

4. It is an approximation to a bound on the test error rate, and there is a substantial body of theory behind it which suggests it should be a good idea.