{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning from Data: Workshop 4 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " | Date set     | Hand-in date |\n",
    "|:------------------|:-----------------------------------|\n",
    "|10th February 2017  | **12:00 Monday 27th February 2017** |\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "This workshop is worth 10% of the total module mark.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Candidate number: ** Put your candidate number here **  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your report should consist of your IPython notebook showing what you did, what was the\n",
    "result, and what you can conclude from the exercise. Each report will be\n",
    "assessed on the following criteria:\n",
    "\n",
    "* Does it record what was done in the exercise?\n",
    "* Does it permit the results to be reproduced?\n",
    "* How does the work relate to the theoretical foundations discussed in lectures?\n",
    "* Is it well presented?\n",
    "\n",
    "Just write comments on the results, etc, using markdown in new cells.\n",
    "\n",
    "### Submitting the notebooks\n",
    "\n",
    "Only an electronic submissions is required.  Submit your notebook (the .ipynb file) to electronic copy via the  [electronic hand-in system](http://empslocal.ex.ac.uk/submit/) using the topic <cmd>2017-02-27~ECM3420~Richard Everson~Workshop 4 Regression and RBF networks</cmd>.\n",
    "\n",
    "You should be able to upload the notebook directly from wherever it is on your machine.  If you're not sure where it is, type <code>pwd</code> (print working directory) in a cell to find out."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "Although you will undoubtedly work collaboratively in the workshops themselves, these are *individual* exercises.  The reports you write should be about the results *you* obtained, and your attention is drawn to the College and University guidelines on collaboration and plagiarism. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-09T23:45:56.550704",
     "start_time": "2017-02-09T23:45:55.846940"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "% pylab inline\n",
    "import wget\n",
    "figsize(10, 8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the following little data set which contains one-dimensional data vectors $\\mathbf{x}$ and $\\mathbf{t}$.  They are stored as two columns, which the following cell splits into two vectors, <code>x</code> and <code>t</code>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-09T23:45:58.877994",
     "start_time": "2017-02-09T23:45:58.741511"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import wget\n",
    "try: \n",
    "    X = loadtxt('linreg.txt')\n",
    "except IOError:\n",
    "    wget.download('http://empslocal.ex.ac.uk/people/staff/reverson/ECM3420/linreg.txt')\n",
    "    X = loadtxt('linreg.txt')\n",
    "\n",
    "print(X.shape)\n",
    "x = X[:,0]\n",
    "t = X[:,1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "  The data were generated\n",
    "  according to\n",
    "  \\begin{equation*}\n",
    "    t_n = w_0 + w_1 x_n + \\epsilon_n\n",
    "  \\end{equation*}\n",
    "  where $\\epsilon_n$ is Gaussian-distributed noise: $\\epsilon_n \\sim\n",
    "  \\mathcal{N}(0, \\sigma^2)$. Use linear regression to identify the coefficients\n",
    "  $w_0$ and $w_1$.  Recall that to do this you need to set up a\n",
    "  *design matrix* $\\mathbf{X}$ that contains the features and the dummy\n",
    "  feature $1$ to go with the bias coefficient $w_0$; thus\n",
    "  \\begin{align*}\n",
    "    \\mathbf{X} =\n",
    "    \\begin{bmatrix}\n",
    "      1 & x_1\\\\\n",
    "      1 & x_2\\\\\n",
    "      1 & x_3\\\\\n",
    "      \\vdots & \\vdots\\\\\n",
    "      1 & x_N\n",
    "    \\end{bmatrix}\n",
    "  \\end{align*}\n",
    "  With $\\mathbf{X}$ on hand, you can find the coefficients from:\n",
    "  \\begin{align*}\n",
    "    \\mathbf{w} = \\mathbf{X}^\\dagger \\mathbf{t}\n",
    "  \\end{align*}\n",
    "  where  $\\mathbf{t}$ is the vector of the targets and $\\mathbf{X}^\\dagger$ is the\n",
    "  pseudo-inverse of $\\mathbf{X}$. Use <code>np.linalg.pinv</code> or \n",
    "  construct it yourself as $\\mathbf{X}^\\dagger = (\\mathbf{X}^T \\mathbf{X})^{-1} \\mathbf{X}^T$ -- see the lecture slides.\n",
    "  \n",
    "Plot the data and the\n",
    "  regression line.  Measure the correlation between the features and\n",
    "  targets.  How does it relate to the coefficients?\n",
    "\n",
    "  Estimate the variance of the noise by find the variance of the\n",
    "  differences between your prediction of the targets and the actual\n",
    "  targets.  Thus if $y_n = w_0 + w_1 x_n$ is the prediction of the $n$th\n",
    "  target, then you could estimate the variance $\\sigma^2$ as:\n",
    "  \\begin{align*}\n",
    "    \\sigma^2 = \\frac{1}{N} \\sum_{n=1}^N (t_n - y_n)^2\n",
    "  \\end{align*}\n",
    "  Does your estimate of the variance make sense in terms of the average\n",
    "  deviation of the targets from the prediction?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Robust linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Standard regression analysis minimises the  squared\n",
    "error between the regression line and the data, namely:\n",
    "\\begin{equation*}\n",
    "    E_2(\\mathbf{w}) = \\sum_{n=1}^N (t_n - y_n(\\mathbf{x}; \\mathbf{w}) )^2\n",
    "  \\end{equation*}\n",
    "  where $y_n(\\mathbf{w}) = w_0 + w_1 x_n$ and $\\mathbf{w} = (w_0, w_1)$. Recall that the\n",
    "  sum of squares error function $E_2$ comes from the negative log\n",
    "  likelihood and the assumption that the errors are normally (Gaussian) distributed.\n",
    "\n",
    "A heavy-tailed distribution that is more appropriate if there are\n",
    "  occasional large deviations from the systematic trend is the Laplacian\n",
    "  distribution:\n",
    "  \\begin{align*}\n",
    "    p(\\epsilon_n) = p(t_n \\,|\\, \\mathbf{x}_n, \\mathbf{w}) \\propto \\exp\n",
    "    \\left\\{\n",
    "      - \\frac{| \\epsilon_n | }{\\sigma}\n",
    "    \\right\\}\n",
    "  \\end{align*}\n",
    "Substitute this expression for $p(t_n \\,|\\, \\mathbf{x}_n, \\mathbf{w})$ into the\n",
    "  general expression for an error function $E(\\mathbf{w}) = -\\sum_{n=1}^N \\log\n",
    "  p(t_n \\,|\\, \\mathbf{x}_n, \\mathbf{w}) $ to show that the error function that arises\n",
    "  from this noise distribution is\n",
    "  \\begin{equation*}\n",
    "    E_1(\\mathbf{w}) = \\sum_{n=1}^N |t_n - y_n(\\mathbf{x}; \\mathbf{w}) |\n",
    "  \\end{equation*}\n",
    "\n",
    "\n",
    "Please use the LaTeX mark-up to display the derivation.  You may find cutting and pasting from this cell convenient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The file <code>outlier.txt</code> contains the same data as the one-dimensional linear regression data that you have just been using, but with one target value to be far from the general trend in the data.  You can download it and split it into features and targets with the following."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-09T23:46:35.551923",
     "start_time": "2017-02-09T23:46:35.342247"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "try: \n",
    "    X = loadtxt('outlier.txt')\n",
    "except IOError:\n",
    "    wget.download('http://empslocal.ex.ac.uk/~reverson/ECM3420/outlier.txt')\n",
    "    X = loadtxt('outlier.txt')\n",
    "print(X.shape)\n",
    "x = X[:,0]\n",
    "t = X[:,1]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot $t_n$ versus $x_n$ and find the\n",
    "  linear regression line for these data using $E_2$. Notice how the\n",
    "  regression line is grossly affected by the single outlier.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write a\n",
    "  routine to fit a find a straight fitting the data by minimising\n",
    "  $E_1(\\mathbf{w})$.  Note that the pseudo-inverse will not work here.  One\n",
    "  possibility is to find the minimum error by trying a grid of combinations\n",
    "  of $w_0$ and $w_1$.  From your plot of the data you should be able to\n",
    "  estimate appropriate ranges of $w_0$ (the intercept) and $w_1$ (the\n",
    "  gradient) to search.   If you adopt this approach it is nice to plot a\n",
    "  contour or <code>pcolor</code> representation of $E_1(\\mathbf{w})$ as a function of\n",
    "  $w_0$ and $w_1$.\n",
    "\n",
    "   Plot and compare your fitted line with the line derived from the\n",
    "  squared error (all on the same graph).\n",
    "\n",
    "  Searching a grid like this works well when there are just two\n",
    "  coefficients to be found, but is computationally very expensive when\n",
    "  there are many. An alternative is to use a numerical minimiser such as\n",
    "  <code>scipy.optimize.fmin</code> to locate the minimum -- you might start the search\n",
    "  at the solution to the $E_2$ problem.  For example, the following cell will minimise the bannana function of two variables from the starting point <code>x0</code>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import scipy.optimize\n",
    "\n",
    "def banana(x):\n",
    "     return 100*(x[1]-x[0]**2)**2+(1-x[0])**2\n",
    "\n",
    "xopt = scipy.optimize.fmin(func=banana, x0=[-1.2,1])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how the $E_1$ regression line passes close to the majority of the data because the outlier carries less weight than in the $E_2$ case."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Radial basis function regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the demonstration of radial basis functions that I showed in a lecture with a couple of modifications.\n",
    "\n",
    "The first cell just defines a generator that produces colours in turn, which is useful for plotting later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-09T23:48:16.570875",
     "start_time": "2017-02-09T23:48:16.567482"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from itertools import cycle\n",
    "colour = cycle(\"bgrcmykw\")\n",
    "for i in range(10):\n",
    "    print(next(colour))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Make some data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "N = 20\n",
    "xtr = rand(N)*3\n",
    "xtr[:N//2] += 4\n",
    "xtr = sorted(xtr)         # Sorting helps visualise the design matrix later\n",
    "ttr = sin(xtr) + randn(N)*0.2\n",
    "\n",
    "xte = rand(N)*3\n",
    "xte[:N//2] += 4\n",
    "xte = sorted(xte)         # Sorting helps visualise the design matrix later\n",
    "tte = sin(xte) + randn(N)*0.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot(xtr, ttr, 'bo')\n",
    "xx = linspace(0, 7, 200)\n",
    "plot(xx, sin(xx), 'g')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define radial basis functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\begin{align*}\n",
    "\\phi(x) = \\frac{1}{\\sqrt{2\\pi}\\sigma} \\exp\\left\\{-x^2/(2\\sigma^2)\\right\\}\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def phi(x, c, sigma=0.1):\n",
    "    \"\"\"Radial basis function centred at c with \"radius\" sigma\"\"\"\n",
    "    return exp(-(x-c)**2/(2*sigma**2))/(sqrt(2*pi)*sigma)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def plot_phi():\n",
    "    x = linspace(-1, 5, 200)\n",
    "    c = 1.5\n",
    "    plot(x, phi(x, c, 0.25))\n",
    "    title('Basis function centred at %g' % c)\n",
    "figure(figsize=(10,4))\n",
    "plot_phi()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Choose centres and find the activations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We could choose the centres randomly, but here we'll choose \n",
    "# every other one to get an even spread\n",
    "M = 10\n",
    "I = np.random.choice(N, M, replace=False)\n",
    "I = sorted(I)           # Only useful for plotting\n",
    "print(I)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the centres $x_m$ and the activations of each of the data points $\\phi(x_n-x_m)$.  Note how the activations are large only for the points close to a particular centre."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sigma = 0.5     # Choose the width of the basis functions\n",
    "\n",
    "fig, ax = plt.subplots(2, 1, sharex=True)\n",
    "\n",
    "ax[0].plot(xtr, ttr, 'bo')\n",
    "ax[0].plot(xx, sin(xx), 'g')\n",
    "\n",
    "\n",
    "for i in I:\n",
    "    colour = next(colour)\n",
    "    ax[1].hold(True)\n",
    "    ax[1].plot(xx, phi(xx, xtr[i], sigma=sigma), c=colour)\n",
    "    activation = phi(xtr, xtr[i], sigma=sigma)\n",
    "    ax[1].plot(xtr, activation, ls='', c=colour, marker='o')\n",
    "    ax[0].plot(xtr[i], 0.0, marker='s', c=colour)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Construct a design matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\\begin{align*}\n",
    "    \\newcommand{\\bx}{\\mathbf{x}}\n",
    "    \\newcommand{\\bX}{\\mathbf{X}}\n",
    "    \\newcommand{\\bw}{\\mathbf{w}}\n",
    "      \\bX =\n",
    "      \\begin{bmatrix}\n",
    "        1 &\\phi_1(\\bx_1) & \\phi_2(\\bx_1) & \\ldots & \\phi_M(\\bx_1)\\\\\n",
    "        1& \\phi_1(\\bx_2) & \\phi_2(\\bx_2) & \\ldots & \\phi_M(\\bx_2)\\\\\n",
    "        1 & \\phi_1(\\bx_3) & \\phi_2(\\bx_3) & \\ldots & \\phi_M(\\bx_3)\\\\\n",
    "        \\vdots & & & & \\vdots\\\\\n",
    "        1 & \\phi_1(\\bx_N) & \\phi_2(\\bx_N) & \\ldots & \\phi_M(\\bx_N)\\\\\n",
    "      \\end{bmatrix}\n",
    "    \\end{align*}\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = zeros((N,M+1))\n",
    "X[:,0] = 1    # Bias\n",
    "for m, i in enumerate(I):\n",
    "    activation = phi(xtr, xtr[i], sigma=sigma)\n",
    "    X[:,m+1] = activation\n",
    "\n",
    "imshow(X, interpolation='nearest')\n",
    "colorbar()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solve for the weights using the pseudo-inverse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "w = np.linalg.pinv(X) @ ttr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "figure(figsize=(10,6))\n",
    "plot(w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predictions $y(x)$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\\begin{align*}\n",
    "   y(\\mathbf{x}; \\mathbf{w}) &= w_0 + \\sum_{m=1}^M w_m \\phi(\\mathbf{w}-\\mathbf{x}_m)\\\\\n",
    "   &= w_0 + \\sum_{m=1}^M w_m \\phi_m(\\mathbf{x})\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Make predictions at lots of points to get a smooth curve\n",
    "Npred = 200\n",
    "xtest = linspace(0, 7, Npred)\n",
    "\n",
    "X = zeros((Npred,M+1))\n",
    "X[:,0] = 1    # Bias\n",
    "for m, i in enumerate(I):\n",
    "    activation = phi(xtest, xtr[i], sigma=sigma)\n",
    "    X[:,m+1] = activation\n",
    "\n",
    "ytest = X@w\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "figure(figsize=(10,6))\n",
    "plot(xtest, ytest, 'r')\n",
    "plot(xtr, ttr, 'bo')\n",
    "plot(xx, sin(xx), 'g')\n",
    "\n",
    "for i in I:\n",
    "    plot(xtr[i], ttr[i], 'rs')\n",
    "ylim(ymin=-4, ymax=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exploring $\\sigma$ and $M$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the above code as a model write a function\n",
    "\n",
    "    rbf(xtr, ttr, xte, M, sigma)\n",
    "\n",
    "that will use the training data <code>xtr</code> and <code>ttr</code> to make predictions for the features <code>xte</code>, using <code>M</code> centres and a width <code>sigma</code> for the radial basis functions.   Your function should return the predictions for <code>xte</code> and the vector of coefficients <code>w</code>.\n",
    "\n",
    "Use your function to explore the effect of changing $M$ and $\\sigma$.  What happens when they are large and small?  Plot both the predictions and the weights.  Notice that $\\sigma$ has a smoothing effect when it is large and that poor predictions are made when $\\sigma$ is too small.  Also explore what happens when $M$ is large (it can't be larger than the number of training points).  Notice that in this situation the weights can become very large in magnitude and if $\\sigma$ is not large enough to provide lots of smoothing then the predictions are very poor where there is not much data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Choosing a few centres uniformly at random, as we have done here, seems like a good idea because it should ensure that there are a basis functions covering every region of the feature space.  However, if we are unlucky the basis functions all end up close to each other, leaving other parts of the space without any coverage.  You may have noticed this happening as you experimented. Two alternatives to counter this are:\n",
    "\n",
    "*  Cluster the data and place a basis function at the centre of each cluster.  Here we still have to choose the number of centres to use.\n",
    "\n",
    "* Put a basis function on *every* point. This could be expensive with lots of data, but we won't need to worry about this.  However, the weights may become very large and the predictions very poor, particularly if $\\sigma$ is small.\n",
    "\n",
    "It would be nice to have a radial basis function regressor that didn't require us to choose $M$ and $\\sigma$.  To achieve this we'll put a basis function on every training data point, removing the need to choose $M$.  A reasonable strategy for choosing $\\sigma$ is to set it equal to a multiple of the average distance to neighbouring data points. \n",
    "\n",
    "Copy your <code>rbf</code> function to make a new function that chooses $M = N$ and $\\sigma$ to be, say 10 times, the average distance to the nearest neighbours.  How well does this perform?  Quantify how well it does by calculating the root mean squared error on the test data.  Is 10 times the average nearest neighbour distance a reasonable choice?  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Overfitting and regularisation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Minimising the mean squared error on the training data can lead to **overfitting**, where the training data is fitted very well, but the test data is fitted poorly: the model does not **generalise** well.  This is because the model weights have been learned so that they fit not only the systematic trends in the data, but also the noise.  This is particularly a problem when there are lots of weights because in this case there is lots of flexibility in the model so that it is possible to fit the noise.\n",
    "\n",
    "One way to counteract this in RBF regression is to adjust $\\sigma$ so that it provides enough smoothing.   However, this limits the expressiveness of the model and a more general way of controlling the flexibility of the model is to prevent the weights from becoming too large.  You will have noticed that when the model is overfitting it tends to have large weights and it is intuitively clear that if the output of the model can only be large or change rapidly with $x$ if the weights are large. \n",
    "\n",
    "We can thus **regularise** the model by adding a penalty to the usual error function that makes the error large if the weights are large.  The penalised error function is\n",
    "\\begin{align}\n",
    "    E(\\mathbf{w}) = E_{data}(\\mathbf{w}) + \\alpha ||\\mathbf{w}||^2\n",
    "\\end{align}\n",
    "where $E_{data}(\\mathbf{w}) = E_2(\\mathbf{w}) $ is the mean squared error function that we have been using that measures the difference between the data and the output of the model, and $||\\mathbf{w}||^2$ is the sum of the squares of the weights.  Thus the usual error function, the first term, is penalised by the second term: if (over)fitting the data would lead to large weights then that also means that the second term and thus the overall error is large.  Consequently minimising this error term arrives at a balance between fitting the data well and having small weights, effectively controlling the smoothness of the model.  The coefficient $\\alpha$ controls how important the penalty is.  If $\\alpha$ is small, the penalty is unimportant and the weights can be large; if $\\alpha$ is large, the penalty means that the weights must be small and the output of the model smooth.  We will have to choose $\\alpha$.\n",
    "\n",
    "This is known as **weight decay regularisation** because it tends to make the weights small. Do some reading about weight decay regularisation; any of the recommended books will do.\n",
    "\n",
    "A nice feature of WDR for regression is that $E(\\mathbf{w})$ is still quadratic and so the optimum weights can be found by linear algebra.   The regularised weights are found as:\n",
    "\\begin{align}\n",
    "   \\newcommand{\\bX}{\\mathbf{X}}\n",
    "\\mathbf{w} = (\\bX^T \\bX + \\alpha\\mathbf{I})^{-1} \\bX^T\\mathbf{t}\n",
    "\\end{align}\n",
    "where $\\mathbf{X}$ is the design matrix as above and $\\mathbf{t}$ is the vector of training targets.  Note that when $\\alpha = 0$ we recover the expression for the pseudo-inverse."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Weight decay regularisation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copy and modify your RBF regression function to make a function     \n",
    "\n",
    "    rbfwdr(xtr, ttr, xte, sigma, alpha)\n",
    "\n",
    "that will use the training data <code>xtr</code> and <code>ttr</code> to make predictions for the features <code>xte</code> using a weight decay regularisation coefficent <code>alpha</code> and RBF width <code>sigma</code>.  As before your function should return the predictions for <code>xte</code> and the vector of coefficients <code>w</code>.\n",
    "\n",
    "Plot graphs of the predictions and weights with large and small $\\alpha$ and verify that it does control the smoothness of the model output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Choosing $\\alpha$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assuming that you have a reasonable value for $\\sigma$, it remains to choose $\\alpha$ which is done by cross validation.   The data set <code>nonlinreg-train.txt</code> and <code>nonlinreg-test.txt</code> contain features and targets for another one-dimensional, nonlinear regression problem.  Use your <code>rbfwdr</code> function to make predictions for these data and choose $\\alpha$ by estimating the generalisation error with cross validation.  You will need to evaluate the training and validation errors for $\\alpha$ over a wide range, perhaps $10^{-5}$ to $10^1$; the function <code>logspace</code> is useful for producing equally spaced values: \n",
    "\n",
    "    alpha = logspace(-5, 1, 20)\n",
    "    print(alpha)\n",
    "    \n",
    "    [  1.00000000e-05   2.06913808e-05   4.28133240e-05   8.85866790e-05\n",
    "       1.83298071e-04   3.79269019e-04   7.84759970e-04   1.62377674e-03\n",
    "       3.35981829e-03   6.95192796e-03   1.43844989e-02   2.97635144e-02\n",
    "       6.15848211e-02   1.27427499e-01   2.63665090e-01   5.45559478e-01\n",
    "       1.12883789e+00   2.33572147e+00   4.83293024e+00   1.00000000e+01]\n",
    "\n",
    "Plot a graph of the training and validation errors versus $\\alpha$ (<code>semilogx</code> is useful) and so choose the best $\\alpha$ as the one that minimises the validataion error.\n",
    "\n",
    "You probably won't need to use leave-one-out cross validation, 5-fold cross validation will probably be sufficient.  You can either write your own or you could use the functions in scikit-learn; see <http://scikit-learn.org/stable/modules/cross_validation.html>.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following cell loads the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "name = 'nonlinreg-train.txt'\n",
    "try: \n",
    "    X = loadtxt(name)\n",
    "except:\n",
    "    wget.download('http://empslocal.ex.ac.uk/~reverson/ECM3420/'+name)\n",
    "    X = loadtxt(name)\n",
    "print(X.shape)\n",
    "xtr = X[:,0]\n",
    "ttr = X[:,1]\n",
    "\n",
    "name = 'nonlinreg-test.txt'\n",
    "try: \n",
    "    X = loadtxt(name)\n",
    "except IOError:\n",
    "    wget.download('http://empslocal.ex.ac.uk/~reverson/ECM3420/'+name)\n",
    "    X = loadtxt(name)\n",
    "print(X.shape)\n",
    "xte = X[:,0]\n",
    "tte = X[:,1]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluate your final, cross validated model on the actual test data.  There are many more examples of this so you should get an accurate result, but make sure you haven't used the test data during training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Upto now we have used a value of $\\sigma$ chosen by rule-of-thumb or by a bit of experimentation.  A nicer way to choose it would be by cross validation too.  However, since $\\sigma$ and $\\alpha$ interact, you really need to find the best value for them both jointly, by searching a grid of possible $(\\sigma, \\alpha)$ combinations.\n",
    "\n",
    "Do this for these data and plot (<code>pcolor</code>) the validation error versus \n",
    "$\\sigma$ and $\\alpha$.  Evaluate the performance of your optimised model on the test data.\n",
    "\n",
    "(There is machinery in sklearn to help with this, but it involves wrapping your RBF network in a class and providing some methods -- see [sklearn.model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).  It's probably easiest just to write generate your own $(\\sigma, \\alpha)$ pairs using nested loops.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "285px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}