{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Learning from Data: Workshop 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This workshop is not assessed, but if you give me the completed notebook, I'll give you formative feedback on it.\n", "\n", "There are three sections and quite a lot of work if you do all of them:\n", "\n", "1. Constructing a classifier based on the Mahalanobis distance\n", "1. Probabilities, transformations of probabilities and Gaussians\n", "1. Linear regression and robust linear regression. This will form part of the assessed work for the next workshop, so you might want to start it now.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2017-02-02T23:45:32.510110", "start_time": "2017-02-02T23:45:31.939272" }, "collapsed": false }, "outputs": [], "source": [ "%pylab inline\n", "figsize(10, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classifier using Mahalanobis distances\n", "\n", "Use the Mahalanobis distance to construct a classifier that\n", "discriminates between the two classes.\n", "\n", "Make plots similar to those in lectures that show the minimum Mahalanobis\n", " distance from every point in a grid of $(x, y)$ points. One way of doing\n", " this is along the following lines" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2017-02-02T23:45:35.183036", "start_time": "2017-02-02T23:45:34.883582" }, "collapsed": false }, "outputs": [], "source": [ "# Load and standardise the tremor data\n", "import wget\n", "\n", "try: \n", " X = loadtxt('raw-tremor.txt')\n", "except IOError:\n", " wget.download('http://empslocal.ex.ac.uk/~reverson/ECM3420/raw-tremor.txt')\n", " X = loadtxt('raw-tremor.txt')\n", "\n", "t = X[:,2]\n", "X = X[:,:2]\n", "print(t.shape, X.shape)\n", "\n", "normal = t == 1\n", "patient = t == 0\n", "\n", "\n", "plot(X[normal,0], X[normal,1], 'bo', label='Normal')\n", "plot(X[patient,0], X[patient,1], 'rs', label='Patient')\n", "legend(loc=2)\n", "title('Standardised data')\n", "\n", "# Divide into training and test sets\n", "from numpy.random import permutation\n", "N = X.shape[0]\n", "I = permutation(N) # Shuffled indices 0,..., N-1\n", "Itr = I[:N//2]\n", "Ite = I[N//2:]\n", "\n", "Xtr = X[Itr,:]\n", "ttr = t[Itr]\n", "\n", "Xte = X[Ite,:]\n", "tte = t[Ite]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2017-02-02T23:45:36.554096", "start_time": "2017-02-02T23:45:36.227808" }, "collapsed": false }, "outputs": [], "source": [ "N = 30; # Number of grid points in x direction\n", "M = 40; # Number of grid points in y direction\n", "xloc = linspace(X[:,0].min(), X[:,0].max(), N); # Vector of x coordinates to use\n", "yloc = linspace(X[:,1].min(), X[:,1].max(), M); # Vector of y coordinates to use\n", "D = zeros((M, N)); # Space for the distances\n", "\n", "for nx, x in enumerate(xloc):\n", " for ny, y in enumerate(yloc):\n", " D[ny,nx] = x**2\n", " # Here is where you calculate the minimum Mahalanobis distance \n", " # between the point (x, y) and the centre of each of the classes\n", "\n", "\n", "pcolor(xloc, yloc, D, cmap=cm.gray) # False colour rendition of D\n", "axis('tight')\n", "colorbar()\n", "# Draw the data. The names of your variables might be different\n", "I = ttr == 1\n", "scatter(Xtr[I,0], Xtr[I,1], c='b')\n", "I = ttr == 0\n", "scatter(Xtr[I,0], Xtr[I,1], c='r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Modify your code so that it not only draws the picture, but it also\n", "classifies data into the two classes.\n", "Find the classification rates (that is, the proportion of examples that\n", "are assigned to the correct class) for the training and test data\n", "Xte and tte. Remember that when you are classifying the\n", "test data, you should use the covariance matrices from the\n", "*training* data to calculate the Mahalanobis distances. Why\n", "are the classification rates for the training and test data different?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A simple modification to your code is to use the ordinary Euclidean distance. How does this change the classification rate?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could also use the Mahalanobis classifier to classify the digits data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PDFs and histograms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Generate 1000 uniformly distributed\n", " samples from a Normal/Gaussian density with zero mean and variance 1,\n", " $x_n \\sim \\mathcal{N}(0, 1)$, using the randn function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot a histogram of the samples using the hist function. By\n", " default hist uses 10 bins, but experiment with changing the number of\n", " bins. The number of samples you would expect to fall in a histogram bin of\n", " width $\\Delta$ centred at $x$ would be $p(x) \\Delta$ where $p(x)$ is the\n", " probability density function (pdf). Since you know the pdf for this\n", " Gaussian density (mean $0$ and variance $1$) is\n", " \\begin{equation*}\n", " p(x) = \\frac{1}{\\sqrt{2\\pi}} \\exp\n", " \\left\\{\n", " -\\frac{x^2}{2}\n", " \\right\\}\n", " \\end{equation*}\n", " compare the histogram with the expected number of samples by drawing the\n", " curve of the expected number of samples over the histogram.\n", " Calling hist like counts, edges, patches = hist(...) will return the counts in each bin, the locations of the bin edges (and the graphical patches making the histogram figure) to help work\n", " out the relevant $x$ and $\\Delta$ for each bin; see\n", " the doc-string for hist.\n", "\n", " What are the trade-offs involved in choosing the\n", " number of bins? What do you think is the best number of bins for these data? Why?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PDFs and expectations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Generate 1000 uniformly distributed\n", " samples $x_n \\sim \\mathcal{U}(0,1)$ using the rand\n", " function. Plot a histogram of them using the function hist.\n", "\n", " Write a little function to find the variance of the samples and check\n", " your result using the var function. Work out an expression for\n", " the variance as the expected value:\n", " \\begin{align*}\n", " \\mathbb{E}((x-\\mu)^2) &= \\int (x-\\mu)^2 p(x)\\,dx\\\\\n", " &= \\int_0^1 (x-\\mu)^2 \\,dx\n", " \\end{align*}\n", " (For simplicity, assume that you know the mean, $\\mu = 0.5$.)\n", "\n", " Calculate the variance using $5, 10, 20, 50, 100, 500, \\ldots $ samples:\n", " how does the sample variance behave as the number of samples increases?\n", "\n", " Generalise your result for the variance of $\\mathcal{U}(0,1)$ to\n", " $\\mathcal{U}(a,b)$. If you're feeling like an exercise in integration (it's good for the soul),\n", " show that the variance of a Gaussian density\n", " \\begin{equation*}\n", " p(x) = \\frac{1}{\\sqrt{2\\pi}\\sigma} \\exp\n", " \\left\\{\n", " -\\frac{x^2}{2\\sigma^2}\n", " \\right\\}\n", " \\end{equation*}\n", " is $\\sigma^2$.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Gaussian densities" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Here is a little demo of sampling from a Gaussian density in two dimensions.\n", "\n", "First create a mean vector and a covariance matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2017-02-02T23:46:17.062696", "start_time": "2017-02-02T23:46:17.056702" }, "collapsed": false }, "outputs": [], "source": [ "mu = np.array([3, 2])\n", "print(\"Mean\")\n", "print(mu)\n", "\n", "\n", "Sigma = np.array([[5.5, -4.5], [-4.5,5.5]])\n", "print(\"Covariance\")\n", "print(Sigma)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need a function to evaluate the pdf of a Normal density. Surprisingly one isn't readily available, so here's one:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2017-02-02T23:46:21.893392", "start_time": "2017-02-02T23:46:21.886896" }, "collapsed": false }, "outputs": [], "source": [ "def gaussian_pdf(x, mu, Sigma):\n", " \"\"\"\n", " Evaluate the probability of each row of x using a multivariate Gaussian density\n", " with mean mu and covariance Sigma.\n", " \"\"\"\n", " assert len(mu.shape) == 1\n", " assert len(Sigma.shape) == 2\n", " d = mu.shape[0]\n", " assert d == Sigma.shape[0] and d == Sigma.shape[1]\n", " invSigma = np.linalg.inv(Sigma) # Should be invertible\n", " dx = x - mu\n", " fact = np.sum(np.dot(dx, invSigma)*dx, axis=1)\n", " p = np.exp(-fact/2)/np.sqrt((2*np.pi)**d*np.linalg.det(Sigma))\n", " return p\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2017-02-02T23:46:23.708795", "start_time": "2017-02-02T23:46:23.024630" }, "collapsed": false }, "outputs": [], "source": [ "# Plot contours of probability over a region (-5, 10) x (-4, 10)\n", "# In this cell we just set up the grid of locations at which the pdf will be evaluated.\n", "Nx = 40\n", "Ny = 50\n", "cx = linspace(-5, 10, Nx)\n", "cy = linspace(-4, 10, Ny)\n", "\n", "xx, yy = meshgrid(cx, cy)\n", "# xx and yy contain the x and y coordinates for the grid. Plot them to see what's going on.\n", "subplot(1,2,1)\n", "pcolor(cx, cy, xx)\n", "subplot(1,2,2)\n", "pcolor(cx, cy, yy)\n", "\n", "# Turn the separate x and y locations into a matrix where each row is a single (x,y) location\n", "xy = vstack((xx.ravel(), yy.ravel())).T\n", "print(xy.shape)\n", "\n", "# Evaluate the pdf at every point on the grid. \n", "p = gaussian_pdf(xy, mu, Sigma)\n", "print(p.shape)\n", "p = reshape(p, xx.shape)\n", "figure()\n", "cplot = contour(cx, cy, p)\n", "clabel(cplot, inline=1, fontsize=10)\n", "axis('scaled')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Draw the contours again with some samples on top \n", "cplot = contour(cx, cy, p)\n", "clabel(cplot, inline=1, fontsize=10)\n", "\n", "# Draw some samples using the multivariate_normal function\n", "from numpy.random import multivariate_normal\n", "samples = multivariate_normal(mu, Sigma, 200)\n", "plot(samples[:,0], samples[:,1], 'bo', markerfacecolor='none')\n", "axis('scaled')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a similar manner to above, plot\n", " 200 samples and contours of two-dimensional Gaussian density\n", " functions with the following means and covariances:\n", " \\begin{align*}\n", " \\newcommand{\\bmu}{\\boldsymbol{\\mu}}\n", " \\bmu &=\n", " \\begin{bmatrix}\n", " 1.0\\\\ 0.0\n", " \\end{bmatrix} \\qquad\\qquad\n", " \\Sigma =\n", " \\begin{bmatrix}\n", " 0.2 & 0.0\\\\\n", " 0.0 & 0.4\n", " \\end{bmatrix}\\\\\n", " \\bmu &=\n", " \\begin{bmatrix}\n", " 1.0\\\\ 0.0\n", " \\end{bmatrix} \\qquad\\qquad\n", " \\Sigma =\n", " \\begin{bmatrix}\n", " 0.4 & 0.05\\\\\n", " 0.05 & 0.1\n", " \\end{bmatrix}\\\\\n", " \\bmu &=\n", " \\begin{bmatrix}\n", " 1.0\\\\ 1.0\n", " \\end{bmatrix} \\qquad\\qquad\n", " \\Sigma =\n", " \\begin{bmatrix}\n", " 0.1 & 0.0\\\\\n", " 0.0 & 0.1\n", " \\end{bmatrix}\\\\\n", " \\end{align*}\n", "\n", "Make sure that you use the axis('scaled') command to \n", "ensure that the axes are scaled so that circles come out as circles and\n", "not as ellipses. This is important in order to relate the shape of the\n", "contours to the covariance matrices. In each case say what in the\n", "covariance matrices gives the contours of constant probability their\n", "particular shape and orientation.\n", "\n", "For each of the Gaussians calculate the covariance matrix directly\n", "from the samples and make sure that it (approximately) corresponds to\n", "the covariance matrix you used to generate the samples. Either use cov or, \n", "just as easy, write your own." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the following little data set which contains one-dimensional data vectors $\\mathbf{x}$ and $\\mathbf{t}$. They are stored as two columns, which the following cell splits into two vectors, x and t" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import wget\n", "try: \n", " X = loadtxt('linreg.txt')\n", "except IOError:\n", " wget.download('http://empslocal.ex.ac.uk/~reverson/ECM3420/linreg.txt')\n", " X = loadtxt('linreg.txt')\n", "\n", "print(X.shape)\n", "x = X[:,0]\n", "t = X[:,1]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ " The data were generated\n", " according to\n", " \\begin{equation*}\n", " t_n = w_0 + w_1 x_n + \\epsilon_n\n", " \\end{equation*}\n", " where $\\epsilon_n$ is Gaussian-distributed noise: $\\epsilon_n \\sim\n", " \\mathcal{N}(0, \\sigma^2)$. Use linear regression to identify the coefficients\n", " $w_0$ and $w_1$. Recall that to do this you need to set up a\n", " *design matrix* $\\mathbf{X}$ that contains the features and the dummy\n", " feature $1$ to go with the bias coefficient $w_0$; thus\n", " \\begin{align*}\n", " \\mathbf{X} =\n", " \\begin{bmatrix}\n", " 1 & x_1\\\\\n", " 1 & x_2\\\\\n", " 1 & x_3\\\\\n", " \\vdots & \\vdots\\\\\n", " 1 & x_N\n", " \\end{bmatrix}\n", " \\end{align*}\n", " With $\\mathbf{X}$ on hand, you can find the coefficients from:\n", " \\begin{align*}\n", " \\mathbf{w} = \\mathbf{X}^\\dagger \\mathbf{t}\n", " \\end{align*}\n", " where $\\mathbf{t}$ is the vector of the targets and $\\mathbf{X}^\\dagger$ is the\n", " pseudo-inverse of $\\mathbf{X}$. Use np.linalg.pinv or \n", " construct it yourself as $\\mathbf{X}^\\dagger = (\\mathbf{X}^T \\mathbf{X})^{-1} \\mathbf{X}^T$ -- see the lecture slides.\n", "\n", " Plot the data and the\n", " regression line. Measure the correlation between the features and\n", " targets. How does it relate to the coefficients?\n", "\n", " Estimate the variance of the noise by find the variance of the\n", " differences between your prediction of the targets and the actual\n", " targets. Thus if $y_n = w_0 + w_1 x_n$ is the prediction of the $n$th\n", " target, then you could estimate the variance $\\sigma^2$ as:\n", " \\begin{align*}\n", " \\sigma^2 = \\frac{1}{N} \\sum_{n=1}^N (t_n - y_n)^2\n", " \\end{align*}\n", " Does your estimate of the variance make sense in terms of the average\n", " deviation of the targets from the prediction?" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Robust linear regression" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "_This will probably appear as part of the next (assessed) workshop too, so now might be a good time to do it!_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Standard regression analysis minimises the squared\n", "error between the regression line and the data, namely:\n", "\\begin{equation*}\n", " E_2(\\mathbf{w}) = \\sum_{n=1}^N (t_n - y_n(\\mathbf{x}; \\mathbf{w}) )^2\n", " \\end{equation*}\n", " where $y_n(\\mathbf{w}) = w_0 + w_1 x_n$ and $\\mathbf{w} = (w_0, w_1)$. Recall that the\n", " sum of squares error function $E_2$ comes from the negative log\n", " likelihood and the assumption that the errors are normally (Gaussian) distributed.\n", "\n", "A heavy-tailed distribution that is more appropriate if there are\n", " occasional large deviations from the systematic trend is the Laplacian\n", " distribution:\n", " \\begin{align*}\n", " p(\\epsilon_n) = p(t_n \\,|\\, \\mathbf{x}_n, \\mathbf{w}) \\propto \\exp\n", " \\left\\{\n", " - \\frac{| \\epsilon_n | }{\\sigma}\n", " \\right\\}\n", " \\end{align*}\n", " Substitute this expression for $ p(t_n \\,|\\, \\mathbf{x}_n, \\mathbf{w})$ into the\n", " general expression for an error function $E(\\mathbf{w}) = -\\sum_{n=1}^N \\log\n", " p(t_n \\,|\\, \\mathbf{x}_n, \\mathbf{w}) $ to show that the error function that arises\n", " from this noise distribution is\n", " \\begin{equation*}\n", " E_1(\\mathbf{w}) = \\sum_{n=1}^N |t_n - y_n(\\mathbf{x}; \\mathbf{w}) |\n", " \\end{equation*}\n", "\n", "\n", "You can (hand)write this out on paper rather than typing it out in LaTeX." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file outlier.txt contains the same data as the one-dimensional linear regression data that you used last workshop. You can download it and split it into features and targets with the following." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "try: \n", " X = loadtxt('outlier.txt')\n", "except IOError:\n", " wget.download('http://empslocal.ex.ac.uk/~reverson/ECM3420/outlier.txt')\n", " X = loadtxt('outlier.txt')\n", "print(X.shape)\n", "x = X[:,0]\n", "t = X[:,1]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot $t_n$ versus $x_n$ and find the\n", " linear regression line for these data using $E_2$. Notice how the\n", " regression line is grossly affected by the single outlier.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a\n", " routine to fit a find a straight fitting the data by minimising\n", " $E_1(\\mathbf{w})$. Note that the pseudo-inverse will not work here. One\n", " possibility is to find the minimum error by trying a grid of combinations\n", " of $w_0$ and $w_1$. From your plot of the data you should be able to\n", " estimate appropriate ranges of $w_0$ (the intercept) and $w_1$ (the\n", " gradient) to search. If you adopt this approach it is nice to plot a\n", " contour or pcolor representation of $E_1(\\mathbf{w})$ as a function of\n", " $w_0$ and $w_1$.\n", "\n", " Plot and compare your fitted line with the line derived from the\n", " squared error (all on the same graph).\n", "\n", " Searching a grid like this works well when there are just two\n", " coefficients to be found, but is computationally very expensive when\n", " there are many. An alternative is to use a numerical minimiser such as\n", " scipy.optimize.fmin to locate the minimum -- you might start the search\n", " at the solution to the $E_2$ problem. For example, the following cell will minimise the bannana function of two variables from the starting point x0." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2017-02-02T23:48:02.921155", "start_time": "2017-02-02T23:48:02.732956" }, "collapsed": false }, "outputs": [], "source": [ "import scipy.optimize\n", "\n", "def banana(x):\n", " return 100*(x[1]-x[0]**2)**2+(1-x[0])**2\n", "\n", "xopt = scipy.optimize.fmin(func=banana, x0=[-1.2,1])\n", "\n", "print(xopt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "159px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 0 }