{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Learning from Data: Workshop 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This workshop is not assessed per se, but I will be happy to talk about it and parts of it may appear in a later workshop." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2017-02-17T09:22:09.006376", "start_time": "2017-02-17T09:22:08.337632" }, "collapsed": false }, "outputs": [], "source": [ "% pylab inline\n", "figsize(10, 8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Decision boundaries and the Bayes error rate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Bayes error rate is the minimum error rate that can be achieved with perfect knowledge of the probability distributions involved. In this exercise we will assume that the *class conditional probability densities* are known and are one-dimensional Gaussian densities:\n", "\\begin{align*}\n", "p(x | C_0) = \\mathcal{N}(x; \\mu_0, \\sigma^2_0)\\\\\n", "p(x | C_1) = \\mathcal{N}(x; \\mu_1, \\sigma^2_1)\n", "\\end{align*}\n", "For convenience, take $\\mu_0 = 0$ and $\\sigma^2_i = 1$ for $i = 0, 1$. Remember that these are the probability that we observe a feature $x$, given that we know the class of the feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will take the *prior probabilities* $p(C_0)$ and $p(C_1)$ to be $0.3$ and $0.7$ respectively. These are the probabilities that an observation is in one class or the other before we have observed the actual data point." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data generation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can generate some data from this model. The procedure is:\n", "\n", " for each data point:\n", " draw the class, t, according to the prior probabilities\n", " draw the feature, x, from a Gaussian distribution with the mean and variance given by the class\n", "\n", "Write a loop to generate vectors of pairs $x$ and $t$. Draw lots of samples (say 10,000) and plot histograms (once for each class) of the samples (on the same plot) to make sure they correspond to your expectations. For this exercise, take $\\mu_1 = 3.5$ and with so many data points you'll be able to use plenty of bins (say 50 or 100); the second argument of hist controls the number of bins." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the histograms you've plotted are proportional to $p(x | C_0) p(C_0)$ and $p(x | C_1) p(C_1)$. Based on this, estimate by eye where the ideal decision boundary is." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Empirical error rate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To start, let's see how the empirical error rate varies as the decision boundary is varied. Call the location of the decision boundary $\\lambda$. Then if $x_n < \\lambda$ then $x_n$ is classified to $C_0$; otherwise it is classified to $C_1$. \n", "\n", "Make a plot of the how the accuracy of this simple classifier varies as the decision boundary $\\lambda$ is varied from -6 to +10. From your plot what is the optimal location of the decision boundary? Does it concur with your \"by eye\" estimate in the previous cell? What is an *estimate* of the Bayes error rate?\n", "\n", "Programming gotcha: beware that lambda is a Python keyword, so you can't use that name (spelled like that) as a variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Theory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will find the Bayes error rate using our knowlegde of the distributions from which the data was generated. \n", "\n", "Using the formula for $p(x|C_i) = \\mathcal{N}(x; \\mu_i, \\sigma^2_i)$, draw on the same plot $p(x | C_0) p(C_0)$ and $p(x | C_1) p(C_1)$. The point at which these curves cross is the location of the ideal decision boundary (why)?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If, as before, the decision boundary is at $x = \\lambda$, then the errors that are made in the classification of points in $C_0$ is $\\int_{\\lambda}^\\infty p(x | C_0) p(C_0) dx$. Since we know that $p(x | C_0)$ is a Gaussian density this integral can be worked out in terms of the (complementary) error function. \n", "\n", "Recall that \n", "\\begin{align}\n", "\\int_t^\\infty \\mathcal{N}(x; 0, 1) dx = \\frac{1}{2}\\mbox{erfc}(t/\\sqrt{2})\n", "\\end{align}\n", "where $\\mbox{erfc}(t)$ is the complementary error function, which the next cell plots." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import scipy.special\n", "t = linspace(-3, 3, 200)\n", "plot(x, scipy.special.erfc(t))\n", "xlabel('t')\n", "ylabel('erfc(t)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the complementary error function to draw a graph of the probability of making an error in classifying $C_0$ points as $\\lambda$ is varied. (Don't forget to account for the prior probability.)\n", "\n", "Augment your graph with another curve, showing the probability of making an error in classifying $C_1$ points (you may find it easier to use the error function instead of the complementary error function).\n", "\n", "Now plot the sum of these two curves, which is the total probability of making an error for any given $\\lambda$. Where is the best place for the decision boundary and what is the Bayes error rate?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Posterior probabilities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we know the class conditional probabilities, we can use Bayes rule to work out the posterior probabilities exactly. Use Bayes rule (see the lecture slides) to plot graphs of the posterior probabilities $p(C_i | x)$ for $ i = 0, 1$. Check that these cross at the Bayes optimal decision boundary. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rejection region" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to only classify examples that you are more than 80% confident of, where should the rejection lie? In this case, what fraction of data is rejected?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### With less data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The answer you obtained above for the Bayes error and decision boundaries should concur with the answer you got empirically (make sure they do agree!). However, the empirical estimate used a large number of data points and the data are one-dimensional. Usually the data are high dimensional (so the decision boundary isn't a single $\\lambda$) and we don't have that many data points, so empirically finding the Bayes error is infeasible. \n", "\n", "Suppose that instead of 10,000 data points you only have 20. Use the first 20 of the 10,000 that you generated above. \n", "\n", "In this case trying to estimate $p(x | C_i)p(C_i)$ by histograms is going to be too crude even with one-dimensional features (you could try it as you did at the start above). \n", "\n", "An alternative approach is to assume (correctly in this case) that the data come from Gaussians, but to estimate their mean and standard deviations from the data. Using means and standard deviations calculated from the data, find the best decision boundary location and estimate the Bayes error rate. \n", "\n", "Since you used a single particular sample of 20 data points you might have got particularly lucky or unlucky. Write a loop (and maybe a function!) to find the best decision boundary and error rate averaged over several 20-data-point samples. How does the estimate vary as you increase the amount of data available?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Receiver operating characteristics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the \"empirical error rate\" cell above, you will have noticed that there is a trade-off between two sorts of errors: when $\\lambda$ is negative, everything in $C_1$ is correctly classified, but at the expense of a poor classification rate for $C_0$. The reverse is true when $\\lambda$ is large and positive. \n", "\n", "A common way of viewing this trade-off is to focus on one of the classes, calling it the *positive class*. Often we are more interested in one class than the other, for example, in spam classification we might designate emails that are spam as the positive class, in medical screening the condition that is being screened for is the positive class. The types of classification that can be made are then:\n", "\n", "* True positives: when $x$ is classified as positive and it is in fact positive;\n", "* False positives: when $x$ is classified as positive, but is actually in the negative class.\n", "\n", "Clearly, a good classifier will have a large true positive rate (the number of true positive classifications divided by the actual number of positive instances in the data) and a low false positive rate.\n", "\n", "Make a plot of the true positive rate versus the false positive rate as $\\lambda$ is varied from -6 to + 10. Use the set of 10,000 examples to get a nice smooth curve. This curve, which displays the range of available true and false positive rates, is known as the receiver operating characteristic or ROC curve. It allows the user to select a decision boundary (here $\\lambda$) that takes into account the relative costs of misclassification. In general, the better the classifier, the greater the area under the ROC curve. Calculate the AUC in this case.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "174px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 0 }