{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Learning from Data: Workshop 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Date set | Hand-in date |\n",
"|:------------------|:-----------------------------------|\n",
"|27th January 2017 | **12:00 Thursday 13th February 2017** |\n",
"\n",
"\n",
"This workshop is worth 10% of the total module mark.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Candidate number: ** Put your candidate number here ** "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your report should consist of your IPython notebook showing what you did, what was the\n",
"result, and what you can conclude from the exercise. Each report will be\n",
"assessed on the following criteria:\n",
"\n",
"* Does it record what was done in the exercise?\n",
"* Does it permit the results to be reproduced?\n",
"* How does the work relate to the theoretical foundations discussed in lectures?\n",
"* Is it well presented?\n",
"\n",
"Just write comments on the results, etc, using markdown in new cells.\n",
"\n",
"### Submitting the notebooks\n",
"\n",
"Only an electronic submissions is required. Submit your notebook (the .ipynb file) to electronic copy via the [electronic hand-in system](http://empslocal.ex.ac.uk/submit/) using the topic pwd
(print working directory) in a cell to find out."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
cov
to return the covariance matrix (remember that the covariance matrix should be a 2 by 2 matrix for these data). Check that the diagonal entries are what you expect from the standard deviations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Write a loop to calculate the covariance matrix by hand:\n",
"\\begin{align*}\n",
" S_{ij} = \\frac{1}{N-1}\\sum_{n=1}^N (x_{ni} - \\bar{x}_{i})(x_{nj} - \\bar{x}_{j})\n",
"\\end{align*}\n",
"for each $i$ and $j$.\n",
"\n",
"Check that you get the same result using both methods."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the\n",
"correlation between the two variables $x_1$ and $x_2$ from the standardised\n",
"covariance matrix. You can use corrcoef
to check your results, but you should be able to read it from the covariance matrix of the standardised data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2017-01-27T00:11:15.749585",
"start_time": "2017-01-27T00:11:15.746440"
},
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## k-nearest neighbour classifier\n",
"\n",
"Now we will use a k-nn classifier to classify the data. You will have to divide the data into a training and a test set. You can use the k-nn classifier from the sklearn
module as follows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2017-01-27T00:13:17.641138",
"start_time": "2017-01-27T00:13:17.629243"
},
"collapsed": false
},
"outputs": [],
"source": [
"# Divide the data into training and test sets\n",
"from numpy.random import permutation\n",
"N = X.shape[0]\n",
"I = permutation(N) # Shuffled indices 0,..., N-1\n",
"Itr = I[:N//2]\n",
"Ite = I[N//2:]\n",
"\n",
"Xtr = X[Itr,:]\n",
"ttr = t[Itr]\n",
"\n",
"Xte = X[Ite,:]\n",
"tte = t[Ite]\n",
"\n",
"# Make a copy of the features as you will need it later.\n",
"Xtr_copy = Xtr.copy()\n",
"Xte_copy = Xte.copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot your training and test sets to make sure that they look like a fair random division of the data.\n",
"\n",
"The training data are to be used to construct the classifier. The test data, which should not be used at all during training, are used to evaluate how well the classifier works."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Standardisation\n",
"\n",
"Since the scales of the data are so different, it will be important to standardise the data before trying to classify it.\n",
"\n",
"\n",
"Find the mean and standard deviations of the *training* data and use these to standardise the training data. (You can use the commands\n",
"mean
and std
to find the mean and standard deviation.) Use the training data mean and standard deviation to standardise the test data. Note that it's important to use the training data statistics (rather than the test data statistics) because both data should be treated in *exactly* the same way and we might only have a single test data point to classify.\n",
"\n",
" \n",
"Plot the standardised data \n",
"and check your result by finding its mean and covariance matrix. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use the k-nearest neighbour classifier from scikit learn, which is quite an extensive implementation of various machine learning algorithms."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2017-01-27T00:04:20.487325",
"start_time": "2017-01-27T00:04:20.107360"
},
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn import neighbors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In general you can train the clasifier using Xtr
and Xte
as follows:\n",
"\n", "knn = KNeighborsClassifier(n_neighbors=3)\n", "knn.fit(Xtr, ttr) # Train it\n", "\n", "y = knn.predict(Xte) # Predict the class of the testing data\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell shows you how to classify points on a grid in feature space. This is so that we can gain an understanding of how the classifier works for a whole range of points. Note that this will give poor results unless you have first standardised your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2017-01-27T00:04:51.760312", "start_time": "2017-01-27T00:04:50.359883" }, "collapsed": false }, "outputs": [], "source": [ "k = 5 # Chosse the number of nearest neighbours\n", "\n", "knn = neighbors.KNeighborsClassifier(n_neighbors=k)\n", "knn.fit(Xtr, ttr)\n", "\n", "# Plot the prediction by the classifier of the class probability \n", "# (estimated from the fraction of points of each class in the k\n", "# nearest neighbours) for data on a grid. \n", "\n", "N, M = 40, 30 # Make these larger to get a smoother picture\n", "\n", "Xgrid = linspace(-3.0, 3.0, N)\n", "Ygrid = linspace(-3.0, 3.0, M)\n", "pred = zeros((M,N))\n", "prob = zeros((M,N,2))\n", "# Writing this double loop is not very efficient, but it is clear.\n", "for ny, y in enumerate(Ygrid):\n", " for nx, x in enumerate(Xgrid):\n", " pred[ny, nx] = knn.predict([[x, y]]) # Predict expects a matrix of features\n", " prob[ny, nx, :] = knn.predict_proba([[x, y]]) # Probabilities of belonging to one class\n", "pcolor(Xgrid, Ygrid, pred, cmap=cm.gray, alpha=0.2)\n", "colorbar()\n", "plot(Xtr[ttr==0,0], Xtr[ttr==0,1], 'b.')\n", "plot(Xtr[ttr==1,0], Xtr[ttr==1,1], 'r.')\n", "axis('tight')\n", "\n", "# Plot the class probabilites\n", "figure()\n", "pcolor(X, Y, prob[:,:,0], cmap=cm.coolwarm, alpha=0.8)\n", "colorbar()\n", "plot(Xtr[ttr==0,0], Xtr[ttr==0,1], 'bo')\n", "plot(Xtr[ttr==1,0], Xtr[ttr==1,1], 'ro')\n", "axis('tight')\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "knn = KNeighborsClassifier(n_neighbors=3)\n", "knn.fit(Xtr, ttr) # Train it\n", "\n", "y = knn.predict(Xte) # Predict the class of the testing data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use your classifier to carry out $k=1$ classification of the\n", "
Xte
tremor data. Following the example above, plot the\n",
"training data according to its class and plot the test data according to\n",
"both its true class (from tte
) and its predicted class from your\n",
"classifier. Where the predicted class differs from the true class make\n",
"sure that you can see from the plot why the classifier has classified the\n",
"way it has. Work out the classification accuracy for your\n",
"classifier, that is the fraction of examples in Xte
for which the\n",
"classifier predicts the correct class.\n",
"\n",
"Repeat the above for $k = 3$ and $k = 10$ and give an explanation for\n",
"your results. Automate the procedure to plot a graph of the\n",
"classification accuracy of Xte
versus $k$ for $k$ up to about 40. What's the best $k$ to use? Why are smaller $k$ worse? Why are larger $k$ worse?\n",
"\n",
"Now plot the classification accuracy for the training data (that\n",
"is call your classifier like knn.predict(Xtr)
).\n",
"Explain the shape of the curve."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now use the k-nn classifier to find the accuracy using the standardised data. How does it compare with the raw data?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2017-01-26T23:35:20.559965",
"start_time": "2017-01-26T23:35:20.557688"
}
},
"source": [
"### Cross validation\n",
"\n",
"Above we used all the training data and guessed the value of $k$. Much better is to estimate the optimum value of $k$, but dividing the training data into a training and a validation set; the generalisation error is then estimated on the validation set and the $k$ giving the minimum error is used for making predictions about unknown data.\n",
"\n",
"Better than just dividing the training data into two is to use $k$ fold cross validation (don't confuse the $k$ in $k$ cross validation with the $k$ in $k$ nearest neighbours!\n",
"\n",
"To perform $k$-fold cross validation divide the training data into several (Nfold) portions/folds, use all but one of them to train the classifier, and evaluate the accuracy on the fold that you have reserved. Do this for each fold in turn and average the error on the reserved folds to find an overall *validation error*, which is an estimate of the *generalisation error*. Usually dividing the data into 5 or 10 folds will be enough. \n",
"\n",
"\n",
"You can either write your own code to do this or use the cross validation machinery provided by scikit learn. The following cell shows how the sklearn routines may be used to produce training and validation sets automatically. More information at