Today we will get to know the package scikit-learn (sklearn). It has many different machine learning algorithms already implemented, so we will be using it for the next five classes. The first algorithm, which we are going to learn today is the k-nearest neighbor algorithm. It can be used for classification as well as for regression.
Take a look at the file src/nn_iris.py. We will implement the TODOs step by step:
- 
Install the scikit-learnpackage withpip install -r requirements.txt or directly via pip install scikit-learn.
 The dataset iris is very popular amongst machine learners in example tasks. For this reason it can be found directly in the sklearn package.
- 
Navigate to the __main__function ofsrc/nn_iris.pyand load the iris dataset fromsklearn.datasets.
 In the dataset there are several plants of different species of genus Iris. For each of the examples width and length of petal and sepal of the flower were measured.
  
- 
Find out how to access the attributes of the database (Hint: set a breakpoint and examine the variable). Print the shape of the data matrix and the number of the target entries. Print the names of the labels. Print the names of the features. 
Your goal is to determine the species for an example, based on the dimensions of its petals and sepals. But first we need to inspect the dataset.
- 
Use a histogram (classes distribution) to check if the iris dataset is balanced. To plot a histogram you can for example use pandas.Series.histormatplotlib.pyplot.hist. Fortunately, the iris dataset is balanced, so it has the same number of samples for each species. Balanced datasets make it simple to proceed directly to the classification phase. In the opposite case we would have to take additional steps to reduce the negative effects (e.g. collect more data) or use other algorithms than the k-Nearest Neighbors (e.g. Random Forests).
- 
We also can use pandas scatter_matrixto visualize some trends in our data. A scatter matrix (pairs plot) compactly plots all the numeric variables we have in a dataset against each other.
 Plot the scatter matrix. To make the different species visually distinguishable use the parameterc=iris.targetinpandas.plotting.scatter_matrixto colorize the datapoints according to their target species.
 In the scatter matrix you can see domains of values as well as the distributions of each of the attributes. It is also possible to compare groups in scatter plots over all pairs of attributes. From those it seems that groups are well separated, two of the groups slightly overlap.
First, we need to split the dataset into train and test data. Then we are ready to train the model.
- 
Use train_test_splitfromsklearn.model_selectionand create a train and a test set with the ratio 75:25. Print the dimensions of the train and the test set. You can use the parameterrandom_stateto set the seed for the random number generator. That will make your results reproducible. Set this value to 29.
- 
Define a classifier knnfrom the classKNeighborsClassifierand set the hyperparametern_neighborsvalue to 1.
- 
Train the classifier on the training set. The method fit()is present in all the estimators of the packagescikit-learn.
The trained model is now able to receive the input data and produce predictions of the labels.
- 
Predict the labels first for the train and then for the test data. 
- 
The comparison of a predicted and the true label can tell us valuable information about how well our model performs. The simplest performance measure is the ratio of correct predictions to all predictions, called accuracy. Implement a function compute_accuracyto calculate the accuracy of predictions. Use your function and evaluate your model by calculating the accuracy on the train set and the test set. Print both results.
- 
To evaluate, whether our model performs well, its performance is compared to other models. Since we now only know one classifier, we will compare it to dummy models. Those dummy models are not trained on the data. Instead, they just follow some simple rule in order to decide which predicition to make. One dummy model is the "Most frequent"-model. It always predicts the label that occurs the most in our train set. If the train set is balanced, we choose one of the classes. Implement the function accuracy_most_frequentto compute the accuracy of the most frequent model. (Hint: the functionnumpy.bincountmight be helpful.) Print the result.
- 
(Optional) Another dummy model is a stratified model. A stratified model assigns random labels based on the ratio of the labels in the train set. So labels that occur more frequent have a higher chance to be chosen, but there is still a chance for a more rare label to be picked. (Hint: numpy.random.choicemight help.) Implement the functionaccuracy_stratifiedto compute the accuracy of the stratified model. Call the function several times and print the results. You see that the results are different. In order to reproduce the results, it is usefull to set a seed. Usenumpy.random.seedbefore calling the function to set the seed. Set it to 29.
Another common method to evaluate the performance of a classifier is constructing a confusion matrix that shows not only accuracies for each of the classes (labels), but what classes the classifier is most confused about.
- 
Use the function confusion_matrixto compute the confusion matrix for the test set.
- 
(Optional) The accuracy of the prediction can be derived from the confusion matrix as sum of the matrix diagonal over the sum of the whole matrix. Compute the accuracy using the information obtained from the confusion matrix. Print the result. 
- 
We can also visualize the confusion matrix in form of a heatmap. Use ConfusionMatrixDisplayto plot a heatmap of the confusion matrix for the test set. Usedisplay_labels=iris.target_namesfor better visualization.
Now we need to find the best value for our hyperparameter k. We will use a common procedure called grid search to search the space of the possible values. Since our train dataset is small, we will perform cross-validation in order to compute the validation error for each value of k. Implement this hyperparameter tuning in the function cv_knearest_classifier following these steps:
- 
Define a second classifier knn2. Define a grid of parameter values forkfrom 1 to 25 (Hint:numpy.arange). This grid must be stored in a dictionary withn_neighborsas the key in order to useGridSearchCVwith it.
- 
Use the class GridSearchCVto perform grid search. It gives you the possibility to perform n-fold cross-validation too, so use the attributecvto set the number of folds to 3. When everything is set, you can train yourknn2by fitting theGridSearchCV-object.
After the training you can access the best parameter best_params_, the corresponding validation accuracy best_score_ and the corresponding estimator best_estimator_.
- 
Use the best estimator to compute the accuracy on your train and test sets. Print the results. Has the test accuracy improved after the hyperparameter tuning? 
- 
Plot the new confusion matrix for the test set. 
Navigate to the __main__ function of nn_regression.py in the src directory and fill in the blanks by implementing the TODOs.