# Some Machine Learning with Python

The following problems appeared as part of a project in the edX course ColumbiaX: CSMM.101x Artificial Intelligence (AI).

## I. Perceptron Learning Algorithm

In this question, the perceptron learning algorithm (“PLA”) will be implemented for a linearly separable 2-D dataset. The dataset contains a series of data points. Each point is a comma-separated ordered triple, representing feature_1, feature_2, and the label for the point. The values of the features can be though of as the x- and y-coordinates of each point. The label takes on a value of positive or negative one. The label can be thought of as separating the points into two categories. Implement your PLA in a file called problem1.py, which will be executed like so:

With each iteration of the PLA, your program must print a new line to the output file, containing a comma-separated list of the weights w_1, w_2, and b in that order. Upon convergence, the program will stop, and the final values of w_1, w_2, and b will be printed to the output file. This defines the decision boundary that your PLA has computed for the given dataset.

Here are first few rows of the sample dataset and the scatter plot of the dataset:

x1 x2 y
0 8 -11 1
1 7 7 -1
2 12 -20 1
3 14 -3 -1
4 12 8 -1

The following figure and the animation show how the perceptron algorithm converges to find the 2-D hyperplane (line) that separates two classes.

## II. Linear Regression

In this problem, multiple linear regression will be implemented using gradient descent on a dataset with multiple features. Again, the dataset contains a series of data points. Each point is a comma-separated ordered triple, representing age, weight, and height (derived from CDC growth charts data). A few rows of the sample dataset is shown below.

Out[26]:
age weight height
0 2.00 10.21027 0.805248
1 2.04 13.07613 0.919474
2 2.13 11.44697 0.908350
3 2.21 14.43984 0.803756
4 2.29 12.59622 0.811357

Data Preparation and Normalization. Once the dataset is loaded, the content needs to be explored to identify each feature. The intercept is needed to be added ahead of the data matrix. Since the features are not on the same scale, e.g., they represent age (years), and weight (kilograms), they need to normalized. What is the mean and standard deviation of each feature? The last column is the label, and represents the height (meters). Each feature (i.e. age and weight, except the intercept) has to be scaled by its standard deviation, and its mean has to be set to zero, using the following formula (for z-score normalization):

Gradient Descent: Gradient descent is to be implemented to find a regression model. The β’s are to be initialized to zero. The empirical risk and gradient descent rule are as follows:

The gradient descent algorithm is to be run using the following learning rates: α ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10}. For each value of α, the algorithm is to be run for exactly 100 iterations and the convergence rates to be compared when α is small versus large. Also, the ideal learning rate to obtain an accurate model has to be computed.

The animation below shows how the regression plane is learnt with gradient descent.

## III. Classification

In this problem the support vector classifiers in the sklearn package is to be used to learn a classification model for a chessboard-like dataset. Again, the input dataset containins a series of data points as shown below.

A B label
0 3.40 0.35 1
1 0.70 0.20 1
2 3.55 2.50 0
3 3.10 0.35 1
4 0.70 1.30 1

SVM with different kernels is to be used to build a classifier. The dataset is to be splitted into training (60%) and testing (40%), using stratified sampling (i.e. same ratio of positive to negative in both the training and testing datasets). K-Fold Cross validation (with the number of folds k = 5) is to be used instead of a validation set. SVM with Linear Kernel. The performance of the SVM with linear kernel is to be observed. A good setting of parameters is to be searched to obtain high classification accuracy. Specifically, the values of C = [0.1, 0.5, 1, 5, 10, 50, 100] are to be tried. After locating the optimal parameter value by using the training data, the corresponding best score (accuracy) achieved needs to e recorded Then by applying the testing data to the model, and the actual test score is to be recored. Both scores will be a number between zero and one.

SVM with Polynomial Kernel. (Similar to above). Try values of C = [0.1, 1, 3], degree = [4, 5, 6], and gamma = [0.1, 1].

SVM with RBF Kernel. (Similar to above). Try values of C = [0.1, 0.5, 1, 5, 10, 50, 100] and gamma = [0.1, 0.5, 1, 3, 6, 10].

Logistic Regression. (Similar to above). Try values of C = [0.1, 0.5, 1, 5, 10, 50, 100].

k-Nearest Neighbors. (Similar to above). Try values of n_neighbors = [1, 2, 3, …, 50] and leaf_size = [5, 10, 15, …, 60].

Decision Trees. (Similar to above). Try values of max_depth = [1, 2, 3, …, 50] and min_samples_split = [1, 2, 3, …, 10].

Random Forest. (Similar to above). Try values of max_depth = [1, 2, 3, …, 50] and min_samples_split = [1, 2, 3, …, 10].

The following plots show the cross validation accuracies for a few different classifiers with different values of the hyper-parameters.

Here are the decision boundaries learnt with different classifiers using the best-fit models: