# Some Optimization: Implementing the Orthogonal Matching Pursuit (OMP) and the Basis Pursuit (BP) Algorithms with Octave / Matlab

The following problems appeared in a project in the edX course 236862.1x: Introduction to the Fundamentals of Sparse Representations by Prof. Michael Elad from The Technion – Israel Institute of Technology.  The description of the problems are taken straightaway from the project.

## Pursuit Algorithms

In this article we demonstrate the Orthogonal Matching Pursuit (OMP) and Basis Pursuit (BP) algorithms by running them on a set of test signals and checking whether they provide the desired outcome for the P0 problem.

## Some Theory

Our goal is to solve the following problem: which is also known as the P0 problem (since the objective function to be minimized is the L0 norm of a vector x):  a problem that seeks the sparsest solution to a linear system.  The concept of seeking the sparsest representation for data are of central importance to data processing in a universal way (e.g., in machine learning, image processing, computer vision, remote sensing), although the problem is NP-hard in general.

In order to solve it in practice, some approximation algorithm is needed to be used, either using greedy or relaxation based approaches, as shown in the next figure. In this article the implementation  of the above two algorithms and some experimental results on approximately solving the P0 problem with some synthetic signal dataset will be described. The following figure describes how the experiments are to be done: ## Data Construction

In this part we create a dictionary and synthetic signals that will serve for us in the experiments.  Here are the ingredients involved:

• The matrix A is the “dictionary” (of size 50-by-100) under which the ideal signal is known to be sparse. We need to first construct this dictionary by drawing at random a matrix with i.i.d. Gaussian entries. Normalize each atom to a unit L2 norm.

• Then we need to generate a sparse vector x0 of cardinality s. Let’s draw at random the  locations of the non-zero entries. Then, draw at random the value of each entry from a uniform distribution in the range [1, 3] and multiply each by a random sign.

The following figure shows the heatmap of the dictionary A randomly generated. ## Greedy Pursuit

The greedy algorithm OMP is described in the following figure:  ## Basis Pursuit Even though the P0 problem is relaxed to P1, it’s still not ready to be solved by an LP Solver since, the objective function is still not linear. As explained in this thesis, the P1 problem can be converted to a LP by introducing a bunch of new variables and using the trick shown in the following figure. The matlab /octave function to implement the BP algorithm using LP is shown below: ## Analyzing the Results

In this part we compare the results obtained via the OMP and BP, by executing the following steps.

• Plot the average relative ℓ2 error, obtained by the OMP and BP versus the cardinality.
• Plot the average support recovery error, obtained by the OMP and BP versus the cardinality.
• Discuss the presented graphs and explain the results. ## Discussion

As can be seen from the above results,

• The Linear Programming-based relaxed implementation of Basis Pursuit has higher accuracy (in terms of both L2 and support-probability errors) than the greedy Orthogonal Matching Pursuit, particularly when the cardinality of the true solution increases beyond cardinality 6.

• Both OMP and BP (LP) performs equally well (with almost zero L2 error) upto cardinality 6 and then OMP clearly performs worse than the BP (LP).

• Although the average L2 error for OMP increases upto 0.2 when the cardinality of the true solution increases to 15, whereas that of BP only increases very slightly.

• Similar pattern can be seen for the probability of error in support.

## Mutual Coherence of A and theoretical guarantees for OMP and BP where the mutual coherence of matrix A is defined as follows: For our experiment, the mutual coherence μ(A)  = 0.45697. Hence, as per the theoretical guarantee, the OMP and BP are guaranteed to find the solution of P0 for s < 1.594164. Although in practice we observe that till s = 6 the solution found is quite accurate, since the theoretical bound shown above is for the worst-case only.

Advertisements

# Data Science with Python: Exploratory Analysis with Movie-Ratings and Fraud Detection with Credit-Card Transactions

The following problems are taken from the projects / assignments in the edX course Python for Data Science (UCSanDiagoX) and the coursera course Applied Machine Learning in Python (UMich).

## 1. Exploratory Analysis to Find Trends in Average Movie Ratings for different Genres

### Dataset

● The IMDB Movie Dataset (MovieLens 20M) is used for the analysis.
● The dataset is downloaded from here .
● This dataset contains 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users and was released in 4/2015.
● The csv files movies.csv and ratings.csv are used for the analysis.

### Motivation

● Understand the trend in average ratings for different movie genres over years (from 1995 to 2015) and Correlation between the trends for different genres (8 different genres are considered: Animation, Comedy, Romance, Thriller, Horror, Sci-Fi and Musical).

● This will give us an insight about how the people’s liking for the different movie genres change over time and about the strength of association between trends in between different movie genres, insights possibly useful for the critics.

### Research Questions

The answer to the following research questions will be searched for, using exploratory analysis and visualization:

1) The trend in average ratings for different movie genres: How the average ratings for a few different movie genres (namely, Animation, Comedy, Romance, Thriller, Horror,
Sci-Fi and Musical) change over time (different years, from 1995 to 2015)? How can the average ratings for different genres be compared among themselves?

2) Correlation between the trends for different genres: How are the trends for the genres correlated? For example, are the average ratings for Comedy and Sci-Fi movies positively associated with each other? What is the strength of association?

### Pre-processing the dataset

The next table shows the first few rows of the movies dataset, loaded in a pandas DataFrame.

movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
```print movies.shape
(27278, 3)```
with in explode() function implementation similar to this stackoverflow post , the DataFrame is transformed to the one shown below.
```movies.genres = movies.genres.str.split('|')
movies = explode(movies, ['genres'])
movies.head()```
movieId title genres
0 1 Toy Story (1995) Adventure
1 1 Toy Story (1995) Animation
2 1 Toy Story (1995) Children
3 1 Toy Story (1995) Comedy
4 1 Toy Story (1995) Fantasy

The next table shows the first few rows of the ratings dataset, again loaded with pandas.

userId movieId rating timestamp
0 1 2 3.5 1112486027
1 1 29 3.5 1112484676
2 1 32 3.5 1112484819
3 1 47 3.5 1112484727
4 1 50 3.5 1112484580

The input tables are pre-processed using the following code to get the data in the desired format, ready for the analysis.

```import time
ratings['timestamp'] = ratings['timestamp'].apply(lambda x: time.strftime('%Y', time.localtime(x)))
ratings.head()```
userId movieId rating timestamp
0 1 2 3.5 2005
1 1 29 3.5 2005
2 1 32 3.5 2005
3 1 47 3.5 2005
4 1 50 3.5 2005
```movies = movies.drop('title', axis=1)
movies.head()```
movieId genres
0 1 Adventure
1 1 Animation
2 1 Children
3 1 Comedy
4 1 Fantasy
```ratings = ratings.merge(movies, left_on='movieId', right_on='movieId', how='inner')
ratings.head()```
userId movieId rating timestamp genres
0 1 2 3.5 2005 Adventure
1 1 2 3.5 2005 Children
2 1 2 3.5 2005 Fantasy
3 5 2 3.0 1996 Adventure
4 5 2 3.0 1996 Children
```ratings.shape
(53177059, 5)```
```ratings = ratings.loc[ratings['genres'].isin(['Sci-Fi', 'Animation', 'Comedy', 'Romance', 'Thriller', 'Horror', 'Musical'])]
mean_ratings = ratings.groupby(['timestamp', 'genres'], as_index=False)['rating'].aggregate(np.mean)
`mean_ratings.rename(columns={'timestamp': 'year'}, inplace=True)` sd_ratings = ratings.groupby(['timestamp', 'genres'], as_index=False)['rating'].aggregate(np.std) `sd_ratings.rename(columns={'timestamp': 'year'}, inplace=True)` sd_ratings.head()```
year genres rating
0 1995 Comedy 0.000000
1 1995 Romance NaN
2 1995 Thriller 1.414214
3 1996 Animation 0.953820
4 1996 Comedy 1.010918

### Visualizations

Let’s first visualize (visualization done using seaborn) the distributions of the movie-ratings across different genres, as shown in the following figure. The next figure shows the trends of the average ratings by users for different genres across different years.  The vertical error bars represent the standard deviations of the average ratings (ratings for different movies averaged over users) for the same genres (s.d. across the movies belonging to the same genre).

```ratings2 = ratings.groupby(['movieId', 'timestamp', 'genres'], as_index=False)['rating'].aggregate(np.mean)
ratings2.head()
```
movieId year genres rating
0 1 1996 Animation 4.132947
1 1 1996 Comedy 4.132947
2 1 1997 Animation 3.875221
3 1 1997 Comedy 3.875221
4 1 1998 Animation 3.883929

## The next figure shows the trends of the ratings (averaged over users and movies for each genre) for different genres across different years.

## ```ratings3 = ratings.groupby(['userId', 'genres'], as_index=False)['rating'].aggregate(np.mean)
ratings3.head()
```
userId genres rating
0 1 Animation 3.650000
1 1 Comedy 3.731707
2 1 Horror 3.744444
3 1 Musical 3.666667
4 1 Romance 3.954545
```pivot = ratings3.pivot(index='userId', columns='genres', values='rating')
pivot.head()
```
genres Animation Comedy Horror Musical Romance Sci-Fi Thriller
userId
1 3.650000 3.731707 3.744444 3.666667 3.954545 3.712500 3.761905
2 3.000000 3.900000 3.555556 3.000000 3.833333 4.608696 4.263158
3 3.750000 4.057692 3.937500 4.000000 4.062500 4.000000 4.260000
4 4.000000 3.545455 NaN 4.000000 3.500000 3.000000 3.461538
5 4.666667 4.083333 3.000000 4.375000 3.937500 4.600000 4.333333

The next figures show the trends for different genres for different sub-windows of time and with the variations (inter-quartile ranges)  in average ratings for each genre.

##    The next figures show how correlated are the trends for average ratings for different genres.     ## Some Findings

● There is a decreasing trend in the average ratings for all 8 genres during 1995-98, then the ratings become stable during 1999-2007, then again increase.

● Horror movies always have the lowest average ratings. Sci-Fi and Comedy movies also get low average ratings.

● Musical, Animation and Romance movies get the highest average ratings.

● Sci-Fi and Animation movies show very similar trends, they again become popular during 2009-2013.

● Trends in the average ratings of Romance and Horror movies show positive association between them.

## 2. Fraud Detection with Classification

### Abstract

In this project, we aim to build machine learning models to automatically detect frauds in credit card transactions. Several supervised binary classification models will be trained using 75-25 validation on this credit card transaction dataset from Kaggle. Given a transaction instance, a model will predict whether it is fraud or not. Different models will then be evaluated on a held-out subset of this data by measuring how effectively they predict instances of credit card fraud. The model with the best recall value (the one which is able to detect the highest number of true frauds) will be selected for prediction. We found that the k-Nearest Neighbors (k=3) and LogisticRegression models perform the best when only the recall (sensitivity) is concerned, whereas the RandomForest model gives pretty high specificity and sensitivity at the same time.

### Motivation

Using the credit card transaction dataset, we want to train a few machine learning models that can predict whether an unseen transaction in the future is likely to be fraud or not. This automatic prediction / detection of fraud can immediately raise an alarm and the transaction could be stopped before it completes. This can also help by expediting the process of manual inspection / examination / investigation of the predicted fraudulent transactions, thereby ensuring safety for the financial institution / bank / customers.

### Dataset

● This dataset from Kaggle is used for credit card fraud detection. The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available here and here.

● The datasets contains transactions made by credit cards in September 2013 by the European cardholders. This dataset presents transactions that occurred in two days, where there were 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Each row in the dataset creditcard.csv corresponds to a credit card transaction. The dataset contains 284,807 rows and 30 columns.

● Features (Columns) include an integer variable Time (which represents the seconds elapsed between each transaction and the first transaction in the dataset), 28 confidential variables V1 through V28 (these are numerical input variables which are the result of a PCA transformation – unfortunately, due to confidentiality issues, the original features and more background information about the data could not be provided), as well as the variable Amount which is the amount of the transaction.

● The target is stored in the Class column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

● The next table shows the first few rows. ### Data Preparation and Cleaning

● The main problem of this dataset is that it is highly unbalanced, the positive class (frauds) account for only 0.172% of all transactions. There are only 492 frauds out of 284,807 transactions: too many negative instances and too few positive (fraud) instances.

● In order to mitigate this high imbalance ratio, so that while training the models can see enough fraud examples, the following techniques were used.

1. Oversampling: the Synthetic Minority Oversampling Technique (SMOTE) is used to  generate new fraud (minority class) samples with interpolation and k-nearest neighbors. Using this technique, the number of positive examples were increased to 5000 samples.
2.  Under-sampling: the NeverMiss-1 Algorithm is used to select samples from the majority (non-fraud) class for which the average distance of the k nearest samples of the minority class is the smallest. Using this technique, the number of positive examples were decreased to 10000 samples.

● The input feature Time is not that relevant to predict fraud, that is why that feature is dropped from the input.

### Research Questions

Predict whether a given transaction is fraudulent or not.

● Given a credit card transaction (represented by the values of the 30 input features), the goal is to answer the following question: is the transaction fraud?

● More mathematically, given the labelled data, we want to learn a function f : from the data. Also we shall be interested to compute the conditional probability Prob(fraud|transaction).

● We want to use the function learnt to predict new transactions (not seen while learning the function ).

● We want to evaluate how correctly we can find frauds from the unseen data and find which model performs the best (model selection).

### Methods

● The dataset is first split (with sklearn) into random train (75%) and test (25%) subsets. Both subsets roughly maintain the ratio of majority vs. minority class.

### Split the data into train and test

```from sklearn.model_selection import train_test_split

df = pd.read_csv(path + 'creditcard.csv')
df.drop('Time', 1, inplace=True)

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
```

● The training dataset is highly imbalanced (only 372 fraud instances out of 213,607 total instances) w.r.t. class-ratio, it’s balanced using SMOTE (oversampling: the number of the fraud instances to increase to 5000) and NeverMiss-1 (under-sampling: decreasing the number of the non-fraud instances to 10000). The test dataset is not touched.

### Preprocessing: Resample to balance the dataset ● A few sklearn models (kNN, SVM, LogisticRegression, RandomForest, DecesionTree, AdaBoost, NaiveBayesian) are then trained separately on the training dataset and every time a model is learnt, it is used to predict the class of the hitherto-unseen test dataset.

● Given the class imbalance ratio, one of the recommend measures for model evaluation is the Area Under the Precision-Recall Curve (AUPRC), since Confusion matrix accuracy is not meaningful for unbalanced classification. e.g., Training a dummy classifier that classifies everything as the majority class of the training data, the accuracy of this classifier obtained is 0.996643 but the recall obtained is very poor: 0.008333, as shown below.

Using `X_train``X_test``y_train`, and `y_test` (as defined above), let’s train a dummy classifier that classifies everything as the majority class of the training data and compute what is the accuracy of this classifier? What is the recall?

```from sklearn.metrics import recall_score, precision_score
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = float(sum(y_pred==y_test))/len(y_test)
recall = recall_score(y_test, y_pred) print (accuracy, recall)```
`# (0.9966433527148114, 0.0083333333333333332)`

● Particularly we shall be interested in high Recall, since ideally we want all the fraud instances to be predicted correctly as fraud instances by the model, with zero False Negatives.

### Findings

● The next figure shows the prediction evaluation results on the test dataset using the python sklearn Support Vector Classifier with RBF kernel. As can be seen, using training with resampling, the recall becomes quite high (~91.7%), although the accuracy and precision drops.

```from sklearn.svm import SVC
clf = SVC(random_state=0, C=1e9, gamma=1e-07, probability=True)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)
precision, recall, _ = precision_recall_curve(y_test, y_pred)
fpr, tpr, _ = roc_curve(y_test, y_prob[:,1], pos_label=1)
print(auc(fpr, tpr))
cnf_matrix = confusion_matrix(y_test, y_pred)
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['0','1'], title='Confusion matrix, without normalization')``` ● The next figure shows the prediction evaluation results on the test dataset using the python sklearn LogisticRegression classifier. As can be seen, using training with resampling, the recall becomes even higher (~94.2%), although the accuracy and precision drops further. ● The next figure again shows the prediction recall values on the test dataset using the sklearn LogisticRegression classifier, but this time using GridSearch with different hyper-parameter values (for C and regularization).

```from sklearn.linear_model import LogisticRegression
parameters = {'penalty': ['l1', 'l2'], 'C':[0.01, 0.1, 1, 10, 100]}
lr = LogisticRegression()
clf = GridSearchCV(lr, parameters, scoring='recall')
clf.fit(X_train, y_train)
#print(clf.cv_results_['mean_test_score'])
print np.reshape(clf.cv_results_['mean_test_score'], (5,2))``` ● As can be seen, C = 100 with L1 penalty obtains the best recall on the test dataset among the values searched on the grid.

● Also, there are 120 fraud instances in the test dataset, out of which all but 7 are detected correctly with the best Logistic Regression Model. The next table shows a few highlighted in red for which the model failed to predict a fraud instance. ● The next figure shows visualizations of the classification decision functions learnt from the training dataset and the class labels predicted for the test dataset, projected on the variables V1 and V2, for different classifiers. ● With every classifier (specificity, sensitivity) metrics on the test dataset are also computed and shown.

● As can be seen, the RandomForest Classifier does a descent job in having both specificity and sensitivity quite high (89% and 88% resp.) but the precision is very poor.

● The k-NearestNeibors (with k = 3) and LogisticRegression classifiers do the best in terms of recall (95%, 94% respectively) but their specificity is quite poor and precision is very poor. ### Limitations and Future Work

● The models are learnt from this particular credit card fraud dataset and hence may not generalize to other fraud datasets.

● The dataset being highly imbalanced, other methods (such as different variants of SMOTE and ADASYN oversampling, Tomek’s link / different variants of Edited Nearest Neighbor methods e.g. AllKNN) might be tried as pre-processing step, in order to create a more balanced dataset. Also, the resampling was done to result in a final training dataset with 1:2 ratio of the number of minority vs. majority instances, we could try different ratios (e.g., 1:1 or 2:3).

● 75-25 validation is used to evaluate the classifier models, instead k-fold (e.g., k=10) cross-validation could be used to obtain more generalizable models.

● Fine-tuning with hyper-parameters for all models could be done, it was only done with grid search for LogisticRegression. For probabilistic classifiers (e.g., Logistic Regression), we could try different thresholds for classification instead of default.

● We assumed the transactions to be independent of each other (not time dependent) and ignored the Time feature, one could use Time to build a sequential model for prediction.

● The entire dataset might be used with data augmentation for the minority fraud class and deep learning models (CNN or RNN if the Time feature is used) can be learnt and later used for prediction to improve specificity and sensitivity (recall) simultaneously.

● We could try to find the patterns in the input features for a probable fraud instance.

### Conclusions

● As can be seen, with LogisticRegression and k-NearestNeighbors (k=3) classifiers we could obtain the best recall on the held-out dataset (although precision / specificity with these models are really poor). Hence, if recall is the only concerned metric, we can go ahead with either of these models for fraud detection / prediction of a fraud instance.

## References

● Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.