The following problems appeared as assignments in the **coursera course** **Data-Driven Astronomy **(by the University of Sydney). The description of the problems are taken mostly from the course assignments and from https://groklearning.com/learn/data-driven-astro/.

## 1. Building a Regression Model to predict Redshift

The Sloan data (*sdss_galaxy_colors*) is going to be used for this purpose, the first few rows are shown below. The columns **‘u’**–**‘z’** are the flux magnitude columns. The data also includes **spec_class** and **redshift_err** columns.

Now let’s compute four color features **u – g, g – r, r – i** and **i – z**. Our targets are the corresponding **redshifts**. The following shows the preprocessed data ready for training the regression model.

The following figure shows how the features correlate with each other and also how the *redshift* changes with the feature values.

Now let’s split our data into training and testing subsets, use our features and targets to train a regression tree from the training dataset and then make a prediction on the held-out dataset. How do we know if the tree is actually any good at predicting redshifts?

In regression we compare the predictions generated by our model with the actual values to test how well our model is performing. The difference between the predicted values and actual values (sometimes referred to as *residuals*) can tell us a lot about where our model is performing well and where it is not.

While there are a few different ways to characterize these differences, we will use the median of the differences between our predicted and actual values. This is given by

This method of validation is the most basic approach to validation and is called *held-out validation*. We will use the *med_diff* accuracy measure and hold-out validation to assess the accuracy of our decision tree.

### Overfitting

*Decision* / *Regression* trees have some limitations though, one of the biggest being they tend to over fit the data. What this means is that if they are left unchecked they will create an overly complicated tree that attempts to account for outliers in the data. This comes at the expense of the accuracy of the general trend.

Part of the reason for this over-fitting is that the algorithm works by trying to optimise the decision locally at each node. There are ways in which this can be mitigated and in the next problem we will see how constraining the number of decision node rows (the tree depth) impacts on the accuracy of our predictions.

In order to see how the regression tree is *overfitting* we would like to examine how our decision tree performs for different tree depths. Specifically, we would like to see how it performs on test data compared to the data that was used to train it.

Naïvely we’d expect, the *deeper* the tree, the better it should perform. However, as the model overfits we see a difference in its accuracy on the training data and the more general testing data.

The following figures show the decision trees with *maximum depth 2,3,4 and 5* learnt from the training dataset.

**Regression Tree** with **max depth = 2**

**Regression Tree** with **max depth = 3**

**Regression Tree** with **max depth = 4**

**Regression Tree** with **max depth = 5**

We can see that the accuracy of the regression tree on the training set gets better as we allow the tree to grow to greater depths. In fact, at a depth of 27 our errors goes to zero!

Conversely, the accuracy measure of the predictions for the test set gets better initially and then worse at larger tree depths. At a tree depth ~19 the regression tree starts to *overfit* the data. This means it tries to take into account outliers in the training set and loses its general predictive accuracy.

*Overfitting* is a common problem with decision / regression trees and can be circumvented by adjusting parameters like the tree depth or setting a minimum number of cases at each node. For now, we will set a maximum tree depth of 19 to prevent over-fitting in our redshift problem.

### K-Fold cross validation

The method we used to validate our model so far is known as *hold-out validation*. Hold out validation splits the data in two, one set to test with and the other to train with. Hold out validation is the most basic form of validation.

While hold-out validation is better than no validation, the measured accuracy (i.e. our median of differences) will vary depending on how we split the data into testing and training subsets. The med_diff that we get from one randomly sampled training set will vary to that of a different random training set of the same size.

In order to be more certain of our models accuracy we should use *k* fold cross validation. *k* fold validation works in a similar way to hold-out except that we split the data into subsets. We train and test the model times, recording the accuracy each time. Each time we use a different combination of *k* subsets to train the model and the final *k* subset to test. We take the average of the accuracy measurements to be the overall accuracy of the the model.

It is an important part of assessing the accuracy of any machine learning model. When we plotted our predicted vs measured redshifts we are able to see that for many our galaxies we were able to get a reasonably accurate prediction of redshift. However, there are also several outliers where our model does not give a good prediction.

Our sample of galaxies consists of two different populations: regular galaxies and quasi-stellar objects (QSOs). QSOs are a type of galaxy that contain an actively (and intensely) accreting supermassive black hole. This is often referred to as an *Active Galactic Nucleus* (**AGN**).

The light emitted from the **AGN** is significantly brighter than the rest of the galaxy and we are able to detect these **QSO**s out to much higher redshifts. In fact, most of the normal galaxies we have been using to create our models have redshifts less than *z~0.4*, while the QSOs have redshifts all the way out to *z~6*. Due to this contribution from the AGN, the flux magnitudes measured at different wavelengths might not follow the typical profile we assumed when predicting redshifts.

Next we are going look at whether there is a difference in the accuracy of the decision trees between *QSOs* and *regular galaxies*.

## 2. Exploring Machine Learning Classification to predict galaxy classes

There is a wide range of galaxy types observed by the *Sloan Digital Sky Survey* in the *Galaxy Zoo*. In this activity, we will limit our dataset to three types of galaxy: *spirals, ellipticals and mergers*, as shown below.

The galaxy catalog we are using is a sample of galaxies where at least 20 human classifiers (such as yourself) have come to a consensus on the galaxy type. Examples of spiral and elliptical galaxies were selected where there was a unanimous classification. Due to low sample numbers, we included merger examples where at least 80% of human classifiers selected the merger class. We need this high quality data to train our classifier.

The features that we will be using to do our galaxy classification are *color index, adaptive moments, eccentricities* and *concentrations*. These features are provided as part of the SDSS catalogue.

Color indices are the same colors (u-g, g-r, r-i, and i-z) we used for regression. Studies of galaxy evolution tell us that spiral galaxies have younger star populations and therefore are ‘bluer’ (brighter at lower wavelengths). Elliptical galaxies have an older star population and are brighter at higher wavelengths (‘redder’).

Eccentricity approximates the shape of the galaxy by fitting an ellipse to its profile. Eccentricity is the ratio of the two axis (semi-major and semi-minor). The De Vaucouleurs model was used to attain these two axis. To simplify our experiments, we will use the median eccentricity across the 5 filters.

Adaptive moments also describe the shape of a galaxy. They are used in image analysis to detect similar objects at different sizes and orientations. We use the fourth moment here for each band.

Concentration is similar to the luminosity profile of the galaxy, which measures what proportion of a galaxy’s total light is emitted within what radius. A simplified way to represent this is to take the ratio of the radii containing 50% and 90% of the Petrosian flux.

The Petrosian method allows us to compare the radial profiles of galaxies at different distances. If you are interested, you can read more here on the need for Petrosian approach. We will use the concentration from the u, r and z bands. For these experiments, we will define concentration as:

We have extracted the **SDSS** and *Galaxy Zoo* data for 780 galaxies, the first few rows fo the datatset are shown below:

As described earlier, the data has the following fields:

- colors: u-g, g-r, r-i, and i-z
- eccentricity: ecc
- 4th adaptive moments: m4_u, m4_g, m4_r, m4_i, and m4_z;
- 50% Petrosian: petroR50_u, petroR50_r, petroR50_z;
- 90% Petrosian: petroR90_u, petroR90_r, petroR90_z.

Now, let’s split the data and generate the features, and then train a **decision tree classifier**, perform a *held-out validation* by predicting the actual classes for later comparison.

The decision tree learnt with *grid search cross validation* is shown below:

The *accuracy of classification* problems is a lot simpler to calculate than for regression problems. The simplest measure is the fraction of objects that are correctly classified, as shown below. The *accuracy measure* is often called the model score. While the way of calculating the score can vary depending on the model, the accuracy is the most common for classification problems.

In addition to an overall accuracy score, we’d also like to know where our model is going wrong. For example, were the incorrectly classified mergers miss-classified as spirals or ellipticals? To answer this type of question we use a confusion matrix. The confusion matrix computed for our problem is shown below:

### Random Forest

So far we have used a single *decision tree model*. However, we can improve the *accuracy* of our classification by using a collection (or ensemble) of trees as known as a random forest.

A *random forest* is a collection of decision trees that have each been independently trained using different subsets of the training data and/or different combinations of features in those subsets.

When making a prediction, every tree in the forest gives its own prediction and the most common classification is taken as the overall forest prediction (in regression the mean prediction is used).

The following figure shows the *confusion matrix* computed with *random forest classifier*.

Did the *random forest* improve the *accuracy* of the model? The answer is yes – we see a substantial increase in accuracy. When we look at the 10-fold cross validation results, we see that the random forest systematically out performs a single decision tree: The random forest is around *~6-7%* more accurate than a standard decision tree.