Implementing kd-tree for fast range-search, nearest-neighbor search and k-nearest-neighbor search algorithms in 2D (with applications in simulating the flocking boids: modeling the motion of a flock of birds and in learning a kNN classifier: a supervised ML model for binary classification) in Java and python

The following problem appeared as an assignment in the coursera course Algorithms-I by Prof. Robert Sedgewick from the Princeton University few years back (and also in the course cos226 offered at Princeton). The problem definition and the description is taken from the course website and lectures.  The original assignment was to be done in java, where in this article both the java and a corresponding python implementation will also be described.

  • Use a 2d-tree to support
    • efficient range search (find all of the points contained in a query rectangle)
    • nearest-neighbor search (find a closest point to a query point).

    2d-trees have numerous applications, ranging from classifying astronomical objects to computer animation to speeding up neural networks to mining data to image retrieval. The figure below describes the problem:

    kdtree-ops.png

    2d-tree implementation: A 2d-tree is a generalization of a BST to two-dimensional keys. The idea is to build a BST with points in the nodes, using the x– and y-coordinates of the points as keys in strictly alternating sequence, starting with the x-coordinates, as shown in the next figure.

    kdtree3

    • Search and insert. The algorithms for search and insert are similar to those for BSTs, but at the root we use the x-coordinate (if the point to be inserted has a smaller x-coordinate than the point at the root, go left; otherwise go right); then at the next level, we use the y-coordinate (if the point to be inserted has a smaller y-coordinate than the point in the node, go left; otherwise go right); then at the next level the x-coordinate, and so forth.
    • The prime advantage of a 2d-tree over a BST is that it supports efficient implementation of range search and nearest-neighbor search. Each node corresponds to an axis-aligned rectangle, which encloses all of the points in its subtree. The root corresponds to the entire plane [(−∞, −∞), (+∞, +∞ )]; the left and right children of the root correspond to the two rectangles split by the x-coordinate of the point at the root; and so forth.
      • Range search: To find all points contained in a given query rectangle, start at the root and recursively search for points in both subtrees using the following pruning rule: if the query rectangle does not intersect the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a subtree only if it might contain a point contained in the query rectangle.
      • Nearest-neighbor search: To find a closest point to a given query point, start at the root and recursively search in both subtrees using the following pruning rule: if the closest point discovered so far is closer than the distance between the query point and the rectangle corresponding to a node, there is no need to explore that node (or its subtrees). That is, search a node only if it might contain a point that is closer than the best one found so far. The effectiveness of the pruning rule depends on quickly finding a nearby point. To do this, organize the recursive method so that when there are two possible subtrees to go down, you choose first the subtree that is on the same side of the splitting line as the query point; the closest point found while exploring the first subtree may enable pruning of the second subtree.
      • k-nearest neighbors search: This method returns the k points that are closest to the query point (in any order); return all n points in the data structure if n ≤ k. It must do this in an efficient manner, i.e. using the technique from kd-tree nearest neighbor search, not from brute force.
      • BoidSimulator: Once the  k-nearest neighbors search we can simulate boids: how a flock of birds flies together and a hawk predates. Behold their flocking majesty.The following figures show the theory that are going to be used, taken from the lecture slides of the same course.

    f1f2f3f4

    f5

    Results

    The following figures and animations show how the 2-d-tree is grown with recursive space-partioning for a few sample datasets.

  • Circle 10 dataset

circle10

test1test2test3test4test6test7test8test9

  • Circle 100 dataset

circle100.gif

test1test6test16test32test64test72test99
The following figure shows the result of the range search algorithm on the same dataset after the 2d-tree is grown. The yellow points are the points found by the algorithm inside the query rectangle shown.

test-range

The next animations show the nearest neighbor search algorithm for a given query point (the fixed white point with black border: the point (0.3, 0.9)) and how the the branches are traversed and the points (nodes) are visited in the 2-d-tree until the nearest neighbor is found.

nn.gif

nearest1.gif

The next animation shows how the kd-tree is traversed for nearest-neighbor search for a different query point (0.04, 0.7).

nearest2

The next figures show the result of k-nearest-neighbor search, by extending the previous algorithm with different values of k (15, 10, 5 respectively).

test_knn2test_knn3test_knn4

 

Runtime of the algorithms with a few datasets in Python

As can be seen from the next figure, the time complexity of 2-d tree building (insertion), nearest neighbor search and k-nearest neighbor query depend  not only on the size of the datasets but also on the geometry of the datasets .

runtime.png

 

Flocking Boids simulator

Finally the flocking boids simulator is implemented with 2-d-trees and the following 2 animations (java and python respectively) shows how the flock of birds fly together, the black / white ones are the boids and the red one is the predator hawk.

birds

birds1

 

Implementing a kNN Classifier with kd tree from scratch

Training phase

Build a 2d-tree from a labeled 2D training dataset (points marked with red or blue represent 2 different class labels).

Testing phase

  • For a query point (new test point with unknown class label) run k-nearest neighbor search on the 2d-tree with the query point (for a fixed value of k, e.g., 3).
  • Take a majority vote on the class labels of the k-nearest neighbors (with known class labels) obtained by querying the 2d-tree. Label the query point with the class label that majority of its neighbors have.
  • Repeat for different values of k.

 

The following figures show how the kd tree built can be used to classify (randomly generated) 2D datasets and the decision boundaries are learnt with k=3, 5 and 10 respectively.

test_classification3test_classification5test_classification10

Advertisements

Using Particle Filter for localization: tracking objects from noisy measurements in 2D (in R)

In this article, we shall see how the Particle Filter can be used to predict positions of some moving objects using a few sampled particles in 2D.  This article is inspired by a programming assignment from the coursera course Robotics Learning by University of Pennsylvania. The same algorithm is often used for self-localization of a robot from the noisy sensor measurements.

The basic idea for the Particle Filter is described in the figure below, taken from  this video by Bert Huang.

p1.png

 

The next few figures, taken from the lectures of the Coursera Course Robotics Learning again describe the basics of particle filter.

p2.png
p3p4

  • The particle filter algorithm is described in the next figure, taken from a lecture video by udacity.p5

 

  • The next set of figures / animations show how the position of a moving bug is tracked using Particle Filter.
    • First the noisy measurements of the positions of the bug are obtained at different time instants.
    • Then M=100 particles are sampled and later they are propagated  based upon some prior assumptions on position uncertainty..
    • Next the noisy measurement for the particle is computed (simulated by adding random noise to the true position).
    • Measurement update step: For each particle, the probability of the particle is calculated to update the particle weights, assuming some measurement noise.
    • Next the best particle is chosen to update the pose.
    • Re-sampling step:  if the effective number of particles is smaller than a threshold (e.g., 0.8),  the particles are re-sampled.
    • Next the particles along with the probability-weights are used to compute the estimated location (e.g., computing the weighted average of the particle positions).
    • The next animation shows the steps, the probability cloud represents the density over the particles chosen at each step.motion2
    • The position of the bug as shown in the animation above is moving in  the x and direction randomly in a grid defined by  the rectangle
      [-100,100]x[-100,100].
    • The above figure also shows how at different iterations the Particle Filter predicts the position of the bug.
    • The next animation shows the density of the particles for a different measurement uncertainty.

motion1.gif

Unsupervised Deep learning with AutoEncoders on the MNIST dataset (with Tensorflow in Python)

  • Deep learning,  although primarily used for supervised classification / regression problems, can also be used as an unsupervised ML technique, the autoencoder being a classic example. As explained here,  the aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of  dimensionality reduction.
  • Deep learning can be used to learn a different representation (typically a set of input features in a  low-dimensional space) of the data that can be used for pre-training for example in transfer-learning.
  • In this article, a few vanilla autoencoder implementations will be demonstrated for the mnist dataset.
  • The following figures describe the theory (ref: coursera course Neural Networks for Machine Learning, 2012 by Prof. Hinton, university of Toronto). As explained, the autoencoder with back-propagation-based implementation can be used to generalize the linear dimension-reduction techniques as PCA, since the hidden layers can learn non-linear manifolds with non-linear activation functions (e.g., relu , sigmoid).

 

p2p3

 

  • The input and output units of an autoencoder are identical, the idea is to learn the input itself as a different representation with one or multiple hidden layer(s).
  • The mnist images are of size 28×28, so the number of nodes in the input and the output layer are always 784 for the autoencoders shown in this article.
  • The left side of an auto-encoder network is typically a mirror image of the right side and the weights are tied (weights learnt in the left hand side of the network are reused, to better reproduce the input at the output).
  • The next figures and animations show the outputs for the following simple autoencoder with just 1 hidden layer, with the input mnist data. A Relu activation function is used in the hidden layer.  It also uses the L2 regularization on the weights learnt. As can be seen from the next figure, the inputs are kind of reproduced with some variation at the output layer, as expected.

aec1
a1i_0.001

a1o_0.001

  • The next animations visualize the hidden layer weights learnt (for the 400 hidden units) and the output of the autoencoder with the same input training dataset, with a different value of the regularization parameter.

 

a1h_10

a1o_10

 

  • The next figure visualizes the hidden layer weights learnt with yet another different regulariation parameter value.

 

hidden

 

  • The next animation visualizes the output of the autoencoder with the same input training dataset, but this time no activation function  being used at the hidden layer.

 

a1o_0.001.na

  • The next animations show the results with a deeper autoencoder with 3 hidden layers (the architecture shown below). As before, the weights are tied and in this case no activation function is used, with L2 regularization on the weights.

 

aec2

a2i_0.001a2o_0.001

 

  • Let’s implement a more deeper autoencoder. The next animations show the results with a deeper autoencoder with 5 hidden layers (the architecture shown below). As before, the weights are tied and in this case no activation function is used, with L2 regularization on the weights.

 

aec3
a2i_0.001a3o_0.00001

 

Diffusion, PDE and Variational Methods in Image Processing and Computer Vision (Python implementation)

This article is inspired by the lecture videos by Prof. Daniel Cremers and also by the coursera course Image and Video Processing: From Mars to Hollywood with a Stop at the Hospital (by Duke University).

1. The Diffusion Equation and Gaussian Blurring

  • Diffusion is a physical process that minimizes the spatial concentration u(x,t) of a substance gradually over time.
  • The following figure shows the PDE of general diffusion (from the Fick’s law), where the diffusivity g becomes a constant, the diffusion process becomes linear, isotropic and homogeneous.

 

f0.png

 

  • Here we shall concentrate mainly on the linear (Gaussian blur) and non-linear  (e.g., edge-preserving) diffusion techniques.
  • The following figures show that the Gaussian Blur on an image can be thought of as heat flow, since the solution of the (linear) diffusion equation is typically the Gaussian Kernel convolution, as shown in the 1-D case, the diffusion equation has a unique solution with Gaussian convolution kernel G(0,σ=√(2t)), where the bandwidth of the kernel changes proportional to the square root of the time.
  • We can verify that the solution satisfies the equation, but more generally, the solution of the diffusion equation can be analytically obtained using the Fourier transform, as descried here in this post.

 

f1

f2

  • Since linear diffusion can be implemented more efficiently with Gaussian Filter, let’s focus on the non-linear diffusion techniques. One interesting example of a non-linear diffusion is edge-preserving diffusion from this seminal paper by Perona-Mallik, as shown in the next figure.

 

f4

  • The next animations show the results of a non-linear diffusion implementation with gradient descent, with 100 iterations using the edge-preserving diffusivity defined as above on the lena image with different λ (mentioned as kappa in the Perona-Mallik paper) values, the first one with λ=10 and the second one with λ=100.

diffusion_k10_l1_4diffusion_k100_l1_4

  • The next animation shows blurring the same image with simple filtering with the Gaussian kernel (equivalent to linear diffusion), as can be seen the edge-preserving non-linear diffusion does a way better job in terms of preserving edges, for  small values of the parameter λ.

lena_gauss.gif

  • The next animations show tthe results of a non-linear diffusion implementation  with gradient descent, this time with 50 iterations using the  edge-preserving
    diffusivity on yet another image (monument) with different λ (mentioned as kappa in the Perona-Mallik paper) values, the first one with λ=10 and the second one with λ=100.

diffusion_monument_k_10diffusion_monument_k_100

  • The next image shows the difference between the output after GD and the input image.

diffusion_diff_monument

The above image shows how the edge detection can be done with edge-preserving non-linear diffusion. The next results compare the edge detection result obtained using edge-preserving non-linear diffusion and the one obtained with canny edge detector on my image.

diffusion_00me
me
diffusion_49me
The next image shows the edges detected with the difference image obtained with non-linear diffusion, which are much more subtler than canny’s.
diffusion_diff_49me
canny_me

2. Denoising Images with Variational Methods, Euler-Lagrange equation for the functional and Gradient Descent

  • The following figures describe the theory required to de-noise an image with  variational methods, using gradient descent.

 

f3f6

  • The following animations and figures show the implementation results of the above technique, for a noisy ship image, taken from the book Variational Methods in Image Processing , by Luminita A. Vese, Carole Le Guyader.

denoise_ship1.gif

denoise_grad_val_99ship1.png

 

  • The following figure shows the difference image in between the original input image and the gradient descent output image.

 

denoise_grad_diff_99ship1.png

Dogs vs. Cats: Image Classification with Deep Learning using TensorFlow in Python

The problem

Given a set of labeled images of  cats and dogs, a  machine learning model  is to be learnt and later it is to be used to classify a set of new images as cats or dogs. This problem appeared in a Kaggle competition and the images are taken from this kaggle dataset.

  • The original dataset contains a huge number of images (25,000 labeled cat/dog images for training and 12,500 unlabeled images for test).
  • Only a few sample images are chosen (1100 labeled images for cat/dog as training and 1000 images from the test dataset) from the dataset, just for the sake of  quick demonstration of how to solve this problem using deep learning (motivated by the Udacity course Deep Learning by Google), which is going to be described (along with the results) in this article.
  • The sample test images chosen are manually labeled to compute model accuracy later with the model-predicted labels.
  • The accuracy on the test dataset is not going to be good in general for the above-mentioned reason. In order to obtain good accuracy  on the test dataset using deep learning, we need to train the models with a large number of input images (e.g., with all the training images from the kaggle dataset).
  • A few sample labeled images from the training dataset are shown below.

Dogs
download.png

Cats
download (1).png

  • As a pre-processing step, all the images are first resized to 50×50 pixel images.

Classification with a few off-the-self classifiers

  • First, each image from the training dataset is fattened and represented as 2500-length vectors (one for each channel).
  • Next, a few sklearn models are trained on this flattened data. Here are the results

m1
m2
m4m3

As shown above, the test accuracy is quite poor with a few sophisticated off-the-self classifiers.

Classifying images using Deep Learning with Tensorflow

Now let’s first train a logistic regression and then a couple of neural network models by introducing L2 regularization for both the models.

  • First, all the images are converted to gray-scale images.
  • The following figures visualize the weights learnt for the cat vs. the dog class during training the logistic regression  model with SGD with L2-regularization (λ=0.1, batch size=128).cd_nn_no_hidden.png

clr.png

Test accuracy: 53.6%
  • The following animation visualizes the weights learnt for 400 randomly selected hidden units using a neural net with a single hidden layer with 4096 hidden nodes by training the neural net model  with SGD with L2-regularization (λ1=λ2=0.05, batch size=128).cd_nn_hidden1.pngMinibatch loss at step 0: 198140.156250
    Minibatch accuracy: 50.0%
    Validation accuracy: 50.0%Minibatch loss at step 500: 0.542070
    Minibatch accuracy: 89.8%
    Validation accuracy: 57.0%Minibatch loss at step 1000: 0.474844
    Minibatch accuracy: 96.9%
    Validation accuracy: 60.0%

    Minibatch loss at step 1500: 0.571939
    Minibatch accuracy: 85.9%
    Validation accuracy: 56.0%

    Minibatch loss at step 2000: 0.537061
    Minibatch accuracy: 91.4%
    Validation accuracy: 63.0%

    Minibatch loss at step 2500: 0.751552
    Minibatch accuracy: 75.8%
    Validation accuracy: 57.0%

    Minibatch loss at step 3000: 0.579084
    Minibatch accuracy: 85.9%
    Validation accuracy: 54.0%

    Test accuracy: 57.8%
    

 

n1.gif

Clearly, the model learnt above overfits the training dataset, the test accuracy improved a bit, but still quite poor.

  • Now, let’s train a deeper neural net with a two hidden layers, first one with 1024 nodes and second one with 64 nodes.
    cd_nn_hidden2.png
  • Minibatch loss at step 0: 1015.947266
    Minibatch accuracy: 43.0%
    Validation accuracy: 50.0%
  • Minibatch loss at step 500: 0.734610
    Minibatch accuracy: 79.7%
    Validation accuracy: 55.0%
  • Minibatch loss at step 1000: 0.615992
    Minibatch accuracy: 93.8%
    Validation accuracy: 55.0%
  • Minibatch loss at step 1500: 0.670009
    Minibatch accuracy: 82.8%
    Validation accuracy: 56.0%
  • Minibatch loss at step 2000: 0.798796
    Minibatch accuracy: 77.3%
    Validation accuracy: 58.0%
  • Minibatch loss at step 2500: 0.717479
    Minibatch accuracy: 84.4%
    Validation accuracy: 55.0%
  • Minibatch loss at step 3000: 0.631013
    Minibatch accuracy: 90.6%
    Validation accuracy: 57.0%
  • Minibatch loss at step 3500: 0.739071
    Minibatch accuracy: 75.8%
    Validation accuracy: 54.0%
  • Minibatch loss at step 4000: 0.698650
    Minibatch accuracy: 84.4%
    Validation accuracy: 55.0%
  • Minibatch loss at step 4500: 0.666173
    Minibatch accuracy: 85.2%
    Validation accuracy: 51.0%
  • Minibatch loss at step 5000: 0.614820
    Minibatch accuracy: 92.2%
    Validation accuracy: 58.0%
Test accuracy: 55.2%
  • The following animation visualizes the weights learnt for 400 randomly selected hidden units from the first hidden layer, by training the neural net model with SGD with L2-regularization (λ1=λ2=λ3=0.1, batch size=128, dropout rate=0.6).h1c
  • The next animation visualizes the weights learnt and then the weights learnt for all the 64 hidden units for the second hidden layer.h2c
  • Clearly, the second deeper neural net model learnt above overfits the training dataset more, the test accuracy decreased a bit.

Classifying images with Deep Convolution Network

Let’s use the following conv-net shown in the next figure.

 

f8.png

As shown above, the ConvNet uses:

  • convolution layers each with
    • 5×5 kernel
    • 16 filters
    • 1×1 stride 
    • SAME padding
  • Max pooling layers each with
    • 2×2 kernel
    • 2×2 stride 
  • 64 hidden nodes
  • 128 batch size
  • 5K iterations
  • 0.7 dropout rate
  • No learning decay

Results

Minibatch loss at step 0: 1.783917
Minibatch accuracy: 55.5%
Validation accuracy: 50.0%

Minibatch loss at step 500: 0.269719
Minibatch accuracy: 89.1%
Validation accuracy: 54.0%

Minibatch loss at step 1000: 0.045729
Minibatch accuracy: 96.9%
Validation accuracy: 61.0%

Minibatch loss at step 1500: 0.015794
Minibatch accuracy: 100.0%
Validation accuracy: 61.0%

Minibatch loss at step 2000: 0.028912
Minibatch accuracy: 98.4%
Validation accuracy: 64.0%

Minibatch loss at step 2500: 0.007787
Minibatch accuracy: 100.0%
Validation accuracy: 62.0%

Minibatch loss at step 3000: 0.001591
Minibatch accuracy: 100.0%
Validation accuracy: 63.0%

Test accuracy: 61.3%

The following animations show the features learnt (for the first 16 images for each SGD batch) at different convolution and Maxpooling layers:

c1n.gif

c2n.gif

c3n

c4n

  • Clearly, the simple convolution neural net outperforms all the previous models in terms of test accuracy, as shown below.

f14.png

  • Only 1100 labeled images (randomly chosen from the training dataset) were used to train the model and predict 1000 test images (randomly chosen from the test dataset). Clearly the accuracy can be improved a lot if a large number of images are used for training with deeper / more complex networks (with more parameters to learn).

 

 

 

Deep Learning with TensorFlow in Python: Convolution Neural Nets

The following problems appeared in the assignments in the Udacity course Deep Learning (by Google). The descriptions of the problems are taken from the assignments (continued from the last post).

Classifying the alphabets with notMNIST dataset with Deep Network

Here is how some sample images from the dataset look like:

im13

Let’s try to get the best performance using a multi-layer model! (The best reported test accuracy using a deep network is 97.1%).

  • One avenue you can explore is to add multiple layers.
  • Another one is to use learning rate decay.

 

Learning L2-Regularized  Deep Neural Network with SGD

The following figure recapitulates the neural network with a 3 hidden layers, the first one with 2048 nodes,  the second one with 512 nodes and the third one with with 128 nodes, each one with Relu intermediate outputs. The L2 regularizations applied on the lossfunction for the weights learnt at the input and the hidden layers are λ1, λ2, λ3 and λ4, respectively.

nn_hidden3.png

 

The next 3 animations visualize the weights learnt for 400 randomly selected nodes from hidden layer 1 (out of 2096 nodes), then another 400 randomly selected nodes from hidden layer 2 (out of 512 nodes) and finally at all 128 nodes from hidden layer 3, at different steps using SGD and L2 regularized loss function (with λλλλ4
=0.01).  As can be seen below, the weights learnt are gradually capturing (as the SGD step increases) the different features of the alphabets at the corresponding output neurons.

 

hidden1

 

hidden2
hidden3

Results with SGD

Initialized
Validation accuracy: 27.6%
Minibatch loss at step 0: 4.638808
Minibatch accuracy: 7.8%
Validation accuracy: 27.6%

Validation accuracy: 86.3%
Minibatch loss at step 500: 1.906724
Minibatch accuracy: 86.7%
Validation accuracy: 86.3%

Validation accuracy: 86.9%
Minibatch loss at step 1000: 1.333355
Minibatch accuracy: 87.5%
Validation accuracy: 86.9%

Validation accuracy: 87.3%
Minibatch loss at step 1500: 1.056811
Minibatch accuracy: 84.4%
Validation accuracy: 87.3%

Validation accuracy: 87.5%
Minibatch loss at step 2000: 0.633034
Minibatch accuracy: 93.8%
Validation accuracy: 87.5%

Validation accuracy: 87.5%
Minibatch loss at step 2500: 0.696114
Minibatch accuracy: 85.2%
Validation accuracy: 87.5%

Validation accuracy: 88.3%
Minibatch loss at step 3000: 0.737464
Minibatch accuracy: 86.7%
Validation accuracy: 88.3%

Test accuracy: 93.6%

f2

 

Batch size = 128, number of iterations = 3001 and Drop-out rate = 0.8 for training dataset are used for the above set of experiments, with learning decay. We can play with the hyper-parameters to get better test accuracy.

Convolution Neural Network

Previously  we trained fully connected networks to classify notMNIST characters. The goal of this assignment is to make the neural network convolutional.

Let’s build a small network with two convolutional layers, followed by one fully connected layer. Convolutional networks are more expensive computationally, so we’ll limit its depth and number of fully connected nodes. The below figure shows the simplified architecture of the convolution neural net.

f2.png

As shown above, the ConvNet uses:

  • 2 convolution layers each with
    • 5×5 kernel
    • 16 filters
    • 2×2 strides 
    • SAME padding
  • 64 hidden nodes
  • 16 batch size
  • 1K iterations


Results

Initialized
Minibatch loss at step 0: 3.548937
Minibatch accuracy: 18.8%
Validation accuracy: 10.0%

Minibatch loss at step 50: 1.781176
Minibatch accuracy: 43.8%
Validation accuracy: 64.7%

Minibatch loss at step 100: 0.882739
Minibatch accuracy: 75.0%
Validation accuracy: 69.5%

Minibatch loss at step 150: 0.980598
Minibatch accuracy: 62.5%
Validation accuracy: 74.5%

Minibatch loss at step 200: 0.794144
Minibatch accuracy: 81.2%
Validation accuracy: 77.6%

Minibatch loss at step 250: 1.191971
Minibatch accuracy: 62.5%
Validation accuracy: 79.1%

Minibatch loss at step 300: 0.441911
Minibatch accuracy: 87.5%
Validation accuracy: 80.5%

Minibatch loss at step 350: 0.605005
Minibatch accuracy: 81.2%
Validation accuracy: 79.3%

Minibatch loss at step 400: 1.032123
Minibatch accuracy: 68.8%
Validation accuracy: 81.5%

Minibatch loss at step 450: 0.869944
Minibatch accuracy: 75.0%
Validation accuracy: 82.1%

Minibatch loss at step 500: 0.530418
Minibatch accuracy: 81.2%
Validation accuracy: 81.2%

Minibatch loss at step 550: 0.227771
Minibatch accuracy: 93.8%
Validation accuracy: 81.8%

Minibatch loss at step 600: 0.697444
Minibatch accuracy: 75.0%
Validation accuracy: 82.5%

Minibatch loss at step 650: 0.862341
Minibatch accuracy: 68.8%
Validation accuracy: 83.0%

Minibatch loss at step 700: 0.336292
Minibatch accuracy: 87.5%
Validation accuracy: 81.8%

Minibatch loss at step 750: 0.213392
Minibatch accuracy: 93.8%
Validation accuracy: 82.6%

Minibatch loss at step 800: 0.553639
Minibatch accuracy: 75.0%
Validation accuracy: 83.3%

Minibatch loss at step 850: 0.533049
Minibatch accuracy: 87.5%
Validation accuracy: 81.7%

Minibatch loss at step 900: 0.415935
Minibatch accuracy: 87.5%
Validation accuracy: 83.9%

Minibatch loss at step 950: 0.290436
Minibatch accuracy: 93.8%
Validation accuracy: 84.0%

Minibatch loss at step 1000: 0.400648
Minibatch accuracy: 87.5%
Validation accuracy: 84.0%

Test accuracy: 90.3%

f3.png

 

The following figures visualize the feature representations at different layers for the first 16 images for the last batch with SGD during training:

f4f5f6f7

The next animation shows how the features learnt at convolution layer 1 change with iterations.

c1.gif

 


Convolution Neural Network with Max Pooling

The convolutional model above uses convolutions with stride 2 to reduce the dimensionality. Replace the strides by a max pooling operation of stride 2 and kernel size 2. The below figure shows the simplified architecture of the convolution neural net with MAX Pooling layers.

f8.png

As shown above, the ConvNet uses:

  • 2 convolution layers each with
    • 5×5 kernel
    • 16 filters
    • 1×1 stride
    • 2×2 Max-pooling 
    • SAME padding
  • 64 hidden nodes
  • 16 batch size
  • 1K iterations


Results

Initialized
Minibatch loss at step 0: 4.934033
Minibatch accuracy: 6.2%
Validation accuracy: 8.9%

Minibatch loss at step 50: 2.305100
Minibatch accuracy: 6.2%
Validation accuracy: 11.7%

Minibatch loss at step 100: 2.319777
Minibatch accuracy: 0.0%
Validation accuracy: 14.8%

Minibatch loss at step 150: 2.285996
Minibatch accuracy: 18.8%
Validation accuracy: 11.5%

Minibatch loss at step 200: 1.988467
Minibatch accuracy: 25.0%
Validation accuracy: 22.9%

Minibatch loss at step 250: 2.196230
Minibatch accuracy: 12.5%
Validation accuracy: 27.8%

Minibatch loss at step 300: 0.902828
Minibatch accuracy: 68.8%
Validation accuracy: 55.4%

Minibatch loss at step 350: 1.078835
Minibatch accuracy: 62.5%
Validation accuracy: 70.1%

Minibatch loss at step 400: 1.749521
Minibatch accuracy: 62.5%
Validation accuracy: 70.3%

Minibatch loss at step 450: 0.896893
Minibatch accuracy: 75.0%
Validation accuracy: 79.5%

Minibatch loss at step 500: 0.610678
Minibatch accuracy: 81.2%
Validation accuracy: 79.5%

Minibatch loss at step 550: 0.212040
Minibatch accuracy: 93.8%
Validation accuracy: 81.0%

Minibatch loss at step 600: 0.785649
Minibatch accuracy: 75.0%
Validation accuracy: 81.8%

Minibatch loss at step 650: 0.775520
Minibatch accuracy: 68.8%
Validation accuracy: 82.2%

Minibatch loss at step 700: 0.322183
Minibatch accuracy: 93.8%
Validation accuracy: 81.8%

Minibatch loss at step 750: 0.213779
Minibatch accuracy: 100.0%
Validation accuracy: 82.9%

Minibatch loss at step 800: 0.795744
Minibatch accuracy: 62.5%
Validation accuracy: 83.7%

Minibatch loss at step 850: 0.767435
Minibatch accuracy: 87.5%
Validation accuracy: 81.7%

Minibatch loss at step 900: 0.354712
Minibatch accuracy: 87.5%
Validation accuracy: 83.8%

Minibatch loss at step 950: 0.293992
Minibatch accuracy: 93.8%
Validation accuracy: 84.3%

Minibatch loss at step 1000: 0.384624
Minibatch accuracy: 87.5%
Validation accuracy: 84.2%

Test accuracy: 90.5%

f9.png

As can be seen from the above results, with MAX POOLING, the test accuracy increased slightly.

The following figures visualize the feature representations at different layers for the first 16 images during training with Max Pooling:

f10f11f12f13

Till now the convnets we have tried are small enough and we did not obtain high enough accuracy on the test dataset. Next we shall make our convnet deep to increase the test accuracy.

Deep Convolution Neural Network with Max Pooling 

Let’s Try to get the best performance you can using a convolutional net. Look for exampleat the classic LeNet5 architecture, adding Dropout, and/or adding learning rate decay.

Let’s try with a few convnets:

1. The following ConvNet uses:

  • 2 convolution layers (with Relu) each using
    • 3×3 kernel
    • 16 filters
    • 1×1 stride
    • 2×2 Max-pooling 
    • SAME padding
  • all weights initialized with truncated normal distribution with sd 0.01
  • single hidden layer (fully connected) with 1024 hidden nodes
  • 128 batch size
  • 3K iterations
  • 0.01 (=λ1=λ2) for regularization
  • No dropout
  • No learning decay

Results

Minibatch loss at step 0: 2.662903
Minibatch accuracy: 7.8%
Validation accuracy: 10.0%

Minibatch loss at step 500: 2.493813
Minibatch accuracy: 11.7%
Validation accuracy: 10.0%

Minibatch loss at step 1000: 0.848911
Minibatch accuracy: 82.8%
Validation accuracy: 79.6%

Minibatch loss at step 1500: 0.806191
Minibatch accuracy: 79.7%
Validation accuracy: 81.8%

Minibatch loss at step 2000: 0.617905
Minibatch accuracy: 85.9%
Validation accuracy: 84.5%

Minibatch loss at step 2500: 0.594710
Minibatch accuracy: 83.6%
Validation accuracy: 85.7%

Minibatch loss at step 3000: 0.435352
Minibatch accuracy: 91.4%
Validation accuracy: 87.2%

Test accuracy: 93.4%

 a1.png


As we can see, by introducing couple of convolution layers, the accuracy increased from 90% (refer to the earlier blog) to 93.4% under the same settings.

Here is how the hidden layer weights (400 out of 1024 chosen randomly) changes, although the features don’t clearly resemble the alphabets anymore, which is quite expected.

a1.gif

2. The following ConvNet uses:

  • 2 convolution layers (with Relu) each using
    • 3×3 kernel
    • 32 filters
    • 1×1 stride
    • 2×2 Max-pooling 
    • SAME padding
  • all weights initialized with truncated normal distribution with sd 0.1
  • hidden layers (fully connected) both with 256 hidden nodes
  • 128 batch size
  • 6K iterations
  • 0.7 dropout
  • learning decay starting with 0.1

Results

Minibatch loss at step 0: 9.452210
Minibatch accuracy: 10.2%
Validation accuracy: 9.7%
Minibatch loss at step 500: 0.611396
Minibatch accuracy: 81.2%
Validation accuracy: 81.2%
Minibatch loss at step 1000: 0.442578
Minibatch accuracy: 85.9%
Validation accuracy: 83.3%
Minibatch loss at step 1500: 0.523506
Minibatch accuracy: 83.6%
Validation accuracy: 84.8%
Minibatch loss at step 2000: 0.411259
Minibatch accuracy: 89.8%
Validation accuracy: 85.8%
Minibatch loss at step 2500: 0.507267
Minibatch accuracy: 82.8%
Validation accuracy: 85.9%
Minibatch loss at step 3000: 0.414740
Minibatch accuracy: 89.1%
Validation accuracy: 86.6%
Minibatch loss at step 3500: 0.432177
Minibatch accuracy: 85.2%
Validation accuracy: 87.0%
Minibatch loss at step 4000: 0.501300
Minibatch accuracy: 85.2%
Validation accuracy: 87.1%
Minibatch loss at step 4500: 0.391587
Minibatch accuracy: 89.8%
Validation accuracy: 87.7%
Minibatch loss at step 5000: 0.347674
Minibatch accuracy: 90.6%
Validation accuracy: 88.1%
Minibatch loss at step 5500: 0.259942
Minibatch accuracy: 91.4%
Validation accuracy: 87.8%
Minibatch loss at step 6000: 0.392562
Minibatch accuracy: 85.9%
Validation accuracy: 88.4%

Test accuracy: 94.6%

p1

 

3. The following ConvNet uses:

  • 3 convolution layers (with Relu) each using
    • 5×5 kernel
    • with 16, 32 and 64 filters, respectively
    • 1×1 stride
    • 2×2 Max-pooling 
    • SAME padding
  • all weights initialized with truncated normal distribution with sd 0.1
  • hidden layers (fully connected) with 256, 128 and 64 hidden nodes respectively
  • 128 batch size
  • 10K iterations
  • 0.7 dropout
  • learning decay starting with 0.1

Results

Minibatch loss at step 0: 6.788681
Minibatch accuracy: 12.5%
Validation accuracy: 9.8%
Minibatch loss at step 500: 0.804718
Minibatch accuracy: 75.8%
Validation accuracy: 74.9%
Minibatch loss at step 1000: 0.464696
Minibatch accuracy: 86.7%
Validation accuracy: 82.8%
Minibatch loss at step 1500: 0.684611
Minibatch accuracy: 80.5%
Validation accuracy: 85.2%
Minibatch loss at step 2000: 0.352865
Minibatch accuracy: 91.4%
Validation accuracy: 85.9%
Minibatch loss at step 2500: 0.505062
Minibatch accuracy: 84.4%
Validation accuracy: 87.3%
Minibatch loss at step 3000: 0.352783
Minibatch accuracy: 87.5%
Validation accuracy: 87.0%
Minibatch loss at step 3500: 0.411505
Minibatch accuracy: 88.3%
Validation accuracy: 87.9%
Minibatch loss at step 4000: 0.457463
Minibatch accuracy: 84.4%
Validation accuracy: 88.1%
Minibatch loss at step 4500: 0.369346
Minibatch accuracy: 89.8%
Validation accuracy: 88.7%
Minibatch loss at step 5000: 0.323142
Minibatch accuracy: 89.8%
Validation accuracy: 88.5%
Minibatch loss at step 5500: 0.245018
Minibatch accuracy: 93.8%
Validation accuracy: 89.0%
Minibatch loss at step 6000: 0.480509
Minibatch accuracy: 85.9%
Validation accuracy: 89.2%
Minibatch loss at step 6500: 0.297886
Minibatch accuracy: 92.2%
Validation accuracy: 89.3%
Minibatch loss at step 7000: 0.309768
Minibatch accuracy: 90.6%
Validation accuracy: 89.3%
Minibatch loss at step 7500: 0.280219
Minibatch accuracy: 92.2%
Validation accuracy: 89.5%
Minibatch loss at step 8000: 0.260540
Minibatch accuracy: 93.8%
Validation accuracy: 89.7%
Minibatch loss at step 8500: 0.345161
Minibatch accuracy: 88.3%
Validation accuracy: 89.6%
Minibatch loss at step 9000: 0.343074
Minibatch accuracy: 87.5%
Validation accuracy: 89.8%
Minibatch loss at step 9500: 0.324757
Minibatch accuracy: 92.2%
Validation accuracy: 89.9%
Minibatch loss at step 10000: 0.513597
Minibatch accuracy: 83.6%
Validation accuracy: 90.0%

Test accuracy: 95.5%

To be continued…

Some Machine Learning with Astronomy data (in Python)

The following problems appeared as assignments in the coursera course Data-Driven Astronomy (by the University of Sydney). The description of the problems are taken mostly from the course assignments and from https://groklearning.com/learn/data-driven-astro/.

1. Building a Regression Model to predict Redshift

The Sloan data (sdss_galaxy_colors) is going to be used for this purpose, the first few rows are shown below. The columns ‘u’‘z’ are the flux magnitude columns. The data also includes spec_class and redshift_err columns.

f16

Now let’s compute four color features u – g, g – r, r – i and i – z. Our targets are the corresponding redshifts. The following shows the preprocessed data ready for training the regression model.

f17.png

The following figure shows how the features correlate with each other and also how the redshift changes with the feature values.

f18.png

Now let’s split our data into training and testing subsets, use our features and targets to train a regression tree from the training dataset and then make a prediction on the held-out dataset. How do we know if the tree is actually any good at predicting redshifts?

In regression we compare the predictions generated by our model with the actual values to test how well our model is performing. The difference between the predicted values and actual values (sometimes referred to as residuals) can tell us a lot about where our model is performing well and where it is not.

While there are a few different ways to characterize these differences, we will use the median of the differences between our predicted and actual values. This is given by

f13

This method of validation is the most basic approach to validation and is called held-out validation. We will use the med_diff accuracy measure and hold-out validation to assess the accuracy of our decision tree.

Median difference: 0.022068

A plot of how the colors change with redshift tells us that it is quite a complex function, for example redshift versus u – g:

f19
The following figures again show how few of the features correlate with each other, how their joint density varies and also how the redshift changes with the feature values.

f20f21

The median of differences of ~ 0.02 means that half of our galaxies have a error in the prediction of < 0.02, which is pretty good. One of the reason we chose the median of differences as our accuracy measure is that it gives a fair representation of the errors especially when the distribution of errors is skewed. The graph below shows the distribution of residuals (differences) for our model along with the median and inter-quartile values.

f22

Overfitting

Decision / Regression trees have some limitations though, one of the biggest being they tend to over fit the data. What this means is that if they are left unchecked they will create an overly complicated tree that attempts to account for outliers in the data. This comes at the expense of the accuracy of the general trend.

Part of the reason for this over-fitting is that the algorithm works by trying to optimise the decision locally at each node. There are ways in which this can be mitigated and in the next problem we will see how constraining the number of decision node rows (the tree depth) impacts on the accuracy of our predictions.

In order to see how the regression tree is overfitting we would like to examine how our decision tree performs for different tree depths. Specifically, we would like to see how it performs on test data compared to the data that was used to train it.

Naïvely we’d expect, the deeper the tree, the better it should perform. However, as the model overfits we see a difference in its accuracy on the training data and the more general testing data.

The following figures show the decision trees with maximum depth 2,3,4 and 5 learnt from the training dataset.

Regression Tree with max depth = 2

regression_tree3.png

Regression Tree with max depth = 3

regression_tree2

 

Regression Tree with max depth = 4

regression_tree1

Regression Tree with max depth = 5

regression_tree

We can see that the accuracy of the regression tree on the training set gets better as we allow the tree to grow to greater depths. In fact, at a depth of 27 our errors goes to zero!

Conversely, the accuracy measure of the predictions for the test set gets better initially and then worse at larger tree depths. At a tree depth ~19 the regression tree starts to overfit the data. This means it tries to take into account outliers in the training set and loses its general predictive accuracy.

Overfitting is a common problem with decision / regression trees and can be circumvented by adjusting parameters like the tree depth or setting a minimum number of cases at each node. For now, we will set a maximum tree depth of 19 to prevent over-fitting in our redshift problem.

Depth with lowest median difference : 19

f23

 

K-Fold cross validation

The method we used to validate our model so far is known as hold-out validation. Hold out validation splits the data in two, one set to test with and the other to train with. Hold out validation is the most basic form of validation.

While hold-out validation is better than no validation, the measured accuracy (i.e. our median of differences) will vary depending on how we split the data into testing and training subsets. The med_diff that we get from one randomly sampled training set will vary to that of a different random training set of the same size.

In order to be more certain of our models accuracy we should use k fold cross validation. k fold validation works in a similar way to hold-out except that we split the data into subsets. We train and test the model times, recording the accuracy each time. Each time we use a different combination of k subsets to train the model and the final k subset to test. We take the average of the accuracy measurements to be the overall accuracy of the the model.

It is an important part of assessing the accuracy of any machine learning model. When we plotted our predicted vs measured redshifts we are able to see that for many our galaxies we were able to get a reasonably accurate prediction of redshift. However, there are also several outliers where our model does not give a good prediction.

f24.png

Our sample of galaxies consists of two different populations: regular galaxies and quasi-stellar objects (QSOs). QSOs are a type of galaxy that contain an actively (and intensely) accreting supermassive black hole. This is often referred to as an Active Galactic Nucleus (AGN).

agn.png

The light emitted from the AGN is significantly brighter than the rest of the galaxy and we are able to detect these QSOs out to much higher redshifts. In fact, most of the normal galaxies we have been using to create our models have redshifts less than z~0.4, while the QSOs have redshifts all the way out to z~6. Due to this contribution from the AGN, the flux magnitudes measured at different wavelengths might not follow the typical profile we assumed when predicting redshifts.

Next we are going look at whether there is a difference in the accuracy of the decision trees between QSOs and regular galaxies.

 

2. Exploring Machine Learning Classification to predict galaxy classes

There is a wide range of galaxy types observed by the Sloan Digital Sky Survey in the Galaxy Zoo. In this activity, we will limit our dataset to three types of galaxy: spirals, ellipticals and mergers, as shown below.

 

Classes.png

 

The galaxy catalog we are using is a sample of galaxies where at least 20 human classifiers (such as yourself) have come to a consensus on the galaxy type. Examples of spiral and elliptical galaxies were selected where there was a unanimous classification. Due to low sample numbers, we included merger examples where at least 80% of human classifiers selected the merger class. We need this high quality data to train our classifier.

The features that we will be using to do our galaxy classification are color index, adaptive moments, eccentricities and concentrations. These features are provided as part of the SDSS catalogue.

Color indices are the same colors (u-g, g-r, r-i, and i-z) we used for regression. Studies of galaxy evolution tell us that spiral galaxies have younger star populations and therefore are ‘bluer’ (brighter at lower wavelengths). Elliptical galaxies have an older star population and are brighter at higher wavelengths (‘redder’).

Eccentricity approximates the shape of the galaxy by fitting an ellipse to its profile. Eccentricity is the ratio of the two axis (semi-major and semi-minor). The De Vaucouleurs model was used to attain these two axis. To simplify our experiments, we will use the median eccentricity across the 5 filters.

Adaptive moments also describe the shape of a galaxy. They are used in image analysis to detect similar objects at different sizes and orientations. We use the fourth moment here for each band.

Concentration is similar to the luminosity profile of the galaxy, which measures what proportion of a galaxy’s total light is emitted within what radius. A simplified way to represent this is to take the ratio of the radii containing 50% and 90% of the Petrosian flux.

The Petrosian method allows us to compare the radial profiles of galaxies at different distances. If you are interested, you can read more here on the need for Petrosian approach. We will use the concentration from the u, r and z bands. For these experiments, we will define concentration as:

f14.png

We have extracted the SDSS and Galaxy Zoo data for 780 galaxies, the first few rows fo the datatset are shown below:

 

f25.png

As described earlier, the data has the following fields:

  • colors: u-g, g-r, r-i, and i-z
  • eccentricity: ecc
  • 4th adaptive moments: m4_u, m4_g, m4_r, m4_i, and m4_z;
  • 50% Petrosian: petroR50_u, petroR50_r, petroR50_z;
  • 90% Petrosian: petroR90_u, petroR90_r, petroR90_z.

Now, let’s split the data and generate the features, and then train a decision tree classifier, perform a held-out validation by predicting the actual classes for later comparison.

The decision tree learnt with grid search cross validation is shown below:

decision_tree

The accuracy of classification problems is a lot simpler to calculate than for regression problems. The simplest measure is the fraction of objects that are correctly classified, as shown below. The accuracy measure is often called the model score. While the way of calculating the score can vary depending on the model, the accuracy is the most common for classification problems.

f26

In addition to an overall accuracy score, we’d also like to know where our model is going wrong. For example, were the incorrectly classified mergers miss-classified as spirals or ellipticals? To answer this type of question we use a confusion matrix. The confusion matrix computed for our problem is shown below:

confusion.png

Random Forest

So far we have used a single decision tree model. However, we can improve the accuracy of our classification by using a collection (or ensemble) of trees as known as a random forest.

random forest is a collection of decision trees that have each been independently trained using different subsets of the training data and/or different combinations of features in those subsets.

When making a prediction, every tree in the forest gives its own prediction and the most common classification is taken as the overall forest prediction (in regression the mean prediction is used).

The following figure shows the confusion matrix computed with random forest classifier.

confusion2.png

Did the random forest improve the accuracy of the model? The answer is yes – we see a substantial increase in accuracy. When we look at the 10-fold cross validation results, we see that the random forest systematically out performs a single decision tree: The random forest is around ~6-7% more accurate than a standard decision tree.

Some Analysis with Astronomy data (in Python)

Data-Driven Astronomy

The following problems appeared as assignments in the coursera course Data-Driven Astronomy (by the University of Sydney). The description of the problems are taken mostly from the course assignments and from https://groklearning.com/learn/data-driven-astro/.

 

One of the most widely used formats for astronomical images is the Flexible Image Transport System. In a FITS file, the image is stored in a numerical array. The FITS files shown below are some of the publicly available ones downloaded from the following sites:

f0.png

 

1. Computing the Mean and Median Stacks from a set of noisy FITS files

In this assignment, we shall focuss on calculating the mean of a stack of FITS files. Each individual file may or may not have a detected a pulsar, but in the final stack we should be able to see a clear detection.

Computing the Mean FITS

The following figure shows 5 noisy FITS files, which will be used to compute the mean FITS file.

f3.png

The following figure shows the mean FITS file computed from thes above FITS files. Mean being an algebraic measure, it’s possible to compute running mean by loadig each file at a time in the memory.

imm.png

Median FITS

Now let’s look at a different statistical measure — the median, which in many cases is considered to be a better measure than the mean due to its robustness to outliers. The median can be a more robust measure of the average trend of datasets than the mean, as the latter is easily skewed by outliers.

However, a naïve implementation of the median algorithm can be very inefficient when dealing with large datasets. Median, being a holistic measure, required all the datasets to be loaded in memory for exact computation, when implemeted i a naive manner.

Calculating the median requires all the data to be in memory at once. This is an issue in typical astrophysics calculations, which may use hundreds of thousands of FITS files. Even with a machine with lots of RAM, it’s not going to be possible to find the median of more than a few tens of thousands of images.This isn’t an issue for calculating the mean, since the sum only requires one image to be added at a time.

Computing the approximate runing median: the BinApprox Algorithm

If there were a way to calculate a “running median” we could save space by only having one image loaded at a time. Unfortunately, there’s no way to do an exact running median, but there are ways to do it approximately.

The binapprox algorithm (http://www.stat.cmu.edu/~ryantibs/papers/median.pdf) does just this. The idea behind it is to find the median from the data’s histogram.

First of all it can be proved that media always lies within one standard deviation of the mean, as shown below:

f4.png

The algorithm to find the running approximate median is show below:

f5.png

As soon as the relevant bin is updated the data point being binned can be removed from memory. So if we’re finding the median of a bunch of FITS files we only need to have one loaded at any time. (The mean and standard deviation can both be calculated from running sums so that still applies to the first step).

The downside of using binapprox is that we only get an answer accurate to σ/B by using B bins. Scientific data comes with its own uncertainties though, so as long as you keep large enough this isn’t necessarily a problem.

The following figure shows the histogram of 1 million data points generated randomly. Now, the binapprox algorithm will be used to compute the running median and the error in approximation will be computed with different number of bins B.

f6

f7

f8

As can be seen from above, as the number of bins B increases, the error in
approximation of the running median decreases.

 

Computing the Median FITS

Now we can use the binapprox algorithm to efficiently estimate the median of each pixel from a set of astronomy images in FITS files. The following figure shows 10 noisy FITS files, which will be used to compute the median FITS file.

f12.png

The following figure shows the median FITS file computed from the above FITS files using the binapprox algorithm.

immed.png

 

2. Cross-matching

When investigating astronomical objects, like active galactic nuclei (AGN), astronomers compare data about those objects from different telescopes at different wavelengths.

This requires positional cross-matching to find the closest counterpart within a given radius on the sky.

In this activity you’ll cross-match two catalogues: one from a radio survey, the AT20G Bright Source Sample (BSS) catalogue and one from an optical survey, the SuperCOSMOS all-sky galaxy catalogue.

The BSS catalogue lists the brightest sources from the AT20G radio survey while the SuperCOSMOS catalogue lists galaxies observed by visible light surveys. If we can find an optical match for our radio source, we are one step closer to working out what kind of object it is, e.g. a galaxy in the local Universe or a distant quasar.

We’ve chosen one small catalogue (BSS has only 320 objects) and one large one (SuperCOSMOS has about 240 million) to demonstrate the issues you can encounter when implementing cross-matching algorithms.

The positions of stars, galaxies and other astronomical objects are usually recorded in either equatorial or Galactic coordinates.

Equatorial coordinates are fixed relative to the celestial sphere, so the positions are independent of when or where the observations took place. They are defined relative to the celestial equator (which is in the same plane as the Earth’s equator) and the ecliptic (the path the sun traces throughout the year).

A point on the celestial sphere is given by two coordinates:

  1. Right ascension: the angle from the vernal equinox to the point, going east along the celestial equator.
  2. Declination: the angle from the celestial equator to the point, going north (negative values indicate going south).

The vernal equinox is the intersection of the celestial equator and the ecliptic where the ecliptic rises above the celestial equator going further east.

  • Right ascension is often given in hours-minutes-seconds (HMS) notation, because it was convenient to calculate when a star would appear over the horizon. A full circle in HMS notation is 24 hours, which means 1 hour in HMS notation is equal to 15 degrees. Each hour is split into 60 minutes and each minute into 60 seconds.
  • Declination, on the other hand, is traditionally recorded in degrees-minutes-seconds (DMS) notation. A full circle is 360 degrees, each degree has 60 arcminutes and each arcminute has 60 arcseconds.

To crossmatch two catalogs we need to compare the angular distance between objects on the celestial sphere.

People loosely call this a “distance”, but technically its an angular distance: the projected angle between objects as seen from Earth.

Angular distances have the same units as angles (degrees). There are other equations for calculating the angular distance but this one, called the haversine formula, is good at avoiding floating point errors when the two points are close together.

If we have an object on the celestial sphere with right ascension and declination (α1,δ1), then the angular distance to another object with coordinates (α2,δ2) is given below:

f1

Before we can crossmatch our two catalogs we first have to import the raw data. Every astronomy catalog tends to have its own unique format so we’ll need to look at how to do this with each one individually.

We’ll look at the AT20G bright source sample survey first. The raw data we’ll be using is the file table2.dat from this page (http://cdsarc.u-strasbg.fr/viz-bin/Cat?J/MNRAS/384/775#sRM2.2) in the VizieR archives, but we’ll use the filename bss.dat from now on.

Every catalogue in VizieR has a detailed README file that gives you the exact format of each table in the catalogue.

The full catalogue of bright radio sources contains 320 objects. The first few rows look like this:

f2.png

The catalogue is organised in fixed-width columns, with the format of the columns being:

  • 1: Object catalogue ID number (sometimes with an asterisk)
  • 2-4: Right ascension in HMS notation
  • 5-7: Declination in DMS notation
  • 8-: Other information, including spectral intensities

The SuperCOSMOS all-sky catalogue is a catalogue of galaxies generated from several visible light surveys.

The original data is available on this page (http://ssa.roe.ac.uk/allSky) in a package called SCOS_stepWedges_inWISE.csv.gz. The file is extracted to a csv file named super.csv.

The first few lines of super.csv look like this:

 

f9.png

The catalog uses a comma-separated value (CSV) format. Aside from the first row, which contains column labels, the format is:

  • 1: Right ascension in decimal degrees
  • 2: Declination in decimal degrees
  • 3: Other data, including magnitude and apparent shape.

Naive Cross-matcher

Let’s implement a naive crossmatch function that crossmatches two catalogs within a maximum distance and returns a list of matches and non-matches for the first catalog (BSS) against the second (SuperCOSMOS). The maximum distance is given in decimal degrees (e.g., nearest objects within a distance of 5 degree will be considered to be matched).

There are 320 objects in the BSS catalog that are compared with first n objects from the SuperCOSMOS catalog. The values of n is gradually increased from 500 to
1,25,000 and impact on the running time and the number of matched objected are noted.

The below figures shows the time taken for the naive cross-matching as the umber of objects in the second catalog is increased and also the final matches produced as a bipartite graph.

f10.png

 

bipartite.png

 

An efficient Cross-Matcher

Cross-matching is a very common task in astrophysics, so it’s natural that it’s had optimized implementations written of it already. A popular implementation uses objects called k-d trees to perform crossmatching incredibly quickly, by constructing a k-d tree out of the second catalogue, letting it search through for a match for each object in the first catalogue efficiently. Constructing a k-d tree is similar to binary search: the k-dimensional space is divided into two parts recursively until each division only contains only a single object. Creating a k-d tree from an astronomy catalogue works like this:

  1. Find the object with the median right ascension, split the catalogue into objects left and right partitions of this
  2. Find the objects with the median declination in each partition, split the partitions into smaller partitions of objects down and up of these
  3. Find the objects with median right ascension in each of the partitions, split the partitions into smaller partitions of objects left and right of these
  4. Repeat 2-3 until each partition only has one object in it

This creates a binary tree where each object used to split a partition (a node) links to the two objects that then split the partitions it has created (its children), similar to the one show in the image below.

 

k-d_tree_standin.png

Once a k-d tree is costructed out of a catalogue, finding a match to an object then works like this:

  1. Calculate the distance from the object to highest level node (the root node), then go to the child node closest (in right ascension) to the object
  2. Calculate the distance from the object to this child, then go to the child node closest (in declination) to the object
  3. Calculate the distance from the object to this child, then go to the child node closest (in right ascension) to the object
  4. Repeat 2-3 until you reach a child node with no further children (a leaf node)
  5. Find the shortest distance of all distances calculated, this corresponds to the closest object

Since each node branches into two children, a catalogue of objects will have, on average, log2(Nnodes from the root to any leaf, as show in the following figure.

k-d_tree_search_standin2.png

f11.png

bipartite2.png

 

As can be seen above, the kd-tree based implementation is way more faster than the naive counterpart for crossmatching. When the naive takes > 20 mins to match against 1,25,000 objects in the second catalog, the kd-tree based implemetation takes just about 1 second to produce the same set of results, as shown above.

An Open Science Project on Statistics: Doing the power analysis, equivalence test, NHST t-test, POST and computing the Bayes Factor to compare the IMDB Ratings of a few most recent movies by the legendary film directors Satyajit Ray and Akira Kurosawa (in R)

The following appeared as a project assignment (using Open Science Framework) in the coursera course Improving your Statistical Inferences (by Eindhoven University of Technology). The project is available here.

First we need to do pre-registration to control  the (type-I) error rates and reduce publication bias, as required by the OSF and shown below:

f1.png

Theoretical hypothesis

The theoretical hypothesis we are going to test is the following: both Satyajit Ray (from Kolkata, India) and Akira Kurosawa (from Japan) are great directors, both of them won the Academy Award for their Lifetime Achievement. Because they are both great, the movies they directed are equally good.

Dependent Variables to be measured

  • The dependent variables to be measured are the IMDB ratings (scores), # Users rated each movie.
  • First IMDB search will be used separately for the two legendary directors separately to get all the hits.
  • Then the search results will be sorted based on the release date, and 29 most recent full movies (excluding documentaries / TV series) will be used.
  • So in this case, we shall use the 29 last movies Satyajit Ray and Akira Kurosawa directed in from today (excluding documentaries / TV series), the moment we did the IMDB search.
  • The following table shows the data collected for Satyajit Ray movies.
Movie Rating (Out of 10) #Users Rated Release Year
The Stranger (Agantuk) 8.1 1760 1991
Shakha Prosakha 7.6 453 1990
Ganashatru 7.3 662 1989
Ghare Baire 7.7 812 1984
Hirak Rajar Deshe 8.8 1387 1980
Jai Baba Felunath 7.9 1086 1979
Shatranj Ke Khilari 7.8 2370 1977
Jana Aranya 8.3 887 1976
Sonar Kella 8.5 1308 1974
Distant Thunder (Ashani Sanket) 8.2 908 1973
Company Limited (Seemabaddha) 8.0 782 1971
Pratidwandi 8.2 1051 1970
Days and Nights in the Forest (Aranyer Din Ratri) 8.3 1720 1970
Goopy Gayen Bagha Bayen 8.8 1495 1969
Chiriyakhana 7.2 477 1967
Nayak 8.3 1974 1966
Mahapurush 7.3 719 1965
The Coward (Kapurush) 7.8 858 1965
Charulata 8.3 3597 1964
Mahanagar 8.3 2275 1963
Abhijaan 8.0 781 1962
Kanchenjungha 8.0 706 1962
Teen Kanya 8.2 991 1961
Devi 8.0 1407 1960
The World of Apu (Apur Sansar) 8.2 8058 1959
The Music Room (Jalshaghar) 8.1 3872 1958
Paras-Pathar 7.8 723 1958
Aparajito 8.2 7880 1956
Pather Panchali 8.4 15799 1955
  • The following table shows the data collected for Akira Kurosawa movies.
Movie Rating (Out of 10) #Users Rated Release Year
Maadadayo 7.4 4035 1993
Rhapsody in August 7.3 5131 1991
Dreams 7.8 19373 1990
Ran 8.2 84277 1985
Kagemusha 8.0 25284 1980
Dersu Uzala 8.3 18898 1975
Dodes’ka-den 7.5 4839 1970
Red Beard 8.3 12295 1965
High and Low 8.4 19989 1963
Sanjuro 8.2 22296 1962
Yojimbo 8.3 80906 1961
The Bad Sleep Well 8.1 8082 1960
The Hidden Fortress 8.1 25980 1958
The Lower Depths 7.5 3776 1957
Throne of Blood 8.1 34723 1957
I Live in Fear 7.4 3090 1955
Seven Samurai 8.7 247406 1954
Ikiru 8.3 46692 1952
The Idiot 7.4 3533 1951
Rashomon 8.3 112668 1950
Scandal 7.4 2580 1950
Stray Dog 7.9 11789 1949
The Quiet Duel 7.5 2131 1949
Drunken Angel 7.8 7422 1948
One Wonderful Sunday 7.3 1988 1947
Waga seishun ni kuinashi 7.2 2158 1946
Asu o tsukuru hitobito 6.6 119 1946
The Men Who Tread on the Tiger’s Tail 6.8 2567 1945
Zoku Sugata Sanshirô 6.2 1419 1945
Zoku Sugata Sanshirô 5.8 1229 1944

Justify the sample size

  • We want to predict no difference, and thus we shall do a power analysis for an equivalence test. We want to be pretty sure that we can reject our smallest effect size of interest, so We shall design a study with 84% power. For this educational assignment, we do not collect a huge amount of data.
  • As long as we can exclude a large effect (Cohen’s d=0.8 or larger) we shall be happy for this assignment.
  • The power analysis estimates that the sample size we need to show the difference between the ratings for movies directed by Satyajit Ray and Akira Kurosawa is smaller than Cohen’s d=0.8 (assuming the true effect size is 0, and with n α of 0.05, when we aim for 84power) is 29 movie ratings from Satyajit Ray, and 29 movie ratings from Akira Kurosawa, as can be seen from the following R code and the figures.
  • The αα-level I found acceptable is 0.05.
  • we performed a two-sided test.
  • we used 84% power for this study.
  • The effect size expected is 0.78948 0.8, as shown below.
  • Given that Satyajit Ray has a total 29 full movies directed, we can only collect 29 observations for him, also we collected equal amount of sample
    data (29 movies) for each of the directors.

The following theory is going to be used for the statistical tests:

p11.png

Results

p1
p2.png

  • As can be seen from above, the sample size required to obtain 84power is 29.

Specify the statistical test to conduct

  • We need to translate our theoretical hypothesis to a statistical hypothesis.
  • Let’s calculate the (90%) CI around the effect size.
  • When the 90% CI falls below, and excludes a Cohen’s d of 0.8, we can consider the ratings of the movies directed by Satyajit Ray and Akira Kurosawa as equivalent.

p3.png

p4.png

p5.png

  • As can be seen from the NHST test above that the effects are statistically significant, since 90% confidence interval around the effect size does not contain 0.
  • Also, the TOST procedure results shown above indicates that the observed effect size d=0.69 was not significantly within the equivalent bounds of d=-0.8 and d=0.8t(29)=2.86p=0.997.
  • Also, the 90% CI (0.24,1.14) around the effect size includes a Cohen’s d of 0.8, hence, we can consider the ratings of the movies directed by Satyajit Ray and Akira Kurosawa as not equivalent.
  • Hence, the effect is statistically significant, but not statistically equivalent.
  • Supporting the alternative with Bayes Factors: As can be seen from the following results, the Bayes Factor 50.17844 increases our belief in the alternative hypothesis (H1) over the null hypothesis (H0), starting with small prior belief 0.2 on the effect size.
  • The following code is taken from the course itself and modified as required and it’s originally written / protected by © Professor Daniel Lakens, 2016 and licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (https://creativecommons.org/licenses/by-nc-sa/4.0/).p6.pngp7p8p9p10

SIR Epidemic model for influenza A (H1N1): Modeling the outbreak of the pandemic in Kolkata, West Bengal, India in 2010 (Simulation in Python & R)

This appeared as a project in the edX course DelftX: MathMod1x Mathematical Modelling Basics and the project report can be found here. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Summary

In this report, the spread of the pandemic influenza A (H1N1) that had an outbreak in Kolkata, West Bengal, India, 2010 is going to be simulated. The basic epidemic SIR model will be used, it describes three populations: a susceptible population, an infected population, and a recovered population and assumes the total population (sum of these 3 populations) as fixed over the period of study.

There are two parameters for this model: namely the attack rate (β) per infected person per day through contacts and the recovery rate (α). Initially there will be a small number of infected persons in the population. Now the following questions are to be answered with the simulation / analysis:

1. Whether the number of infected persons increase substantially, producing an epidemic, or the flue will fizzle out.
2. Assuming there is an epidemic, how will it end? Will there still be any susceptibles left when it is over?
3. How long will the epidemic last?

The Euler’s method will be primarily used to solve the system of differential equations for the SIR model and compute the equilibrium points (along with some
analytic solution attempts for a few simplified special cases). Here are the
conclusions obtained from the simulations:

1. When the recovery rate α is ≈ 0 or very very low compared to the attack rate β, the flu will turn out to be an epidemic and the entire population will be infected first (the higher β is the quicker the epidemic outbreak).

2. To be more precise, when the initial susceptible population S(0) is greater than the inverse of the basic reproduction number 1/ R0 = α / β, a proper epidemic will break out.

3. When the initial susceptible population S(0) is less than the inverse of the basic reproduction number 1/R0=α/β, then a proper epidemic will never break out.

4. If the initial susceptible population is non-zero, in the end (at equilibrium) there will always be some susceptible population.

5. When there is an epidemic, it will eventually end in the equilibrium point with 0 infected population, how fast it reaches the equilibrium depends upon the recovery rate (the higher α is the quicker the infection removal).

6. The time to reach the equilibrium can be computed using Euler’s method, it depends on the parameters α (the higher the quicker) and β (the higher the quicker) and the initial infected populated size I(0) (the higher the quicker).

lv.png


Introduction

In 2010, the pandemic influenza A (H1N1) had an outbreak in Kolkata, West Bengal, India. An increased number of cases with influenza like illness (ILI) were reported in Greater Kolkata Metropolitan Area (GKMA) during July and August 2010, as stated in [3]. The main motivation for this research project will be to understand the spread of the pandemic, compute the equilibrium points and find the impact of the initial values of the infected rate and the attack / recovery rate parameters on the spread of the epidemic, using simulations using the basic epidemic SIR model.

The Euler’s method will be primarily used to solve the system of differential equations
for SIR model and compute the equilibrium points. First a few simplified special cases will be considered and both analytic and numerical methods (with Euler method) will be used to compute the equilibrium points. Then the solution for the generic model will be found. As described in [6], the SIR model can also be effectively used (in a broader
context) to model the propagation of computer virus in computer networks, particularly for the networks with Erdos-Renyi type random graph topology.


SIR Epidemic Model

The SIR model is an epidemiological model that computes the theoretical number of people infected with a contagious illness in a closed population over time. One of the basic one strain SIR models is Kermack-McKendrick Model. The Kermack-McKendrick Model is used to explain the rapid rise and fall in the number of infective patients observed in epidemics. It assumes that the population size is fixed (i.e., no births, no deaths due to disease nor by natural causes), incubation period of the infectious agent
is instantaneous, and duration of infectivity is the same as the length of the disease. It also assumes a completely homogeneous population with no age, spatial, or social structure.

The following figure 2.1 shows an electron microscope image of the re-assorted H1N1 influenza virus photographed at the CDC Influenza Laboratory. The viruses are 80 − 120 nm in diameter [1].

virus.png

2.1 Basic Mathematical Model

The starting model for an epidemic is the so-called SIR model, where S stands for susceptible population, the people that can be infected. I is the already infected population, the people that are contagious, and R stands for the recovered population, people who are not contagious any more.

2.1.1 Differential Equations

The SIR model can be defined using the following ordinary differential equations 2.1:

dif1

• The terms dS/dtdI/dtdR/dt in the differential equations indicate the rate of change of the susceptible population size, the infected population size and the recovered population size, respectively.

• The terms β and α indicate the attack rate (number of susceptible persons get infected per day) and the recovery rate of the flu (inverse of the number of days a person remains infected), respectively.

• High value of α means a person will be infected by the flu for less number of days and high value of β means that the epidemic will spread quickly.

• Also, as can be seen from below, from the differential equations it can be shown that the population (S + I + R) is assumed to be constant.

dif2.png

2.1.2 Collected data, units and values for the constants

• As can be seen from the following figure 2.2, the focus of this analysis will be limited to  the population in Kolkata Metropolitan Corporation (KMC, XII) area where the population can be assumed to be ≈ 4.5 million or 4500 thousands, as per [7].

Units

– All population (S, I, R) units will be in thousands persons (so that total population
N = 4500).
– As can be derived from the differential equations 2.1, the unit of β will be in
10^(−6) /persons/day (β = 25 will mean 25 persons in a million gets infected by
susceptible-infected contact per infected person per day).
– Similarly, the units of α will be in 10^(−3) / day (α = 167 will mean 167 × 10^(−3) /
day gets recovered from the flu per day).

• The attack rate is 20-29/100000 and the number of days infected (i.e. the inverse of recovery rate) = 5−7 days on average (with a few exceptions), as per [3].

• Typical values for β and α can be assumed to be 25 /person / day and 10^3/6 ≈ 167 / day, respectively.

f1.jpg

2.2 Simplified Model 1 (with α = 0)

• At first a simplified model is is created assuming that α = 0 (/ day) and that R = 0, so once infected, a person stays contagious for ever. Because S(t) + I(t) + R(t) = S(t) + I(t) = N is constant (since population size N is fixed), S(t) can be eliminated and a single differential equation in just I(t) is obtained as shown in the equation below 2.2.

f2

• Also, let the (fixed) population size N = 4500 = S(0) + I(0), (in thousand persons), initially the number of persons infected = I(0) = 1 (in thousand persons) and number of persons susceptible S(0) = N −I(0) = 4499 (in thousand persons), respectively. Let β = 25 × 10^(−6) /persons / day) to start with.

2.2.1 Analytic Solution

• The analytic solution can be found by following the steps shown in the Appendix A and the final solution is shown in the below equations 2.3:

f3.png
• The following figure 2.3 shows the logistic (bounded) growth in I(t) (in thousands persons) w.r.t. the time (in days) for different values of attack rate β×10^(−6) (/ person / day). As expected, the higher the attack rate, the quicker all the persons in the population become infected.

f4.png

2.2.2 Finding the equilibrium points for I

• The equilibrium points are the points where the rate of change in I is zero, the points that satisfy the following equation

f5

• Considering a small neighborhood of the equilibrium point at I = 0, it can be seen from the figure 2.4 that whenever I > 0, dI/dt > 0, so I increases and goes away from the equilibrium point.

• Hence, the equilibrium point at I = 0 is unstable.

• At I = N = 4500 (in thousand persons) it is a stable equilibrium. As can be seen from the following figure 2.4, in a small neighborhood of the equilibrium point at I = 4500, it always increases / decreases towards the equilibrium point.

• In a small  ε > 0 neighborhood at I = 4500 (in thousand persons),

1. dI/dt > 0, so I increases when I <= 4500 − ε .
2. dI/dt > 0, so I decreases when I >= 4500 +  ε .

• The same can be observed from the direction fields from the figure 2.5.

• Hence, the equilibrium at I = 4500 is stable.

f6.png

f7.png

2.2.3 Numerical Solution with the Euler’s Method

• The algorithm (taken from the course slides) shown in the following figure 2.6 will be used for numerical computation of the (equilibrium) solution using the Euler’s method.

f8.png

• As can be seen from the figure 2.6, then the infection at the next timestep can be (linearly) approximated (iteratively) by the summation of the the infection current timestep with the product of the difference in timestep and the derivative of the infection evaluated at the current timestep.

2.2.3.1 Finding the right step size (with β = 25 × 10^(−6)/person/day)

• In order to decide the best step size for the Euler method, first the Euler’s method is run with different step sizes as shown in the figure 2.7.

• As can be seen from the following table 2.1 and the figure 2.7, the largest differences in the value of I (with two consecutive step sizes) occurs around 78 days:

f9.png

f10.png
• As can be seen from the table in the Appendix B, the first time when the error becomes less than 1 person (in thousands) is with the step size 1/512 , hence this step size will be used for the Euler method.

2.2.3.2 Computing the (stable) equilibrium point

• Now, this timestep will be used to solve the problem to find the equilibrium time teq (in days). Find teq such that N − I(teq) < ε = 10^(−6) , the solution obtained is teq = 272.33398 days ≈ 273 days.

• Now, from the analytic solution 2.3 and the following figure 2.8, it can be verified that the teq solution that the Euler’s method obtained is pretty accurate (to the ε tolerance).

f11.png

2.2.3.3 Results with β = 29 × 10^(−6) / person / day, I(0) = 1 person

• Following the same iterations as above, the steepest error is obtained at t = 67 days in this case, as shown in the figure 2.9.

• The first time when error becomes less than one person for t = 67 days with the Euler ‘s method is with step size 1/512 again.

• The solution obtained is teq = 234.76953125 days ≈ 235 days, so the equilibrium (when all the population becomes infected) is obtained earlier as expected, since the attack rate β is higher.

f12.png

2.2.3.4 Results with β = 25 × 10^(−6) / person / day, with different initial values for infected persons (I(0))

• Following the same iterations as above, the equilibrium point is computed using the Euler’s method with different values of initial infected population I(0), as shown in the figure 2.10.

• The solutions obtained are teq = 272.33, 258.02, 251.85, 248.23, 245.66, 245.66 days for I(0) = 1, 5, 10, 15, 20 days, respectively. So the equilibrium is obtained earlier when the initial infected population size is higher, as expected.

f13.png

2.3 Simplified Model 2 (with β = 0)

• Next, yet another simplified model is considered by assuming that β = 0 and that α > 0, so the flu can no more infect anyone (susceptible, if any, possibly because everyone got infected), an infected person recovers from flu with rate α. This situation can be described again with a single differential equation in just I(t) as shown in the equation below 2.4.

f14

• Also, let the the entire population be infected, N = 4500 = I(0), (in thousand persons), initially the number of persons susceptible = S(0) = 0, respectively. Let α = 167 × 10^(−3)   (/ day) to start with.

2.3.1 Analytic Solution

• The analytic solution can be found by following the steps shown in the below equations 2.5:

f15

• The following figure 2.11 shows the exponential decay in I(t) (in thousand persons)  w.r.t. the time (in days) for different values of recovery rate α × 10^(−3)  (/ day). As expected, the higher the recovery rate, the quicker all the persons in the population get rid of the infection.

f16.png

• Now, I(t) + R(t) = N (since S(t) = 0 forever, since no more infection) and I(0) = N, combining with the above analytic solution I(t) = I(0).exp(−αt) = N.exp(−αt), the following equation is obtained:
f17

• The following figure 2.12 shows the growth in R(t) (in thousand persons) w.r.t. the time (in days) for different values of recovery rate α × 10^(−3) (/ day). As expected, the higher the recovery rate, the quicker all the persons in the population move to the removed state.

f18.png

2.3.2 Numerical Solution with the Euler’s Method

2.3.2.1 Solution with α = 167 × 10−3 / day

• Following the same iterations as above, the steepest error is obtained at t = 6 in this case, as shown in the figure 2.16.

• The first time when error becomes less than one person for t = 67 with the Euler’s method is with step size 1/256 .

• The solution obtained with the Euler’s method is 133.076171875 days ≈ 133 days to remove the infection from population with 10^(−6) tolerance. From the analytic solution,
I(133) = N.exp(−αt) = 1.016478E−06, similar result is obtained.

2.3.2.2 Results

The following figure 2.16 shows the solutions obtained with different step sizes using the Euler’s method.

f19.png

2.4 Generic Model (with α, β > 0)

First, the numeric solution will be attempted for the generic model (using the Euler’s method) and then some analytic insights will be derived for the generic model.

2.4.1 Numerical Solution with the Euler’s  Method

• The following algorithm 2.14 shown in the next figure is going to be used to obtain the solution using Euler method (the basic program for Euler’s method, adapted to include three dependent variables and three differential equations).

f20.png

• As can be seen from the figure 2.14, first the vector X(0) is formed by combining the three variables S, I, R at timestep 0. Then value of the vector at the next timestep can be (linearly) approximated (iteratively) by the (vector) summation of the vector value at the current timestep with the product of the difference in timestep and the derivative of the
vector evaluated at the current timestep.

2.4.1.1 Equilibrium points

• At the equilibrium point,

f21

There will be no infected person at the equilibrium point (infection should get removed).

• As can be seen from the following figure 2.15 also, I = 0 is an equilibrium point, which is quite expected, since in the equilibrium all the infected population will move to the removed state.

• Also, at every point the invariant S + I + R = N holds.

• In this particular case shown in figure 2.15, the susceptible population S also becomes 0 at equilibrium (since all the population got infected initially, all of them need to move to removed state) and R = N = 4500 (in thousand persons).

f22.png

2.4.1.2 Results with Euler method

• As explained in the previous sections, the same iterative method is to find the right stepsize for the Euler method. The minimum of the two stepsizes determined is
∆t = 1/512 day and again this stepsize is going to be used for the Euler’s method.

• The following figures show the solutions obtained with different values of α, β with the initial infected population size I(0) = 1 (in thousand persons). Higher values for the parameter β obtained from the literature are used for simulation, since β = 25 × 10^(−6) /person /day is too small (with the results not interesting) for the growth of the epidemic using the Euler’s method (at least till ∆t = 1/ 2^15), after which the iterative Euler’s
method becomes very slow).

• As can be seen, from the figures 2.16, 2.17 and 2.19, at equilibrium, I becomes zero.

• The solution (number of days to reach equilibrium) obtained at α = 167×10^(−3) /day and β = 25×10^(−5) /person /day is teq = 143.35546875 ≈ 144 days with I(0) = 1 (in thousand persons), the corresponding figure is figure 2.16.

• The solution (number of days to reach equilibrium) obtained at α = 167 × 10^(−3) /day and β = 5 × 10^(−5) /person /day is teq ≈ 542 days with I(0) = 1 (in thousand persons), the corresponding figure is figure 2.17.

• Hence, higher the β value, the equilibrium is reached much earlier.

• The solution obtained at α = 500 × 10^(−3) /day and β = 25 × 10^(−5) /person /day is
teq ≈ 78 days with I(0) = 1 (in thousand persons), the corresponding figure is figure 2.19.

• Hence, higher the α value, the equilibrium is reached earlier.

• The solution obtained at α = 167×10^(−3) /day and β = 25×10^(−5) /person /day is
teq = 140 days with I(0) = 10. Hence, as expected, higher the number of initial infected population size, quicker the equilibrium is reached.

• At equilibrium, S does not necessarily become close to zero, since sometimes the entire  population may not get infected ever, as shown in the figure 2.17, where at equilibrium the susceptible population is non-zero.

f23f24f25f26

• As can be seen from the phase planes from following figure 2.21, at equilibrium, the infected population becomes 0.

f27.png

2.4.2 Analytic Solution and Insights

2.4.2.1 Basic Reproduction Number (R0)

The basic reproduction number (also called basic reproduction ratio) is defined by
R0 = β / α (unit is /day). As explained in [2], this ratio is derived as the expected number of new infections (these new infections are sometimes called secondary infections) from a single infection in a population where all subjects are susceptible. How the dynamics of the system depends on R0 will be discussed next.

2.4.2.2 The dynamics of the system as a function of R0

• By dividing the first equation by the third in 2.1, as done in [2], the following equation is obtained:

f28.png

• Now, at t → ∞, the equilibrium must have been already reached and all infections must have been removed, so that lim (t→∞) I(t) = 0.

• Also, let R∞ = lim (t→∞) R(t).

• Then from the above equation 2.7, R∞ = N − S(0).exp(R0.(R∞−R(0)))
.
• As explained in [2], the above equation shows that at the end of an epidemic, unless
S(0) = 0, not all individuals of the population have recovered, so some must remain  susceptible.

• This means that the end of an epidemic is caused by the decline in the number of infected individuals rather than an absolute lack of susceptible subjects [2].

• The role of the basic reproduction number is extremely important, as explained in [2]. From the differential equation, the following equation can be obtained:

f29

S(t) > 1/R0 ⇒ dI(t)/dt > 0 ⇒ there will be a proper epidemic outbreak with an increase of the number of the infectious (which can reach a considerable fraction of the population).

S(t) < 1 R0 ⇒ dI(t) dt < 0 ⇒ independently from the initial size of the susceptible population the disease can never cause a proper epidemic outbreak.

• As can be seen from the following figures 2.21 and 2.22 (from the simulation results obtained with Euler method), when S(0) > 1/R0 , there is a peak in the infection curve, indicating a proper epidemic outbreak.

• Also, from the figures 2.21 and 2.22, when S(0) > 1/R0 , the higher the the gap between S(0) and 1/R0 , the higher the peak is (the more people get infected) and the quicker the peak is attained.

• Again, from the figure 2.22, when 4490 = S(0) < 1/R0 = 5000, it never causes a proper epidemic outbreak .

f30

f31.png

 

• Again, by dividing the second equation by the first in 2.1, the following equation is obtained:

f32.png

f33.png

 

• As can be noticed from the above figure 2.23 that because the formulas differ only by an additive constant, these curves are all vertical translations of each other.

• The line I(t) = 0 consists of equilibrium points.

• Starting out at a point on one of these curves with I(t) > 0, as time goes on one needs to travel along the curve to the left (because dS/dt < 0), eventually approaching at some positive value of S(t).

• This must happen since on any of these curves, as I(t) → ∞, as S(t) → 0, from equation 2.8.

• So the answer to question (2) is that the epidemic will end as with approaching some positive value and thus there must always be some susceptibles left over.

• As can be seen from the following figure 2.24 (from the simulation results obtained with the Euler’s method), when S(0) > 1/R0 , lesser the the gap between S(0) and 1/R0 , the higher the population remains susceptible at equilibrium (or at t → ∞).

f34.png

Conclusions

In this report, the spread of the pandemic influenza A (H1N1) that had an outbreak in Kolkata, West Bengal, India, 2010 was simulated using the basic epidemic SIR model.Initially there will be a small number of infected persons in the population, most of the population had susceptible persons (still not infected but prone to be infected) and zero removed persons. Given the initial values of the variables and the parameter (attack and recovery rates of the flu) values, the following questions were attempted to be answered with the simulation / analysis:

1. Whether the number of infected persons increase substantially, producing an epidemic, or the flue will fizzle out.

2. Assuming there is an epidemic, how will it end? Will there still be any susceptibles left when it is over?

3. How long will the epidemic last?
The following conclusions are obtained after running the simulations with
different values of the parameters and the initial values of the variables:

1. When the recovery rate α is ≈ 0 or very very low compared to the attack rate β (so that R0 = β / α >> 1) and I(0) > 1, the flu will turn out to be an epidemic and the entire population will be infected first (the higher β is the quicker the epidemic break out).

2. To be more precise, when the initial susceptible population S(0) is greater than the inverse of the basic reproduction number 1/R0 = α / β, a proper epidemic will break out.

3. When the initial susceptible population S(0) is less than the inverse of the basic reproduction number 1/R0 = α/β, then a proper epidemic will never break out.

4. If the initial susceptible population is non-zero, in the end (at equilibrium) there will always be some susceptible population.

5. When there is an epidemic, it will eventually end in the equilibrium point with 0 infected population, how fast it reaches the equilibrium depends upon the recovery rate (the higher α is the quicker the infection removal).

6. The time to reach the equilibrium can be computed using Euler method, it depends on the parameters α (the higher the quicker) and β (the higher the quicker) and the initial infected populated size I(0) (the higher the quicker).

7. Scope of improvement: The SIR model could be extended to The Classic Endemic Model [5] where the birth and the death rates are also considered for the population (this will be particularly useful when a disease takes a long time to reach the equilibrium state).

f35.png

f36.png

f37.png