This article describes a few approaches and algorithms for **temporal neighborhood discretization** from a couple of papers accepted in the conference ICDM (2009) and in the journal IDA (2014), authored by me and my fellow professors Dr. Aryya Gangopadhyay and Dr. Vandana Janeja at** UMBC**. Almost all of the texts / figures in this blog post are taken from the papers.

**Neighborhood discovery** is a precursor to knowledge discovery in complex and large datasets such as **Temporal data**, which is a sequence of data tuples measured at successive time instances. Hence instead of mining the entire data, we are

interested in dividing the huge data into several **smaller intervals** of interest which we call as **temporal neighborhoods**. In this article we propose 4 novel of algorithms to generate temporal neighborhoods through unequal depth discretization:

**Similarity**-based simple Merging (**SMerg**)**Markov Stationary distribution**based Merging (**StMerg**)**Greedy**Merging (**GMerg**)**Optimal**Merging with**Dynamic Programming**(**OptMerg**).

The next figure shows the schematic diagram of the four algorithms:

All the algorithms start by dividing a **given time series data** into a few (user-chosen) initial number of bins (with **equal-frequency binning**) and then they proceed **merging similar** bins (resulting a few possibly **unequal-width bins**) using different approaches, finally the algorithms **terminate** if we obtain the user-specified number of output (merged) bins .

Both the algorithms **SMerg** and **STMerg** start by setting a **Markov Model** with the initial bins as the **states **(each state being summarized using the corresponding bin’s **mean** and **variance**) and **Transition Matrix** computed with the **similarity** (computed using a few different **distribution-similarity computation metrics**) in between the bins, as shown in the following figure:

The following** distribution-distance measures **are used to compute the** similarity **in between the bins (we assume the temporal data in each of the bins as **normally distributed** with the bin mean and the variance).

We discuss the results for a one dimensional special case here, however, our approach is generalizable to multidimensional temporal data.

**
**The

**Markov Transition Matrix**is defined as shown in the following figure as a

**block-diagonal matrix**with a given

**Temporal window size**

*w*.

The following figure shows the** pre-processing **step** Init-bin **for setting up the** Markov Model**

In this approach we**Similarity based merging (SMerg)****iteratively**merge**highly similar**bins (based on a**threshold**). At every iteration the transition matrix is recomputed since bin population has changed.

2. **Markov Stationary distribution based Merging (StMerg)**

Given the initial equal-depth bins** **and the transition matrix** T, **we compute the** steady-state vector **at the** stationary distribution, **using** power-iteration **method. Once we have this **vector** we need to detect the **spikes** in the vector which indicate the **split** points such that the data on either side of any split point is very **dissimilar** and the data within two split points is highly **similar**. In order to detect these spikes we use Discrete Fourier Transform (**DFT**) and Inverse Discrete Fourier Transform (**IDFT**).

The following figure shows how the steady state vector **approximate** the **pattern** in the **original data** for a few **synthetically generated** and public **transportation** datasets, thereby achieving **data compression **(since number of states in the steady state vector =

#** output bins** obtained after bin merging << **# initial bins**).

3. **Greedy Approach (GMerg)**

This algorithm starts with the initial bins and starts merging the **most similar adjacent bins** right away (does not require a **Markov Model**). It is greedy in the sense that it chooses the current **most similar intervals** for **merging** and hence may find a **local optimum** instead of obtaining a **globally optimal merging**.

4. **Optimal merging (OPTMerg)**

Starting with the initial equal depth intervals (bins), we introduce the notion of **optimal merging** by defining **optimal partitioning**,** **as described in the following figures:

Next we use **dynamic programming** to find **optimal partition**, as shown in the next figure:

The following figures show the **optimal partition trees** obtained on a few datasets:

**Experimental Results with Synthetic and CAT Lab data**

The following figures show the output bins obtained for few of the datasets using few of the algorithms and distance measures.

From the regions in the following figures marked (a, b) we can see that the** weekend days** are clearly **different** from the **weekdays** marked as (d, e). Further if we consider a very high level granularity such as the region (c) using **StMerg** with the **KL distance** we can entirely demarcate the **weekends** as compared to the **weekdays **(the **global outliers** can be detected, marked by a circle).

In the next figure, **temporal interval** (a) and (b) obtained after merging the intervals shows that there is an **anomaly** in the data from 05/12/2009 23 : 03 : 39 to 05/13/2009 00 03 39. We indeed find that speed value recorded by the censor is 20, constant all over the time period, and we verified from CATT lab representative that this is due to a possible ”free flow speed condition” where the sensor may go into a powersave mode if the traffic intensity is very minimal or close to zero.

We also have extended the existing well-known and popular technique **SAX** (Symbolic Aggregate Approximation) to **SAXMerg **(where we merge the adjacent bins represented by **same symbol**) as shown below.

We then compare our results with those obtained using **SAXMerg**, as shown in the following figures.

It can be seen from the merged output bins from the next figure, the **OPTMerg** with **Mahalanobis** distance measure is more **robust** to **local outliers** than the **SAXMerg** is.

Some results with **precision** and **recall**: