## Dynamic Programming – 4 : Sequence Alignment

Sequence Alignment

We are given 2 sequences $X = $ and $Y = < y_1, \ldots, y_n>$. Our task is to produce a valid matching of the elements of $X$ and $Y$ such that the cost of the matching is minimized.

Terminology:

A valid matching denotes a one-to-one correspondence of (some subset of) elements of $X$ and $Y$ with the added condition that there should be no crossing in the matching.i.e. if $x_i$ is matched to $y_j$, and $x_p$ is matched to $y_q$, then $i < p$ implies $j < q$.

e.g.

X = < e l e p h a n t >

Y = < m o n k e y >

The matching $M = \{ (1,2), (3,3), (6, 5) \}$ refers to the valid matching where we match $x_1, y_2$ and $x_3, y_3$ and $x_6, y_5$. Note that there are no crossings in $M$.

Cost of a matching $M$ is the sum of:

(1) matching cost $m_{x_i,y_j}$ for each $(i,j) \in M$.

(2) gap cost $d$ for each element of $X$ or $Y$ which is not part of $M$.

Solution.

Given $X$ and $Y$, we have 2 choices: either include $(m,n)$ in the optimum matching, or exclude it from the optimum matching.

Case 1:

If $(m,n)$ is part of optimum matching, then the optimum cost would be $m_{x_m, y_n}$ plus the optimum cost of matching the sequences $ and $< y_1, \ldots, y_{n-1}$.

Notation: Let $OPT(i,j)$ denote the optimum cost of matching $$ and $$.

Therefore, we have:

$OPT (m,n) = m_{x_m, y_n} + OPT(m-1, n-1)$

Case 2:

If $(m,n)$ is not part of the optimum matching, then one of the following should hold true (because of the no crossing condition):

(1) $x_m$ is not part of the optimum matching.

or, (2) $y_n$ is not part of the optimum matching.

Therefore, we have:

$OPT(m,n) = min \{ OPT(m-1, n) + d, OPT (m,n-1) + d\}$

—-

Generalizing the above recurrence, we have, for $1 \leq i \leq m$ and $1 \leq j \leq n$:

$OPT (i, j) = min \{ m_{x_i, y_j} + OPT (i-1, j-1), OPT (i-1, j) + d, OPT (i, j-1) + d\}$

Also, we have the base conditions that:

$OPT (i, 0) = i \times d$ for all $0 \leq i \leq m$; and analogously for $OPT (0, j)$.

—-

Pseudocode for computing optimum value:

Initialize OPT [i, 0] and OPT [0,j] as given above.

For i from 1 to m:

For j from 1 to n:

Compute OPT [i,j] using the given recurrence.

End of algorithm.

—-

The above algorithm computes the optimum value $OPT (m,n)$ in $O(mn)$ time.

—-

Computing the optimum solution using OPT [i,j] values:

Let us denote :

(1) $m_{x_i, y_j} + OPT (i-1, j-1)$ by Expr1.

(2) $d + OPT (i, j-1)$ by Expr2.

(3) $d + OPT (i-1, j)$ by Expr3.

Function: OPT_Solution (i,j)

If (i = = 0) OR ( j = = 0)

Return;

If Expr1 <= both Expr2 and Expr3

OPT_Solution (i-1, j-1);

Print: Index i of X is matched to Index j of Y.

Else if: Expr2 <= both Expr1 and Expr3

OPT_Solution (i, j-1);

Else:

OPT_Solution (i-1, j);

End of function.

—-

What we want to compute is OPT_Solution (m, n). The above function computes that in time $O(m+n)$.

—-