Dynamic Programming – 4 : Sequence Alignment

Sequence Alignment

We are given 2 sequences X = <x_1, \ldots, x_m> and Y = < y_1, \ldots, y_n>. Our task is to produce a valid matching of the elements of X and Y such that the cost of the matching is minimized.

Terminology:

A valid matching denotes a one-to-one correspondence of (some subset of) elements of X and Y with the added condition that there should be no crossing in the matching.i.e. if x_i is matched to y_j, and x_p is matched to y_q, then i < p implies j < q.

e.g.

X = < e l e p h a n t >

Y = < m o n k e y >

The matching M = \{ (1,2), (3,3), (6, 5) \} refers to the valid matching where we match x_1, y_2 and x_3, y_3 and x_6, y_5. Note that there are no crossings in M.

Cost of a matching M is the sum of:

(1) matching cost m_{x_i,y_j} for each (i,j) \in M.

(2) gap cost d for each element of X or Y which is not part of M.

Solution.

Given X and Y, we have 2 choices: either include (m,n) in the optimum matching, or exclude it from the optimum matching.

Case 1:

If (m,n) is part of optimum matching, then the optimum cost would be m_{x_m, y_n} plus the optimum cost of matching the sequences <x_1, \ldots, x_{m-1} and < y_1, \ldots, y_{n-1}.

Notation: Let OPT(i,j) denote the optimum cost of matching <x_1, \ldots, x_i> and <y_1, \ldots, y_j>.

Therefore, we have:

OPT (m,n) = m_{x_m, y_n} + OPT(m-1, n-1)

Case 2:

If (m,n) is not part of the optimum matching, then one of the following should hold true (because of the no crossing condition):

(1) x_m is not part of the optimum matching.

or, (2) y_n is not part of the optimum matching.

Therefore, we have:

OPT(m,n) = min \{ OPT(m-1, n) + d, OPT (m,n-1) + d\}

—-

Generalizing the above recurrence, we have, for 1 \leq i \leq m and 1 \leq j \leq n:

OPT (i, j) = min \{ m_{x_i, y_j} + OPT (i-1, j-1), OPT (i-1, j) + d, OPT (i, j-1) + d\}

Also, we have the base conditions that:

OPT (i, 0) = i \times d for all 0 \leq i \leq m; and analogously for OPT (0, j).

—-

Pseudocode for computing optimum value:

Initialize OPT [i, 0] and OPT [0,j] as given above.

For i from 1 to m:

For j from 1 to n:

Compute OPT [i,j] using the given recurrence.

End of algorithm.

—-

The above algorithm computes the optimum value OPT (m,n) in O(mn) time.

—-

Computing the optimum solution using OPT [i,j] values:

Let us denote :

(1) m_{x_i, y_j} + OPT (i-1, j-1) by Expr1.

(2) d + OPT (i, j-1) by Expr2.

(3) d + OPT (i-1, j) by Expr3.

Function: OPT_Solution (i,j)

If (i = = 0) OR ( j = = 0)

Return;

If Expr1 <= both Expr2 and Expr3

OPT_Solution (i-1, j-1);

Print: Index i of X is matched to Index j of Y.

Else if: Expr2 <= both Expr1 and Expr3

OPT_Solution (i, j-1);

Else:

OPT_Solution (i-1, j);

End of function.

—-

What we want to compute is OPT_Solution (m, n). The above function computes that in time O(m+n).

—-

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: