Distance between a given pair of sequences is calculated depending on the alignment resulting from running an algorithm like Smith-Waterman, Needleman-Wunsch, or Blast. In general the alignment of two sequences may appear as shown below.
Each square represents a character and dashes indicate gaps. Characters and gaps existing outside the aligned region is possible only with a global alignment algorithm like Needleman-Wunsch. Local alignments resulting from both Smith-Waterman and Blast will have aligned region being identical to the entire alignment. Also, note that with a local alignment the starting pair and ending pair of characters from the two sequences will not include a gap character.
If the two aligned sequences are S1 and S2 then aligned region is defined from StartIndex to EndIndex inclusively defined as below.
StartIndex = Max (S1.FirstNonGapIndex, S2.FirstNonGapIndex)
EndIndex = Min(S1.LastNonGapIndex, S2.LastNonGapIndex)
The length of the aligned region, AlignedLenth = EndIndex - StartIndex + 1
- Percent Identity (PID) Distance
- Let NumOfIdenticalPairs be the number of identical pairs within the aligned region. For example in the above picture there are five such pairs (2 greens, 1 purple, 1 blue, 1 red)
- PID = NumOfIdenticalPairs / AlignedLength
- Convert percent identity as a distance by taking 1 - PID
- Score
- Each pair of aligned characters is assigned a score using the substitution matrix and gap penalties.
- The summation of all such scores is called the score of the alignment. The alignment algorithm always tries to maximize this value.
- See [1] for more details.
- Normalized Score
- Compute the score for aligning S1 with S2, S1 with S1, and S2 with S2. Let these values be named as S1S2, S1S1, and S2S2 respectively.
- NormalizedScore = 2 * S1S2 / (S1S1 + S2S2)
- BitScore
- Blast alignment has a value called BitScore, which is a log scaled version of the Score.
- See [1,2] for more details.
- Normalized BitScore
- Similar to NormalizedScore, compute BitScore for aligning S1 with S2, S1 with S1, and S2 with S2.
- Used the same formula as in NormalizedScore to compute NormalizedBitScore
No comments:
Post a Comment