Tuesday, May 29, 2012

COG 95672 NW PID Log(1-d^4) with Sammon


Description

DataSet: COG Size: 95672 Unique: Yes
Aligner: NeedlemanWunsch ScoringMatrix: BLOSUM62 GapOpen: -16 GapExt: -4
DistanceType: (1 - PercentIdentity) Transformation: TM12,TP4
Mapping: Sammon DistanceCut: None
Initialization: Random
Fixed: None
Varied: All
DensitySat: 0.85

Links

Images


Full Sample with Selected Clusters




Full Sample with Selected Clusters Zoomed-in


COG 95672 NW PID Log(1-d^2) with Sammon


Description

DataSet: COG Size: 95672 Unique: Yes
Aligner: NeedlemanWunsch ScoringMatrix: BLOSUM62 GapOpen: -16 GapExt: -4
DistanceType: (1 - PercentIdentity) Transformation: TM12,TP2
Mapping: Sammon DistanceCut: None
Initialization: Random
Fixed: None
Varied: All
DensitySat: 0.85

Links

Images


Full Sample with Selected Clusters




Full Sample with Selected Clusters Zoomed-in


Friday, May 25, 2012

COG 95672 NW PID Log(1-d^6) with Sammon


Description

DataSet: COG Size: 95672 Unique: Yes
Aligner: NeedlemanWunsch ScoringMatrix: BLOSUM62 GapOpen: -16 GapExt: -4
DistanceType: (1 - PercentIdentity) Transformation: TM12,TP6
Mapping: Sammon DistanceCut: None
Initialization: Random
Fixed: None
Varied: All
DensitySat: 0.85

Links

Images


Full Sample with Selected Clusters


Full Sample with Selected Clusters Zoomed-in


Full Sample with Selected Clusters Zoomed-in Further




Wednesday, May 23, 2012

COG 95672 NW PID Log(1-d) with Sammon

Description

DataSet: COG Size: 95672 Unique: Yes
Aligner: NeedlemanWunsch ScoringMatrix: BLOSUM62 GapOpen: -16 GapExt: -4
DistanceType: (1 - PercentIdentity) Transformation: TM12
Mapping: Sammon DistanceCut: None
Initialization: Random
Fixed: None
Varied: All
DensitySat: 0.85

Links

Images

 

Full Sample with Selected Clusters



Full Sample with Selected Clusters Zoomed-in

Friday, March 30, 2012

The Role of Seven Clusters

The 7 clusters were chosen early on as interesting ways of looking at value of transformation. They are
COG0444 137 members
COG4608 130 members
COG1131 240 members
COG1126 114 members
COG1136 195 members
COG3842 110 members
COG3849 135 members

We show analysis in terms of
Original distance versus Euclidean 3D map
and
Original Distance for two different methods

The intercluster is collection of all pairs of points inside same cluster and this can measure how well individual clusters are mapped
The intracluster is collection of all pairs -- one in one of seven clusters; the other in another. The quality of these plots measures the relative placement of clusters

Wednesday, March 28, 2012

Distance Types

Distance between a given pair of sequences is calculated depending on the alignment resulting from running an algorithm like Smith-Waterman, Needleman-Wunsch, or Blast. In general the alignment of two sequences may appear as shown below.

Each square represents a character and dashes indicate gaps. Characters and gaps existing outside the aligned region is possible only with a global alignment algorithm like Needleman-Wunsch. Local alignments resulting from both Smith-Waterman and Blast will have aligned region being identical to the entire alignment. Also, note that with a local alignment the starting pair and ending pair of characters from the two sequences will not include a gap character. 

If the two aligned sequences are S1 and S2 then aligned region is defined from StartIndex to EndIndex inclusively defined as below.

StartIndex = Max (S1.FirstNonGapIndex, S2.FirstNonGapIndex)
EndIndex = Min(S1.LastNonGapIndex, S2.LastNonGapIndex)

The length of the aligned region, AlignedLenth = EndIndex - StartIndex + 1

  • Percent Identity (PID) Distance
    • Let NumOfIdenticalPairs be the number of identical pairs within the aligned region. For example in the above picture there are five such pairs (2 greens, 1 purple, 1 blue, 1 red)
    • PID = NumOfIdenticalPairs / AlignedLength
    • Convert percent identity as a distance by taking 1 - PID
  • Score
    • Each pair of aligned characters is assigned a score using the substitution matrix and gap penalties. 
    • The summation of all such scores is called the score of the alignment. The alignment algorithm always tries to maximize this value.
    • See [1] for more details.
  • Normalized Score
    • Compute the score for aligning S1 with S2, S1 with S1, and S2 with S2. Let these values be named as S1S2, S1S1, and S2S2 respectively.
    • NormalizedScore = 2 * S1S2 / (S1S1 + S2S2) 
  • BitScore
    • Blast alignment has a value called BitScore, which is a log scaled version of the Score.
    • See [1,2] for more details. 
  • Normalized BitScore
    • Similar to NormalizedScore, compute BitScore for aligning S1 with S2S1 with S1, and S2 with S2.
    • Used the same formula as in NormalizedScore to compute NormalizedBitScore
References:

FAQ

  1. What does PID stands for? 
    • PID stands for Percent Identity and it implies the particular Manxcat run used (1 - PID) value of each aligned pair of sequences as the distance between the two original sequences corresponding to that particular pair. 
    • See more on different distance types at DistanceTypes
  2. What is Simple Points file?
    • Given the input sequence file used in the particular Manxcat run, the Simple Points file presents 3D coordinates for each sequence in order. These coordinates are computed by the Manxcat program with its best effort to preserve the original distance between each pair. The term original distance refers to the distance (transformed distance if specified - see Distances and Transformations) calculated through aligning the corresponding two sequences.
    • Note. In cases where Blast is used to do the alignment it's possible not to get alignments for certain sequence pairs. In such cases Manxcat may not produce coordinates for all the sequences. Therefore you may find some point numbers are missing in the Simple Points file although they are ordered by the point number. The value of the Distance Cut may also ignore pairs of sequences having a distance value greater than that, resulting similar missing points in the output. 
  3. How do the coordinates in Simple Points correspond to COG clusters?
    • Predefined cluster assignment is available for each sequence in the used set of COG sequences. These are available in the Introduction page.
  4. What is the difference between COG95672 and COG50000? 
  5. Can you please give more details on distance transformations, i.e. Transformation: TM10,TP4 ?
  6. What does DistanceCut: 0.96 mean?
  7. Is PlotViz available for Linux?
    • Currently, PlotViz is available for Windows and Mac environments only.
  8. What are the selected clusters?