The COG Project

Tuesday, May 29, 2012

COG 95672 NW PID Log(1-d^4) with Sammon

Description

DataSet: COG Size: 95672 Unique: Yes

Aligner: NeedlemanWunsch ScoringMatrix: BLOSUM62 GapOpen: -16 GapExt: -4

DistanceType: (1 - PercentIdentity) Transformation: TM12,TP4

Mapping: Sammon DistanceCut: None

Initialization: Random

Fixed: None

Varied: All

DensitySat: 0.85

Images

Full Sample with Selected Clusters

Full Sample with Selected Clusters Zoomed-in

COG 95672 NW PID Log(1-d^2) with Sammon

Description

DataSet: COG Size: 95672 Unique: Yes

Aligner: NeedlemanWunsch ScoringMatrix: BLOSUM62 GapOpen: -16 GapExt: -4

DistanceType: (1 - PercentIdentity) Transformation: TM12,TP2

Mapping: Sammon DistanceCut: None

Initialization: Random

Fixed: None

Varied: All

DensitySat: 0.85

Images

Full Sample with Selected Clusters

Full Sample with Selected Clusters Zoomed-in

Friday, May 25, 2012

COG 95672 NW PID Log(1-d^6) with Sammon

Description

DataSet: COG Size: 95672 Unique: Yes

Aligner: NeedlemanWunsch ScoringMatrix: BLOSUM62 GapOpen: -16 GapExt: -4

DistanceType: (1 - PercentIdentity) Transformation: TM12,TP6

Mapping: Sammon DistanceCut: None

Initialization: Random

Fixed: None

Varied: All

DensitySat: 0.85

Images

Full Sample with Selected Clusters

Full Sample with Selected Clusters Zoomed-in

Full Sample with Selected Clusters Zoomed-in Further

Wednesday, May 23, 2012

COG 95672 NW PID Log(1-d) with Sammon

Description

DataSet: COG Size: 95672 Unique: Yes
Aligner: NeedlemanWunsch ScoringMatrix: BLOSUM62 GapOpen: -16 GapExt: -4
DistanceType: (1 - PercentIdentity) Transformation: TM12
Mapping: Sammon DistanceCut: None
Initialization: Random
Fixed: None
Varied: All
DensitySat: 0.85

Images

Full Sample with Selected Clusters

Full Sample with Selected Clusters Zoomed-in

Friday, March 30, 2012

The Role of Seven Clusters

The 7 clusters were chosen early on as interesting ways of looking at value of transformation. They are
COG0444 137 members
COG4608 130 members
COG1131 240 members
COG1126 114 members
COG1136 195 members
COG3842 110 members
COG3849 135 members

We show analysis in terms of
Original distance versus Euclidean 3D map
and
Original Distance for two different methods

The intercluster is collection of all pairs of points inside same cluster and this can measure how well individual clusters are mapped
The intracluster is collection of all pairs -- one in one of seven clusters; the other in another. The quality of these plots measures the relative placement of clusters

Wednesday, March 28, 2012

Distance Types

Distance between a given pair of sequences is calculated depending on the alignment resulting from running an algorithm like Smith-Waterman, Needleman-Wunsch, or Blast. In general the alignment of two sequences may appear as shown below.

Each square represents a character and dashes indicate gaps. Characters and gaps existing outside the aligned region is possible only with a global alignment algorithm like Needleman-Wunsch. Local alignments resulting from both Smith-Waterman and Blast will have aligned region being identical to the entire alignment. Also, note that with a local alignment the starting pair and ending pair of characters from the two sequences will not include a gap character.

If the two aligned sequences are S1 and S2 then aligned region is defined from StartIndex to EndIndex inclusively defined as below.

StartIndex = Max (S1.FirstNonGapIndex, S2.FirstNonGapIndex)

EndIndex = Min(S1.LastNonGapIndex, S2.LastNonGapIndex)

The length of the aligned region, AlignedLenth = EndIndex - StartIndex + 1

Percent Identity (PID) Distance

Let NumOfIdenticalPairs be the number of identical pairs within the aligned region. For example in the above picture there are five such pairs (2 greens, 1 purple, 1 blue, 1 red)
PID = NumOfIdenticalPairs / AlignedLength
Convert percent identity as a distance by taking 1 - PID

Score

Each pair of aligned characters is assigned a score using the substitution matrix and gap penalties.
The summation of all such scores is called the score of the alignment. The alignment algorithm always tries to maximize this value.
See [1] for more details.

Normalized Score

Compute the score for aligning S1 with S2, S1 with S1, and S2 with S2. Let these values be named as S1S2, S1S1, and S2S2 respectively.
NormalizedScore = 2 * S1S2 / (S1S1 + S2S2)

BitScore

Blast alignment has a value called BitScore, which is a log scaled version of the Score.
See [1,2] for more details.

Normalized BitScore

Similar to NormalizedScore, compute BitScore for aligning S1 with S2, S1 with S1, and S2 with S2.
Used the same formula as in NormalizedScore to compute NormalizedBitScore

References:

[1] homepages.ulb.ac.be/~dgonze/TEACHING/stat_scores.pdf

[2] http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

FAQ

What does PID stands for?

PID stands for Percent Identity and it implies the particular Manxcat run used (1 - PID) value of each aligned pair of sequences as the distance between the two original sequences corresponding to that particular pair.
See more on different distance types at DistanceTypes

What is Simple Points file?

Given the input sequence file used in the particular Manxcat run, the Simple Points file presents 3D coordinates for each sequence in order. These coordinates are computed by the Manxcat program with its best effort to preserve the original distance between each pair. The term original distance refers to the distance (transformed distance if specified - see Distances and Transformations) calculated through aligning the corresponding two sequences.
Note. In cases where Blast is used to do the alignment it's possible not to get alignments for certain sequence pairs. In such cases Manxcat may not produce coordinates for all the sequences. Therefore you may find some point numbers are missing in the Simple Points file although they are ordered by the point number. The value of the Distance Cut may also ignore pairs of sequences having a distance value greater than that, resulting similar missing points in the output.

How do the coordinates in Simple Points correspond to COG clusters?

Predefined cluster assignment is available for each sequence in the used set of COG sequences. These are available in the Introduction page.

What is the difference between COG95672 and COG50000?

COG95672 stands for the unique sequences filtered from the original set of sequences. It has 95672 sequences.
COG50000 stands for the first 50,000 sequences taken out of the unique sequences file.
Also note both these files preserve the same order as in the original set of sequences. In other words the sequences are not randomized.

Can you please give more details on distance transformations, i.e. Transformation: TM10,TP4 ?

Please refer to the separate post on Distances and Transformations

What does DistanceCut: 0.96 mean?

Please refer to the separate post on Distance Cut

Is PlotViz available for Linux?

Currently, PlotViz is available for Windows and Mac environments only.

What are the selected clusters?

Please refer to the separate post on The Role of Seven Clusters

The COG Project

Pages

Tuesday, May 29, 2012

COG 95672 NW PID Log(1-d^4) with Sammon

Description

Links

Images

COG 95672 NW PID Log(1-d^2) with Sammon

Description

Links

Images

Friday, May 25, 2012

COG 95672 NW PID Log(1-d^6) with Sammon

Description

Links

Images

Wednesday, May 23, 2012

COG 95672 NW PID Log(1-d) with Sammon

Description

Links

Images

Friday, March 30, 2012

The Role of Seven Clusters

Wednesday, March 28, 2012

Distance Types

FAQ

Contributors