- Original sequences (including duplicates)
- Count: 100514
- Sequence file: http://salsahpc.indiana.edu/manxcat/cog/input/cog.fasta
- Cluster assignment in [sequence_number]<tab>[cluster_name]: http://salsahpc.indiana.edu/manxcat/cog/input/clustAssign2_sorted_by_pnum(sbp)_clean.txt
- Filtered unique sequences
- Count: 95672
- Sequence file: http://salsahpc.indiana.edu/manxcat/cog/input/cog_unique.fasta
- Cluster assignment in [sequence_number]<tab>[cluster_name]: http://salsahpc.indiana.edu/manxcat/cog/input/clustAssign2_sorted_by_pnum_unique_95672(sbpu).txt
- Cluster assignment in [sequence_number]<tab>[cluster_number]: http://salsahpc.indiana.edu/manxcat/cog/input/clustAssign2_sorted_by_pnum_unique_95672(sbpu)_numbers.txt
- Filtered unique first 50,000 sequences
- Count: 50000
- Cluster assignment in [sequence_number]<tab>[cluster_name]: http://salsahpc.indiana.edu/manxcat/cog/input/clustAssign2_sorted_by_pnum_unique_first50k(sbpu).txt
- Cluster assignment in [sequence_number]<tab>[cluster_number]: http://salsahpc.indiana.edu/manxcat/cog/input/clustAssign2_sorted_by_pnum_unique_first50k(sbpu)_numbers.txt
General Comments
As well as output of program, we link to plot files to be plotted by Plotviz http://salsahpc.indiana.edu/pviz3/ (latest version http://salsahpc.indiana.edu/pviz3/data/PVIZ3-0.8.11-win64.exe). Note there are mac and Windows version of Plotviz. Currently no Linux version
We also give typical screendumps of 2D projections of 3D mappings. There are 3 heat maps in each case which scatterplot original v mapped distance for
- Full data sample
- Distances between points inside 7 clusters (intra)
- Distances between points in all pairs of 7 clusters (inter) excluding those cases where points inside same cluster
There are too many entries to use normal scatterplots; so we use heat maps that can use sophisticated representation of density
Manxcat is one of our two major dimension reduction programs. It can cope with methods like Sammon that have non-unit weight in sum of distance discrepancies. It can also cope with undefined distances
We will be adding more explanation
Note Needleman Wunsch, Blast Bitscore and SWG Score give reasonable answers; SWG PID and Blast PID are no good
Distance Cut
There is some evidence that distance computations are unreliable when they are near 1. This is illustrated by lack of correlation between Blast and Needleman Wunsch for Blast distances above 0.9.
So in some dimension reduction runs we remove terms where distances are larger than "Distance Cut" which value is specified on input. Remember we are minimizing sum of N(N-1)/2 terms (distance between i and j - Euclidean distance between mapped i and j)^2. So we drop terms where distance between i and j> Distance Cut. Typically any point i and j are still left with some distances less than cut and so you can determine mapping. For Blast sometimes distances are not calculated and those terms are also left out of sum
Manxcat is only dimension reduction program that allows missing distances. It removes any point that has less than or equal Linkcut valid distances. Linkcut is defaulted to 5
Distances and Transformations
Distances are rather arbitrary. We have different "biology" approaches including
a) Needleman Wunsch
b) Smith Waterman Gotoh (SWG) Percentage Identity (PID)
c) SWG Score
d) Blast PID
e) Blast Bitscore
Further given any distance d(i,j), I can replace it by m(d(i,j)) for any monotonic mapping function y-->m(y) satisfying
m(y1) > m(y2) for y1>y2
Typically mapping is chosen to make distribution of distances "more reasonable". Note if you have a high dimensional random distribution, one can show that
D = Formal Dimension = 2 Mean^2 / Sigma^2
showing high dimension corresponds to a standard deviation of distances sigma that is small compared to mean. High dimension translates into mapped points concentrated at edge of surface (sphere) when mapped to 3D. So cases like Blast and SWG with a huge peak at distances = 1 have a high dimension as sigma small compared to mean
So we choose mapping m to reduce dimension while retaining ordering of distances. We have looked at several choices of m.
a) Transformation Method 10
m(y) = y^TP where TP is transformation parameter -- TP = 2 4 6 investigated
b) Transformation Method 8
m(y) maps to 4 dimensions. If you assume your data is randomly distributed in a sphere in dimension D, you can analytically derive formula for mapping so m(d) corresponds to points randomly distributed in 2 or 4 dimensions. Transformation method 8 implements this mapping for final dimension 4. Note original data is not random so one doesn't get an exact final dimension but it is typically around 4.
c) Transformation Method 9 or SQRT(4D)
Here we start with mapping m8(d) mapping to 4 dimensions. Then we INCREASE formal mapped dimension by
m9(d) = m8(d)^0.5
Note in Transformation Method 10, TP > 1 lowers formal dimension but TP < 1 increases formal dimension. Thus m9(d) has larger formal dimension than m8 which is ~4. m9(d) for COG has formal dimension around 14.
No comments:
Post a Comment