Wednesday, March 28, 2012

Distances and Transformations

Distances are rather arbitrary. We have different "biology" approaches including
a) Needleman Wunsch
b) Smith Waterman Gotoh (SWG) Percentage Identity (PID)
c) SWG Score
d) Blast PID
e) Blast Bitscore

Further given any distance d(i,j), I can replace it by m(d(i,j)) for any monotonic mapping function y-->m(y) satisfying
m(y1) > m(y2) for y1>y2

Typically mapping is chosen to make distribution of distances "more reasonable". Note if you have a high dimensional random distribution, one can show that

D = Formal Dimension = 2 Mean^2 / Sigma^2
showing high dimension corresponds to a standard deviation of distances sigma that is small compared to mean.
High dimension translates into mapped points concentrated at edge  of surface (sphere) when mapped to 3D. So cases like Blast and SWG with a huge peak at distances = 1 have a high dimension as sigma small compared to mean

So we choose mapping m to reduce dimension while retaining ordering of distances. We have looked at several choices of m.

a) Transformation Method 10

m(y) = y^TP where TP is transformation parameter -- TP = 2 4 6 investigated

b) Transformation Method 8
    m(y) maps to 4 dimensions.  If you assume your data  is randomly distributed in a sphere in dimension D, you can analytically derive formula for mapping so m(d) corresponds to points randomly distributed in 2 or 4 dimensions. Transformation method 8 implements this mapping for final dimension 4. Note original data is not random so one doesn't get an exact final dimension but it is typically around 4.

c) Transformation Method 9 or SQRT(4D)
  Here we start with mapping m8(d) mapping to 4 dimensions.
Then we INCREASE formal mapped dimension by
m9(d) = m8(d)^0.5

Note in Transformation Method 10, TP > 1 lowers formal dimension but TP < 1 increases formal dimension.
Thus m9(d) has larger formal dimension than m8 which is ~4. m9(d) for COG has formal dimension around 14



No comments:

Post a Comment