Python performance optimize, a practise

Question

Assume that we have 10000 genes, which is quantitied in normal ppl and turmor ppl. We have a algorithm to convert it to a bool table, than we want to pick up the combinations of gene have satisfied sensitivity and specificity

Sensitivity (also called the true positive rate, the recall, or probability of detection[1] in some fields) measures the proportion of positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition).
Specificity (also called the true negative rate) measures the proportion of negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

First idea

Seem that intutive idea is like this

  1. read bool table
  2. generate combination?
  3. check the Sen. and Spe.

use python’s itertools.combination to generate a iterator of combination of potential result and check it. However, even while handling 100 genes, the result may be tremendous. for example

1
5C100 = 75287520

The Language like Python or perl, handle these kind of loop was terriblly slow, one of my mate takes nearly 35 hours to finish a 5C100 recombination in Perl.

Is every combination needed?

After pick some genes, we actually have a bool matrix like this:

1
2
3
|<-----normal---->|<----turmor----->|
g1 0 0 1 1 1 0 1 0 0 0
g2 1 1 1 0 1 1 0 1 1 1

than we can calculate the spe, filter the set whose spe or sen < 0.8:

1
2
3
4
5
6
7
spec, sen = 0, 0
spec = len([x for x in range(normal)
if g1[x] and ge2[x]
]) / normal
sen = len([x for x in range(normal, turmor)
if g1[x] and ge2[x]
]) / turmor

Here we feel diffcult to reduct the execute time of this part of codes. However, we find out that if one gene’s spe or sen < 0.8, than the combination won’t satisfied the requirements. So some improvements could be made.

Little improvements

so assume we have a list initial, than we can filter the unsatisfied single gene.

1
candidate = [gene for gene in initial if is_satisfied(gene)]

Than we run first combination, 2 C len(candidate), far less than the inital count.

Go deeper

when we get the result of 2 elements’s combination, we next want to generate the 3 element’s combination.

1
2
3
4
5
6
def combination(list1, list2):
result = []
for x in list1:
for y in list2:
list.append(frozenset(x + [y]))
return set(result)