Using beginner python and parallel lists to find frequency of dysfunctional alleles in a population

Let's say I have a database DNA sequences from 90 people

seqList, a list, contains:

Seq1: ATGGGGGTTAAGGATA

Seq2: GCCACACTTTGGGGGA

Seq3: AGTAAAAGAACACACA... and so on until Seq90

Each separate sequence of the 90 is known as a variant

I have been given a list that gives positions within these DNA sequences where mutations regularly occur

slim_pos = [0,4,10,13] -> in ALL 90 sequences referred to above, these are the indices of where mutations typically happen.

So for Seq1, the nucleotides/positions affected would be [A,G,A,A]. Pattern continues for all 90 sequences.

I have also been given a list called dys_nuc, that runs parallel to slim_pos. They have the same length and are entirely corresponding.

dys_nuc = ['A','T', 'A', 'A'] <- these nucleotides (the letters) match up with the positions listed in slim_pos. dys_nuc is a list that details the nucelotides in these positions where mutations occurs that are associated with dysfunction and disease.

My task is to find the frequency of people that have dysfunctional nucleotides the positions otulined by slim_pos. There are 90 different people/sequences, and I want to know the frequency of the dysfunctional mutations at those sites in the entire population.

I am very new to python and am completely stuck on how to go about this. I've been at it for 2 days now and think I need some help... Below is some code showing what I have to this point. It does not work because of a list indexing error, but this is more to show you how I am thinking of approaching the problem. My first thought was to make a counter (dysfunctional_counter) to record where each variant had a dysfunctional mutation in the defined mutation sites and calculate frequencies using this counter somehow.. but I am completely lost now and confused.

dysfunctional_count=[]
normal_count =[]
count = [] 
for variant in seqList:
    for(ind, val) in enumerate(slim_pos):
        if variant[val] == dys_nuq[ind]:
            count[ind] += 1
           # dysfunctional_count[variant] += 1
        else:
            normal_count[variant] += 1

Eventually, I am hoping to create an output like this:

Dysfunctional Sequence Site (this is slim_pos) Dysfunctional nucleotide Frequency of population who have dysfunctional nucleotide in the sequence site
0 A 0.06666666666666667
4 T 0.011111111111111112
10 A 0.011111111111111112
14 A 0.03333333333333333

Any guidance is very much appreciated :)

1 answer

  • answered 2021-10-15 15:35 BTables

    The index error you're seeing is because you start with an empty list count and try to index into it. You also define normal_count as a list, but the index you're using is of type str. I would personally use dictionaries for this, and to make lookup after easier.

    dys_count = {dys: 0 for dys in dys_nuq} 
    for variant in seqList:
        for(ind, val) in enumerate(slim_pos):
            if variant[val] == dys_nuq[ind]:
                dys_count[dys_nuq[ind]] += 1
    
    
        
    

    Then, all you have to do is look up each nucleotide by name. For example

    freq = dys_count['A'] / len(seqList)
    

    Or, if you wanted to loop over all the data

    for nucl, occur in dys_count.items():
        print(f'Nucleotide {nucl} was found {occur} times')
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum