iniwap 2011年04月15日 星期五 20:53 | 2549次浏览 | 2条评论
======================================================
In this project we will apply Python programming to the field of biology. In biology, scientist work with DNA sequences. A DNA sequence is made up of four symbols which each represent a nucleic acid base as follows:
Symbol | Nucleic Acid Base |
---|---|
A | Adenine |
G | Guanine |
C | Cytosine |
T | Thymine |
Following is an example (truncated for brevity) of the human chromosome 22 (source: National Center for Biotechnology Information):
GTGTCTCATGCCTGTAATCCCAACACTTTGGGAGCCTGAGGTAGAAGTTGAGGTCAGGAGGTCCAAGACA AGCCTGGGCAACATAGTGAGTCTACAAAAAAAATATTGAGAATCTACATAAATATAGTGGGGGTGGGGGT GGGGAACAAGAAAACAAAAAATTCAAAGCATAGTGAAAAGAAATATAGCAAAACCCAGCCGGGCATGGTG CCTCACGCCTGTAATCTCAGCACTTTGAGAGGCCGAGGCATGTGGATCACGGGGTCAGGAGATCGAGACC ATCCTGGCTAACACAGTGAAACCTGTCTCTACTAAAAATACAGAAAAATTAGCCACATGTGGTGGCGGGT GCCTGCAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAG TGAGCCGATATCGCGCCACTGCACTCCAGCCTGGGCGATAGAGCGAGACTCCGTCTCACAAAAAAAAAAA
In this project we will perform operations on DNA sequences such as these. The project is divided into three topics FASTA Data, Digram Frequency, and Sequence Scoring.
There are many file formats for storing and communicating DNA sequence data. For this project we will be working with the FASTA file format. The FASTA file format is described in detail at this web address: http://en.wikipedia.org/wiki/FASTA_format . The first line is called the header. The header line begins with the >
symbol. Subsequent lines that begin with a ;
symbol are comment lines. All lines that do not begin with a >
or ;
are considered to be part of the sequence. Once the header and comment lines are processed, all following lines are part of the sequence.
Create a function named parseFastaFile
. This function will take as its only parameter a variable named filepath
which is a string containing the path to the FASTA file stored locally on your computer. It will then perform the following operations:
filepath
header
, comments
, and sequence
header
variable is a String in which you must store the first line from the file that begins with the >
symbolcomments
variable is a List in which you must store a string element for each line that begins with the ;
symbolsequence
variable is a String in which you must store the DNA sequence(header, comments, sequence)
Given the human22.txt file, the function would produce (line breaks added for readability on this wiki page):
('>gi|224514636:1-29755346 Homo sapiens chromosome 22 genomic contig, GRCh37.p2 reference primary assembly', ['; Human chromosome 22', '; Source: NCBI', '; Truncated for brevity, the actual file is much longer'], 'GTGTCTCATGCCTGTAATCCCAACACTTTGGGAGCCTGAGGTAGAAGTTGAGGTCAGGAGGTCCAAGACA AGCCTGGGCAACATAGTGAGTCTACAAAAAAAATATTGAGAATCTACATAAATATAGTGGGGGTGGGGGT GGGGAACAAGAAAACAAAAAATTCAAAGCATAGTGAAAAGAAATATAGCAAAACCCAGCCGGGCATGGTG CCTCACGCCTGTAATCTCAGCACTTTGAGAGGCCGAGGCATGTGGATCACGGGGTCAGGAGATCGAGACC ATCCTGGCTAACACAGTGAAACCTGTCTCTACTAAAAATACAGAAAAATTAGCCACATGTGGTGGCGGGT GCCTGCAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAG TGAGCCGATATCGCGCCACTGCACTCCAGCCTGGGCGATAGAGCGAGACTCCGTCTCACAAAAAAAAAAA')
A digram is a sequence of two symbols in a DNA sequence. For example, the following are a few of the digrams that could be contained in a DNA sequence: 'AA', 'AC', and 'GT'. Given the set of symbols 'A', 'C', 'G', and 'T' you must first figure out all possible two symbol digrams.
For all possible digrams that you calculated above, you must find frequency of occurrence of the digram in the DNA sequence. Please note that some digrams may not appear in a given DNA sequence, in which case their frequency would be 0.
The frequency of digrams in a DNA sequence can assist scientist in DNA analysis. The frequency of a digram is calculated by counting the number of occurrences of that digram in a sequence. For additional information on frequencies please take a look at Project 3 and Project 4 solutions. Be careful, as these frequency analysis worked over strings and calculated only the frequency for each letter.
Create a function named digramFreqScores
. This function will take as its only parameter a variable named s
which is a String containing a DNA sequence. It will then perform the following operations:
scores
containing a Dictionary in which you must store:scores
is a digram: for example 'AA' or 'GT'scores
is an integer representing the total number of occurrences for the digram in the sequence s
s
into the Dictionary scores
scores
Given the sequence from human22.txt file (the third element from the tuple returned by parseFastaFile), the function would produce and return the following dictionary (line breaks added for readability on this wiki page):
{'AA': 60, 'AC': 25, 'GT': 26, 'AG': 43, 'CC': 25, 'CA': 38, 'CG': 14, 'TT': 10, 'GG': 46, 'GC': 32, 'AT': 26, 'GA': 36, 'TG': 36, 'TA': 21, 'TC': 23, 'CT': 28}
In this TODO we will present the data calculated by TODO 2.1 in another form, organized using a matrix. The following figure is a visual representation of the matrix we will be generating in this TODO (Please note that the following diagram illustrates the logical arrangements of values. The matrix you produce with this function will not contain the label row and column (first row, first column). Your matrix will only contain the integer values):
A | G | C | T | |
---|---|---|---|---|
A | 60 | 43 | 25 | 26 |
G | 36 | 46 | 32 | 26 |
C | 38 | 14 | 25 | 28 |
T | 21 | 36 | 23 | 10 |
Notice that the digram 'AG' corresponds to the position: row 1 and column 0 of the matrix and at this position in the matrix we store the frequency of the digram obtained from the dictionary we produced in TODO 2.1 (namely 36).
The labels given to the rows and colums in the picture above are only there to help illustrate what the matrix must contain. They WILL NOT be part of the matrix itself (please see the example output below).
Create a function named digramFreqMatrix
. This function will take as its only parameter a variable named d
which is a Dictionary containing scores as returned by the digramFreqScores
function of TODO 2.1 and will return as its output the matrix described above. The function must perform the following operations:
matrix
containing a List of Lists (a matrix encoded row-by-row) in which you must store:matrix
Given a dictionary of digram sequence scores from the sequence found in the human22.txt file, the function would produce:
[[60, 43, 25, 26], [36, 46, 32, 26], [38, 14, 25, 28], [21, 36, 23, 10]]
For example, the digram 'CG' is found at row C, column G and has the value 14.
The goal of this TODO is to calculate the similarity of two DNA sequences. This is a common calculation performed in computational biology.
One way to do this is through sequence scoring. We tackle a simplified version of the scoring problem in this project. To generate a score, we will use a scoring matrix (the scoring matrix is NOT the digram frequency matrix you calculated in TODO 2.2 but is given in a file - see below). The scoring matrix will give us a score for each nucleic acid base (A, G, C, T) when compared against another nucleic acid base. So for example, if we had two DNA sequences “AA” and “AA” the score would be high because it is an exact match. Whereas, if we had three sequences “AA”, “AC”, and “AT”, we might like to know if “AA” is more similar to “AC” or more similar to the sequence “AT”. We score each character in the sequence and then add all the scores for each character to get a final score.
Let's introduce the terms “haystack” and “needle”. Our haystack is a DNA sequence against which we will be scoring other sequences. Our needle is a smaller DNA sequence which we will be comparing to the larger sequence. To calculate the score for a needle in a haystack, we compare the needle string to the haystack string at each position of the larger sequence. For each position in the larger sequence we calculate a score using the scoring matrix. We want to find the position that produces the highest score. To illustrate this concept visually, please see the examples below.
Needle and haystack:
haystack = "GTGTCT" needle = "TCT"
Scoring matrix:
A G C T A 6 1 2 1 G 1 6 1 2 C 2 1 6 1 T 1 2 1 6
Sample scoring run:
Step 1: GTGTCT TCT 212 Score 5 at position 0 Step 2: GTGTCT TCT 616 Score 13 at position 1 Step 3: GTGTCT TCT 211 Score 4 at position 2 Step 4: GTGTCT TCT 666 Score 18 at position 3
In our example the score in step 4 is the highest and therefor the DNA sequences are most similar at this position. So we say that the highest score occurred at position 3.
Create a function named parseScoringFile
. This function will take as its only parameter a variable named filepath
which is a string containing the path to the scoring CSV file stored locally on your computer (the file is given below). The function will then perform the following operations:
scoreMatrix
containing a List of Lists (matrix encoded row-by-row) in which you must store:scoreMatrix
corresponds to a row from the filescoreMatrix
Given the scoring.txt file, the function would produce:
[[6, 1, 2, 1], [1, 6, 1, 2], [2, 1, 6, 1], [1, 2, 1, 6]]
Create a function named scoreSequence
. This function will take three parameters (in this order):
haystack
which contains the sequence to test against (like from the FASTA file).needle
which contains another sequence for scoring.scoring_m
which holds the scoring matrix (like from the CSV file).It will then perform the following operations:
max_score
containing a Tuple which stores two values (in order):max_score
(NOTE: if there are multiple positions that contain the high score you must return the last such position)Given the sequence from human22.txt (output sequence from parseFastaFile) and the scoring matrix derived from scoring.csv (output from parseScoringFile) the function would produce:
>>> scoreSequence(sequence, "TAATCTCAGCACTTTGAGAGGCCGAGGCAT", scoring_m) (221, 180)
Create a function named findHighScore
. This function will take three parameters (in this order):
haystack
, a String, which contains the sequence to test against (like from the FASTA file).needles
, a List of Strings, which contains multiple sequences for scoring.scoring_m
, a List of Lists (matrix), which holds the scoring matrix (like from the CSV file).It will then perform the following operations:
high_scorer
containing a Tuple which stores three values (in order):high_scorer
Given the sequence from human22.txt (output sequence from parseFastaFile) and the scoring matrix derived from scoring.csv (output from parseScoringFile) the function would produce:
>>> findHighScore(sequence, ["TAATCTCAGCACTTT", "GAGAGGCCGAGGCAT"], scoring_m) (221, 90, 'TAATCTCAGCACTTT')
As usual, you must now create the main() function that “glues” together what you did in previous TODOs. You must think carefully about which functions return values that must be used by other functions.
The function main() must do the following:
parseFastaFile
(see TODO 1)digramFreqScores
(see TODO 2.1)digramFreqMatrix
(see TODO 2.2)scoreSequence
(see TODO 2.3)parseScoringFile
(see TODO 3.1)scoreSequence
(see TODO 3.2)findHighScore
. (see (TODO 3.3)======================分割线============================
#-*-coding:utf-8-*-
# Project 5
#TODO1
'''
Open file specified by filepath
Create three variables to store the header, comments, and sequence
The header variable is a String in which you must store the first line from the file that begins with the > symbol
The comments variable is a List in which you must store a string element for each line that begins with the ; symbol
The sequence variable is a String in which you must store the DNA sequence
Return a tuple containing the (header, comments, sequence)
'''
def parseFastaFile(filepath):
f=open(filepath)
FASTA=f.readlines()
f.close()
header=FASTA[0]
comments=[''.join(line.split('\n')) for line in FASTA if line[0]==';']
sequence=''.join([''.join(line.split('\n')) for line in FASTA if line[0]!=';' and line[0]!='>'])
return (header,comments,sequence)
#TODO2.1
'''
Create a variable named scores containing a Dictionary in which you must store:
The key of each element in scores is a digram: for example 'AA' or 'GT'
The value of each element in scores is an integer representing the total number of occurrences for the digram in the sequence s
For all possible digrams, store the total number of occurrences in sequence s into the Dictionary scores
Return scores
'''
def digramFreqScores(s):
scores ={'AA':0, 'AC':0, 'GT':0, 'AG':0, 'CC':0, 'CA':0, 'CG':0, 'TT':0,
'GG':0, 'GC':0, 'AT':0, 'GA':0, 'TG':0, 'TA':0, 'TC':0, 'CT':0}
for key in scores.keys():
if(key=='AA' or key=='CC' or key=='GG' or key=='TT'):
count=0
for i in range(len(s)-1):
if(s[i:i+2])==key:
count+=1
scores[key]=count
continue
scores[key]=s.count(key)
return scores
#TODO2.2
'''
Create a variable named matrix containing a List of Lists (a matrix encoded row-by-row) in which you must store:
A frequency count corresponding to the digram at the position row, column:
the row corresponds to the first character in a digram
the column corresponds to the second character in the digram
Note: The order of the rows and columns must be AGCT (see example)
Return matrix
'''
def digramFreqMatrix(d):
matrix=[[0 for i in range(4)] for j in range(4)]
AGCT='AGCT'
for rAGCT in range(4):
for cAGCT in range(4):
matrix[rAGCT][cAGCT]=d[AGCT[rAGCT]+AGCT[cAGCT]]
return matrix
#TODO3.1
'''
Create a variable named scoreMatrix containing a List of Lists (matrix encoded row-by-row) in which you must store:
Each List in scoreMatrix corresponds to a row from the file
The elements in each list correspond to the values from the file which are comma separated
Return scoreMatrix
'''
def parseScoringFile(filepath):
scoreMatrix=[[0 for i in range(4)] for j in range(4)]
f=open(filepath)
scoring=f.readlines()
f.close()
for i in range(4):
scoreMatrix[i]=[int(scoring[i][j]) for j in range(8) if j%2==0]
return scoreMatrix
#TODO3.2
'''
Create a function named scoreSequence. This function will take three parameters (in this order):
Variable named haystack which contains the sequence to test against (like from the FASTA file).
Variable named needle which contains another sequence for scoring.
Variable named scoring_m which holds the scoring matrix (like from the CSV file).
It will then perform the following operations:
Calculate a word score for needle when aligned at each position in haystack (see above example)
Note: Do not count partial alignments (when the last character of needle is aligned with the last character of haystack, stop)
Create a variable named max_score containing a Tuple which stores two values (in order):
Position: The index (Integer) of character from haystack which produces the maximum word score for needle
Score: The maximum word score (Integer)
Return max_score (NOTE: if there are multiple positions that contain the high score you must return the last such position)
'''
def scoreSequence(haystack,needle,scoring_m):
max_score=max_s=0
position=pos=0
r=c='AGCT'
scores ={'AA':0, 'AC':0, 'GT':0, 'AG':0, 'CC':0, 'CA':0, 'CG':0, 'TT':0,
'GG':0, 'GC':0, 'AT':0, 'GA':0, 'TG':0, 'TA':0, 'TC':0, 'CT':0}
for i in range(4):
for j in range(4):
scores[r[i]+c[j]]=scoring_m[i][j]
while pos<=(len(haystack)-len(needle)):
for i in range(len(needle)):
max_s+=scores[haystack[pos+i]+needle[i]]
if(max_s>max_score):
max_score=max_s
position=pos
max_s=0
pos+=1
return (position,max_score)
#TODO 3.3
'''
Create a function named findHighScore. This function will take three parameters (in this order):
Variable named haystack, a String, which contains the sequence to test against (like from the FASTA file).
Variable named needles, a List of Strings, which contains multiple sequences for scoring.
Variable named scoring_m, a List of Lists (matrix), which holds the scoring matrix (like from the CSV file).
It will then perform the following operations:
Create a variable named high_scorer containing a Tuple which stores three values (in order):
Position: The index (Integer) of character from haystack which produces the maximum word score for maximum needle (the highest scorer)
Score: The maximum word score (Integer) for maximum needle (the highest scorer)
Needle: The needle (String) which produces the maximum score amound all needles
Return high_scorer
'''
def findHighScore(haystack,needles,scoring_m):
high_score=0
for needle in needles:
pos=scoreSequence(haystack,needle,scoring_m)[0]
score=scoreSequence(haystack,needle,scoring_m)[1]
if(high_score<score):
high_score=score
high_scorer=(pos,high_score,needle)
return high_scorer
#TODO4
'''
The function main() must do the following:
call the function parseFastaFile (see TODO 1)
call the function digramFreqScores (see TODO 2.1)
call the function digramFreqMatrix (see TODO 2.2)
call the function scoreSequence (see TODO 2.3)
call the function parseScoringFile (see TODO 3.1)
call the function scoreSequence (see TODO 3.2)
call the function findHighScore. (see (TODO 3.3)
'''
def main():
#TODO 1
s=parseFastaFile('d:\\human22.txt')
#TODO 2.1
d=digramFreqScores(s)
#TODO 2.2
digramFreqMatrix(d)
#TODO 3.1
scoring_m=parseScoringFile('d:\\scoring.txt')
#TODO 3.2
scoreSequence(s,"TAATCTCAGCACTTTGAGAGGCCGAGGCAT",scoring_m)
#TODO 3.3
print(findHighScore(s[2],["TAATCTCAGCACTTT", "GAGAGGCCGAGGCAT"],scoring_m))
if __name__=="__main__":
main()
Zeuux © 2025
京ICP备05028076号
回复 王指晓 2011年04月19日 星期二 20:02