From: John Posner on
On 1/28/2010 10:50 AM, evilweasel wrote:
> I will make my question a little more clearer. I have close to 60,000
> lines of the data similar to the one I posted. There are various
> numbers next to the sequence (this is basically the number of times
> the sequence has been found in a particular sample). So, I would need
> to ignore the ones containing '0' and write all other sequences
> (excluding the number, since it is trivial) in a new text file, in the
> following format:
>
>> seq59902
> TTTTTTTATAAAATATATAGT
>
>> seq59903
> TTTTTTTATTTCTTGGCGTTGT
>
>> seq59904
> TTTTTTTGGTTGCCCTGCGTGG
>
>> seq59905
> TTTTTTTGTTTATTTTTGGG
>
> The number next to 'seq' is the line number of the sequence. When I
> run the above program, what I expect is an output file that is similar
> to the above output but with the ones containing '0' ignored. But, I
> am getting all the sequences printed in the file.
>
> Kindly excuse the 'newbieness' of the program. :) I am hoping to
> improve in the next few months. Thanks to all those who replied. I
> really appreciate it. :)

Your program is a good first try. It contains a newbie error (looking
for the number 0 instead of the string "0"). But more importantly,
you're doing too much work yourself, rather than letting Python do the
heavy lifting for you. These practices and tools make life a lot easier:

* As others have noted, don't accumulate output in a list. Just write
data to the output file line-by-line.

* You don't need to initialize every variable at the beginning of the
program. But there's no harm in it.

* Use the enumerate() function to provide a line counter:

for counter, line in enumerate(file1):

This eliminates the need to accumulate output data in a list, then use
the index variable "j" as the line counter.

* Use string formatting. Each chunk of output is a two-line string, with
the line-counter and the DNA sequence as variables:

outformat = """seq%05d
%s
"""

... later, inside your loop ...

resultsfile.write(outformat % (counter, sequence))

HTH,
John
From: Jean-Michel Pichavant on
evilweasel wrote:
> I will make my question a little more clearer. I have close to 60,000
> lines of the data similar to the one I posted. There are various
> numbers next to the sequence (this is basically the number of times
> the sequence has been found in a particular sample). So, I would need
> to ignore the ones containing '0' and write all other sequences
> (excluding the number, since it is trivial) in a new text file, in the
> following format:
>
>
>> seq59902
>>
> TTTTTTTATAAAATATATAGT
>
>
>> seq59903
>>
> TTTTTTTATTTCTTGGCGTTGT
>
>
>> seq59904
>>
> TTTTTTTGGTTGCCCTGCGTGG
>
>
>> seq59905
>>
> TTTTTTTGTTTATTTTTGGG
>
> The number next to 'seq' is the line number of the sequence. When I
> run the above program, what I expect is an output file that is similar
> to the above output but with the ones containing '0' ignored. But, I
> am getting all the sequences printed in the file.
>
> Kindly excuse the 'newbieness' of the program. :) I am hoping to
> improve in the next few months. Thanks to all those who replied. I
> really appreciate it. :)
>
Using regexp may increase readability (if you are familiar with it).
What about

import re

output = open("sequences1.txt", 'w')

for index, line in enumerate(open(sys.argv[1], 'r')):
match = re.match('(?P<sequence>[GATC]+)\s+1')
if match:
output.write('seq%s\n%s\n' % (index, match.group('sequence')))


Jean-Michel
From: D'Arcy J.M. Cain on
On Thu, 28 Jan 2010 18:49:02 +0100
Jean-Michel Pichavant <jeanmichel(a)sequans.com> wrote:
> Using regexp may increase readability (if you are familiar with it).

If you have a problem and you think that regular expressions are the
solution then now you have two problems. Regex is really overkill for
the OP's problem and it certainly doesn't improve readability.

--
D'Arcy J.M. Cain <darcy(a)druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
From: Jean-Michel Pichavant on
D'Arcy J.M. Cain wrote:
> On Thu, 28 Jan 2010 18:49:02 +0100
> Jean-Michel Pichavant <jeanmichel(a)sequans.com> wrote:
>
>> Using regexp may increase readability (if you are familiar with it).
>>
>
> If you have a problem and you think that regular expressions are the
> solution then now you have two problems. Regex is really overkill for
> the OP's problem and it certainly doesn't improve readability.
>
>
It depends on the reader ability to understand a *simple* regexp.
It is also strange to get such answer after taking so much precautions,
so let me quote myself:

"Using regexp *may* increase readability (*if* you are *familiar* with it)."

I honestly find it quite readable in the sample code I provided and
spares all the if-len-startwith-strip logic, but If the OP does not
agree, fine with me. But there's no need to get certain that I'm
completly wrong.

JM


From: Steven Howe on
On 01/28/2010 09:49 AM, Jean-Michel Pichavant wrote:
> evilweasel wrote:
>> I will make my question a little more clearer. I have close to 60,000
>> lines of the data similar to the one I posted. There are various
>> numbers next to the sequence (this is basically the number of times
>> the sequence has been found in a particular sample). So, I would need
>> to ignore the ones containing '0' and write all other sequences
>> (excluding the number, since it is trivial) in a new text file, in the
>> following format:
>>
>>> seq59902
>> TTTTTTTATAAAATATATAGT
>>
>>> seq59903
>> TTTTTTTATTTCTTGGCGTTGT
>>
>>> seq59904
>> TTTTTTTGGTTGCCCTGCGTGG
>>
>>> seq59905
>> TTTTTTTGTTTATTTTTGGG
>>
>> The number next to 'seq' is the line number of the sequence. When I
>> run the above program, what I expect is an output file that is similar
>> to the above output but with the ones containing '0' ignored. But, I
>> am getting all the sequences printed in the file.
>>
>> Kindly excuse the 'newbieness' of the program. :) I am hoping to
>> improve in the next few months. Thanks to all those who replied. I
>> really appreciate it. :)
> Using regexp may increase readability (if you are familiar with it).
> What about
>
> import re
>
> output = open("sequences1.txt", 'w')
>
> for index, line in enumerate(open(sys.argv[1], 'r')):
> match = re.match('(?P<sequence>[GATC]+)\s+1')
> if match:
> output.write('seq%s\n%s\n' % (index, match.group('sequence')))
>
>
> Jean-Michel

Finally!

After ready 8 or 9 messages about find a line ending with '1', someone
suggests Regex.
It was my first thought.

Steven