From: D'Arcy J.M. Cain on
On Thu, 28 Jan 2010 07:07:04 -0800 (PST)
evilweasel <karthikramaswamy88(a)gmail.com> wrote:
> I am a newbie to python, and I would be grateful if someone could

Welcome.

> point out the mistake in my program. Basically, I have a huge text
> file similar to the format below:

You don't say how it isn't working. As a first step you should read
http://catb.org/~esr/faqs/smart-questions.html.

> The text is nothing but DNA sequences, and there is a number next to
> it. What I will have to do is, ignore those lines that have 0 in it,

Your code doesn't completely ignore them. See below.

> and print all other lines (excluding the number) in a new text file
> (in a particular format called as FASTA format). This is the program I
> wrote for that:
>
> seq1 = []
> list1 = []
> lister = []
> listers = []
> listers1 = []
> a = []
> d = []
> i = 0
> j = 0
> num = 0

This seems like an awful lot of variables for such a simple task.

>
> file1 = open(sys.argv[1], 'r')
> for line in file1:

This is good. You aren't trying to load the whole file into memory at
once. If the file is huge as you say then that would have been bad. I
would have made one small optimization that saves one assignment and
one extra variable.

for line in open(sys.argv[1], 'r'):

> if not line.startswith('\n'):
> seq1 = line.split()
> if len(seq1) == 0:
> continue

This is redundant and perhaps not even correct at the end of the file.
It assumes that the last line ends with a newline. Look at what
'\n'.split() gives you and see if you can't improve the above code.

Another small optimization - "if seq1" is better than "if len(seq1)".

>
> a = seq1[0]
> list1.append(a)

Aha! I may have found your bug. Are you mixing tabs and spaces?
Don't do that. Either always use spaces or always use tabs. My
suggestion is to use spaces and choose a short indent such as three or
even two but that's a religious issue.

>
> d = seq1[1]
> lister.append(d)

You can also do "a, d = seq1". Of course you must be sure that you
have two fields. Perhaps that's guaranteed for your input but a quick
sanity test wouldn't hurt here.

However, I don't understand all of the above. It may also be a source
of problems. You say the files are huge. Are you filling up memory
here? You did the smart thing reading the file but you lose it here.
In any case, see below.

> b = len(lister)
> for j in range(0, b):

Go lookup zip()

> if lister[j] == 0:

I think that you will find that lister[j] is "0", not 0.

> listers.append(j)
> else:
> listers1.append(j)

Why are you collecting the input? Just toss the '0' ones and write the
others lines directly to the output.

Hope this helps with this script and in further understanding the power
and simplicity of Python. Good luck.

--
D'Arcy J.M. Cain <darcy(a)druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
From: Krister Svanlund on
On Thu, Jan 28, 2010 at 4:31 PM, Krister Svanlund
<krister.svanlund(a)gmail.com> wrote:
> On Thu, Jan 28, 2010 at 4:28 PM, Krister Svanlund
> <krister.svanlund(a)gmail.com> wrote:
>> On Thu, Jan 28, 2010 at 4:07 PM, evilweasel
>> <karthikramaswamy88(a)gmail.com> wrote:
>>> Hi folks,
>>>
>>> I am a newbie to python, and I would be grateful if someone could
>>> point out the mistake in my program. Basically, I have a huge text
>>> file similar to the format below:
>>>
>>> AAAAAGACTCGAGTGCGCGGA   0
>>> AAAAAGATAAGCTAATTAAGCTACTGG     0
>>> AAAAAGATAAGCTAATTAAGCTACTGGGTT   1
>>> AAAAAGGGGGCTCACAGGGGAGGGGTAT     1
>>> AAAAAGGTCGCCTGACGGCTGC  0
>>>
>>> The text is nothing but DNA sequences, and there is a number next to
>>> it. What I will have to do is, ignore those lines that have 0 in it,
>>> and print all other lines (excluding the number) in a new text file
>>> (in a particular format called as FASTA format). This is the program I
>>> wrote for that:
>>>
>>> seq1 = []
>>> list1 = []
>>> lister = []
>>> listers = []
>>> listers1 = []
>>> a = []
>>> d = []
>>> i = 0
>>> j = 0
>>> num = 0
>>>
>>> file1 = open(sys.argv[1], 'r')
>>> for line in file1:
>>>    if not line.startswith('\n'):
>>>        seq1 = line.split()
>>>        if len(seq1) == 0:
>>>            continue
>>>
>>>        a = seq1[0]
>>>        list1.append(a)
>>>
>>>        d = seq1[1]
>>>        lister.append(d)
>>>
>>>
>>> b = len(lister)
>>> for j in range(0, b):
>>>    if lister[j] == 0:
>>>        listers.append(j)
>>>    else:
>>>        listers1.append(j)
>>>
>>>
>>> print listers1
>>> resultsfile = open("sequences1.txt", 'w')
>>> for i in listers1:
>>>    resultsfile.write('\n>seq' + str(i) + '\n' + list1[i] + '\n')
>>>
>>> But this isn't working. I am not able to find the bug in this. I would
>>> be thankful if someone could point it out. Thanks in advance!
>>>
>>> Cheers!

I'm trying this again:

newlines = []

with open(sys.argv[1], 'r') as f:
   text = f.read();
   for line in (l.strip() for l in text.splitlines()):
       if line:
           line_elem = line.split()
           if len(line_elem) == 2 and line_elem[1] == '1':
               newlines.append('seq'+line_elem[0])

with open(sys.argv[2], 'w') as f:
   f.write('\n'.join(newlines))
From: evilweasel on
I will make my question a little more clearer. I have close to 60,000
lines of the data similar to the one I posted. There are various
numbers next to the sequence (this is basically the number of times
the sequence has been found in a particular sample). So, I would need
to ignore the ones containing '0' and write all other sequences
(excluding the number, since it is trivial) in a new text file, in the
following format:

>seq59902
TTTTTTTATAAAATATATAGT

>seq59903
TTTTTTTATTTCTTGGCGTTGT

>seq59904
TTTTTTTGGTTGCCCTGCGTGG

>seq59905
TTTTTTTGTTTATTTTTGGG

The number next to 'seq' is the line number of the sequence. When I
run the above program, what I expect is an output file that is similar
to the above output but with the ones containing '0' ignored. But, I
am getting all the sequences printed in the file.

Kindly excuse the 'newbieness' of the program. :) I am hoping to
improve in the next few months. Thanks to all those who replied. I
really appreciate it. :)
From: nn on
On Jan 28, 10:50 am, evilweasel <karthikramaswam...(a)gmail.com> wrote:
> I will make my question a little more clearer. I have close to 60,000
> lines of the data similar to the one I posted. There are various
> numbers next to the sequence (this is basically the number of times
> the sequence has been found in a particular sample). So, I would need
> to ignore the ones containing '0' and write all other sequences
> (excluding the number, since it is trivial) in a new text file, in the
> following format:
>
> >seq59902
>
> TTTTTTTATAAAATATATAGT
>
> >seq59903
>
> TTTTTTTATTTCTTGGCGTTGT
>
> >seq59904
>
> TTTTTTTGGTTGCCCTGCGTGG
>
> >seq59905
>
> TTTTTTTGTTTATTTTTGGG
>
> The number next to 'seq' is the line number of the sequence. When I
> run the above program, what I expect is an output file that is similar
> to the above output but with the ones containing '0' ignored. But, I
> am getting all the sequences printed in the file.
>
> Kindly excuse the 'newbieness' of the program. :) I am hoping to
> improve in the next few months. Thanks to all those who replied. I
> really appreciate it. :)

People have already given you some pointers to your problem. In the
end you will have to "tweak the details" because only you have access
to the data not us.

Just as example here is another way to do what you are doing:

with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile:
partgen=(line.split() for line in infile)
dnagen=(str(i+1)+'\n'+part[0]+'\n'
for i,part in enumerate(partgen)
if len(part)>1 and part[1]!='0')
outfile.writelines(dnagen)

From: Arnaud Delobelle on
nn <pruebauno(a)latinmail.com> writes:

> On Jan 28, 10:50 am, evilweasel <karthikramaswam...(a)gmail.com> wrote:
>> I will make my question a little more clearer. I have close to 60,000
>> lines of the data similar to the one I posted. There are various
>> numbers next to the sequence (this is basically the number of times
>> the sequence has been found in a particular sample). So, I would need
>> to ignore the ones containing '0' and write all other sequences
>> (excluding the number, since it is trivial) in a new text file, in the
>> following format:
>>
>> >seq59902
>>
>> TTTTTTTATAAAATATATAGT
>>
>> >seq59903
>>
>> TTTTTTTATTTCTTGGCGTTGT
>>
>> >seq59904
>>
>> TTTTTTTGGTTGCCCTGCGTGG
>>
>> >seq59905
>>
>> TTTTTTTGTTTATTTTTGGG
>>
>> The number next to 'seq' is the line number of the sequence. When I
>> run the above program, what I expect is an output file that is similar
>> to the above output but with the ones containing '0' ignored. But, I
>> am getting all the sequences printed in the file.
>>
>> Kindly excuse the 'newbieness' of the program. :) I am hoping to
>> improve in the next few months. Thanks to all those who replied. I
>> really appreciate it. :)
>
> People have already given you some pointers to your problem. In the
> end you will have to "tweak the details" because only you have access
> to the data not us.
>
> Just as example here is another way to do what you are doing:
>
> with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile:
> partgen=(line.split() for line in infile)
> dnagen=(str(i+1)+'\n'+part[0]+'\n'
> for i,part in enumerate(partgen)
> if len(part)>1 and part[1]!='0')
> outfile.writelines(dnagen)

I think that generator expressions are overrated :) What's wrong with:

with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile:
for i, line in enumerate(infile):
parts = line.split()
if len(parts) > 1 and parts[1] != '0':
outfile.write(">seq%s\n%s\n" % (i+1, parts[0]))

(untested)

--
Arnaud