From: candide on
Suppose you have a sequence s , a string for say, for instance this one :

spppammmmegggssss

We want to split s into the following parts :

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

ie each part is a single repeated character word.

What is the pythonic way to answer this question?

A naive solution would be the following :


# -------------------------------
z='spppammmmegggssss'

zz=[]
while z:
k=1
while z[:k]==k*z[0]:
k+=1
zz+=[z[:k-1]]
z=z[k-1:]

print zz
# -------------------------------


but I guess this code is not very idiomatic :(
From: Tim Chase on
On 08/10/10 19:37, candide wrote:
> Suppose you have a sequence s , a string for say, for instance this one :
>
> spppammmmegggssss
>
> We want to split s into the following parts :
>
> ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>
> ie each part is a single repeated character word.

While I'm not sure it's idiomatic, the overabuse of regexps in
Python certainly seems prevalent enough to be idiomatic ;-)

As such, you can use:

import re
r = re.compile(r'((.)\1*)')
#r = re.compile(r'((\w)\1*)')
s = 'spppammmmegggssss'
results = [m.group(0) for m in r.finditer(s)]

Additionally, you have all the properties of the match-object
(which includes the start/end) available too if you need).

You don't specify what you want to have happen with non-letters
(whitespace, punctuation, etc). The above just treats them like
any other character, finding repeats. If you just want "word"
characters, you can use the 2nd ("\w") version, or adjust
accordingly.

-tkc





From: Chris Rebert on
On Tue, Aug 10, 2010 at 5:37 PM, candide <candide(a)free.invalid> wrote:
> Suppose you have a sequence s , a string  for say, for instance this one :
>
> spppammmmegggssss
>
> We want to split s into the following parts :
>
> ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>
> ie each part is a single repeated character word.
>
> What is the pythonic way to answer this question?

If you're doing an operation on an iterable, always leaf thru itertools first:
http://docs.python.org/library/itertools.html

from itertools import groupby
def split_into_runs(seq):
return ["".join(run) for letter, run in groupby(seq)]


If itertools didn't exist:

def split_into_runs(seq):
if not seq: return []

iterator = iter(seq)
letter = next(iterator)
count = 1
words = []
for c in iterator:
if c == letter:
count += 1
else:
word = letter * count
words.append(word)
letter = c
count = 1
words.append(letter*count)
return words

Cheers,
Chris
--
http://blog.rebertia.com
From: MRAB on
Tim Chase wrote:
> On 08/10/10 19:37, candide wrote:
>> Suppose you have a sequence s , a string for say, for instance this
>> one :
>>
>> spppammmmegggssss
>>
>> We want to split s into the following parts :
>>
>> ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>>
>> ie each part is a single repeated character word.
>
> While I'm not sure it's idiomatic, the overabuse of regexps in Python
> certainly seems prevalent enough to be idiomatic ;-)
>
> As such, you can use:
>
> import re
> r = re.compile(r'((.)\1*)')
> #r = re.compile(r'((\w)\1*)')

That should be \2, not \1.

Alternatively:

r = re.compile(r'(.)\1*')
#r = re.compile(r'(\w)\1*')

> s = 'spppammmmegggssss'
> results = [m.group(0) for m in r.finditer(s)]
>
> Additionally, you have all the properties of the match-object (which
> includes the start/end) available too if you need).
>
> You don't specify what you want to have happen with non-letters
> (whitespace, punctuation, etc). The above just treats them like any
> other character, finding repeats. If you just want "word" characters,
> you can use the 2nd ("\w") version, or adjust accordingly.
>
From: Tim Chase on
On 08/10/10 20:30, MRAB wrote:
> Tim Chase wrote:
>> r = re.compile(r'((.)\1*)')
>> #r = re.compile(r'((\w)\1*)')
>
> That should be \2, not \1.
>
> Alternatively:
>
> r = re.compile(r'(.)\1*')

Doh, I had played with both and mis-transcribed the combination
of them into one malfunctioning regexp. My original trouble with
the 2nd one was that r.findall() (not .finditer) was only
returning the first letter of each because that's what was
matched. Wrapping it in the extra set of parens and using "\2"
returned the actual data in sub-tuples:

>>> s = 'spppammmmegggssss'
>>> import re
>>> r = re.compile(r'(.)\1*')
>>> r.findall(s) # no repeated text, just the initial letter
['s', 'p', 'a', 'm', 'e', 'g', 's']
>>> [m.group(0) for m in r.finditer(s)]
['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>>> r = re.compile(r'((.)\2*)')
>>> r.findall(s)
[('s', 's'), ('ppp', 'p'), ('a', 'a'), ('mmmm', 'm'), ('e', 'e'),
('ggg', 'g'), ('ssss', 's')]
>>> [m.group(0) for m in r.finditer(s)]
['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

By then changing to .finditer() it made them both work the way I
wanted.

Thanks for catching my mistranscription.

-tkc