From: Steve Howell on
On Jan 26, 11:34 pm, Arnaud Delobelle <arno...(a)googlemail.com> wrote:
> Steve Howell <showel...(a)yahoo.com> writes:
> > On Jan 25, 1:32 pm, Arnaud Delobelle <arno...(a)googlemail.com> wrote:
> >> Steve Howell <showel...(a)yahoo.com> writes:
>
> >> [...]
>
> >> > My algorithm does exactly N pops and roughly N list accesses, so I
> >> > would be going from N*N + N to N + N log N if switched to blist.
>
> >> Can you post your algorithm?  It would be interesting to have a concrete
> >> use case to base this discussion on.
>
> > I just realized you meant the Python code itself.  It is here:
>
> >https://bitbucket.org/showell/shpaml_website/src/tip/shpaml.py
>
> Hi - I didn't have the time to look at it before today.  You scan
> through a list called prefix_lines, and you use pop(0) as a means to
> keep track of your position in this list.  It seems to me that it would
> be more effective to explicitely keep track of it - and it would remove
> the need to the numerous copies of sublists of indent_lines that you
> have to create.
>
> I have rewritten part of the three relevant functions (html_block_tag,
> get_indented_block and recurse, within indent_lines) to show you what I
> mean (see below).  I have tried to keep the changes to a minimum - you
> can see that the code is no more complex like this.  The advantage is
> that only one list is created, prefix_lines, and there is no need to
> mutate it or copy parts of it during the algorithm.  I have not had the
> time to test it though (only on one of the examples on the examples
> webpage).
>
> Code follows:
>
> [...]
>
> def html_block_tag(output, block, start, end, recurse):
>     append = output.append
>     prefix, tag = block[start]
>     if RAW_HTML.regex.match(tag):
>         append(prefix + tag)
>         recurse(block, start + 1, end)
>     else:
>         start_tag, end_tag = apply_jquery_sugar(tag)
>         append(prefix + start_tag)
>         recurse(block, start + 1, end)
>         append(prefix + end_tag)
>
> [...]
>
> def get_indented_block(prefix_lines, start, end):
>     prefix, line = prefix_lines[start]
>     len_prefix = len(prefix)
>     i = start + 1
>     while i < end:
>         new_prefix, line = prefix_lines[i]
>         if line and len(new_prefix) <= len_prefix:
>             break
>         i += 1
>     while i-1 > start and prefix_lines[i-1][1] == '':
>         i -= 1
>     return i
>
> [...]
>
> def indent_lines(lines,
>             output,
>             branch_method,
>             leaf_method,
>             pass_syntax,
>             flush_left_syntax,
>             flush_left_empty_line,
>             indentation_method,
>             get_block,
>             ):
>     append = output.append
>     def recurse(prefix_lines, start, end):
>         while start < end:
>             prefix, line = prefix_lines[start]
>             if line == '':
>                 start += 1
>                 append('')
>             else:
>                 block_end = get_block(prefix_lines, start, end)
>                 if block_end == start + 1:
>                     start += 1
>                     if line == pass_syntax:
>                         pass
>                     elif line.startswith(flush_left_syntax):
>                         append(line[len(flush_left_syntax):])
>                     elif line.startswith(flush_left_empty_line):
>                         append('')
>                     else:
>                         append(prefix + leaf_method(line))
>                 else:
>                     branch_method(output, prefix_lines, start, block_end, recurse)
>                     start = block_end
>         return
>     prefix_lines = map(indentation_method, lines)
>     recurse(prefix_lines, 0, len(prefix_lines))

I think your rewrite makes sense for the way that Python is
implemented today.

Instead of mutating the list as you consume elements, you propose to
advance the "start" variable, which is essentially a pointer.

There are only two disadvantage of the "start" approach that I can
think of:

1) You have to pass around two parameters where before there was
only one. (Aside: for stylistic concerns, would you bundle them in a
tuple, since they kind of go hand in hand?)
2) You hold on to memory from the elements longer.

Pushing this complexity down into CPython of course would have similar
disadvantages:

1) you have to store two pointers where before there was only one
2) you hold on to memory from pointers to orphaned elements longer

Disadvantage #1 is more than offset by the fact that you would have
had to waste memory with the "start" in the Python program.

Disadvantage #2 is offset by the fact that you would have been holding
on to the elements themselves in the Python program.

Admittedly this a pretty narrow context where pushing the logic down
into CPython gains overall simplicity and performance. Disadvantage
#1 is the sticky point when you consider the patch in the broader
context of all programs.

Thanks for looking at the code, by the way.