Fast string functions [Computer Architecture]

Prev: VLIW pre-history
Next: what is CCIR-565

From: Piotr Wyderski on 11 Aug 2007 15:57

Many compilers for IA-32 generate "repne scasb" in order
tto find the length of a given C string. However, it is possible
to implement strlen (and many other string functions) using
the SSE2 instruction set: pcmpeqb + pmovmaskb until there
is a set bit then bsf to find its index. On Core2 it is roughly
9.3 times faster and about 6.5 times faster on Pentium 4.
The same technique could obviously be applied to wider
data types as well.

My question is: why isn't the "repne scas" idiom microcoded
internally to implement the above (or better!) sequence using
the inherent power of the on-chip SIMD execution engines?
The same applies to repe cmp (that is, strcmp, memcmp)...

Best regards
Piotr Wyderski

From: Terje Mathisen on 12 Aug 2007 10:33

Piotr Wyderski wrote:
> Many compilers for IA-32 generate "repne scasb" in order
> tto find the length of a given C string. However, it is possible
> to implement strlen (and many other string functions) using the SSE2
> instruction set: pcmpeqb + pmovmaskb until there
> is a set bit then bsf to find its index. On Core2 it is roughly
> 9.3 times faster and about 6.5 times faster on Pentium 4.

I assume this is measured starting with 16-byte aligned strings?

To handle arbitrary starting alignment you have to specialcase the first
block, probably by POR'ing in a 16*16=256-byte lookup table value, or
you could use a single 31-byte table with a MOVUPS unaligned load?

However, if you're willing to look at misaligned data, then it is
probably much better to simply handle the first block with a MOVUPS
load, do the testing, then move on to aligned accesses from then on.

This is because (a) quite a few strings will be at least 8-byte aligned
anyway, and (b) most C-style zero-terminated strings are quite short,
needing only a single iteration of a 16-byte wide operation.

> The same technique could obviously be applied to wider
> data types as well.
>
> My question is: why isn't the "repne scas" idiom microcoded
> internally to implement the above (or better!) sequence using
> the inherent power of the on-chip SIMD execution engines?

According to discussions I've had with a well-know cpu architect,
hardware could indeed do a much better job, and afaik, the current 'Fast
String' support on some chips is a step in that direction.

The main problem is how to make it totally general, what with variously
cached/uncached memory buffers in particular.

> The same applies to repe cmp (that is, strcmp, memcmp)...

See above.

Terje
>
> Best regards
> Piotr Wyderski
>

--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: Piotr Wyderski on 12 Aug 2007 11:12

Terje Mathisen wrote:

> I assume this is measured starting with 16-byte aligned strings?

Yes, as I can align all the dynamicly allocated strings to that
boundary with my custom memory allocator and enforce an
appropriate alignment of the static strings using language
extensions (#pragma, __attribute__ etc.).

> To handle arbitrary starting alignment you have to specialcase the first
> block, probably by POR'ing in a 16*16=256-byte lookup table value, or you
> could use a single 31-byte table with a MOVUPS unaligned load?

Why such a complex solution? :-) In case of strlen()/memchr() all you need
is to check whether the string is aligned (test with 0x0f + a statically
predicted "not taken" branch to the unaligned mode handler) and if it is
not, then align it down manually, perform the comparison operation, mask
out the invalid "prefix bytes" and stop if found/follow the aligned path if
not found. This way you'll check the overlapping part two times, but who
cares? :-) The movups-based solution also have its merits in case of
short strings, but perhaps a generic instruction-based while loop will
be enough, since the unaligned part requires at most 15 iterations?

> According to discussions I've had with a well-know cpu architect, hardware
> could indeed do a much better job, and afaik, the current 'Fast String'
> support on some chips is a step in that direction.

But is there any "fast string support"? Almost everywhere the rep family
is as slow as its direct replacement based on test/jnz. Only the block
copy/fill operations are optimized.

> The main problem is how to make it totally general

Have you seen the SSE4's pcmpestri instruction? It is pretty darn general.
:-)
Do they have a large array of 256 comparators to execute it in a single
cycle, or unrolled it in time (which means 8+ cycles, hence quite useless).

Best regards
Piotr Wyderski

From: Terje Mathisen on 13 Aug 2007 07:00

Piotr Wyderski wrote:
> Terje Mathisen wrote:
>> To handle arbitrary starting alignment you have to specialcase the
>> first block, probably by POR'ing in a 16*16=256-byte lookup table
>> value, or you could use a single 31-byte table with a MOVUPS unaligned
>> load?
>
> Why such a complex solution? :-) In case of strlen()/memchr() all you need
> is to check whether the string is aligned (test with 0x0f + a statically
> predicted "not taken" branch to the unaligned mode handler) and if it is
> not, then align it down manually, perform the comparison operation, mask
> out the invalid "prefix bytes" and stop if found/follow the aligned path if
> not found. This way you'll check the overlapping part two times, but who
> cares? :-) The movups-based solution also have its merits in case of
> short strings, but perhaps a generic instruction-based while loop will
> be enough, since the unaligned part requires at most 15 iterations?

There is one very important situation that Alpha did consider:

If you have a misaligned string of less than 8/16 (Alpha/SSE) bytes from
the end of the last allocated page, then you'll get a page fault by
trying to read past the end of the page.

Alpha handled this by having an instruction which could convert the
lower three bits of an address into a mask, said mask to be used
together with an aligned load (which it also had: Load from 0xnnnn3 and
you would get the quadword starting at 0xnnnn0). This made sure that all
load operations were aligned, and therefore couldn't cross a page boundary.

With SSE you don't have a similarly fast way of turning a potentially
misaligned load in an aligned load + mask, and since average C string
lengths are just a few bytes (single digits afair from reading some
paper 10+ years ago), the first block handling is _crucial_.

> Have you seen the SSE4's pcmpestri instruction? It is pretty darn
> general. :-)

Sure.

In fact, it is so general that there's just two possible ways to
implement it:

a) Microcode, which then will effectively expand to a _lot_ of code
handling alignment, locating the shortest (implicit) string length
input, do a bunch of 16-byte loads from the two source addresses,
masking the compare if the shortest string ends anywhere inside, do the
compare, then loop.

b) A _huge_ slice of hardware directly implementing all the logic of
(a), which would turn this into one of the CISC'iest non-OS instructions
ever.

The documentation I have doesn't specify enough detail, but my guess
would be that it simply returns the index of the first non-equal byte,
then it is up to the calling code to figure out which string is longer.

The C-style version is in some ways even more hairy, it requires two
sets of comparators (but not three):

One 16-byte wide set which compares pairs of input bytes and another
which checks for zero bytes. The second can of course re-use the first
hw with one of the inputs forced to zero, and what with all the other
stuff going on, I don't believe the actual XOR+merge 8/16 bits part is
going to be the bottleneck.

> Do they have a large array of 256 comparators to execute it in a single
> cycle, or unrolled it in time (which means 8+ cycles, hence quite useless).

16 or 32 is enough, unless it actually works on 64-byte cache blocks, in
the latter case they also need one or two 64-byte barrel shifters to
align the strings with each other.

Terje

--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: Terje Mathisen on 13 Aug 2007 09:34

Piotr Wyderski wrote:
> Terje Mathisen wrote:
> But is there any "fast string support"? Almost everywhere the rep family
> is as slow as its direct replacement based on test/jnz. Only the block
> copy/fill operations are optimized.
>
>> The main problem is how to make it totally general
>
> Have you seen the SSE4's pcmpestri instruction? It is pretty darn
> general. :-)
> Do they have a large array of 256 comparators to execute it in a single
> cycle, or unrolled it in time (which means 8+ cycles, hence quite useless).

OK, I've found some more docs about these soon-to-be opcodes, and you're
right, it is pretty darn general, but not what I'd think of as a
"string" operation, more like a building block.

They don't need anything like 256 comparators though:

The way they've probably implemented it is to use the same 16 byte-wide
comparators as needed by lots of existing opcodes, with one of the
immediate bits (which really should be considered part of the opcode,
resulting in 64 times as many opcodes I guess), used to determine if
pairs of byte operations should be joined into words.

Each of these comparators must be two-state (</=) or (=/>), so that the
two bits used for equal_any/range/equal_each/ordered can be used to
select the terms to merge.

What it does look like is an opcode that can only ever be used as part
of library code, or in the form of specific canned operations that are
output from a compiler.

I bet you could use this opcode plus a store and have a two-instruction
Turing-equivalent machine, using self-modification for all interesting
stuff. :-)

Terje

--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

| Next | Last
Pages: 1 2 3 4
Prev: VLIW pre-history
Next: what is CCIR-565