From: rs1n on
I am currently rewriting some code that writes data to the display.
For anyone who has experience in timings of loops:

How much overhead (timewise) does a loop cost versus unrolling the
loop if the loop is fairly short? (I am trying to optimize some code
for speed and size) Also, is there any other method for comparing run
times short of wrapping snippets of code within an even larger loop
that reruns the code snippet enough times to actually make a valid
comparison?

Take for example the following code, which renders one character on
the screen (first done as straight code, then as a loop)

Entry:
D1 -> current screen position
A[6] = 6 nibbles, each nibble containing row data of a char (FNT1
data)

P= 1-1
DAT1=A P
D1=D1+ 16
D1=D1+ 16
D1=D1+ 2
P= 2-1
DAT1=A P
D1=D1+ 16
D1=D1+ 16
D1=D1+ 2
.
.
.
P= 6-1
DAT1=A P
P= 0

versus


P= 15-6
- DAT1=A 0
D1=D1+ 16
D1=D1+ 16
D1=D1+ 2
ASR W
P=P+1
GONC -
From: Raymond Del Tondo on
Hi Han,

you could also simply add up cycle times,
which can be found in SASM.DOC;-)

HTH

Raymond


From: Han on
Hmm, I'll look into that.

On Dec 8, 11:10 pm, "Raymond Del Tondo" <Ih8...(a)nowhere.com> wrote:
> Hi Han,
>
> you could also simply add up cycle times,
> which can be found in SASM.DOC;-)
>
> HTH
>
> Raymond

From: Dave Hayden on
On Dec 8, 11:17 pm, rs1n <handuongs...(a)gmail.com> wrote:
> I am currently rewriting some code that writes data to the display.
> For anyone who has experience in timings of loops:
>

I'm not sure that Raymond's suggestion to add up cycle times will
work. After all, the Saturn is emulated on an ARM processor now, so
the cycle times are probably no longer accurate.

This might be a case where you'd be better off dropping into ARM
assembly code. There is an example of how to do this in the 50G
Advanced Users' Reference. It's a little tricky because you usually
have to move the ARM code onto a 4-byte boundary before executing it.

Also, although the example in the AUR doesn't show it, I worry that
you may need to flush the cache after moving the code.

Good luck with it!
Dave
From: Yann on
In your case, i would use loop without hesitation,
for both compact code and maintenance reason (easier modifications
later if needed).

Operations on P are very fast, unnoticeable.
And tests on Carry are also very reasonable (especially when no jump
occurs, this is almost free).

The only minor modification suggested in your proposition
is getting rid of operation of full W register, which is substantially
longer.

If you could present your data in A[6] in a reverse order, this would
be more straightforward :

P= 6-1
- DAT1=A P
D1=D1+ 16
D1=D1+ 16
D1=D1+ 2
P=P-1
GONC -