From: Colin Paul Gloster on
Hello,

I have been improving a program by suppressing C++ in it. After
speeding it up a lot by making changes, I have found one considerable
part which calls a library routine, which is unfortunately very slow
as provided as standard with a number of Ada compilers but which is
very fast with implementations of other languages. For some Ada
compilers perhaps I shall simply need to not rely on an implementation
of this routine by an Ada vendor. Perhaps this post will motivate Ada
vendors to speed up their implementations or for people to report
timings for efficient implementations which are used by default by
other Ada compilers.

At present the routine (which is still being called by a C++ part of
the code) takes up maybe approximately one per cent of the running
time. (I have profiled it but I do not have the timing profiles
available where I am typing this.) This might not sound as if it is
much and it is certainly much less than much of the C++ overhead which
I have removed, but it is trivial to speed it up and speeding up this
one per cent or so would reduce running time by hours.

To contrast different implementations, I have produced the C++ program
#include<iostream>
#include<cmath>
int main()
{
double answer = 0.0;
for(int I=1; I<=1000000; ++I)
{
#include"body_in_CPlusPlus.cc"
}
std::cout << answer;
}
where body_in_CPlusPlus.cc contained
answer += std::log10(0.10000000000000000000
);
/*Lines produced by
for((i=1; i<=500; ++i)) do echo -n 'answer += std::log10(0' ; echo "$i/10" | bc -l ; echo ');'; done
*/
answer += std::log10(050.00000000000000000000
);
and the Ada program
with Ada.Numerics.Generic_Elementary_Functions;
with Interfaces.C;
with Ada.Text_IO;
procedure Logarithmic_Work_In_Ada is
answer : Interfaces.C.double := 0.0;
package double_library is new Ada.Numerics.Generic_Elementary_Functions(Interfaces.C.double);
package double_output_library is new Ada.Text_IO.Float_IO(Interfaces.C.double);
begin

for I in 1 .. 1_000_000 loop --I would not have this loop but one compiler crashed if
--the source file was sized a few megabytes.

answer := Interfaces.C."+"(answer, double_library.log(0.10000000000000000000
, 10.0 ));
--Lines produced by
--for((i=1; i<=500; ++i)) do echo -n 'answer := Interfaces.C."+"(answer, double_library.log(0' ; echo "$i/10" | bc -l ; echo ', 10.0 ));'; done
answer := Interfaces.C."+"(answer, double_library.log(050.00000000000000000000
, 10.0 ));


end loop;

double_output_library.Put(answer);
end;

The aforementioned code which I have been speeding up did not consist
of merely 500 million calls to std::log10() as consecutive calls, but
instead it consisted of hundreds of times as many calls to
std::log10() interspersed across maybe 200 other functions.

Of the two programs shown, the fastest C++ implementation on one test
platform took less than one millisecond and the fastest Ada
implementation took one minute and 31 seconds and 874 milliseconds on
the same platform. Both g++ and gnatmake were from the same
installation of GCC 4.1.2 20080704 (Red Hat 4.1.2-44).

g++ -O3 -ffast-math logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_with_-ffast-math

time ./logarithmic_work_in_CPlusPlus_with_-ffast-math
6.34086e+08
real 0m0.005s
user 0m0.000s
sys 0m0.000s


g++ -O3 logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus

time ./logarithmic_work_in_CPlusPlus
6.34086e+08
real 0m46.443s
user 0m46.435s
sys 0m0.000s


gnatmake -O3 -ffast-math logarithmic_work_in_Ada.adb -o logarithmic_work_in_Ada_compiled_by_GNAT_with_-ffast-math
(which did
gcc -c -O3 -ffast-math logarithmic_work_in_Ada.adb
logarithmic_work_in_Ada.adb:4:11: warning: file name does not match unit name, should be "logarithmic_work_in_ada.adb"
gnatbind -x logarithmic_work_in_Ada.ali
gnatlink logarithmic_work_in_Ada.ali -ffast-math -o logarithmic_work_in_Ada_compiled_by_GNAT_with_-ffast-mat
whereas trying -cargs resulted in no compilation)

time ./logarithmic_work_in_Ada_compiled_by_GNAT_with_-ffast-math
6.34086408536266E+08

real 1m31.879s
user 1m31.874s
sys 0m0.000s



gnatmake -O3 logarithmic_work_in_Ada.adb -o logarithmic_work_in_Ada_compiled_by_GNAT

time ./logarithmic_work_in_Ada_compiled_by_GNAT
6.34086408536266E+08

real 1m33.338s
user 1m33.338s
sys 0m0.000s


proprietary_compiler -O -m logarithmic_work_in_Ada.adb -o logarithmic_work_in_Ada_compiled_by_a_proprietary_compiler

time ./logarithmic_work_in_Ada_compiled_by_a_proprietary_compiler
6.34086408605593E+08
real 1m41.966s
user 1m41.954s
sys 0m0.000s
From: Colin Paul Gloster on
On Mon, 15 Feb 2010, John B. Matthews wrote:

|--------------------------------------------------------------------|
|"In article |
|<alpine.LNX.2.00.1002151055530.17315(a)Bluewhite64.example.net>, |
| Colin Paul Gloster <Colin_Paul_Gloster(a)ACM.org> wrote: |
| |
|> gnatmake -O3 logarithmic_work_in_Ada.adb -o |
|> logarithmic_work_in_Ada_compiled_by_GNAT |
|> |
|> time ./logarithmic_work_in_Ada_compiled_by_GNAT |
|> 6.34086408536266E+08 |
|> |
|> real 1m33.338s |
|> user 1m33.338s |
|> sys 0m0.000s |
| |
|I get a different answer: 698970 = 1000000 * (log10(50) - 1). |
| |
|$ make clean logada ; time ./logada |
|rm -f *.o *.ali b~* core logada |
|gnatmake logada -cargs -O3 -gnatwa -bargs -shared -largs -dead_strip|
|gcc -c -O3 -gnatwa logada.adb |
|gnatbind -shared -x logada.ali |
|gnatlink logada.ali -shared-libgcc -dead_strip |
| 6.98970004334243E+05 |
| |
|real 0m0.138s |
|user 0m0.136s |
|sys 0m0.002s |
| |
|-- |
|John B. Matthews |
|trashgod at gmail dot com |
|<http://sites.google.com/site/drjohnbmatthews> " |
|--------------------------------------------------------------------|

Hi,

Thanks for running it. The Ada program for timing had 500 statements
in the body of the loop. I reproduced only the first and last
verbatim: I showed a bash (Bourne Again SHell) line for producing
all 500 statements in a comment in-between the the first statement and
the last statement. You ran a program with 498 of the statements
missing.

Anyway, the answer produced by the program is not so much of concern
as the relative speeds of different implementations. Did g++ produce
a faster result for you than GNAT? It did for me for many versions of
GCC today on a different platform than I used in the beginning of this
thread...


g++ -O3 -ffast-math logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.4.3_with_-ffast-math

time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.4.3_with_-ffast-math
6.34086e+08
real 0m0.012s
user 0m0.008s
sys 0m0.004s



g++ -O3 -ffast-math logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.2_with_-ffast-math

time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.2_with_-ffast-math
6.34086e+08
real 0m0.012s
user 0m0.012s
sys 0m0.000s



g++ -O3 logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.2

time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.2
6.34086e+08
real 0m0.567s
user 0m0.560s
sys 0m0.000s



g++ -O3 logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.4.3

time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.4.3
6.34086e+08
real 0m0.566s
user 0m0.564s
sys 0m0.004s



g++ -O3 logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.3

time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.3
6.34086e+08
real 0m0.583s
user 0m0.572s
sys 0m0.004s



gnatmake -O3 -ffast-math Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.4.3_with_-ffast-math

time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.4.3_with_-ffast-math
6.34086408536266E+08

real 0m31.618s
user 0m31.618s
sys 0m0.000s




gnatmake -O3 Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.4.3

time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.4.3
6.34086408606382E+08

real 0m32.750s
user 0m32.746s
sys 0m0.004s


gnatmake -O3 Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.2

time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.2
6.34086408606382E+08

real 0m33.537s
user 0m33.506s
sys 0m0.004s



gnatmake -O3 Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.3

time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.3
6.34086408606382E+08

real 0m33.717s
user 0m33.718s
sys 0m0.000s



g++ -O3 logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.2.4

time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.2.4
6.34086e+08
real 0m34.000s
user 0m33.982s
sys 0m0.000s



gnatmake -O3 Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.2.4

time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.2.4
6.34086408606382E+08

real 0m34.037s
user 0m34.034s
sys 0m0.004s



gnatmake -O3 -ffast-math Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.2_with_-ffast-math

time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.2_with_-ffast-math
6.34086408536266E+08

real 0m35.093s
user 0m35.090s
sys 0m0.004s
From: Jeffrey R. Carter on
Colin Paul Gloster wrote:
> with Ada.Numerics.Generic_Elementary_Functions;
> with Interfaces.C;
> with Ada.Text_IO;
> procedure Logarithmic_Work_In_Ada is
> answer : Interfaces.C.double := 0.0;
> package double_library is new
> Ada.Numerics.Generic_Elementary_Functions(Interfaces.C.double);
> package double_output_library is new
> Ada.Text_IO.Float_IO(Interfaces.C.double);
> begin
>
> for I in 1 .. 1_000_000 loop --I would not have this loop but one
> compiler crashed if
> --the source file was sized a few
> megabytes.
>
> answer := Interfaces.C."+"(answer,
> double_library.log(0.10000000000000000000
> , 10.0 ));
> --Lines produced by
> --for((i=1; i<=500; ++i)) do echo -n 'answer := Interfaces.C."+"(answer,
> double_library.log(0' ; echo "$i/10" | bc -l ; echo ', 10.0 ));'; done
> answer := Interfaces.C."+"(answer,
> double_library.log(050.00000000000000000000
> , 10.0 ));
>
>
> end loop;
>
> double_output_library.Put(answer);
> end;

IIUC, your Ada program is equivalent to

with Ada.Numerics.Generic_Elementary_Functions;
with Ada.Text_IO;

with Interfaces.C;

procedure Optimization_Test is
subtype Double is Interfaces.C.Double;
use type Double;

package Math is new Ada.Numerics.Generic_Elementary_Functions (Double);

Answer : Double := 0.0;
begin -- Optimization_Test
for I in 1 .. 1_000_000 loop
Answer := Answer + Math.Log (0.1, 10.0);

for J in 1 .. 500 loop
Answer := Answer + Math.Log (Double (J) / 10.0, 10.0);
end loop;

Answer := Answer + Math.Log (50.0, 10.0);
end loop;

Ada.Text_IO.Put (Answer'Img);
end Optimization_Test;

On my machine I get something like

jrcarter(a)jrcarter-acer-laptop:~/Code$ gnatmake -O2 -gnatnp optimization_test.adb
gcc-4.3 -c -O2 -gnatnp optimization_test.adb
gnatbind -x optimization_test.ali
gnatlink optimization_test.ali
jrcarter(a)jrcarter-acer-laptop:~/Code$ time ./optimization_test
6.34785378608078E+08

real 1m33.817s
user 1m33.810s
sys 0m0.000s

Note that suppressing runtime checks (-gnatp) is needed to be sort of equivalent
to C++.

A good compiler could recognize that this adds a constant amount to Answer each
time through the outer loop, and the outer loop is run a static constant number
of times, and might optimize it to something like

with Ada.Numerics.Generic_Elementary_Functions;
with Ada.Text_IO;

with Interfaces.C;

procedure Optimization_Test is
subtype Double is Interfaces.C.Double;
use type Double;

package Math is new Ada.Numerics.Generic_Elementary_Functions (Double);

function Inner_Loop return Double is
Result : Double := 0.0;
begin -- Inner_Loop
for J in 1 .. 500 loop
Result := Result + Math.Log (Double (J) / 10.0, 10.0);
end loop;

return Result;
end Inner_Loop;
pragma Inline (Inner_Loop);

Answer : constant Double :=
1.0E6 * (Math.Log (0.1, 10.0) + Inner_Loop + Math.Log (50.0, 10.0) );
begin -- Optimization_Test
Ada.Text_IO.Put (Answer'Img);
end Optimization_Test;

This, on my machine and with the same options, gives something like

jrcarter(a)jrcarter-acer-laptop:~/Code$ time ./optimization_test 6.34785378539470E+08

real 0m0.009s
user 0m0.000s
sys 0m0.010s

I suspect this optimization is what the C++ compiler is doing, and the Ada
compilers are not.

Whether a similar optimization may be applied to the calculations in your real
program of interest is another question.

--
Jeff Carter
"Friends don't let friends program in C++."
Ludovic Brenta
114
From: Colin Paul Gloster on
On Tue, 16 Feb 2010, Colin Paul Gloster alleged:

|-------------------------------------------------------------------------------------------------|
|"[..] |
| |
|time ./Optimization_Test_with_many_assignments_to_Answer_compiled_by_GCC4.4.3_with_-ffast-math |
| 6.34785378608078E+08 |
| |
|real 0m34.319s |
|user 0m34.318s |
|sys 0m0.004s |
|here. |
| |
|[..] |
| |
|time ./Optimization_Test_with_a_single_assignment_to_Answer_compiled_by_GCC4.4.3_with_-ffast-math|
| 6.34785378539470E+08 |
| |
|real 0m0.002s |
|user 0m0.000s |
|sys 0m0.008s |
|here. |
| |
|[..]" |
|-------------------------------------------------------------------------------------------------|

I apologize, those two runs were actually with GCC4.2.4.
From: Colin Paul Gloster on
On Mon, 15 Feb 2010, Jeffrey R. Carter sent:

|--------------------------------------------------------------------------------|
|"[..] |
| |
|IIUC, your Ada program is equivalent to |
| |
|with Ada.Numerics.Generic_Elementary_Functions; |
|with Ada.Text_IO; |
| |
|with Interfaces.C; |
| |
|procedure Optimization_Test is |
| subtype Double is Interfaces.C.Double; |
| use type Double; |
| |
| package Math is new Ada.Numerics.Generic_Elementary_Functions (Double); |
| |
| Answer : Double := 0.0; |
|begin -- Optimization_Test |
| for I in 1 .. 1_000_000 loop |
| Answer := Answer + Math.Log (0.1, 10.0); |
| |
| for J in 1 .. 500 loop |
| Answer := Answer + Math.Log (Double (J) / 10.0, 10.0);" |
|--------------------------------------------------------------------------------|

Aside from not encouraging division instructions when not necessary,
as I mentioned on
HTTP://Programmer.97Things.OReilly.com/wiki/index.php/Execution_Speed_versus_Maintenance_Effort
, I wanted to not have too high overhead obscuring the focus of
measuring speeds of log implementations.

|--------------------------------------------------------------------------------|
|" end loop; |
| |
| Answer := Answer + Math.Log (50.0, 10.0); |
| end loop; |
| |
| Ada.Text_IO.Put (Answer'Img); |
|end Optimization_Test; |
| |
|On my machine I get something like |
| |
|jrcarter(a)jrcarter-acer-laptop:~/Code$ gnatmake -O2 -gnatnp optimization_test.adb|
|gcc-4.3 -c -O2 -gnatnp optimization_test.adb |
|gnatbind -x optimization_test.ali |
|gnatlink optimization_test.ali |
|jrcarter(a)jrcarter-acer-laptop:~/Code$ time ./optimization_test |
| 6.34785378608078E+08 |
| |
|real 1m33.817s |
|user 1m33.810s |
|sys 0m0.000s" |
|--------------------------------------------------------------------------------|

time ./Optimization_Test_with_many_assignments_to_Answer_compiled_by_GCC4.4.3_with_-ffast-math
6.34785378608078E+08

real 0m34.319s
user 0m34.318s
sys 0m0.004s
here.

|---------------------------------------------------------------------------------|
|"Note that suppressing runtime checks (-gnatp) is needed to be sort of equivalent|
|to C++." |
|---------------------------------------------------------------------------------|

Thanks for the tip, but I do not program in Ada to really program in
C++ with Ada syntax.

|--------------------------------------------------------------------------------|
|"A good compiler could recognize that this adds a constant amount to Answer each|
|time through the outer loop, and the outer loop is run a static constant number |
|of times, and might optimize it to something like |
| |
|with Ada.Numerics.Generic_Elementary_Functions; |
|with Ada.Text_IO; |
| |
|with Interfaces.C; |
| |
|procedure Optimization_Test is |
| subtype Double is Interfaces.C.Double; |
| use type Double; |
| |
| package Math is new Ada.Numerics.Generic_Elementary_Functions (Double); |
| |
| function Inner_Loop return Double is |
| Result : Double := 0.0; |
| begin -- Inner_Loop |
| for J in 1 .. 500 loop |
| Result := Result + Math.Log (Double (J) / 10.0, 10.0); |
| end loop; |
| |
| return Result; |
| end Inner_Loop; |
| pragma Inline (Inner_Loop); |
| |
| Answer : constant Double := |
| 1.0E6 * (Math.Log (0.1, 10.0) + Inner_Loop + Math.Log (50.0, 10.0) ); |
|begin -- Optimization_Test |
| Ada.Text_IO.Put (Answer'Img); |
|end Optimization_Test; |
| |
|This, on my machine and with the same options, gives something like |
| |
|jrcarter(a)jrcarter-acer-laptop:~/Code$ time ./optimization_test |
|6.34785378539470E+08 |
| |
|real 0m0.009s |
|user 0m0.000s |
|sys 0m0.010s" |
|--------------------------------------------------------------------------------|

time ./Optimization_Test_with_a_single_assignment_to_Answer_compiled_by_GCC4.4.3_with_-ffast-math
6.34785378539470E+08

real 0m0.002s
user 0m0.000s
sys 0m0.008s
here.

|--------------------------------------------------------------------------------|
|"I suspect this optimization is what the C++ compiler is doing, and the Ada |
|compilers are not." |
|--------------------------------------------------------------------------------|

Actually, a C++ compiler optimized out calls to std::log10().

|--------------------------------------------------------------------------------|
|"Whether a similar optimization may be applied to the calculations in your real |
|program of interest is another question." |
|--------------------------------------------------------------------------------|

Such an optimization can not help the log10 calls like that in the
real code because they are spread out in conditional statements
between hundreds of other statements instead of being an analytically
simple sequence of millions of calls to log, as per the following...
G4double G4LogLogInterpolation::Calculate(G4double x, G4int bin,
const G4DataVector& points,
const G4DataVector& data) const
{
G4int nBins = data.size() - 1;
G4double value = 0.;
if (x < points[0])
{
value = 0.;
}
else if (bin < nBins)
{
/*..*/
value = (std::log10(d1)*std::log10(e2/x) + std::log10(d2)*std::log10(x/e1)) /*..*/
/*..*/
}
and
G4double G4SemiLogInterpolation::Calculate(G4double x, G4int bin,
const G4DataVector& points,
const G4DataVector& data) const
{
G4int nBins = data.size() - 1; /*C++ programmers have a way of
replicating statements across functions.*/
G4double value = 0.;
if (x < points[0])
{
value = 0.;
}
else if (bin < nBins)
{
/*..*/
value = (d1*std::log10(e2/x) + d2*std::log10(x/e1)) /*..*/
/*..*/
}
and
inline
size_t G4PhysicsLogVector::FindBinLocation(G4double theEnergy) const
{
// For G4PhysicsLogVector, FindBinLocation is implemented using
// a simple arithmetic calculation.
//
// Because this is a virtual function, it is accessed through a
// pointer to the G4PhyiscsVector object for most usages. In this
// case, 'inline' will not be invoked. However, there is a possibility
// that the user access to the G4PhysicsLogVector object directly and
// not through pointers or references. In this case, the 'inline' will
// be invoked. (See R.B.Murray, "C++ Strategies and Tactics", Chap.6.6)

return size_t( std::log10(theEnergy)//*Ugh, this division is unnecessary.*/dBin - baseBin );
}
and
G4double G4ProtonInelasticCrossSection::
GetCrossSection(G4double kineticEnergy, G4double atomicNumber, G4double nOfProtons)
{
/*..*/
if(nOfNeutrons>1.5) fac2=std::log((nOfNeutrons));
/*..*/
}
and
G4double G4LinLogLogInterpolation::Calculate(G4double x, G4int bin,
const G4DataVector& points,
const G4DataVector& data) const
{
G4int nBins = data.size() - 1;
G4double value = 0.;
if (x < points[0])
{
value = 0.;
}
else if (bin < nBins)
{
/*..*/ /*C++ programmers have a way of replicating statements across functions...*/
if(d1 > 0.0 && d2 > 0.0) { /*... and not remembering to include
checks for mathematical
well-definededness in each code
clone nor indeed for each call to
std::log10() in each statement
guarded by two checks for
mathematical well-definededness.*/
value = (std::log10(d1)*std::log10(e2/x) + std::log10(d2)*std::log10(x/e1)) /*..*/
/*..*/
}