From: Robert Myers on
On Dec 31, 8:30 am, Mayan Moudgill <ma...(a)bestweb.net> wrote:

> Any problem that is not I/O bound that can be made solved using a
> non-imperative language can be made to work faster in an imperative
> language with the appropriate library (i.e. if you can write it in ML,
> it can be re-written in C-as-high-level-assembly, and it will perform
> better).
>
> The questions that arise are:
> - how much speedup does the C version get?
> - how much more effort is it to write the C version?
> - how many machines will this version need to support, and how much
> rewrite is required to support all these machines?

If your original claim is correct even for one machine (and the truth
of your claim is not obvious to me, given the lengthy discussion here
about memory models), does it then follow that it is also possible for
a second, or a third machine? If the cost of rewrite is bounded for
one machine (that is to say that, you eventually succeeded in getting
the program to work--a least you *think* you got it to work--
correctly), it is bounded for a second or a third machine?

I'll accept that Nick is an expert at finding and fixing weird memory
problems. From reading his posts, I'm convinced that the memory model
of c is not well-defined or even agreed-upon in any practical sense.
Thus, not only are you worried about whether your program will work
with a second machine, but even with a second compiler with a
different notion of critical details of c.

So, even if you could do such a rewrite once, whether or not you had a
correct translation in any other circumstance would always be an open
question.

From an economic point of view, the proposed trade doesn't even seem
rational: trade a program with transparently correct memory semantics
(perhaps because there are no memory semantics to be incorrect) for
one that is faster but may or may not do the same thing under some
limited set of circumstances.

From the POV of scientific computing, where reproducibility is
essential, the idea that you're never quite sure whether the software
will work correctly on two different machines or under two different
compilers seems completely unacceptable.

Perhaps, on the other hand, because the world of finance is
essentially a casino, it's a plausible proposal for a manufacturer of
business machines.

Robert.

From: Gavin Scott on
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:
> But that's PERFORMANCE tuning.

Speaking of the P word, offhand I can only think of one
interesting front in the battle for performance these days,
that of executing JavaScript inside web browsers.

There are at least three groups actively competing to produce the
fastest JavaScript execution engine, based on the idea that this
is going to be a key part of success in the upcoming "web os" age.

It might be interesting to look at some of that work and their
specific challenges as a counterpoint to the usual "we can't do
it because C doesn't let us" arguments. It's a different language
and environment which is having real money and brainpower thrown
at it.

G.
From: Bernd Paysan on
Gavin Scott wrote:
> There are at least three groups actively competing to produce the
> fastest JavaScript execution engine, based on the idea that this
> is going to be a key part of success in the upcoming "web os" age.
>
> It might be interesting to look at some of that work and their
> specific challenges as a counterpoint to the usual "we can't do
> it because C doesn't let us" arguments. It's a different language
> and environment which is having real money and brainpower thrown
> at it.

The tragedy is that it was never designed for that job. It just
happened. The consequence of all these accidental pieces growing
together as the "next OS" is a nightmare. We already have enough bad
operating systems, bad programming languages, bad protocols, and bad
user interfaces, but apparently, the direction is straight downhill.

Fefe recently gave a talk about bad API on the 26C3; if you understand
German, you can see the talk on CCC-TV:

http://media.ccc.de/browse/congress/2009/26c3-3691-de-
vier_fuste_fr_ein_halleluja.html

JavaScript is not even part of that presentation, he got lost in the
layers below ;-).

Making JavaScript fast is actually not really a challenge. After all,
JavaScript is a Lisp derivative in disguise, so the compiler technology
used has to be a mix between a fast Lisp-like compiler (Ocaml or so) and
Forth (the latter, since compiler performance matters, and it has to be
an incremental compiler). And actually, that's what those people above
do (at least two use Forth as part of their JavaScript engine).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
From: "Andy "Krazy" Glew" on
nmm1(a)cam.ac.uk wrote:
> In article <4B3BFDF9.2070607(a)patten-glew.net>,
> Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>> Herlihy's non-blocking lock free algorithms using compare-and-swap
>> depend on there being a single pointer that dominates the data structure
>> you are changing.
>
> One of the many reasons that so many people feel that a double form
> of compare-and-swap is essential.

I've read Greenwald's thesis, and I am sympathetic. In fact, I have
proposed implementations for DCAS, twice at Intel and once at AMD.

However

1) It doesn't meet all needs - witness the folks who ask for KCSS (K-way
Compare and Swap Synchronization) and Transactional Memory with
arbitrary numbers of locations involved.

2) It doesn't seem to be necessary in a theoretical sense:

"DCAS is not a silver bullet for nonblocking algorithm design". 16th
annual ACM symposium on Parallelism in algorithms and architectures,
2004, pp. 216�224. Doherty et al, inter alia Mark Moir and Guy Steele.

http://en.wikipedia.org/wiki/Double_compare-and-swap: "More recently,
however, it has been shown that an STM can be implemented with
comparable properties using only CAS."


>> Whereas with a subset of linked datastructure that does not have a
>> single dominator, you can't necessarily copy back in a changed
>> datastructure, with nodes that have been added or deleted.
>
> No. You have to add other constraints on what can be done. Not
> impossible, but not easy, either.

That's the clincher. Algorithms and datastructures that are efficient
non-parallel are not necssarily efficient in parallel.

In a non-parallel code you can operate on an arbitrary sub-mesh of a
linked datastructure, with arbitrary numbers of links in and out. You
can operate in place, replace it, etc.

Whereas in parallel you either need complicated locking. Or, if you
want to do it lock-free, you have to change the datastructure to have
only a single pointer entry point, or maybe 2 with DCAS. Or maybe K
with KCSS. Or maybe ... well, TM (Transactional Memory) doesn't really
give you an arbitrary number of locations. Maybe an L1 cache full of
locations. Maybe. (OK, some forms of TM fall back to locking, e.g. a
single global lock, when the L1 cache or whatever structure they are
tracking the transaction footprint in overflows. De facto, that means
you don't want to go larger than some fraction of the L1 cache, reduced
by a factor related to the load.)

(Which would you rather have: a TM implementation that is limited to an
L1 cache worth of data lines, for *all* of the data being manipulated?
Or a KCSS where K was an L1 cache - where you could use KCSS to handle
all of the links into and out of the sub-mesh, but where you could
access an arbitrary amount of data within the submesh?

===

The thing I don't like about DCAS or KCSS or TM is that most of the
hardware proposals depend on having a cache coherence protocol.

As explained in other posts, of late I have been thinking in terms of
supercomputers without a cache coherence protocol.

Whereas a single location CAS (of any size up to the interleave or bank
width) can be implemented in a machine with no caches - by letting the
memory interface implement it.

DCAS or KCSS or TM, in a system without caches, requires the ability to
lock memory locations. This can be arbitrarily sized if done via a
memory tag bit, e.g. stolen from ECC. But many people want to avoid
that, which introduces an arbitrary limit on the number of locations
involved. There're always the deadlock issues. Those can be resolved,
e.g. by physical address sorting. (The unwillingness to provide
hardware to do the physical address sorting efficiently is why DCAS is
not in the x86 instruction set.)

From: "Andy "Krazy" Glew" on
Gavin Scott wrote:
> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:
>> But that's PERFORMANCE tuning.
>
> Speaking of the P word, offhand I can only think of one
> interesting front in the battle for performance these days,
> that of executing JavaScript inside web browsers.
>
> There are at least three groups actively competing to produce the
> fastest JavaScript execution engine, based on the idea that this
> is going to be a key part of success in the upcoming "web os" age.
>
> It might be interesting to look at some of that work and their
> specific challenges as a counterpoint to the usual "we can't do
> it because C doesn't let us" arguments. It's a different language
> and environment which is having real money and brainpower thrown
> at it.


Care to share pointers to the three leading Javascript groups? Plus
maybe to add some commentary?

Or are you going to make us google it ourselves? :-(