From: Dmitry A. Kazakov on
On 6 Feb 2006 19:32:11 -0800, topmind wrote:

> Dmitry A. Kazakov wrote:

>> On 5 Feb 2006 14:00:22 -0800, topmind wrote:
>>
>> Really? What is the result of add("fee", "123") ?
>>
>> "1069"? "4201"? "fee123"? "sysadmin(a)error.org"?
>
> An error.

Why? "fee" was a number 4078! What did you think? (:-))

>> Huh, the code you presented is 1) longer, 2) far more confused.
>
> I don't have to define and hunt down definitions of types. You will
> have something like:
>
> a = new integer(min=-223452345, max=234234234);
> b = new integer(min=-09809809809, max=242093423423);
> x = a.add(b);

type I is new Integer; -- Compare with CREATE TABLE statement
A, B, X : Integer;
X := A + B;

>> As I have shown, it is not type free. It is typed in a definite way.
>
> I suspect we would need a clear definition of "types" to settle this.
> Either way, it is not relying on "types" in the usual strong-typed way.
> By the way, many scripting languages make no internal distinction
> between:
>
> x = 123
>
> and
>
> x = "123"

and "1 23"? Type is a set of values and operations. It is irrelevant how
you denote values!

>> How do
>> you know that this way is the proper one? Do you have use-tests to verify
>> that? Did you count all these tests? Where are use tests for accuracy,
>> rounding errors?
>
> Rounding errors for "add"?

Add 1.0 to 1000000000.0 in IEEE 32-bit and print the result. That should be
the very first thing students should do in classes. This is why different
numeric models exist. There is a trade-off space, performance, accuracy,
set of closed operations. It isn't rocket science...

>> Fine, show us an outline of a better language limited to solely relations
>> and we'll see.
>
> Hold on, are we talking about implementation here, or the language? I
> don't see how the language relates to compactly representing pixels
> internally.

Of course it relates. I want to access each pixel relationally, using
SELECT. How would you implement, say, motion detection in video images if
you cannot access pixels? You say you are thinking in tables. Please, do it
this way!

> As far as implementation, one could create a table such as:
>
> table: images
> --------------
> imageRef // image ID
> x
> y
> pixelValue

Fine, this is what I want. Now, write in your relational language:

1. FFT
2. Image compression of your choice
3. Line detection

>> 1. Row ID. It falls out of model, completely. In a truly relational model
>> you cannot mark either cells or rows, they have no identity of their own.
>
> Please clarify. Relational does not limit auto-generated keys. Some say
> it "encourages" one to use domain keys, but this has never been
> settled, and some argue that auto-gen keys are or can be domain keys.

Auto-generated keys defeat the model. It is a work-around. Is uniqueness a
property of a table, row, cell, value, DB, Universe? Can I copy it? Is the
copy unique again? In which scope? Can I add such keys? Sort them? These
questions have many contradicting answers. Sometimes I need one answer,
sometimes other. ADT solves that by having clear contracts of the objects
in use.

>> 2. Tables of tables.
>
> This is a *good* restriction of the relational model. Hard-nesting is
> hard-wiring a particular viewpoint or (access path) into the model,
> which goes against the relativistic philosophy of relational. Dr. Codd
> set out to purposely avoid hard-wired access paths when he started
> thinking about relational due to the navigational messes that others
> were creating. However, it does not limit what can be modelled
> externaly. One can still model nested stuff using non-nested tables.
>
> It is not a technical "limit", but a philosophical guide-wall.

Yes. It means that your model isn't complete. Note also relations, as found
in mathematics, do not impose any such limitations. I can have a set of
sets, a set of set of sets, etc and define relations on them.

> Relational imposes rules to be called "relational". Otherwise, it would
> turn into the navigational messes that motivated the creation of
> relational to begin with.

So data in RDBs cannot be properly structured. Fortunately there still
exist hierarchical DBs!

>> 3. Generic programming in terms of sets of tables sharing common
>> properties.
>
> You seem to want to model paradigm X in paradigm Y. That is the wrong
> approach. You *use* paradigm X, not change it to look like another
> paradigm. Relational is a paradigm. If you don't like its rules, then
> leave it.

This is not how complex software can be built. We are in XXI century, you
know.

>> Referential GPS resolution is about 30cm horizontally. Let's take, say, a
>> Germany map (not such a big country) with 30x30cm mesh... [Hint: roads have
>> limited width.]
>
> One could model roads as polygons if you want that precision. GIGO in
> effect here.

No, we wish to model them as relations! Polygon isn't a relation. You can
have a table of rows representing polygons, that's OK. Now, write the
SELECT statement that gives me the car position, movement direction and
distance to the next turn I have to do. Show, how this problem can be
decomposed using relational approach.

> And, I don't know where you are getting your size estimates.

From GPS resolution 30cm. Non-relational approach can reduce the amount of
data needed for search, use k-trees, for example. But you have to stick to
X=a Y=b, which is absolutely unrealistic.

> I don't know. I do notice that academics tend to be poorly trained in
> RDBMS.

Hmm, AFAIK, academics aren't trained, they train others! (:-))

> They could simply be afraid of them out of unfamiliarity.

No, it is so that people usually don't even consider approaches known for
being unsuitable. Perpetuum mobile does not worth a try.

> Further, if there are specialized engines already built to process
> things such as Bayesean networks or neural nets, obviously it makes
> sense to go with those specific already-built solutions. RDBMS shine
> where you have *different uses* for the *same* info. Those things you
> list tend to be *same* uses for the same info. See the difference?

Yes. In short RDBMS is not a paradigm. End of story.

Had you started with this, everybody would agree.

>> But actually, they are incomparable not because relational is incomplete
>> and thus bad. It isn't bad. They are incomparable, because relational is of
>> lower level. You cannot compare a vehicle with a wheel, though a circus
>> rider could use the later. I, personally, better take a car. Some would a
>> jet or ship.
>
> No, it is actually higher-level than OOP. (This will turn into "level"
> definition fight I fear, similer to the "complexity" metrics mess.)

No, will not, because you already have conceded that whatever level it
could be, it is unsuitable outside some niche applications.

>> I don't care about the CPU.
>
> I said compiler/interpreter, not the CPU. However, CPU is a similar
> example: the machine code is just data to it.

Yep, and I don't care about machine code.

>>> A developer may think of a function as
>>> "behavior", but the interpreter treats it more like data if we look at
>>> other processes that read what we normally call "data".
>>
>> That the point, software is developed, maintained and finally scraped by
>> humans.
>
> I don't see how this relates to the relativistic viewpoint of behavior
> and data.

You need a paradigm conformant to this relativism [=data abstraction]. ADT
is the vehicle for this. Either with OO or with pure relational (so that
ADTs are limited to cells) is no matter. The latter is just much weaker.
What you propose is outside.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
From: Dmitry A. Kazakov on
On 6 Feb 2006 12:01:11 -0800, Mikito Harakiri wrote:

> Dmitry A. Kazakov wrote:
>>>> If you want a challenge, fine. Take any machine learning method. Training
>>>> sets are ideal tables, rows and columns, nothing else. Take any method of
>>>> your choice and implement it in SQL! For introduction to existing methods,
>>>> see excellent tutorials by Andrew Moor:
>>>>
>>>> http://www.autonlab.org/tutorials/
>>>
>>> How about something from the domain of custom biz apps. I have already
>>> conceded out of ignorance of the domain that DB's may not be good for
>>> heavy-duty numerical analysis.
>>
>> This is not numerical analysis. This a challenge to the relational approach
>> in general. There is a problem specified relationally, that has an obvious
>> relational solution, yet nobody would even try to use RDBMS for that. How
>> could THAT happen?
>>
>> [ Answer: RDBMS are completely unsuitable either for matching in
>> multidimensional spaces, or for best-match criteria. The idea of separating
>> data and using normalized search schemas simply does not work beyond
>> trivial cases. Now, you will tell me that all biz-applications are trivial,
>> but I won't buy it. And for all I won't buy a tool limited to trivial
>> cases! ]
>
> It is good that you already know the answer. Let me provide the
> relational one. Assume you have the learning sample as a binary
> relation S:
[...]
> Therefore, the standard outer join can be viewed as a primitive machine
> learning method.

Right, it is a trivial translation of the training set into a classifier
which rejects anything that isn't in the training set. It is not yet
learning, through, which uses some additional information to generalize the
training set.

So machine learning is *naturally* defined relationally. The point is that
this does not help.

> More sophisticated methods (eg listed on the page
> http://www.autonlab.org/tutorials/ the referenced) are just variants of
> outer join.

More precisely, ultimately, they all fall under nearest-neighbour search.
The difference is only in how the distance is defined. The problem of RDBMS
is that they can neither to effectively implement such searches, nor to
provide an abstraction layer for performing different types of searches on
same data sets. An ADT system can do this.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
From: Mikito Harakiri on

Dmitry A. Kazakov wrote:
> > More sophisticated methods (eg listed on the page
> > http://www.autonlab.org/tutorials/ the referenced) are just variants of
> > outer join.
>
> More precisely, ultimately, they all fall under nearest-neighbour search.
> The difference is only in how the distance is defined. The problem of RDBMS
> is that they can neither to effectively implement such searches, nor to
> provide an abstraction layer for performing different types of searches on
> same data sets. An ADT system can do this.

No, induction in general is much more complex concept than
nearest-neighbour search. As you see the induction relationally is just
a form of outer join: we just specify which method to use as a
parameter. Of course, the code has to be in the relational engine --
either natively, or through relational extensions.

BTW, data mining stuff is already integrated into DBMS. It is not
represented technically as outer join, but that's a minor detail.
Having it formally as the outer join may help to optimization (that is
all the optimization rules valid for outer join has to work for
induction as well) but due to my lack of experience in that area I'm
not sure how important this idea is. The point is that in your data
warehouse you can write a query that makes a prediction of future
sales, and then massage the output relation, do various aggregations,
etc. In a word, there is life beyond ADT.

From: Dmitry A. Kazakov on
On 7 Feb 2006 10:12:52 -0800, Mikito Harakiri wrote:

> Dmitry A. Kazakov wrote:
>>> More sophisticated methods (eg listed on the page
>>> http://www.autonlab.org/tutorials/ the referenced) are just variants of
>>> outer join.
>>
>> More precisely, ultimately, they all fall under nearest-neighbour search.
>> The difference is only in how the distance is defined. The problem of RDBMS
>> is that they can neither to effectively implement such searches, nor to
>> provide an abstraction layer for performing different types of searches on
>> same data sets. An ADT system can do this.
>
> No, induction in general is much more complex concept than
> nearest-neighbour search. As you see the induction relationally is just
> a form of outer join:

You probably mean inference. No, inference is not learning. You need some
additional knowledge beyond the training set to learn. This knowledge can
be formalized as a metric distance in the feature space. So it goes as
follows (for example): knowledge = "let features be independent random
variables distributed normally and classes don't intersect", build a
classifier minimizing the probability of error.

But if you want to do inference, I wont object. I'm dying to see a
relational theorem prover...

> we just specify which method to use as a
> parameter.

Yes, I know that functional decomposition is a forbidding mountain for
relationalists. OK, let's fix the method for simplicity.

> Of course, the code has to be in the relational engine --
> either natively, or through relational extensions.

That is the whole point. Why

SELECT Class FROM Training_Set WHERE Distance (X,Y) < Threshold

works only on paper? Why is it impossible even to write one statement
working for any set of features (X and Y are tuples (X1, X2, ...))

So much for a "paradigm."

> BTW, data mining stuff is already integrated into DBMS.

No more than numbers are. Ready to roll up a relational implementation of
least-squares non-linear regression? [BTW, ultimately, approximation is
also a nearest-neighbour search.]

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
From: Mikito Harakiri on

Dmitry A. Kazakov wrote:
> > No, induction in general is much more complex concept than
> > nearest-neighbour search. As you see the induction relationally is just
> > a form of outer join:
>
> You probably mean inference. No, inference is not learning.

I have meant induction:
http://en.wikipedia.org/wiki/Induction_(philosophy)

> You need some
> additional knowledge beyond the training set to learn. This knowledge can
> be formalized as a metric distance in the feature space.

Induction can be as simple as Lagrange polynomial interpolation over N
graph points. I fail to follow your idea that "knowledge can be
formalized as a metric distance in the feature space", and am very
sceptical about naive methods of separating points in hyperspace with a
hyperplane which are so common in AI area. Same for distance and
metrics based methods. They are too unsophisticated to produce anything
impressive.

> So it goes as
> follows (for example): knowledge = "let features be independent random
> variables distributed normally and classes don't intersect", build a
> classifier minimizing the probability of error.

Once again with "probability" concept not firmly established, this is
just yet another ad-hock "machine learning" method.

> But if you want to do inference, I wont object. I'm dying to see a
> relational theorem prover...

No, we don't discuss inference here. Speaking of inference, RDBMS is
already an inference engine, admittedly with quite limited
capabilities. Deductive and constraint databases are perceived as a
next big step, but (as it is common in programming world) promises are
short on delivery.

> > Of course, the code has to be in the relational engine --
> > either natively, or through relational extensions.
>
> That is the whole point. Why
>
> SELECT Class FROM Training_Set WHERE Distance (X,Y) < Threshold
>
> works only on paper?

The distance query is not the method I advocated.

> Why is it impossible even to write one statement
> working for any set of features (X and Y are tuples (X1, X2, ...))

You have to be more specific here, for me to follow. Is there a certain
query expressiveness restriction that you indicated?

If you refer to a quite modest success of RDBMS in the spatial/temporal
area, then you are right. The list of spatial operators is short of
being succinct, and the implementation is far from great. The 13(!)
operators for interval datatype imply that the interval ADT is simply a
wrong idea. For example, writing

select * from Intervals a, Intervals b
where overlaps(a,b)

is less elegant and more ambiguous than

select * from Intervals a, Intervals b
where a.x between b.x and c.y

Yet, even with all the drawbacks, querying is far superior abstraction
to OOP. It isolates a particular implementation from the user and,
unlike OOP, tries hard not to introduce unnecessary and ugly artifacts.
This is why all the noise about "logical" versus "physical".