Discussion:
[STORK] Pondering Operating Systems
(too old to reply)
Jeff Bone
2003-05-12 23:11:09 UTC
Permalink
At the risk of colliding headlong with undesirable implications of
Joe's nascent government = OS argument, here's a few thoughts about
operating systems.

OSes exist for one reason and one reason only: to factor out all the
common stuff that cuts across application domains so that application
vendors and users aren't stuck having to reinvent the wheel (and, for
users, deal with the consequences) every time some problem is
encountered.

They succeed (or fail) at this to varying degrees. UNIX is highly
successful in this regard because it takes a small number of
abstractions and attempts to uniformly apply them everywhere. (As
mentioned previously, BSD Sockets broke this, Plan 9 gets it back.)
The net result is that the entire system is a giant integration
toolkit, and new applications are so trivial to build that even green
users do so and throw them away w/o a thought. A side-effect of this
is something I call "information integration and reuse" --- disparate
sources and kinds of information can be synthesized on the fly into
newer, richer sources using a few simple tools and techniques. The
entire system has high degree of transparency and usefulness as a
result of this.

Windows fails in this regard. It takes a much larger number of
abstractions and conventions --- often overlapping, almost always
inconsistent --- and applies them in an unconstrained and rather
undisciplined way. As a result, you get much larger (cognitive surface
area, APIs used, etc.) and more monolithic applications which are in
some sense islands unto themselves --- which have a low degree of
reusability and a high degree of conceptual and operational overlap.
These mega-apps balkanize and fragment the system at all levels, impede
information integration and reuse, and make the user experience far
more tedious while delivering far less functionality. And with
Longhorn's structured containers / multi-stream files, this tendency
will merely accelerate.

No-OS (perhaps counterintuitively) has the potential to accelerate this
trend of functional and informational balkanization. Many if not most
Web applications unfortunately disguise their informational models
behind idiosyncratic and (in the case of e.g. SOAP and XML-RPC) opaque
interfaces. Integration complexity between these apps grows
non-linearly and indeed these applications form "islands" of
functionality and data which communicate poorly with each other, if at
all.

What is needed is a new notion of OS, one that provides the same kind
of cross-cutting factorization for Web-based and distributed
applications that UNIX provides. An "Internet OS," to toss out that
bit of flame bait. IMHO, it's almost there already; all that's
lacking is a better set of toolkits for doing RESTful apps, a Web-aware
shell, and some compositional end-user tools for XML-based data akin to
e.g. grep, cut, etc. (Yes, I'm aware these exist; they just aren't
good enough yet to justify building the URI-aware shell for them.) (NB:
I will omit the discussion of why Web Services aren't this, that topic
having been covered ad nauseam here and elsewhere earlier.)

But it seems to me that one of the things that must frustrate Russell
the most and therefore motivate the No-OS idea is the tight coupling
today between data and device. And I agree, that's a huge problem.
But the solution is neither to do away with the OS nor to do away with
local storage.

Local storage is useful for those times when you "aren't connected."
Let's take a moment to elaborate that: "connectivity" is a spectrum;
"disconnected" is just the edge case of latency. Even with pervasive
WiFi or other IP, there will still be a continuum of bandwidth and
latency available. Some apps naturally have higher demands than
others. Any system designed with the assumption of (some level of)
constant connectivity will ultimately fail due to pressures in this
area; it's a very fragile assumption.

You need local storage. That storage might well be best viewed as a
cache, and all important data should certainly be replicated elsewhere
transparently, but that local storage is important. And since you're
going to have that notion of "my storage" even if its device
independent --- it should cut across applications in order to
facilitate information integration and reuse. No good for my mail to
be at hotmail, my FOAF to be at Friendster, my buddy list to be at ICQ,
etc. That's Gen 1 Web. Better to have some device-independent,
redundant, and disconnectable notion of a home directory or workspace,
floatable, accessible from and to any application. Gen 2 Web is much
more intertwingly.

$0.02,

jb
Russell Turpin
2003-05-13 00:05:08 UTC
Permalink
But it seems to me that one of the things that must frustrate Russell the
most and therefore motivate the No-OS idea is the tight coupling today
between data and device.
Yep.
You need local storage. That storage might well be best viewed as a cache,
and all important data should certainly be replicated elsewhere
transparently, but that local storage is important. And since you're going
to have that notion of "my storage" even if its device independent --- it
should cut across applications in order to facilitate information
integration and reuse. No good for my mail to be at hotmail, my FOAF to be
at Friendster, my buddy list to be at ICQ, etc. That's Gen 1 Web. Better
to have some device-
independent, redundant, and disconnectable notion of a home directory or
workspace, floatable, accessible from and to any application.
Two comments. First, let's be clear that "my data"
is different from cache. Cache has to be local to
serve its purpose to no-OS (running app,
whatever). Cache can be discarded. "My data"
need not be local; at times, I might access it
over the internet simply because I'm on the
move and didn't take it with me. So why should
it have a second API when I'm in the same room
with it? There are a variety of reasons I might
want "my data" to be local, from paranoia about
its security to latency, but those are orthogonal
to the issues with cache.

Second, the problem with having mail at
Hotmail, FOAF at Friendster, etc., isn't that
the data is distributed geographically, but that
you have to run disconnected apps to access
it. If it were presented to you as a cohesive
whole, you would no more care that it was
stored on different servers in different states
than you do that much of your data now is on
different tracks or in different directories.
Where the bits actually reside should matter
only when you start to think about security
issues, e.g., threats from fire and theft, and
how to replicate or encrypt to protect
against these threats.

I don't know the solution to this. If only
all web services made "your data" accessible
first through some application neutral
standard .. click your heels three times, and
go home.
Extensible attributes in the filesystem, coupled with indexing of arbitrary
attributes. BeFS got it right. ..
You know I love attributes, but I can't
fully agree. However nice it is, this kind of
thing introduces a sort of data apartheid.
I might be a believer were all of BeOS's
files zero-length, i.e., were all the data
stored in attributes. There's something that
bothers me about having that one special
attribute, "contents," which only applications
familiar with it can interpret correctly. One
of the neat things about the old Unices is
that everything was a file and all files were
conceptually streams of bytes. The only
exceptions were file name and access
rights. OK, and sticky bit. And a few other
things. Never mind. I thought I had a point,
here. Feh.

_________________________________________________________________
The new MSN 8: smart spam protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail
Jeff Bone
2003-05-13 00:21:18 UTC
Permalink
Post by Russell Turpin
Extensible attributes in the filesystem, coupled with indexing of
arbitrary attributes. BeFS got it right. ..
You know I love attributes, but I can't
fully agree. However nice it is, this kind of
thing introduces a sort of data apartheid.
Probably disagree with the conclusion...
Post by Russell Turpin
There's something that
bothers me about having that one special
attribute, "contents," which only applications
familiar with it can interpret correctly.
Agreed. But most file formats mix data and metadata willy-nilly.
Indeed, you might say that file formats *exist* for that reason, i.e.
because the filesystem was metadata-empoverished! File formats are
application-specific containers for a mix of application-specific
metadata, application-independent metadata, and data (whatever that
is.) And therefore each application has its own API for storing,
accessing, querying, etc. all that --- even when in many cases the data
in question is useful across apps.

Extensible attributes take care of this in two ways: first, harvesters
can post-facto (perhaps triggered by change notifications) dig in and
grab extensible attributes or even synthesize them. We do some of this
today at Deepfile, and the approach is sound and scalable. But more
importantly, it de-emphasizes the need for different file formats. If
attributes could be stored, accessed, queried, etc. via a common
system-wide API --- perhaps via wrinkle in the filesystem itself,
attribute directories == directories, attributes == files, values ==
file contents ala ReiserFS, BeFS, and others --- then over time people
would leverage this rather than trying to cram it all into yet another
file format or XML Schema. (Indeed, I expect that over time filesystem
access and XML processing might cease to be distinct: what if the
filesystem, abstractly, IS an XML object? And / or vice-verse?
There's also a nice, incremental path, here --- mapping XSD directly
onto the DB, leveraging system indexing and change notification, etc.
Sweet.)

Much as I hate to admit it, MUCH of the data reality we deal w/ is
hierarchical in nature, rather than traditional relational.
Normalization is neither always possible nor always desirable. It's
time to dig in and attempt once more to crack that network /
hierarchical DB nut. Hans Reiser gets it. Dominic Giampaolo gets it.
Apparently some of the Microsoft crew *almost* gets it. We'll see what
happens.
Post by Russell Turpin
One
of the neat things about the old Unices is
that everything was a file and all files were
conceptually streams of bytes.
Indeed. And extensible metadata moves us back that direction, rather
than away from it.

jb
Russell Turpin
2003-05-13 01:30:34 UTC
Permalink
Extensible attributes take care of this in two ways: first, harvesters can
post-facto (perhaps triggered by change notifications) dig in and grab
extensible attributes or even synthesize them. .. If attributes could be
stored, accessed, queried, etc. via a common system-wide API --- perhaps
via wrinkle
in the filesystem itself, attribute directories == directories, attributes
== files, values == file contents ala ReiserFS, BeFS, and others --- then
over time people would leverage this rather than trying to cram it all into
yet another file format or XML Schema.
Good points, both.

Any significant move in this direction is going to
depend critically on semantic normalization. It
doesn't do any good to have tools that operate
on attributes on files from different sources
unless they have some way to determine that
"wantdeal=forsale," "fortrade=sale," and
"wanttosell" all mean the same thing as file
attributes. I understand the dislike for heavy,
burdensome industry standardization. The
alternative is to do the semantic normalization
after the data is created. Without normalization
somewhere, there will continue to be data
balkanization. Yeah, normalization rules can be
exposed just as parameters to the scripts/
tools. But a lot of them have persistent value.
Indeed, I expect that over time filesystem access and XML processing might
cease to be distinct: what if the filesystem, abstractly, IS an XML
object?
Push this far enough, and you get rid of the file's
contents, per se. It's all just objects, attributes, and
values, where values can be other objects with
their own attributes and values. (That gives the
network/hierarchical structure.) I think you're right
that the transition is keeping the notion of file while
heavily mining it to surface attributes.

_________________________________________________________________
MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*.
http://join.msn.com/?page=features/virus
Adam L.Beberg
2003-05-13 01:58:27 UTC
Permalink
Post by Russell Turpin
Any significant move in this direction is going to
depend critically on semantic normalization. It
doesn't do any good to have tools that operate
on attributes on files from different sources
unless they have some way to determine that
"wantdeal=forsale," "fortrade=sale," and
"wanttosell" all mean the same thing as file
attributes.
This is exactly why XML hasn't done anyone a damn bit of good for
anyone.

Back in the dark ages of 2001 we started with SQL databases that
couldn't work together because the columns had different names.
ProductID != SCU != PIN.

So some genius decided the solution was to text encode and label all
the columns explicitly. They made a fortune I'm sure.

But now we have XML interfaces that can't work together because the
schema have different names. <ProductID=.../> != <SCU=.../>.

And it's 3x the bandwidth, 100x the interface code, and 10x slower.
BONUS!
Cisco and Intel thank you!

Go geeks! Keep on reinventing that wheel. Now stop FoRKing over the
customer and make it round for a change.

And as far as file metadata, it's useless. Applications can store their
metadata in their own files where it belongs. Publish your file format
if you want to be harvested by Google.

- Adam L. Beberg - ***@mithral.com
http://www.mithral.com/~beberg/
Stephen D. Williams
2003-05-13 03:10:59 UTC
Permalink
Post by Russell Turpin
Any significant move in this direction is going to
depend critically on semantic normalization. It
doesn't do any good to have tools that operate
on attributes on files from different sources
unless they have some way to determine that
"wantdeal=forsale," "fortrade=sale," and
"wanttosell" all mean the same thing as file
attributes.
Indeed.
This is exactly why XML hasn't done anyone a damn bit of good for anyone.
Back in the dark ages of 2001 we started with SQL databases that
couldn't work together because the columns had different names.
ProductID != SCU != PIN.
So some genius decided the solution was to text encode and label all
the columns explicitly. They made a fortune I'm sure.
But now we have XML interfaces that can't work together because the
schema have different names. <ProductID=.../> != <SCU=.../>.
And it's 3x the bandwidth, 100x the interface code, and 10x slower.
BONUS!
Cisco and Intel thank you!
However: the schema is flexible (when are relational databases going to
stop using fixed row schema?), the data is self-describing, the protocol
is no longer proprietary (ODBC, being API based as is MS's wont,
provided counter-incentive to a common RDBMS access protocol), the data
model is regular and extensible, and everyone can work on solving higher
level problems.
Go geeks! Keep on reinventing that wheel. Now stop FoRKing over the
customer and make it round for a change.
And as far as file metadata, it's useless. Applications can store
their metadata in their own files where it belongs. Publish your file
format if you want to be harvested by Google.
http://www.mithral.com/~beberg/
Jeff Bone
2003-05-13 02:05:06 UTC
Permalink
Post by Russell Turpin
Any significant move in this direction is going to
depend critically on semantic normalization.
Yeah, yeah. Ontologies, topic maps, attribute namespaces, mapping and
transformation via XSLT or other standard tools, etc. This isn't
"magic happens here" --- it's just heavy lifting. But the basic
infrastructure to apply all this is still sorely lacking. Enter
filesystems with extensible metadata...

jb
Jeff Bone
2003-05-13 02:11:29 UTC
Permalink
Post by Adam L.Beberg
And as far as file metadata, it's useless. Applications can store
their metadata in their own files where it belongs. Publish your file
format if you want to be harvested by Google.
Bzzzzt, thanks for playing. Until and unless we move away from that,
real knowledge management (not to mention data lifecycle management,
app bloat because of a gazillion file formats and APIs, etc.) on almost
any significant scale is a non-starter. BTW, significant difference
between "store it in the file using your own idiosyncratic container
structure and API" and "store it in the directory-like thing that's
really a graph of metadata, looking exactly like a directory structure
using a standard API."

As for XML, yeah, it's a bitch. But they got the data model right, at
least. Don't confuse the syntax with the semantics.

jb
Stephen D. Williams
2003-05-13 03:11:18 UTC
Permalink
Any filesystem that has full granularity for metadata must also support
rolling up small grains into large-grain operations. This is the
problem with associative databases as some of them stand now, and made
clear by the mbox vs. qmail dicotomy. You must support the full object
paradigm without much overhead most of the time.

In other words, it's ok to view everything as a giant XML forest as long
as that 'subdirectory' can at some level be treated as a sequence of
bytes to be rapidly loaded, stored, transmitted, etc. You can't force
small-grain granularity everywhere and expect anything to work for more
than toy problems.

My bsXML/SDOM ideas are an attempt to solve some of the performance,
versioning/copy-on-write, rollup needs while retaining the usefulness of
XML. It is designed to avoid parsing altogether, except at 'edges', and
be usable for in-memory, 'serialization' (where most work is also
avoided), and XML databases. It could also be used at the filesystem level.

Application specific files can be dealt with via a library of access
code which could be used to build or emulate normalized metadata. The
real problem is proprietary files that haven't been reverse engineered
or otherwise documented. Of course, if an application is
knowledge-management shared-ontology aware, then you really don't need a
separate meta-data store.

sdw
Post by Jeff Bone
Post by Adam L.Beberg
And as far as file metadata, it's useless. Applications can store
their metadata in their own files where it belongs. Publish your file
format if you want to be harvested by Google.
Bzzzzt, thanks for playing. Until and unless we move away from that,
real knowledge management (not to mention data lifecycle management,
app bloat because of a gazillion file formats and APIs, etc.) on
almost any significant scale is a non-starter. BTW, significant
difference between "store it in the file using your own idiosyncratic
container structure and API" and "store it in the directory-like thing
that's really a graph of metadata, looking exactly like a directory
structure using a standard API."
As for XML, yeah, it's a bitch. But they got the data model right, at
least. Don't confuse the syntax with the semantics.
jb
Gavin Thomas Nicol
2003-05-13 05:05:31 UTC
Permalink
Post by Stephen D. Williams
My bsXML/SDOM ideas are an attempt to solve some of the performance,
versioning/copy-on-write, rollup needs while retaining the usefulness of
XML. It is designed to avoid parsing altogether, except at 'edges', and
be usable for in-memory, 'serialization' (where most work is also
avoided), and XML databases. It could also be used at the filesystem level.
I've been doing XML/SGML for more years than I care to remember, large
documents (>100MB) and versioned XML databases with incremental fulltext and
structural indexing for a large portion of that time. I don't see how SDOM or
bsXML really solve the problems I've run into... though I might be wrong.
Care to explain a bit more?
Stephen D. Williams
2003-05-13 05:32:01 UTC
Permalink
Sure. And I want to know more about the range of implementation methods
that were used to do versioned XML databases (if it's anything beyond
multiple copies).

First, and this was a big misconception from someone who reviewed my
full-argument whitepaper, I'm not directly trying to compete with, out
do, or concentrate on XML databases. What I'm primarily focused on is
the application use of XML-based business objects as they flow through
N-tiered applications or are otherwise frequently
read->parsed->represented->traversed->modified->serialized->emitted->repeat.
An application that I architected and helped develop in 1998, using
modest XML objects, Netscape Application Server (which required
flattening for session state replication), and a rule engine,
illustrated the problem to me. One pass through the application
generated over 250,000 Java objects that had to be allocated, filled
with bits of data, referenced, traversed, and later garbage collected.
Using bsXML-based processing, this could be reduced to a couple dozen
objects or even less with object storage caching. When you optimize in
this situation, almost immediately the overhead dominates your
processing and you are out of luck without replacing the infrastructure
you just invested in. You mention 100MB documents. In your experience,
how much overhead was required to process that document compared to
simply reading that data into memory as a block of data? With bsXML,
the latter is all that is required to be ready to traverse, query, or
modify the object. You could think of bsXML as a persistant DOM, except
that it attempts to solve a number of extra constraints. I'm also not
going after 'compression' directly, although I have some interest in that.

Second, XML programming, at least xml-as-data-object programming, should
be as straightforward as storing values into a structure (possibly
simpler as the structure can be implicit, created by access) or creation
of files in a directory structure. Part of the reason for this is to
gain efficiency by avoiding copies and object allocation/memory
management. DOM for instance uses a broken model where you create an
object, copy in data, and then link that object in the DOM tree. SDOM
uses a model where you plug the data directly into the tree in a single
copy. A major added advantage is that you can intermediate between an
application and it's data for sophisticated semantics. A simple example
might be to log all access or modifications to a data object. To me,
this seems a bit revolutionary since I can write a very tiny amount of
code to get something done and avoid creating data structures,
getters/setters, serialization code, etc. I also have a clever way to
match OO code and objects with this representation.

I'm still just doing previews here, but do you see the value in
something with those characteristics? What are the problems that may be
non-obvious?

I decided quite a while ago to publish and open source all of this
because there just isn't a viable path to a product that isn't
hopelessly tainted. See binXML for A) a related but substantially
different idea and B) a hopeless situation where venture money is
preventing the publishing of all details and an open implementation and
relegating the whole project to backwaters. I believe they are still
IDL based which isn't acceptable to me.

sdw
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
My bsXML/SDOM ideas are an attempt to solve some of the performance,
versioning/copy-on-write, rollup needs while retaining the usefulness of
XML. It is designed to avoid parsing altogether, except at 'edges', and
be usable for in-memory, 'serialization' (where most work is also
avoided), and XML databases. It could also be used at the filesystem level.
I've been doing XML/SGML for more years than I care to remember, large
documents (>100MB) and versioned XML databases with incremental fulltext and
structural indexing for a large portion of that time. I don't see how SDOM or
bsXML really solve the problems I've run into... though I might be wrong.
Care to explain a bit more?
Gavin Thomas Nicol
2003-05-13 06:09:40 UTC
Permalink
Post by Stephen D. Williams
Sure. And I want to know more about the range of implementation methods
that were used to do versioned XML databases (if it's anything beyond
multiple copies).
Differential back-chained versioning with workspace, and version-aware
incremental fulltext and structural indexing in one system, a
non-differential system (with better branch support) in another.
Post by Stephen D. Williams
First, and this was a big misconception from someone who reviewed my
full-argument whitepaper, I'm not directly trying to compete with, out
do, or concentrate on XML databases.
...the reaction really hinges on the claim to avoid COW semantics for
modifiying XML documents, and to avoid the overhead of parsing. That kind-of
implies a database-like system.
Post by Stephen D. Williams
One pass through the application generated over 250,000 Java objects
that had to be allocated, filled with bits of data, referenced, traversed,
and later garbage collected. Using bsXML-based processing, this could be
reduced to a couple dozen objects or even less with object storage caching.
Sure... I'm not saying there's not a problem. In fact, I think I was one of
the very first people to say that the verbosity of XML-RPC causes issues at
TCP/IP level, all the way up.
Post by Stephen D. Williams
You mention 100MB documents. In your experience,
how much overhead was required to process that document compared to
simply reading that data into memory as a block of data?
Those 100MB documents (in most cases) had side indexes to get around the
"parse from the beginning" issues, so there was little overhead. FWIW. Most
of those kinds of documents are primarily read-only after creation.
Post by Stephen D. Williams
You could think of bsXML as a persistant DOM
That's where the database side comes in. A persistent DOM is not XML though...
Post by Stephen D. Williams
Second, XML programming, at least xml-as-data-object programming, should
be as straightforward as storing values into a structure (possibly
simpler as the structure can be implicit, created by access) or creation
of files in a directory structure.
Personally, I think "XML programming", except in a few specialised cases, to
be one of the biggest mistakes of the last few years... so while I agree with
the sentiment, I think the solution is not to improve XML programming, but to
*remove* it.
Post by Stephen D. Williams
DOM for instance uses a broken model where you create an
object, copy in data, and then link that object in the DOM tree.
The main thing to remember about the DOM is that it was a fairly good balance
between vastly different sets of requirements. It is not reflective of best
practises necessarily.
Post by Stephen D. Williams
I'm still just doing previews here, but do you see the value in
something with those characteristics? What are the problems that may be
non-obvious?
My main question is "why?"

The XML databases I mentioned above have a clear purpose, but if I want to
hide the gory details of using XML in SOAP or data integrations, I can do
that with better API's and/or by removing XML (especially after the first
parse).

BTW. Is there a link to the latest bsXML spec?
Stephen D. Williams
2003-05-13 06:31:09 UTC
Permalink
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
Sure. And I want to know more about the range of implementation methods
that were used to do versioned XML databases (if it's anything beyond
multiple copies).
Differential back-chained versioning with workspace, and version-aware
incremental fulltext and structural indexing in one system, a
non-differential system (with better branch support) in another.
Those sound like really nice shorthand descriptions of the methods, but
they aren't precise enough for me to understand exactly what was
implemented. Do you have a more in depth description you wouldn't mind
sharing or references?
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
First, and this was a big misconception from someone who reviewed my
full-argument whitepaper, I'm not directly trying to compete with, out
do, or concentrate on XML databases.
...the reaction really hinges on the claim to avoid COW semantics for
modifiying XML documents, and to avoid the overhead of parsing. That kind-of
implies a database-like system.
What do you mean by 'avoid COW semantics'? I'm trying to get COW
semantics, and I'm not sure what you're saying needs to be avoided.
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
You mention 100MB documents. In your experience,
how much overhead was required to process that document compared to
simply reading that data into memory as a block of data?
Those 100MB documents (in most cases) had side indexes to get around the
"parse from the beginning" issues, so there was little overhead. FWIW. Most
of those kinds of documents are primarily read-only after creation.
Ahh, different problem space. While I should be able to handle that
efficiently, I have the constraint of supporting random, frequent
modification of anything in the object/document.
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
You could think of bsXML as a persistant DOM
That's where the database side comes in. A persistent DOM is not XML though...
Why must that imply a database? (In the sense of an external server
that handles secondary storage.)
Of course anything that isn't vanilla XML 1.0 isn't XML by definition,
however something that keeps (eventually) all of the semantics of XML
1.0 and is just encoded differently (not character encoding, tag,
structure, etc.) could be said to be XML with bsXML encoding. I'm not
trying to quibble about semantics at that level or confuse XML as a
label, which is why I coined bsXML (binary structured XML, and also
because just about everything else was 'taken' and I was the only one
gutsy enough to label my own design BS. ;-)).
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
Second, XML programming, at least xml-as-data-object programming, should
be as straightforward as storing values into a structure (possibly
simpler as the structure can be implicit, created by access) or creation
of files in a directory structure.
Personally, I think "XML programming", except in a few specialised cases, to
be one of the biggest mistakes of the last few years... so while I agree with
the sentiment, I think the solution is not to improve XML programming, but to
*remove* it.
What would you prefer? What are the alternatives? My argument is that
native data structures only make sense when A) all of your processing
with that data structure is internal to the application or B) the
processing is so intense (i.e. video, computational models) that
external access is minor overhead.
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
DOM for instance uses a broken model where you create an
object, copy in data, and then link that object in the DOM tree.
The main thing to remember about the DOM is that it was a fairly good balance
between vastly different sets of requirements. It is not reflective of best
practises necessarily.
Sure, I'm just saying it is less effective than what is possible for
application data object programming.
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
I'm still just doing previews here, but do you see the value in
something with those characteristics? What are the problems that may be
non-obvious?
My main question is "why?"
Efficiency, E., E. I've built systems that do several thousand
transactions per second (C/C++, binary data messages pipelined in a
custom MQ system) and one transaction every two seconds (Java, XML,
Web). XML is the only candidate to help clean up a lot of application
development situations yet the more you use it the more your performance
falls behind. I want elegance, standardization, extensibility, minimal
coding, and efficiency.
Post by Gavin Thomas Nicol
The XML databases I mentioned above have a clear purpose, but if I want to
hide the gory details of using XML in SOAP or data integrations, I can do
that with better API's and/or by removing XML (especially after the first
parse).
BTW. Is there a link to the latest bsXML spec?
Not public, but it will be shortly. I'm swamped with project work, but
I'm determined to make progress on this and in fact need it for a large
project.

Thanks
sdw
Gavin Thomas Nicol
2003-05-13 12:08:05 UTC
Permalink
Post by Stephen D. Williams
Those sound like really nice shorthand descriptions of the methods, but
they aren't precise enough for me to understand exactly what was
implemented. Do you have a more in depth description you wouldn't mind
sharing or references?
Maybe off-list... but the problem is essentially managing a persistent
n-dimensional space, and then being able to efficiently slice the space.
Post by Stephen D. Williams
What do you mean by 'avoid COW semantics'? I'm trying to get COW
semantics, and I'm not sure what you're saying needs to be avoided.
What I meant was creating a full copy for each change, or copying data, rather
than creating data in place.
Post by Stephen D. Williams
Post by Gavin Thomas Nicol
That's where the database side comes in. A persistent DOM is not XML though...
Why must that imply a database? (In the sense of an external server
that handles secondary storage.)
That's not what I meant. To me a database is really just a store or some
form... like the Berkely embeddable stuff, for example. bsXML sounds similar
to a database from that perspective.
Post by Stephen D. Williams
What would you prefer? What are the alternatives? My argument is that
native data structures only make sense when A) all of your processing
with that data structure is internal to the application or B) the
processing is so intense (i.e. video, computational models) that
external access is minor overhead.
Valid assertions, but I think (A) and (B) are often true. Also, there are
other reasons, such as type safety etc. that need to be considered too...
it's like the prototyping phase where every data structure is a hashtable of
hashtables ;-)
Post by Stephen D. Williams
Sure, I'm just saying it is less effective than what is possible for
application data object programming.
Very true. I still don't understand why so many people use the DOM, *except*
that XML use has kind-of pushed it onto them. For example, I was recently
writing a DAV server (finally caved in ;-)) and the worst part of the whole
thing was dealing with the XML, which, at the end of he day, is really
irrelevant to the problem space.
Post by Stephen D. Williams
XML is the only candidate to help clean up a lot of application
development situations
OK. I think we'll have to agree to disagree on the assertion here. My
experience has been that XML is somethings convenient, but seldom ideal for
such things.
Stephen D. Williams
2003-05-13 12:58:32 UTC
Permalink
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
Those sound like really nice shorthand descriptions of the methods, but
they aren't precise enough for me to understand exactly what was
implemented. Do you have a more in depth description you wouldn't mind
sharing or references?
Maybe off-list... but the problem is essentially managing a persistent
n-dimensional space, and then being able to efficiently slice the space.
Please, I'd appreciate that.
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
What do you mean by 'avoid COW semantics'? I'm trying to get COW
semantics, and I'm not sure what you're saying needs to be avoided.
What I meant was creating a full copy for each change, or copying data, rather
than creating data in place.
I believe I can avoid that. A fundamental data structure I use called
"elastic memory" tracks byte ranges using a couple alternative methods
for efficiency. I added the ability for an elastic memory space to
reference a parent space with copy/insert/delete semantics. This is all
below the level of the binary structure so it becomes transparent. Like
I mentioned, I was planning to do this at a higher level, but my current
evaluation is that it will be much better at this level. Imagine a
large stack of deltas, there are a number of ways to optimize such a
range-threaded unification vs. logical manipulation of sparse XML tree
equivalencies. For example, all delta COW layers except the active,
writable one could be collapsed.
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
Post by Gavin Thomas Nicol
That's where the database side comes in. A persistent DOM is not XML though...
Why must that imply a database? (In the sense of an external server
that handles secondary storage.)
That's not what I meant. To me a database is really just a store or some
form... like the Berkely embeddable stuff, for example. bsXML sounds similar
to a database from that perspective.
Ok, yes, part of what I want is database semantics in-memory as well as
remote. It's a database like a collection library, dictionary, etc. is
a database.
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
What would you prefer? What are the alternatives? My argument is that
native data structures only make sense when A) all of your processing
with that data structure is internal to the application or B) the
processing is so intense (i.e. video, computational models) that
external access is minor overhead.
Valid assertions, but I think (A) and (B) are often true. Also, there are
other reasons, such as type safety etc. that need to be considered too...
it's like the prototyping phase where every data structure is a hashtable of
hashtables ;-)
A and B are almost never true for enterprise web applications, as one
example. Type safety is an issue, however there are a couple avenues
that make this a non-issue. First, you could wrap all access with 3GL
application objects that do whatever validation is needed. Second, you
could use a schema or template/metadata-based extended validation that
is handled uniformly by the bsXML library or its wrapper.
Intermediation allows you to do just about anything. For instance, I
could build, again, a forward and backward chaining rule engine very
easily because I could track the semantics needed: which rules accessed
or modified which 'facts'. One design I had for validation was using a
template (by which, in this case, I mean an XML tree with cardinality of
1 for each possible leaf path with metadata attributes) and using Jython
expressions for extended validation.
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
Sure, I'm just saying it is less effective than what is possible for
application data object programming.
Very true. I still don't understand why so many people use the DOM, *except*
that XML use has kind-of pushed it onto them. For example, I was recently
writing a DAV server (finally caved in ;-)) and the worst part of the whole
thing was dealing with the XML, which, at the end of he day, is really
irrelevant to the problem space.
Post by Stephen D. Williams
XML is the only candidate to help clean up a lot of application
development situations
OK. I think we'll have to agree to disagree on the assertion here. My
experience has been that XML is somethings convenient, but seldom ideal for
such things.
Fine, what are the candidates that have many of the positives of XML for
enterprise development? INI files? Postscript? CSV?

XML is seldom ideal because the library APIs do not map to the highest
semantic level possible and processing is very inefficient. You could
also argue that some semantics are poorly supported or missing, like
pointers, COW, etc. My goal is XML+Semantic sugar+better API+a few
extra semantics -> much closer to all around ideal data
structure/processing than anything else.

I find it interesting that it is so difficult to illustrate the whole
solution and get agreement that it might be useful or much better. I'm
sure that I'm right and I'll release code soon to prove it. It
reinforces my assessment that Open Source is the way to go. Suggestions
for improvement are welcome. I have convinced a development team in one
important project that SDOM was useful and they seemed to like using it.

sdw
Gavin Thomas Nicol
2003-05-13 14:18:41 UTC
Permalink
Post by Stephen D. Williams
Post by Gavin Thomas Nicol
Valid assertions, but I think (A) and (B) are often true. Also, there are
other reasons, such as type safety etc. that need to be considered too...
it's like the prototyping phase where every data structure is a hashtable
of hashtables ;-)
A and B are almost never true for enterprise web applications, as one
example.
We'll have to agree to disagree on that. My experience is that XML is useful
glue for data integration, and for certain forms of document, and that's
about it.
Post by Stephen D. Williams
First, you could wrap all access with 3GL
application objects that do whatever validation is needed. Second, you
could use a schema or template/metadata-based extended validation that
is handled uniformly by the bsXML library or its wrapper.
Right, and you've added so much more than XML, that if you stand back and
*really* think about it, I don't think XML adds much. Why use XML schemas
when Java compilers can to the type checking for me? Why use XML, when EJB
generators do everything for me (bad example ;-)).
Post by Stephen D. Williams
Fine, what are the candidates that have many of the positives of XML for
enterprise development? INI files? Postscript? CSV?
Depends on the application. Postscript is fine for some things, and terrible
for others. INI files just suck ;-)
Post by Stephen D. Williams
XML is seldom ideal because the library APIs do not map to the highest
semantic level possible and processing is very inefficient.
Right, but once you map into the semantic domain, the XML, for all intents and
purposes, should be buried. A good example *is* INI files, resources bundles
etc. There are clearly good reasons to migrate those things to XML for
*management* purposes, but in an API to them, I should neither know, nor care
that XML is involved... after all, in some cases, I may not *want* to use XML
for managing such things, but rather a relational database, or somesuch.
Post by Stephen D. Williams
I find it interesting that it is so difficult to illustrate the whole
solution and get agreement that it might be useful or much better.
Well, like I said, I've been doing XML/SGML for longer than I care to
remember, so there's nothing you've said that I haven't either done or
thought about. I did LISP/S-expressions before that, and had a close to
religious fervor there (that is one reason I got involved in SGML... similar
enough to S-expressions that I could see nifty information applications
springing from it), so I'm no stranger to the "everything is a tree" or
"everything is a list", or "everything is just slots" viewpoints.

Maybe I just have a jaundiced view on the "one data structure to bind them"
mentality.
Stephen D. Williams
2003-05-13 15:03:06 UTC
Permalink
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
Post by Gavin Thomas Nicol
Valid assertions, but I think (A) and (B) are often true. Also, there are
other reasons, such as type safety etc. that need to be considered too...
it's like the prototyping phase where every data structure is a hashtable
of hashtables ;-)
A and B are almost never true for enterprise web applications, as one
example.
We'll have to agree to disagree on that. My experience is that XML is useful
glue for data integration, and for certain forms of document, and that's
about it.
What do you prefer as communication between tiers of an application?
RMI? CORBA? COM? DCOM? SOAP? ;-) When I'm writing tiers in Java, C++,
Python, Perl, PHP, etc., what should I use that isn't an
application-specific hand-packed message format?

I prefer to think in terms of messages, partly because I know that
pipelined, asynchronous message oriented processing is by far the most
efficient distributed processing paradigm, and I want those messages to
be in a canonical, network/language/platform neutral format. At the
same time, I want to avoid the tyranny of IDLs and frozen interfaces.
What's left?
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
First, you could wrap all access with 3GL
application objects that do whatever validation is needed. Second, you
could use a schema or template/metadata-based extended validation that
is handled uniformly by the bsXML library or its wrapper.
Right, and you've added so much more than XML, that if you stand back and
*really* think about it, I don't think XML adds much. Why use XML schemas
when Java compilers can to the type checking for me? Why use XML, when EJB
generators do everything for me (bad example ;-)).
Gavin Thomas Nicol
2003-05-13 15:24:33 UTC
Permalink
Post by Stephen D. Williams
What do you prefer as communication between tiers of an application?
RMI? CORBA? COM? DCOM? SOAP? ;-)
Depends on the application.
Post by Stephen D. Williams
When I'm writing tiers in Java, C++,
Python, Perl, PHP, etc., what should I use that isn't an
application-specific hand-packed message format?
Why do you need something that isn't 'application-specific hand-packed message
format'? I'd choose based on application/architectural need.
Post by Stephen D. Williams
My goal, which has been accomplished with
non-public code, is to have an API that is nearly as good or better than
traditional, native data structures and methods.
That's really a subjective call.
Post by Stephen D. Williams
Exactly. Since you neither know or care that XML is involved, why not
just directly express and operate on data that is XML-like but
efficient, a la bsXML.
...but why? Sounds to me like you want to replace XML, but maintain the
architectures that have sprung up around it. IMHO. That only solves a
smallish part of the problem.
Post by Stephen D. Williams
Post by Gavin Thomas Nicol
Maybe I just have a jaundiced view on the "one data structure to bind
them" mentality.
I can understand that, but I think that my particular synergy has merit.
Maybe not as broadly as I think, but we'll see.
Sure, and for all I know, bsXML might just be the greatest thing since sliced
bread. Time will tell.
Joseph S. Barrera III
2003-05-13 15:38:09 UTC
Permalink
Post by Gavin Thomas Nicol
Maybe I just have a jaundiced view on the "one data structure to bind
them" mentality.
Array of bytes. Works every time.

You're free to add your own structuring on top, of course.

- Joe
Gavin Thomas Nicol
2003-05-13 15:53:43 UTC
Permalink
Post by Joseph S. Barrera III
Array of bytes. Works every time.
;-)
Post by Joseph S. Barrera III
You're free to add your own structuring on top, of course.
That's the problem with streams.... half the time the bit you want is at the
end ;-)
Gavin Thomas Nicol
2003-05-13 04:58:27 UTC
Permalink
Post by Jeff Bone
As for XML, yeah, it's a bitch. But they got the data model right, at
least. Don't confuse the syntax with the semantics.
Which one? I don't remember having one when we did the XML spec. and the ones
since aren't *really* XML (despite pithy titles like "The Essense of XML").
Russell Turpin
2003-05-13 02:56:31 UTC
Permalink
.. all that's lacking is .. and some compositional end-user
tools for XML-based data akin to e.g. grep, cut, etc. (Yes, I'm aware
these exist; they just aren't good enough yet to justify building the
URI-aware shell for them.)
It seems to me that if these tools are going to be
truly compositional, they must produce the same
kind of thing as that which they operate upon, and
the platform must support that kind of thing as a
basic data type for buffering, display, passing
across interfaces, etc. Grep, awk, et al were/are
so useful not just because they all work on text
streams, but also because Unices support text
streams as glue, as building block, as command,
as roofing tile, etc.

In the case of attributed file, "that kind of thing"
is a set of attributed objects. (Here, I use object
not in the OOP sense, but in the sense of
something with identity on which attributes can
be pinned.)

This is almost semantically equivalent to an XML
file, but not quite. To my (very limited)
understanding of XML, every proper XML file
has to reference a schema, where the results
of composition typically will be an object that
mixes fully-qualified attributes across a set of
schemas. In other words, it's necessary to lose
the notion of schema as something that types
or validates an object, and use it simply as
something that qualifies attributes, along with
other schemas, arbitrarily mixed. (I don't know
how the XML gurus think of this.)

Of course, the actual "kind of thing" should
not be XML itself, but attributes objects in
the abstract, for which (almost) XML serves
as a textual representation. Build this into the
platform as the uniform medium that passes
through all the plumbing, and the compositional
tools will emerge, because then they will pay
off.

The problem with pondering operating
systems is that we're not Bill Gates. In that
sense, Beberg may be correct. For the next
few years, rather than being a panacea for
customer problems, XML may mostly provide
work for programmers. (I'm surprised Beberg
didn't see this as an upside. No, wait -- I take
that back.)

_________________________________________________________________
MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*.
http://join.msn.com/?page=features/virus
Stephen D. Williams
2003-05-13 03:40:40 UTC
Permalink
Post by Russell Turpin
.. all that's lacking is .. and some compositional end-user
tools for XML-based data akin to e.g. grep, cut, etc. (Yes, I'm
aware these exist; they just aren't good enough yet to justify
building the URI-aware shell for them.)
It seems to me that if these tools are going to be
truly compositional, they must produce the same
kind of thing as that which they operate upon, and
the platform must support that kind of thing as a
basic data type for buffering, display, passing
across interfaces, etc. Grep, awk, et al were/are
so useful not just because they all work on text
streams, but also because Unices support text
streams as glue, as building block, as command,
as roofing tile, etc.
Absolutely. I have pined for a complete set of Unix-like XML tools
before, and as noted there are some. We should rethink this and see how
far we can get with the semantics. Obviously you need sgrep (
http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html ), grep that uses
XPATHs, and versions of cut, sort, etc. One of the biggest problems,
and a mistake in the XML 1.0 spec I think, is the requirement that there
be one root. You end up processing 'XML fragment' documents and
wrapping them with a proper header line and root dummy object. This
would be an acceptable compromise, IMHO.

Apropos an earlier comment about the problems with /proc, I've commented
before that the Linux community should immediately support .xml versions
of nearly everything under /proc (not mem, etc. of course).
Post by Russell Turpin
In the case of attributed file, "that kind of thing"
is a set of attributed objects. (Here, I use object
not in the OOP sense, but in the sense of
something with identity on which attributes can
be pinned.)
No need to be pedantic. An 'object' in Java, C++, etc., doesn't carry a
copy of it's code with each instance. The language, when processing it,
knows that certain code applies to certain object types. An XML
hierarchy is just like an actual data object in a 3GL, right down to the
type reference that can dereference to actual 'object' code. Semantic
sugar. You could argue that for a schema to be complete it should
include actual portable code (Java, whatever) that an engine could load
(since, if validating, it has to load the schema anyway) to validate, or
even process or intermediate access to, an object. The semantic
restrictions of cardinatility, etc., are never going to be complete
enough in a schema to fully validate, so this makes some sense if a
concensus could be reached by enough of the market on a portable code
method.
Post by Russell Turpin
This is almost semantically equivalent to an XML
file, but not quite. To my (very limited)
understanding of XML, every proper XML file
No, not 'every proper XML file'. XML 1.0 files are not required to
reference anything external.
Post by Russell Turpin
has to reference a schema, where the results
of composition typically will be an object that
mixes fully-qualified attributes across a set of
schemas. In other words, it's necessary to lose
the notion of schema as something that types
or validates an object, and use it simply as
something that qualifies attributes, along with
Necessary? Pour qua?
Post by Russell Turpin
other schemas, arbitrarily mixed. (I don't know
how the XML gurus think of this.)
A schema validates and potentially provides default values. A namespace
can act as a sort of multi-inheritance typing method.
Post by Russell Turpin
Of course, the actual "kind of thing" should
not be XML itself, but attributes objects in
the abstract, for which (almost) XML serves
as a textual representation. Build this into the
platform as the uniform medium that passes
through all the plumbing, and the compositional
tools will emerge, because then they will pay
off.
The problem with pondering operating
systems is that we're not Bill Gates. In that
Windows = Microsoft
Linux (and a bit more) = everyone else
There is no problem pondering operating systems if you don't mind trying
your ideas out on Linux.

In actuality, you can implement a good range of interesting things in
Windows also. Look at Cygwin and various virtual filesystems including
Rational ClearCase (I don't want to use it, but it has elegance).
Post by Russell Turpin
sense, Beberg may be correct. For the next
few years, rather than being a panacea for
customer problems, XML may mostly provide
work for programmers. (I'm surprised Beberg
didn't see this as an upside. No, wait -- I take
that back.)
I'm trying to improve that at least a little. Think 4GL STL-like data
collections in 3GL languages.


sdw
Adam L. Beberg
2003-05-13 03:27:26 UTC
Permalink
Post by Stephen D. Williams
However: the schema is flexible (when are relational databases going
to stop using fixed row schema?),
Postres is pretty darn flexible. Objects and everything, since about
'94.
Post by Stephen D. Williams
the data is self-describing,
So is SQL, unless you give your columns REALLY bad names.
Post by Stephen D. Williams
the protocol is no longer proprietary
ANSI/ISO actually.
Post by Stephen D. Williams
and everyone can work on solving higher level problems.
Just as soon as they all decide what to name their "columns". Which
they never will.

My point was that they didn't solve the problem - I want to R/W your
data - but instead made a bunch of things a whole lot worse.

- Adam L. Beberg - ***@mithral.com
http://www.mithral.com/~beberg/
Stephen D. Williams
2003-05-13 05:10:04 UTC
Permalink
Post by Stephen D. Williams
However: the schema is flexible (when are relational databases going
to stop using fixed row schema?),
Postres is pretty darn flexible. Objects and everything, since about '94.
I was talking about schema migration (adding columns) or the avoidance
thereof.

There is no ideal way, for instance, of creating an inventory database
that includes all of the salient attributes of every item in a
hypermarket. Each of the tens or hundreds of thousands of different
items has an attribute set that it only shares identically with a few
neighbors. If databases, even plain relational databases, stored rows
as XML objects and automatically indexed each node name/value as
metadata/value instances (i.e., each node is a potential column), then
you could have a single table for inventory description and search for
any attribute type/value combination.

Another problem is that relational databases essentially force you to
explode objects that are really business documents that always occur
atomically. For a project last year, there was an existing schema of 20
tables to represent a key business object. I reduced that to an XML
document blob and 5 or so fields to be indexed. While you can do that
with a relational database, it would still be nice to be able to query
outside and inside objects seamlessly.
Post by Stephen D. Williams
the data is self-describing,
So is SQL, unless you give your columns REALLY bad names.
Post by Stephen D. Williams
the protocol is no longer proprietary
ANSI/ISO actually.
Really? Although I know some recent protocols have been proposed for
OpenODBC, etc., what ANSI/ISO protocol does MS SQL, Oracle, Sybase, etc.
follow that allows me to write to without using their libraries?
Post by Stephen D. Williams
and everyone can work on solving higher level problems.
Just as soon as they all decide what to name their "columns". Which
they never will.
My point was that they didn't solve the problem - I want to R/W your
data - but instead made a bunch of things a whole lot worse.
http://www.mithral.com/~beberg/
Jeff Bone
2003-05-13 03:33:11 UTC
Permalink
Post by Adam L. Beberg
Post by Stephen D. Williams
However: the schema is flexible (when are relational databases going
to stop using fixed row schema?),
Postres is pretty darn flexible. Objects and everything, since about
'94.
Postgres DOES NOT SOLVE THIS PROBLEM. AFAIK, and I've been looking at
this a lot lately, if they have a solution it eludes me. Now, granted,
tables can inherit from and extend other tables, and queries can apply
to a specific type or all subtypes, recursively. But it does not solve
the problem of having lots of objects with arbitrary attributes which
are not known a priori to the database designer.

You've got to do that nasty table-per-attribute or
table-per-attribute-value-type trick, and expensively roll / unroll
things up. Yuck.

jb
Russell Turpin
2003-05-13 04:05:57 UTC
Permalink
Absolutely. I have pined for a complete set of Unix-like XML tools before,
and as noted there are some.
I think it may be a mistake to focus on XML,
rather than on attributed objects (or whatever
we want to call the underlying data model).
XML is a specific character-oriented presentation,
whereas the logic one wants to implement is
really a much cleaner abstraction. Yes, fine,
use XML as an interchange format for sending
down the wire, or for export and import. But
don't build an OS around a specific text format,
or even a set of platform tools.

BTW, one of the powerful things about this
data model is that a large part of inferential
logic is tractable for it, and there are
reasonable rules for identifying that part.
This won't be so much database as file
system, as it will be deductive database as
file system.

_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail
Stephen D. Williams
2003-05-13 04:58:10 UTC
Permalink
When I say, XML, I always have in mind my bsXML
soon-to-be-proposed-derivative-standard. bsXML is a portable format for
chunked data spans that have XML 1.0 semantics but involve no parsing,
can be modified in-place inexpensively, and have other semantics like
cheap intra-object pointers and copy-on-write.

Additionally, SDOM makes it reasonable to use this structure for the one
and only business-object data structure by reducing the
navigate-traverse-getter and create-populate-link model of DOM with
something more akin to setters/getters and STL-like collection interfaces:

raw OO 3GL:
message.customer.address.setCountry("US");

SDOM/SPATH:
message.set("customer/address/country", "US");

The DOM version is long, the SAX version boils down to a lot of parse
management code and the 3GL example.

This does actually give you attributed objects that can be readily
converted to textual XML 1.0, but with the characteristics much closer
to a 3GL structure/object. I think of it as a bit of semantic sugar
that avoids creation of a lot of 3GL expression when you are dealing
with objects that are represented externally anyway.

I've been talking about this for quite a while (1998), have implemented
both the front-side and back-side in projects, and am finally putting it
all together as open source. Oddly, few seem to see enough merit to
think I'll be widely successful, but that isn't necessarily a bad sign.

sdw
Post by Russell Turpin
Post by Stephen D. Williams
Absolutely. I have pined for a complete set of Unix-like XML tools
before, and as noted there are some.
I think it may be a mistake to focus on XML,
rather than on attributed objects (or whatever
we want to call the underlying data model).
XML is a specific character-oriented presentation,
whereas the logic one wants to implement is
really a much cleaner abstraction. Yes, fine,
use XML as an interchange format for sending
down the wire, or for export and import. But
don't build an OS around a specific text format,
or even a set of platform tools.
BTW, one of the powerful things about this
data model is that a large part of inferential
logic is tractable for it, and there are
reasonable rules for identifying that part.
This won't be so much database as file
system, as it will be deductive database as
file system.
_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail
Jeff Bone
2003-05-13 05:07:07 UTC
Permalink
Post by Gavin Thomas Nicol
Post by Jeff Bone
As for XML, yeah, it's a bitch. But they got the data model right, at
least. Don't confuse the syntax with the semantics.
Which one? I don't remember having one when we did the XML spec. and
the ones
since aren't *really* XML (despite pithy titles like "The Essense of
XML").
Okay, that's a reasonable point. I guess I mean the kind of 2-d
intuitive model most people assume.

jb
Gavin Thomas Nicol
2003-05-13 05:28:23 UTC
Permalink
Post by Jeff Bone
Okay, that's a reasonable point. I guess I mean the kind of 2-d
intuitive model most people assume.
If you mean the tree-like thing... I personally think most models I've seen of
XML are far too syntax-specific. Even the good work by Wadler (taken as such)
is really too tied to the syntax.

That said, there's a reason for that: the data models are for *XML* not what
the XML *represents*. This is the biggest problem with most uses of XML...
they treat XML as the canonical form, when in many cases it should be treated
as an encoding.

I have an old saying: "The best XML is the XML you don't see".
Stephen D. Williams
2003-05-13 05:39:08 UTC
Permalink
Post by Gavin Thomas Nicol
Post by Jeff Bone
Okay, that's a reasonable point. I guess I mean the kind of 2-d
intuitive model most people assume.
If you mean the tree-like thing... I personally think most models I've seen of
XML are far too syntax-specific. Even the good work by Wadler (taken as such)
is really too tied to the syntax.
That said, there's a reason for that: the data models are for *XML* not what
the XML *represents*. This is the biggest problem with most uses of XML...
they treat XML as the canonical form, when in many cases it should be treated
as an encoding.
I have an old saying: "The best XML is the XML you don't see".
Ahh, but when should XML BE the canonical form? I posit that in most
business application situations, XML should be the canonical form. You
are interfacing with XML, you are storing it, loading it, logging to it,
designing schemas, etc. Why do you need another form that gets encoded
into XML when the object has to go in and out of XML constantly anyway?
Why introduce the impedance mismatch? Conversion overhead? Extra
programming steps? bsXML and SDOM are centered around the idea that
XML, or at least an efficient equivalent representation of it, can be
the canonical form, thus obviating a lot of work and overhead.

This is quite separate from protocol, multimedia, and other cases where
it makes sense to stay close to the base metal, BEEP notwithstanding.
Even application infrastructure, if it is stable, well designed, and
performant, shouldn't necessarily be expressed in XML. While my bsXML
work is intended to broaden the range of applications that can use
something XML-like (in the sense of a complete derivative, not subset)
efficiently, the goal is business objects. Business objects are special
because they must be self-describing, sometimes for posterity, they have
to ease schema migration (adding or changing fields), and are frequently
used to interface with external entities and projects.

sdw
Gavin Thomas Nicol
2003-05-13 06:18:20 UTC
Permalink
Post by Stephen D. Williams
Post by Gavin Thomas Nicol
I have an old saying: "The best XML is the XML you don't see".
Why do you need another form that gets encoded
into XML when the object has to go in and out of XML constantly anyway?
But do you really model the object in XML, or *encode* it in XML? What is the
real object?

I am a firm believer in what I call "edge transformations", where, at the
edges of my application, I convert to whatever form your application needs
for purposes of integration. Currently, this interchange format is for the
most part, encoded in XML (with edge transformations there too, to handle
dialect differences). This is a good thing... but it does *not* imply that my
object are, or should be XML.
Post by Stephen D. Williams
Extra programming steps? bsXML and SDOM are centered around the idea that
XML, or at least an efficient equivalent representation of it, can be
the canonical form, thus obviating a lot of work and overhead.
bsXML and SDOM need to be converted to text to be considered XML for purposes
of interchange. How are they then any different from JavaBeans, or
S-expressions, or any number of other object models that can be easily
serialised into XML? Why should I introduce a foreign object model into my
programming environment, possibly resulting in the inability to use proper
runtime/static type checking etc?
Post by Stephen D. Williams
Business objects are special because they must be self-describing,
The objects or the encoding thereof?
Stephen D. Williams
2003-05-13 06:43:47 UTC
Permalink
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
Post by Gavin Thomas Nicol
I have an old saying: "The best XML is the XML you don't see".
Why do you need another form that gets encoded
into XML when the object has to go in and out of XML constantly anyway?
But do you really model the object in XML, or *encode* it in XML? What is the
real object?
A bsXML has the same format on the wire as it does in memory (although,
after modifications it might optionally be able to be 'condensed'). You
could consider it an encoding of XML 1.0 or XML 1.0 as an encoding of
it. You are modeling the object in bsXML, which from a conceptual and
API perspective is the same as modeling it in XML.
Post by Gavin Thomas Nicol
I am a firm believer in what I call "edge transformations", where, at the
edges of my application, I convert to whatever form your application needs
for purposes of integration. Currently, this interchange format is for the
most part, encoded in XML (with edge transformations there too, to handle
dialect differences). This is a good thing... but it does *not* imply that my
object are, or should be XML.
Of course, you need to be able to integrate with many things typically.
My thinking is that a bsXML library will give you a common edge that
can talk either bsXML (which is intended to be a network-portable
standardized format) or XML 1.0 (via someone elses parser, at least
initially, that does SAX->bsXML).
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
Extra programming steps? bsXML and SDOM are centered around the idea that
XML, or at least an efficient equivalent representation of it, can be
the canonical form, thus obviating a lot of work and overhead.
bsXML and SDOM need to be converted to text to be considered XML for purposes
of interchange. How are they then any different from JavaBeans, or
S-expressions, or any number of other object models that can be easily
serialised into XML? Why should I introduce a foreign object model into my
programming environment, possibly resulting in the inability to use proper
runtime/static type checking etc?
If every transition requires XML 1.0 representation on one side, then
you lose much, but not all of the efficiency gains. The idea is to
stabilize the bsXML format as a de facto (and more eventually) standard.
This won't happen immediately since the goal is efficiency and I will
want people to beat my initial data structure methods while still
meeting my requirements and constraints. The difference is that, at
least when you have bsXML aware applications on each side, there is no
parsing or serialization. There is only simple, efficient IO of blocks
ready to go without processing (or with very simple processing to
condense 'stretched' memory). Numerous methods exist to provide a nice
development model, but all of those incur extra overhead rather than
potentially removing nearly all of it.
Post by Gavin Thomas Nicol
Post by Stephen D. Williams
Business objects are special because they must be self-describing,
The objects or the encoding thereof?
In this sense, it's the encoding that must be self-describing, while the
objects themselves could benefit from also being self-describing.

sdw
Adam L.Beberg
2003-05-13 06:34:01 UTC
Permalink
Post by Stephen D. Williams
There is no ideal way, for instance, of creating an inventory database
that includes all of the salient attributes of every item in a
hypermarket. Each of the tens or hundreds of thousands of different
items has an attribute set that it only shares identically with a few
neighbors. If databases, even plain relational databases, stored rows
as XML objects and automatically indexed each node name/value as
metadata/value instances (i.e., each node is a potential column), then
you could have a single table for inventory description and search for
any attribute type/value combination.
I always viewed this as a simple tradeoff of functionality vs features.
I used to toy with all the dynamic stuff of Postgres, but you know
what, it was SLOW. Sure I _can_ have dynamic attributes on everything
in the store, but how useful are they?
If I have to search through the entire raw data set every time, it's
going to suck.

In other words, with XML, your metadata (indexes, tags, parsers, etc)
is larger then your data. You're better off just parsing the raw data
every time you run a query then trying to look in the metadata which is
larger.

I can't find the screwdrivers in inventory because the database is
currently busy worrying which hammers are blue.

So what do you do, you index (row) the stuff that matters (columns),
and blob the rest. Just like we did in the dark ages, *gasp*!

I have yet to see a single use for XML that isn't completely
unacceptable performance wise other then file exchange. But we'll be
fixing all the damage XML is causing for years, so I'm all for other
people using it. I need to get back to my roots of 3rd generation
clients anyway - screwed by a geek, screwed by a corp, then they are
ready to get serious and listen.
Post by Stephen D. Williams
Really? Although I know some recent protocols have been proposed for
OpenODBC, etc., what ANSI/ISO protocol does MS SQL, Oracle, Sybase,
etc. follow that allows me to write to without using their libraries?
Oh that, yea that's just a mess. It's also their only way to lock in
users. They are all almost identical except for price until you get
over 10TB of data, and how many users is that?

- Adam L. Beberg - ***@mithral.com
http://www.mithral.com/~beberg/
James Rogers
2003-05-13 06:47:18 UTC
Permalink
Post by Jeff Bone
Extensible attributes take care of this in two ways: first, harvesters
can post-facto (perhaps triggered by change notifications) dig in and
grab extensible attributes or even synthesize them. We do some of this
today at Deepfile, and the approach is sound and scalable. But more
importantly, it de-emphasizes the need for different file formats. If
attributes could be stored, accessed, queried, etc. via a common
system-wide API --- perhaps via wrinkle in the filesystem itself,
attribute directories == directories, attributes == files, values ==
file contents ala ReiserFS, BeFS, and others --- then over time people
would leverage this rather than trying to cram it all into yet another
file format or XML Schema. (Indeed, I expect that over time filesystem
access and XML processing might cease to be distinct: what if the
filesystem, abstractly, IS an XML object? And / or vice-verse?
There's also a nice, incremental path, here --- mapping XSD directly
onto the DB, leveraging system indexing and change notification, etc.
Sweet.)
The problem with this is that it ultimately fails because it is brittle.
Not that it doesn't do a good job compared to what we have now. The
brittleness in this context is that the measure of information is synthetic,
arbitrary, and absolute. Your idea of meta-data may not be my idea of
meta-data, and even if it is, the meta-data may not even be accessible
without a context-sensitive transform operation that is most definitely NOT
universal in every example I've seen given.

The second problem is that the mechanisms described cannot be BOTH universal
AND tractable. Its in the math; you can't get there from XML or analogous
protocols. Sorry.

Fortunately, mathematics clearly indicates that there are tractable,
scalable solutions for finite systems, but it does not take more than
cursory analysis to prove that XML-based (and similar) systems are a
figurative breach of the protocol. Unfortunately, it is apparent that few
people have looked at the mathematics of the problem. The problem,
ultimately, is the insistence of treating a file as a discrete object, or
even a collection of discrete objects. That assumption inevitably leads to
a representational format that is easy to use within limited scopes and
rigid rules, but which is by no means universal and trivial to "break".
What is needed is a format that implicitly orders and correlates all
abstract data whether conceived of by the author or not. Fortunately, there
are extraordinarily efficient ways to do this (provably so), but I don't
want to go off on this mathematical tangent tonight.

Incidentally, I cannot for the life of me figure out what is so spiffy about
XML. Other than being a convention, it brings nothing to the table. People
could have agreed on any number of format standards, including quite a few
with a much smarter and more compact representation of data in general. And
from an information theoretic standpoint, it is butt ugly any way you slice
it. For example, I always found the 8-bit stack machine protocols/formats
to be fascinating, though horribly unwieldly to use in practice. But then,
I am one of those perverse individuals who never distinguishes between data
and machine, which probably explains why I find XML painful and inefficient
in practice.

Cheers,

-James Rogers
***@best.com
Stephen D. Williams
2003-05-13 07:05:26 UTC
Permalink
Post by James Rogers
...
Incidentally, I cannot for the life of me figure out what is so spiffy about
XML. Other than being a convention, it brings nothing to the table. People
could have agreed on any number of format standards, including quite a few
with a much smarter and more compact representation of data in general. And
from an information theoretic standpoint, it is butt ugly any way you slice
it. For example, I always found the 8-bit stack machine protocols/formats
to be fascinating, though horribly unwieldly to use in practice. But then,
I am one of those perverse individuals who never distinguishes between data
and machine, which probably explains why I find XML painful and inefficient
in practice.
I thought the same thing for a while. IMHO, the 'magic' of XML has
almost nothing to do with the format, syntax, or most details, some of
which (DTDs) are downright ugly. The great thing about XML, beyond the
fact that it has extensible structure, is self describing, supports
Unicode, etc., is that it is surrounded by an expectation of advanced
idioms of usage. These idioms could have been implemented earlier, but
just like a number of great ideas that were expressed in Java, their
time had come yet they needed the right environment. This has
effectively moved the industry forward even though most of these ideas
were fairly obvious because the general level of sophistication has
increased.

For instance, prior to the use of XML (and SGML) for data encoding, you
would hand code data structures, conversion and parsing code, and be
done with it. That was the standard method. Documentation was loosely
linked to the actual code. With XML, you had critical mass to create
and expect support for DTDs/Schemas that were an executable (for
validation at least) specification of your format.

Another example is of an application recieving a message, operating on
it, and sending it on to another application. Prior to XML, and still
in practice with CORBA, COM, etc., you could create a hard coded
interface with your exact parameters then have to version it when there
were changes. With XML, an application is expected, although not
required, to be tolerant of new extensions to the data format, and even
to preserve them when modifying objects, without necessarily failing.
Post by James Rogers
Cheers,
-James Rogers
Gavin Thomas Nicol
2003-05-13 14:30:36 UTC
Permalink
Post by James Rogers
Incidentally, I cannot for the life of me figure out what is so spiffy
about XML. Other than being a convention, it brings nothing to the table.
That's another satori most people miss. XML itself is nothing... but the
network effects around it are great. That's really the point of any
standard... to generate network effects through a well-defined level of
interoperability.
Post by James Rogers
But then, I am one of those perverse individuals who never
distinguishes between data and machine, which probably explains why I find
XML painful and inefficient in practice.
Maybe... but if you flip your head sideways and adopt the "a program is just
data, and data is just another kind of program" mentality, XML fits in. It's
not pretty, I agree. I got really interested in SGML because to me, SGML/XML
documents can be looked at as programs and/or data.

That view is fundamentally different from XML/SGML as *text*.
Jeff Bone
2003-05-13 06:52:07 UTC
Permalink
Post by James Rogers
The problem with this is that it ultimately fails because it is
brittle.
Which "this" are you talking about, James? There was a chunk of
context there. You're either failing to parse, talking out of your
ass, or being uselessly nonspecific. Please clarify your damage. ;-)

jb
Russell Turpin
2003-05-13 13:19:53 UTC
Permalink
Ahh, but when should XML BE the canonical form? ..
For what started this conversation, I think it should NOT
be. The underlying canonical data that is used as glue
on a platform, to communicate between applications, to
undergird scripting and plumbing, etc., should be based
on a clean data model. XML might be a default
expression of it, but it should not be defined as XML.
I posit that in most business application situations, XML should be the
canonical form. You are interfacing with
XML, you are storing it, loading it, logging to it, designing schemas, etc.
..
You're describing an interchange format, which needs
to be generated and parsed only at the far edge where
it has to be put into a text container. There's no reason
to generate and reparse simply for gluing applications
and utilities together. I'm tempted to argue efficiency
-- i.e., we don't want to store all those files, including
a good number of small ones, as XML -- but really the
issue is that for thinking about a better platform, one
wants to start with a clean data model, and not get
too tied to something that is already as crufty as XML.

Now, for other purposes, yeah, maybe it should just
be XML.

_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail
Gavin Thomas Nicol
2003-05-13 14:22:37 UTC
Permalink
Post by Russell Turpin
For what started this conversation, I think it should NOT
be. The underlying canonical data that is used as glue
on a platform, to communicate between applications, to
undergird scripting and plumbing, etc., should be based
on a clean data model. XML might be a default
expression of it, but it should not be defined as XML.
I think this is the small satori most people miss.

XML is *not* a data model, but an encoding thereof, and the XML data model(s)
that do exist are clearly not even close to ideal for most problem domains.
Stephen D. Williams
2003-05-13 14:28:19 UTC
Permalink
Post by Russell Turpin
Ahh, but when should XML BE the canonical form? ..
For what started this conversation, I think it should NOT
be. The underlying canonical data that is used as glue
on a platform, to communicate between applications, to
undergird scripting and plumbing, etc., should be based
on a clean data model. XML might be a default
expression of it, but it should not be defined as XML.
I'm talking about using bsXML, not XML. You're arguing against XML, not
bsXML, which is a qualitatively different situation, as I've described
it. I should have been more clear in my statement.

"When should a conceptual model compatible with XML expressed in an
efficient bsXML data model be the canonical form/model?"

Besides text, what are the candidates for a 'clean data model' besides
text files, XML files, and something like bsXML?
Post by Russell Turpin
I posit that in most business application situations, XML should be
the canonical form. You are interfacing with
XML, you are storing it, loading it, logging to it, designing
schemas, etc. ..
You're describing an interchange format, which needs
to be generated and parsed only at the far edge where
it has to be put into a text container. There's no reason
to generate and reparse simply for gluing applications
and utilities together. I'm tempted to argue efficiency
-- i.e., we don't want to store all those files, including
a good number of small ones, as XML -- but really the
issue is that for thinking about a better platform, one
wants to start with a clean data model, and not get
too tied to something that is already as crufty as XML.
What kind of clean data model? What do you use for gluing applications
and utilities together? What is out there that doesn't require parsing,
conversion, and expression in a memory data structure that is
significantly different than wire/file format?

Down with serialization, down with conversion code, stamp out parsing of
application data except at the edges (i.e. web serving, interfacing to
'legacy XML', debugging). I'm synthesizing ideas like zero copy,
pipelining, COW, virtual pointers (think of the way that mark and point
work in Emacs), 'elastic memory', etc. to bypass processing and code
that is now required everywhere.
Post by Russell Turpin
Now, for other purposes, yeah, maybe it should just
be XML.
sdw
--
Stephen D. Williams ***@lig.net http://sdw.st 703-724-0118W 703-995-0407Fax
Professional: ***@hpti.com www.hpti.com 703-371-9362W
Russell Turpin
2003-05-13 15:02:57 UTC
Permalink
Besides text, what are the candidates for a 'clean data model' besides text
files, XML files, and something like bsXML?
As an initial caveat, let me publicly disclaim familiarity with
bsXML. I don't know how much of what I say next
overlaps with it.

Second, I'm going to speak iin the spirit of imagining a
better platform. Thinking in this mode, people should
free themselves from the constraints of the applications
they now work with, of necessity on existing platforms,
which supposedly suffer in comparison to the new and
improved platform we're trying to conceive.

With those caveats, I would suggest creating a fairly
formal model of attributed objects, much in the same
fashion that there are fairly formal models of
s-expressions, the relational algebra, etc. The
developmental sequence might be something like:
object identity and flat object space, domains of
simple values (integers, strings, dates, etc.) and
operations on them, attributes as maps from objects
to objects or value domains, attribute typing, attribute
inference rules (probably as Horn clauses), and finally
the definition of a database/"file"/scope as (a) set of
objects, (b) attributions on those objects, and
(c) set of rules. With that as a basis, then proceed to
define queries, updates, scope composition, etc.

Programs and scripts can both operate on a scope,
and instantiate parts of the data model, e.g., by
providing attributions for opaque files. The platform
provides the core elements of this data model, and
the framework for programs to work on it or
supply parts of it. Instead of a traditional file system,
the platform supports the operation of making a
scope persistent. Scopes that aren't persistent get
reclaimed when all relevant programs terminate.

Now yeah, you can roll an XML schema and
representation for a scope, at any point in time.
(Note, however, that its schema might change,
since new attributions can be added on the fly.)
But this is just a way to represent the data model
in a text file. If bdXML has a clean definition like the
above, then great, maybe that's the right choice.
That would surprise me a bit, just because XML is
so crufty. But like I said, I don't know it, so I
can't say.

_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail
Stephen D. Williams
2003-05-13 15:16:29 UTC
Permalink
I only grok part of what you want. Could you put together some kind of
straw-man example of a model that satisfies your rich meta-model?
(Extra points for XML representation! ;-) )

What do the "inference rules" and "set of rules" do? How are they used?
Validation and?

sdw
Post by Russell Turpin
Post by Stephen D. Williams
Besides text, what are the candidates for a 'clean data model'
besides text files, XML files, and something like bsXML?
As an initial caveat, let me publicly disclaim familiarity with
bsXML. I don't know how much of what I say next
overlaps with it.
Second, I'm going to speak iin the spirit of imagining a
better platform. Thinking in this mode, people should
free themselves from the constraints of the applications
they now work with, of necessity on existing platforms,
which supposedly suffer in comparison to the new and
improved platform we're trying to conceive.
With those caveats, I would suggest creating a fairly
formal model of attributed objects, much in the same
fashion that there are fairly formal models of
s-expressions, the relational algebra, etc. The
object identity and flat object space, domains of
simple values (integers, strings, dates, etc.) and
operations on them, attributes as maps from objects
to objects or value domains, attribute typing, attribute
inference rules (probably as Horn clauses), and finally
the definition of a database/"file"/scope as (a) set of
objects, (b) attributions on those objects, and
(c) set of rules. With that as a basis, then proceed to
define queries, updates, scope composition, etc.
Programs and scripts can both operate on a scope,
and instantiate parts of the data model, e.g., by
providing attributions for opaque files. The platform
provides the core elements of this data model, and
the framework for programs to work on it or
supply parts of it. Instead of a traditional file system,
the platform supports the operation of making a
scope persistent. Scopes that aren't persistent get
reclaimed when all relevant programs terminate.
Now yeah, you can roll an XML schema and
representation for a scope, at any point in time.
(Note, however, that its schema might change,
since new attributions can be added on the fly.)
But this is just a way to represent the data model
in a text file. If bdXML has a clean definition like the
above, then great, maybe that's the right choice.
That would surprise me a bit, just because XML is
so crufty. But like I said, I don't know it, so I
can't say.
_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail
--
--
Stephen D. Williams ***@lig.net http://sdw.st 703-724-0118W 703-995-0407Fax
Professional: ***@hpti.com www.hpti.com 703-371-9362W
James Rogers
2003-05-15 20:22:43 UTC
Permalink
Post by Jeff Bone
Which "this" are you talking about, James? There was a chunk of
context there. You're either failing to parse, talking out of your
ass, or being uselessly nonspecific. Please clarify your damage. ;-)
Sorry about that -- it was late and I was tired.

My critique was less that of your system than the approach to the underlying
problem. <terminology type="math" field="Kolmogorov Information Theory">

The brittleness is in the requirement of an explicit meta-data system
external to the data itself as the information representation scheme. This
can't be both universal/general AND tractable at the same time using the
kind of representation you are talking about. Now, you don't seem to be
talking about the universal case, hence why it is tractable, but to really
solve the problem in the long term the universal case must be addressed.

Scalable meta-data representation is a hard problem, because universal
solutions generally don't scale beyond toy cases. All zero-order type
representations (what you are talking about) have a hard resource take-off
that occurs VERY quickly in the univeral case. This is actually mandated by
the mathematics no matter what your representation scheme is, even for the
most efficient n-order representational systems. To give an idea of the
tractability problem of zero-order representation systems, the limit on
current systems hits the wall at about 5 bytes worth of information using
optimal universal zero-order algorithms.

The problem then, is exponent management. Most of what we use in software
today and what you are talking about is non-universal zero-order information
representation. The value of this is that it is a very well-understood bit
of computer science and easy to implement, hence why they are the "go to"
scheme everywhere in software. Unfortunately, you can't build a scalable
universal system this way so we end up with systems that are either A)
non-universal but scalable, or B) universal but not scalable. Since B) is
useless for most real work, we are stuck with A).

Where it gets interesting is that we now know that very efficient high-order
information representation systems (which qualitatively look quite different
than traditional zero-order algorithms) have a sufficiently small exponent
that you can build large and useful systems in the real world without the
resource take-off becoming intractable. Unfortunately, the number of people
that really grok this esoteric corner of theoretical computer science can
probably be counted on your fingers (assuming, of course, that you aren't a
table saw operator). Mathematically, this is off in one of the darker but
more interesting corners of algorithmic information theory, and relatively
little is published on it even though it has become a hot topic in some
circles over the last few years.

So by "brittle", what I meant was that you cannot have a useful universal
solution using the representational models you are describing in the
abstract for your software even though it is a fine non-universal solution.
Consequently, any single implementation will "suck" to one degree or another
for someone trying to use it for any given purpose. We know that better
solutions can be built, possibly even ones that are truly universal, but it
requires taking a significant vacation conceptually from where you are now
as far as I can tell.

Cheers,

-James Rogers
***@best.com
Jim Whitehead
2003-05-15 22:48:10 UTC
Permalink
Unfortunately, the number of people
that really grok this esoteric corner of theoretical computer science can
probably be counted on your fingers (assuming, of course, that
you aren't a table saw operator).
OK, I'll bite. What are one or two references that can serve as an entry
point into this literature?

- Jim
James Rogers
2003-05-16 18:19:16 UTC
Permalink
Post by James Rogers
Unfortunately, the number of people
that really grok this esoteric corner of theoretical
computer science
can probably be counted on your fingers (assuming, of
course, that you
aren't a table saw operator).
OK, I'll bite. What are one or two references that can serve
as an entry point into this literature?
Start at Solomonoff induction and universal predictors and head west. It
takes a couple significant theoretical leaps from readily available
publication to what I was talking about. In particular, the finite
expression of these mathematical constructs is very relevant, but the amount
of published papers on this is VERY limited, and usually targeted at other
aspects of it. You'll also find some distant relevance in standard coding
theory and non-axiomatic systems, so being familiar with these helps.

It is essentially all Kolmogorov information theory of one flavor another,
and this in particular cuts across multiple sub-genres. A problem in this
particular area of mathematics (as I see it) is that only a small amount of
the current research is published, in no small part because new results in
this field can have significant commercial value. As a result, where the
pavement ends is a fair distance from where the pioneers actually are and
you have to hoof it for quite a bit of the mathematics to get to the real
frontier. If you've been working out in that part of math-land forever (and
I have), it is pretty easy to keep up and the context is sufficient to know
what everyone else is doing in the field even if they don't tell you
everything. But for someone starting out, they have to bridge the
"publication gap" themselves which takes time. The hapax legomenon in the
above will take you to the edge of the pavement.

This has been roughly my primary area of theoretical research for many
years, so I am as guilty of non-publication as the next guy. Lately it has
become a somewhat "hot" area for theoretical computer science, so there is
more being written about it now than in the past. A lot of the good
discussion happens on lists or in person between researchers rather than in
publication or at public conferences. It isn't too difficult to find
evidence of the universality and tractability/exponentiation issues I was
talking about, but you'll find almost no details on the proofs of such, just
implicit agreement from various mathematics folks working on it. I greatly
simplified the space/issue for the sake of readibility.

Cheers,

-James Rogers
***@best.com
Stephen D. Williams
2003-05-16 20:27:44 UTC
Permalink
I still haven't seen you lay out a clearly defined problem and solution,
or at least the characteristics of a solution.

A) Sounds like a lot of mental masturbation.

B) If you can't tell the world what you have been doing, i.e. publish,
then you might as well not have done it. (And source code generally
beats beating around the bush with partial revelation.)

C) Name some specific applications, algorithms, and problems that this
is crucial to solving. Why do I care?

I suspect that you could draw relatedness comparisons to 'complex' and
'advanced' theories all day long on a large number of systems and
algorithms, say all of the optimization methods in gcc for instance, but
that doesn't make the mathematical analysis worthwhile. What grand
mathematical analysis lead to Huffman coding for instance? (None, he
just thought it up one day according to the story.)

sdw
Post by James Rogers
Post by James Rogers
Unfortunately, the number of people
that really grok this esoteric corner of theoretical
computer science
can probably be counted on your fingers (assuming, of
course, that you
aren't a table saw operator).
OK, I'll bite. What are one or two references that can serve
as an entry point into this literature?
Start at Solomonoff induction and universal predictors and head west. It
takes a couple significant theoretical leaps from readily available
publication to what I was talking about. In particular, the finite
expression of these mathematical constructs is very relevant, but the amount
of published papers on this is VERY limited, and usually targeted at other
aspects of it. You'll also find some distant relevance in standard coding
theory and non-axiomatic systems, so being familiar with these helps.
It is essentially all Kolmogorov information theory of one flavor another,
and this in particular cuts across multiple sub-genres. A problem in this
particular area of mathematics (as I see it) is that only a small amount of
the current research is published, in no small part because new results in
this field can have significant commercial value. As a result, where the
pavement ends is a fair distance from where the pioneers actually are and
you have to hoof it for quite a bit of the mathematics to get to the real
frontier. If you've been working out in that part of math-land forever (and
I have), it is pretty easy to keep up and the context is sufficient to know
what everyone else is doing in the field even if they don't tell you
everything. But for someone starting out, they have to bridge the
"publication gap" themselves which takes time. The hapax legomenon in the
above will take you to the edge of the pavement.
This has been roughly my primary area of theoretical research for many
years, so I am as guilty of non-publication as the next guy. Lately it has
become a somewhat "hot" area for theoretical computer science, so there is
more being written about it now than in the past. A lot of the good
discussion happens on lists or in person between researchers rather than in
publication or at public conferences. It isn't too difficult to find
evidence of the universality and tractability/exponentiation issues I was
talking about, but you'll find almost no details on the proofs of such, just
implicit agreement from various mathematics folks working on it. I greatly
simplified the space/issue for the sake of readibility.
Cheers,
-James Rogers
--
***@hpti.com http://www.hpti.com Personal: ***@lig.net http://sdw.st
Stephen D. Williams 43392 Wayside Cir,Ashburn,VA 20147-4622
703-724-0118W 703-995-0407Fax Oct2002
Jeff Bone
2003-05-16 20:41:30 UTC
Permalink
Post by Stephen D. Williams
I still haven't seen you lay out a clearly defined problem and
solution, or at least the characteristics of a solution.
A) Sounds like a lot of mental masturbation.
C) Name some specific applications, algorithms, and problems that this
is crucial to solving.  Why do I care?
Yup, I would agree. If he can't even state a specific application /
problem domain, then it's highly suspect.

I for one don't know how to "head west" from a theory. ;-) I think
maybe I thought I did when I was younger and an even more of an idiot.

;-)

jb
Joseph S. Barrera III
2003-05-16 20:58:27 UTC
Permalink
I for one don't know how to "head west" from a theory. ;-) I think
maybe I thought I did when I was younger and an even more of an idiot.
I've seen theories go south. Is that the same sort of thing?

- Joe
Gavin Thomas Nicol
2003-05-17 00:20:53 UTC
Permalink
Post by Jeff Bone
Yup, I would agree. If he can't even state a specific application /
problem domain, then it's highly suspect.
I ran into it recently below...

http://homepages.cwi.nl/~tromp/cl/cl.html
James Rogers
2003-05-16 22:26:24 UTC
Permalink
On Friday, May 16, 2003, at 15:27 US/Central, Stephen D.
Post by Stephen D. Williams
I still haven't seen you lay out a clearly defined problem and
solution, or at least the characteristics of a solution.
A) Sounds like a lot of mental masturbation.
C) Name some specific applications, algorithms, and
problems that this
Post by Stephen D. Williams
is crucial to solving.  Why do I care?
Yup, I would agree. If he can't even state a specific application /
problem domain, then it's highly suspect.
Sorry folks, but I don't have the amount of time required to go into this at
the level of detail required to make it interesting, nor am I particularly
interested in convincing you. One could even argue that it is not in my
best interest to convince you. This is not a simple topic, but there is
ample evidence floating around out there that I might be clueful. I'm not
really trying to sell anything, only point out some larger issues that are
being actively addressed by other people and highlighting some relevant
items. <shrug>

And even if I did spend the time, I have a fiduciary obligation not to get
into to much detail so I have to dance around some specific areas. Many of
the "problems" I discuss in the abstract have in fact been solved -- I've
seen implementations running on real systems -- but I'm not going to say
which ones or to what extent.

As a side note, if you are even nominally familiar with that area of
mathematics, nothing I stated should be a surprise or need to be
substantiated. My previous description was a heavily glossed "101" level
overview of the fundamental theorems and their application. What do you
want, page references to Li and Vitanyi? This is the *other* reason I don't
have time for this; I'm not terribly interested in discussing things at that
level.

Off to fight a fire,

-James Rogers
***@best.com
James Rogers
2003-05-16 22:42:48 UTC
Permalink
What grand mathematical analysis lead to Huffman coding for instance?
(None, he just thought it up one day according to the story.)
Talk about irrelevant. Huffman coding has a real theoretical basis, though
one could argue that it was really a mediocre but simple approximation of
arithmetic coding. Occam's Razor existed for centuries before it was proven
in mathematics to be a concise description of some important theorems, but
that doesn't detract from its correctness or its applicability. This sounds
very much like a "God in the gaps" argument.

This touches on a very important point: Mathematics never really proscribes
an algorithmic implementation. The hard part isn't solving the mathematical
models, but in figuring out how to make a real-world algorithmic
implementation of the mathematics. In practice there is an important reason
for this (UTM assumptions), but this frequently leads to a decades long
evolutionary walk towards optimal universal algorithms. The proscription
problem is as much a consequence of history and convention as anything else
though, like many things in theoretical computer science.


-James Rogers
***@best.com
Jeff Bone
2003-05-17 01:45:49 UTC
Permalink
Post by James Rogers
Sorry folks, but I don't have the amount of time required to go into
this at
the level of detail required to make it interesting, nor am I
particularly
interested in convincing you.
Ah, basically "trust me, I know what I'm talking about but I don't have
time to make you understand." VERY convincing!

Not.
Post by James Rogers
One could even argue that it is not in my
best interest to convince you.
The question is not this, but rather whether it's in our best interest
to be convinced. Secondarily, the question is why you even care about
whether we're convinced. That is, if the problem is so valuable to
solve, why do you care if we know / care about your solution?

Puzzle that one out and you might learn something useful about yourself.

Him with sin, throwing stones anyway...

;-)

jb

Loading...