EmbeddedRelated.com
Forums
The 2026 Embedded Online Conference

SGML/XML tools for arbitrary access to entities within document?

Started by Don Y July 18, 2014
Hi,

I am driving much of my software from data excised from
formal documents that describe the algorithms involved
in much more detail (and modalities) than is possible with
(textual) "source code".

I'm a huge fan of table-driven applications so I often
express components of algorithms in tables, then excise
the tables from the document and propagate them into
the formal "sources" (all mechanically, of course).

This minimizes the chance for typographical errors to creep
into the "code" between the documentation and the executable.
It also makes it difficult for the code to evolve without
the documentation coming along *with* it!

And, of course, it allows things to be expressed in forms
that are more intuitive/self-documenting than would otherwise
be available with "ASCII text".

So far, I've been creating ad hoc tools to extract the needed
components from the documents.  The markup language used in the
documents is well documented and the way I build my documents
makes it fairly easy to isolate the components of interest and
"extract them".

For example, to extract a particular table, I invoke a tool I
wrote with the command line:
    gettable TABLETITLE <column> [,<column>]
and redirect the output to a file (which is later massaged by
an application specific tool/script to get it into a form suitable
for #include in a source).  This knows how to parse the (nested)
tags of the MU language until it finds the table having the
specified TABLETITLE (string); then, extracts lists of
(font,string) tuples for each cell in the specified columns of
the table.

[other tags associated with the cell only contain cosmetic
information -- line spacing, text alignment, etc. -- so they
can be ignored]

But, I'm looking at other options for a more generic solution
to this problem.

E.g., I wrote a formal grammar for the markup language so I can
build a specific parser to extract what I need *using* arbitrary
parts of that grammar (e.g., if I later decide the *color* of
the text in a cell is important -- highly doubtful!).

I'm also looking at building a formal DTD for the MU language
and seeing what XML-ish tools exist to do these sorts of things.

The downside of a more "involved"/capable solution is it gets
more tedious to maintain -- especially as the MU language evolves!
And, testing the tool becomes a project in itself!  :<

So, specifically, what sorts of OTS tools (prefer ones with sources
that I can modify) exist that will let me do things like parsing
to a particular nested tag, verifying the attribute associated with
it matches what I seek (e.g., TABLETITLE) then extracting all (and
ONLY!) attributes of specific tags contained nested *within* that
context?

I.e., I want to be able to specify what parts of the tree to extract
based on criteria I specify on a command line.

Thx,
--don
On 19/07/14 08:43, Don Y wrote:
> Hi, > > I am driving much of my software from data excised from > formal documents that describe the algorithms involved > in much more detail (and modalities) than is possible with > (textual) "source code". > > I'm a huge fan of table-driven applications so I often > express components of algorithms in tables, then excise > the tables from the document and propagate them into > the formal "sources" (all mechanically, of course). > > This minimizes the chance for typographical errors to creep > into the "code" between the documentation and the executable. > It also makes it difficult for the code to evolve without > the documentation coming along *with* it! > > And, of course, it allows things to be expressed in forms > that are more intuitive/self-documenting than would otherwise > be available with "ASCII text". > > So far, I've been creating ad hoc tools to extract the needed > components from the documents. The markup language used in the > documents is well documented and the way I build my documents > makes it fairly easy to isolate the components of interest and > "extract them". > > For example, to extract a particular table, I invoke a tool I > wrote with the command line: > gettable TABLETITLE <column> [,<column>] > and redirect the output to a file (which is later massaged by > an application specific tool/script to get it into a form suitable > for #include in a source). This knows how to parse the (nested) > tags of the MU language until it finds the table having the > specified TABLETITLE (string); then, extracts lists of > (font,string) tuples for each cell in the specified columns of > the table. > > [other tags associated with the cell only contain cosmetic > information -- line spacing, text alignment, etc. -- so they > can be ignored] > > But, I'm looking at other options for a more generic solution > to this problem. > > E.g., I wrote a formal grammar for the markup language so I can > build a specific parser to extract what I need *using* arbitrary > parts of that grammar (e.g., if I later decide the *color* of > the text in a cell is important -- highly doubtful!). > > I'm also looking at building a formal DTD for the MU language > and seeing what XML-ish tools exist to do these sorts of things. > > The downside of a more "involved"/capable solution is it gets > more tedious to maintain -- especially as the MU language evolves! > And, testing the tool becomes a project in itself! :< > > So, specifically, what sorts of OTS tools (prefer ones with sources > that I can modify) exist that will let me do things like parsing > to a particular nested tag, verifying the attribute associated with > it matches what I seek (e.g., TABLETITLE) then extracting all (and > ONLY!) attributes of specific tags contained nested *within* that > context? > > I.e., I want to be able to specify what parts of the tree to extract > based on criteria I specify on a command line. > > Thx, > --don
Do you know what XPATH is? It's a widely known, implemented and used language for exactly that. DTDs suck. Use XSD if you need that kind of thing. Again, all XML tools have support for it.
Hi Clifford,

On 7/18/2014 7:54 PM, Clifford Heath wrote:
> On 19/07/14 08:43, Don Y wrote:
>> So, specifically, what sorts of OTS tools (prefer ones with sources >> that I can modify) exist that will let me do things like parsing >> to a particular nested tag, verifying the attribute associated with >> it matches what I seek (e.g., TABLETITLE) then extracting all (and >> ONLY!) attributes of specific tags contained nested *within* that >> context? >> >> I.e., I want to be able to specify what parts of the tree to extract >> based on criteria I specify on a command line. > > Do you know what XPATH is? It's a widely known, implemented and used > language for exactly that.
Thanks, I looked at this. But, after reviewing my code, I see a lot of complexity involved in tracking "state" throughout the table being parsed. This is because tables can have cells that straddle the underlying (row,column) matrix that is defined by the markup. So, the VISIBLE contents (i.e., what a human reader perceives as the contents of that cell) of a <cell> may actually be defined in a previous row *above* (or, above AND TO THE LEFT) or to the left of this cell. Obviously, you can't retrieve the effective contents of the *desired* cell without examining the rest of the table structure and the attributes in those other cells (indicating where the straddle occurs -- if ever) So, I'd have to artificially replicate the contents of the effective cell into ALL straddled "markup" cells in order to effectively retrieve contents solely by accessing a specific node. <frown> That doesn't really save any labor. *But*, this is a worthwhile tool to keep in mind! Perhaps I can use it to extract other aspects of the documents that are more well-behaved (like figures).
> DTDs suck. Use XSD if you need that kind of thing. Again, all XML tools > have support for it.
The advantage the DTD has is that it gives you a (reasonably) terse overview of the overall "structure" of the document -- in much the same way a ToC gives you an overview of a document's contents. E.g., the DTD would make it clear that tables are stored in row major order; that cells can span rows *or* columns; that the contents of a cell can consist of several substrings, with which each can have a particular font tag, etc. (sigh) I think I'll stick with the ad hoc approach and just keep the tools I've developed in mind when I consider how I construct (organize) the rest of my documents!
The 2026 Embedded Online Conference