Cloning XML

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Cloning XML

Tim Channon
There are many XML decoding libraries of varying degrees of capability
and portability. The focus of these is decode.

Discussion of regenerating identical XML from the decode seems to get
lost and particularly when the user has no idea of the XML meaning,
simply wants to intercept textually known entities within a context.

I'm trying to do this where I need to hack values in XML as a fix for
known disability in a major package, ie. a workaround.

FYI, outputting SVG provides a simple hack target, about recolouring
interpolated colours when verbatim is needed. I'm highly amused that the
nameless package involved is stated on various discussion sites as
unable to draw filled polygons, which would be a solution. The SVG
output is... filled polygons! You can't make this up.

Refer to ancient cartoon about the tree swing. So so true in IT.

Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Dirk Laurie-2
2015-05-18 2:24 GMT+02:00 Tim Channon <[hidden email]>:

> There are many XML decoding libraries of varying degrees of capability and
> portability. The focus of these is decode.
>
> Discussion of regenerating identical XML from the decode seems to get lost
> and particularly when the user has no idea of the XML meaning, simply wants
> to intercept textually known entities within a context.
>
> I'm trying to do this where I need to hack values in XML as a fix for known
> disability in a major package, ie. a workaround.
>
> FYI, outputting SVG provides a simple hack target, about recolouring
> interpolated colours when verbatim is needed. I'm highly amused that the
> nameless package involved is stated on various discussion sites as unable to
> draw filled polygons, which would be a solution. The SVG output is... filled
> polygons! You can't make this up.
>
> Refer to ancient cartoon about the tree swing. So so true in IT.
>

Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Dirk Laurie-2
2015-05-18 8:22 GMT+02:00 Dirk Laurie <[hidden email]>:
> 2015-05-18 2:24 GMT+02:00 Tim Channon <[hidden email]>:
>> There are many XML decoding libraries of varying degrees of capability and
>> portability. The focus of these is decode.
>>
>> Discussion of regenerating identical XML from the decode seems to get lost
>> and particularly when the user has no idea of the XML meaning, simply wants
>> to intercept textually known entities within a context.

I found it quite easy to regenerate identical XML from the output of Roberto's
parser on the lua-users.org Wiki, but in a very specialized context: the XML
files were those produced by `pdftohtml -xml`, for which the DTD is
self-contained and a mere 49 lines long. A far cry from SVG's over 300 lines
mainly pulling in other files, I'll admit.

Here is what I did: I tweaked Roberto's code to provide a metatable
for every table-valued item.

element_mt = { __tostring =
   function(s)
      if type(s)=='string' then return s end
      assert(s.tag)
      local render = render[s.tag]
      if type(render)=='string' then return render
      elseif type(render)=='function' then return render(s)
      else error("Can't convert type '"..s.tag.."' to s string")
      end
   end }

Then the regenerate routine becomes:

local function assemble(document)
   local s = {}
   local box = document.first
   while box do
      s[#s+1] = tostring(box)
      box = box.next
   end
   return tconcat(s,'\n')
end

The global or upvalue "render" is

local render = {
   pdf2xml = lines,
   page = lines,
   document = assemble,
   text = contents,
   b = function(s) return '**'..contents(s)..'**' end,
   i = function(s) return '*'..contents(s)..'*' end,
   a = contents,
   outline = "<outline>",
   fontspec = "<fontspec>" }

with

contents = bind_concat""
lines = bind_concat"\n"

where

function bind_concat(sep)
--- table.concat with bound separator and `tostring` filter
   return function(t)
      local u={}
      if type(t)=='string' then return t end
      for k,v in ipairs(t) do u[k]=tostring(v) end
      return tconcat(u,sep)
   end
end

Note that "render" depends on the DTD. Writing a module that
can generate generate "render" from an arbitrary DTD was not
part of my purposes; writing a different "render" than converts
to say Markdown rather than plain text is, but not yet to the
level where I can share it.

Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Tim Channon
On 18/05/2015 10:41, Dirk Laurie wrote:

> 2015-05-18 8:22 GMT+02:00 Dirk Laurie <[hidden email]>:
>> 2015-05-18 2:24 GMT+02:00 Tim Channon <[hidden email]>:
>>> There are many XML decoding libraries of varying degrees of capability and
>>> portability. The focus of these is decode.
>>>
>>> Discussion of regenerating identical XML from the decode seems to get lost
>>> and particularly when the user has no idea of the XML meaning, simply wants
>>> to intercept textually known entities within a context.
>
> I found it quite easy to regenerate identical XML from the output of Roberto's
> parser on the lua-users.org Wiki, but in a very specialized context: the XML
> files were those produced by `pdftohtml -xml`, for which the DTD is
> self-contained and a mere 49 lines long. A far cry from SVG's over 300 lines
> mainly pulling in other files, I'll admit.
>
> Here is what I did: I tweaked Roberto's code to provide a metatable
> for every table-valued item.
>
> element_mt = { __tostring =
>     function(s)
>        if type(s)=='string' then return s end
>        assert(s.tag)
>        local render = render[s.tag]
>        if type(render)=='string' then return render
>        elseif type(render)=='function' then return render(s)
>        else error("Can't convert type '"..s.tag.."' to s string")
>        end
>     end }
>
> Then the regenerate routine becomes:
>
> local function assemble(document)
>     local s = {}
>     local box = document.first
>     while box do
>        s[#s+1] = tostring(box)
>        box = box.next
>     end
>     return tconcat(s,'\n')
> end
>
> The global or upvalue "render" is
>
> local render = {
>     pdf2xml = lines,
>     page = lines,
>     document = assemble,
>     text = contents,
>     b = function(s) return '**'..contents(s)..'**' end,
>     i = function(s) return '*'..contents(s)..'*' end,
>     a = contents,
>     outline = "<outline>",
>     fontspec = "<fontspec>" }
>
> with
>
> contents = bind_concat""
> lines = bind_concat"\n"
>
> where
>
> function bind_concat(sep)
> --- table.concat with bound separator and `tostring` filter
>     return function(t)
>        local u={}
>        if type(t)=='string' then return t end
>        for k,v in ipairs(t) do u[k]=tostring(v) end
>        return tconcat(u,sep)
>     end
> end
>
> Note that "render" depends on the DTD. Writing a module that
> can generate generate "render" from an arbitrary DTD was not
> part of my purposes; writing a different "render" than converts
> to say Markdown rather than plain text is, but not yet to the
> level where I can share it.
>

This seems to confirm the problem is not trivial, one of many similar
computing situations where the intent and the tools do not match well.

Given what seemed intractable and given limited resources I have turned
to a direct ad hoc approach on what is dubious XML (*). Perhaps
critically this allows the maintaining of strict (stable) sequence.
I am able to identify section and data on-the-fly one pass writing back
out as the input flows past. The decoded XML could be stored. This is
not quite accurate, one elusive bug remains but nevertheless several SVG
display tools accept a rewritten file without complaint, validation
produces a warning, not error. Recoloring the subset of polygons to blue
produces blue. Relating the polygons to the original co-ordinates, ah
well, what fun, are translated. Resolvable.

I don't like ad hoc when formal ought to be easy.

I use Roberto's parser elsewhere so your writing is welcome.

I'll continue dabbling and you've given me some things to try.

Thank you Dirk.

* not allowed self closing tags are not illegal by XML rules and here
there is erratic usage of named or self closing for the same tags within
one file. eg. both <text> </text> and <text> />




Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Jay Carlson
On 2015-05-18, at 11:36 AM, Tim Channon <[hidden email]> wrote:

> * not allowed self closing tags are not illegal by XML rules and here there is erratic usage of named or self closing for the same tags within one file. eg. both <text> </text> and <text> />

I think everybody should use a fully-conforming parser like expat, or something that uses like expat, while getting started on XML.

expat delivers an open event and a close event for every tag. Empty tags in XML can be written two different but equivalent ways.

expat does not distinguish character data expressed as CDATA encoding, just as it does not distinguish &# references.

expat signals an error on character set issues.

These all reflect the XML Infoset, which is the only data guaranteed to be communicated between parties using XML.

I should probably implement lazykit in modern Lua one of these decades….

Jay
Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Coda Highland
On Mon, May 18, 2015 at 2:15 PM, Jay Carlson <[hidden email]> wrote:

> On 2015-05-18, at 11:36 AM, Tim Channon <[hidden email]> wrote:
>
>> * not allowed self closing tags are not illegal by XML rules and here there is erratic usage of named or self closing for the same tags within one file. eg. both <text> </text> and <text> />
>
> I think everybody should use a fully-conforming parser like expat, or something that uses like expat, while getting started on XML.
>
> expat delivers an open event and a close event for every tag. Empty tags in XML can be written two different but equivalent ways.
>
> expat does not distinguish character data expressed as CDATA encoding, just as it does not distinguish &# references.
>
> expat signals an error on character set issues.
>
> These all reflect the XML Infoset, which is the only data guaranteed to be communicated between parties using XML.
>
> I should probably implement lazykit in modern Lua one of these decades….
>
> Jay

This is explicitly counter to the stated goal of round-tripping the document.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Jay Carlson
On 2015-05-18, at 5:19 PM, Coda Highland <[hidden email]> wrote:
>
> On Mon, May 18, 2015 at 2:15 PM, Jay Carlson <[hidden email]> wrote:
>> On 2015-05-18, at 11:36 AM, Tim Channon <[hidden email]> wrote:
>>
>>> * not allowed self closing tags are not illegal by XML rules and here there is erratic usage of named or self closing for the same tags within one file. eg. both <text> </text> and <text> />
>>
>> I think everybody should use a fully-conforming parser like expat, or something that uses like expat, while getting started on XML.

[…]

>>
>> These all reflect the XML Infoset, which is the only data guaranteed to be communicated between parties using XML.
>
> This is explicitly counter to the stated goal of round-tripping the document.

If you are an XML processor[1] there is nothing you can notice outside the Infoset.

Jay

[1]: “It’s sort of a limited existence.”
Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Coda Highland
On Mon, May 18, 2015 at 7:40 PM, Jay Carlson <[hidden email]> wrote:

> On 2015-05-18, at 5:19 PM, Coda Highland <[hidden email]> wrote:
>>
>> On Mon, May 18, 2015 at 2:15 PM, Jay Carlson <[hidden email]> wrote:
>>> On 2015-05-18, at 11:36 AM, Tim Channon <[hidden email]> wrote:
>>>
>>>> * not allowed self closing tags are not illegal by XML rules and here there is erratic usage of named or self closing for the same tags within one file. eg. both <text> </text> and <text> />
>>>
>>> I think everybody should use a fully-conforming parser like expat, or something that uses like expat, while getting started on XML.
>
> […]
>
>>>
>>> These all reflect the XML Infoset, which is the only data guaranteed to be communicated between parties using XML.
>>
>> This is explicitly counter to the stated goal of round-tripping the document.
>
> If you are an XML processor[1] there is nothing you can notice outside the Infoset.
>
> Jay
>
> [1]: “It’s sort of a limited existence.”

All that means is that there are tools which operate on XML which are
not XML processors.

I'm okay with this. XML is a mess anyway.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Matthew Wild
On 19 May 2015 at 03:44, Coda Highland <[hidden email]> wrote:

> On Mon, May 18, 2015 at 7:40 PM, Jay Carlson <[hidden email]> wrote:
>> On 2015-05-18, at 5:19 PM, Coda Highland <[hidden email]> wrote:
>>> On Mon, May 18, 2015 at 2:15 PM, Jay Carlson <[hidden email]> wrote:
>>>> These all reflect the XML Infoset, which is the only data guaranteed to be communicated between parties using XML.
>>>
>>> This is explicitly counter to the stated goal of round-tripping the document.
>>
>> If you are an XML processor[1] there is nothing you can notice outside the Infoset.
>>
>> Jay
>>
>> [1]: “It’s sort of a limited existence.”
>
> All that means is that there are tools which operate on XML which are
> not XML processors.
>
> I'm okay with this. XML is a mess anyway.

So it would be ok to request a JSON processor that has to correctly
modify a JSON document while preserving things irrelevant to the data
- like whitespace, whether key names are quoted, and what kind of
quotes are used?

If a system has requirements for such a thing in its design, I'd
suggest there may be a flaw in that design[1].

Regards,
Matthew

[1]: I appreciate we don't always get to design the systems we have to
interop with...

Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Javier Guerra Giraldez
On Wed, May 20, 2015 at 1:58 PM, Matthew Wild <[hidden email]> wrote:
> So it would be ok to request a JSON processor that has to correctly
> modify a JSON document while preserving things irrelevant to the data
> - like whitespace, whether key names are quoted, and what kind of
> quotes are used?


nitpicking: JSON mandates all keys to be quoted, and only double
quotes are valid.  the only "style" variations allowed are whitespace
and field order.

in general, i agree with you: requiring to preserve non-significant
details is broken by design.

unfortunately.... in XML some of those things are not non-significant.
mostly because schizophrenic nature of XML, where it can be a text
document with annotations or a data structure.  the requirements for
each are very different, so a general tool might have to preserve
things that many uses consider irrelevant.

--
Javier

Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Matthew Wild
On 20 May 2015 at 20:06, Javier Guerra Giraldez <[hidden email]> wrote:

> On Wed, May 20, 2015 at 1:58 PM, Matthew Wild <[hidden email]> wrote:
>> So it would be ok to request a JSON processor that has to correctly
>> modify a JSON document while preserving things irrelevant to the data
>> - like whitespace, whether key names are quoted, and what kind of
>> quotes are used?
>
>
> nitpicking: JSON mandates all keys to be quoted, and only double
> quotes are valid.  the only "style" variations allowed are whitespace
> and field order.

I know that. But almost every JSON implementation is willing to accept
a random combination of these violations (some don't require any
string to be quoted). But it's XML that is a mess :)

Data storage formats are like programming languages, text editors and
operating systems - doomed to suffer the "latest cool thing" trends
and spark flame wars until the end of time.

> unfortunately.... in XML some of those things are not non-significant.
> mostly because schizophrenic nature of XML, where it can be a text
> document with annotations or a data structure.  the requirements for
> each are very different, so a general tool might have to preserve
> things that many uses consider irrelevant.

Mixing human-generated content with machine-generated content in the
same document that way never turns out nicely :)

Regards,
Matthew

Reply | Threaded
Open this post in threaded view
|

Re: Cloning XML

Jay Carlson
On 2015-05-20, at 3:20 PM, Matthew Wild <[hidden email]> wrote:
>
> Mixing human-generated content with machine-generated content in the
> same document that way never turns out nicely :)

We are on the same side, but I am in a pedantic mood:

There’s not really any human-generated content. The last reasonably close mapping was paper tape punched offline. Even then, there is structure to that.

One of my favorite programming languages does not allow string characters outside printable ASCII, plus space. This was very instructive in how much structure there was in “plain text”. Many people use writing systems not 1-to-1-mapped to teletypes/movable type, and those people have the same machine-mediation experience immediately.

The world beyond teletypes makes programming language string support look sad. We need better, easier ways to manipulated structured data than jamming it through char *.

Jay