[Pharo-dev] How to get rid of empty XML nodes?

monty monty2 at programmer.net
Mon Jan 29 08:00:51 EST 2018


I attached a commit patch (apply with `git am ...`) to the 'books.pharo.org' repo to update the Scraping .pdf link. (The .pdf it links to now is obsolete.)

> Sent: Friday, January 26, 2018 at 2:30 PM
> From: "Stephane Ducasse" <stepharo.self at gmail.com>
> To: "Pharo Development List" <pharo-dev at lists.pharo.org>
> Subject: Re: [Pharo-dev] How to get rid of empty XML nodes?
>
> Tx Monty!
> This is a really important addition :)
> Because a super frequent scenario.
> 
> Stef
> 
> On Fri, Jan 26, 2018 at 8:37 AM, monty <monty2 at programmer.net> wrote:
> > See #removeAllFormattingNodes and its comment in the latest version.
> >
> > And instances of SAXHandler and subclasses are meant to be created with #on: (or another "instance creation" message), _not #new_, otherwise they won't be properly initialized. The class comment is clear about this, but I should have overridden #new to raise an error like Stream does. Your misuse was helpful in bringing this to my attention, and I added a Stream-like #new implementation to SAXHandler.
> >
> >> Sent: Friday, December 08, 2017 at 9:21 AM
> >> From: "Stephane Ducasse" <stepharo.self at gmail.com>
> >> To: "Pharo Development List" <pharo-dev at lists.pharo.org>
> >> Subject: Re: [Pharo-dev] How to get rid of empty XML nodes?
> >>
> >> Hi monty
> >>
> >>
> >> On Fri, Dec 8, 2017 at 9:03 AM, monty <monty2 at programmer.net> wrote:
> >> > By "empty XML nodes," do you mean whitespace-only string nodes?
> >>
> >> Yes
> >>
> >> > Those are included because all in-element whitespace is assumed significant by the spec: https://www.w3.org/TR/xml/#sec-white-space
> >>
> >> I know. There was a discussion a while ago. I just lost a couple of
> >> hours understanding that :(
> >>
> >> But this is a super super super annoying practices.
> >> We had to test each nodes to see if it is a empty nodes so it makes
> >> everything a lot more complex without real justification
> >> beside the fact that these standardizers probably never implemented
> >> some real cases.
> >> This standard is a really out of reality from that perspective.
> >>
> >> > The exception is if the element is declared in the DTD as only having element children ("element content"): https://www.w3.org/TR/xml/#dt-elemcontent
> >>
> >> Well the XML files that I had (I did not choose XML because I would
> >> have prefer JSON :) ), had no DTD :(
> >>
> >> So at the end of the day, this wonderful standard puts all the stress
> >> and burden to people.
> >>
> >> >
> >> > For example, if you declare an element like this:
> >> >
> >> > <!ELEMENT one (two,three*,four?)>
> >> >
> >> > Any whitespace around a "two," "three," or "four" element child of a "one" element is insignificant and ignored (unless #preservesIgnorableWhitespace: is true). Other parsers, like LibXML2 and Xerces, behave the same way.
> >> >
> >> > I'll see if I can come up with some easier way to deal with this, like an optional parser setting, new enumeration methods, or maybe a tree transformation.
> >>
> >> It would be A HUGE PLUS!!!!!!!!!!!!!!!!!!
> >>
> >>
> >> Because reality is that people have XML files with just nodes and no
> >> empty nodes and they are forced to
> >> Let me know because I could try.
> >>
> >> I was showing how to use Pharo to import code to pharo learners and
> >> this was a big drag.
> >>
> >> Stef
> >>
> >>
> >> I tried to set some values in the parser but it did not work.
> >> BTW I saw that the configuration logic forces to write the following
> >>
> >> | parser doc visitor |
> >> parser := XMLDOMParser new
> >>    on: self xmlContents;
> >>    preservesIgnorableWhitespace: true.
> >>
> >> and not
> >>
> >> | parser doc visitor |
> >> parser := XMLDOMParser new
> >>     preservesIgnorableWhitespace: true.
> >>     on: self xmlContents;
> >>
> >>
> >> >
> >> >> Sent: Tuesday, December 05, 2017 at 8:29 AM
> >> >> From: "Stephane Ducasse" <stepharo.self at gmail.com>
> >> >> To: "Pharo Development List" <pharo-dev at lists.pharo.org>
> >> >> Subject: [Pharo-dev] How to get rid of empty XML nodes?
> >> >>
> >> >> )Hi
> >> >>
> >> >> we are manipulating an XML document and I would like to get rid of the
> >> >> spurious empty string.
> >> >> We saw that the gt panes are doing it.
> >> >>
> >> >> (aNodeWithElements isStringNode
> >> >> and: [aNodeWithElements isEmpty
> >> >> or: [aNodeWithElements isWhitespace]]
> >> >>
> >> >> Is there a way not to produce empty nodes?
> >> >> Is there a simple way not to have to handle them
> >> >>
> >> >> Now each time we are dealing with a node with have to check.
> >> >>
> >> >> Stef
> >> >>
> >> >>
> >> >
> >>
> >>
> >
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-updated-Scraping-booklet-.pdf-link.patch
Type: text/x-patch
Size: 1280 bytes
Desc: not available
URL: <http://lists.pharo.org/pipermail/pharo-dev_lists.pharo.org/attachments/20180129/c2d10a8d/attachment.patch>


More information about the Pharo-dev mailing list