pharo-users@lists.pharo.org

Any question about pharo is welcome

View all threads

NeoCSVReader and wrong number of fieldAccessors

J
jtuchel@objektfabrik.de
Mon, Jan 4, 2021 1:36 PM

Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there are
lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number
of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor
NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import that
may have been working for years may one day tear a whole web server
image down because of a wrong number of fieldAccessors. This is bad on
many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use the
NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
columns each. If you remove one of the fieldAccessors, an #upToEnd will
yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless
loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems
to not peek forward to the end of the line. I have the gut feeling
#peekChar should peek instead of reading the #next character form the
input Stream, but #peekChar has too many senders to just go ahead and
mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number of
separators matches the number of fieldAccessors minus 1 (and go through
the hoops of handling separators in quoted fields and such...). Only if
that test succeeds, I would then hand a Stream with the whole line to
the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me the
line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices
did you guys come up with for such problems?

Thanks in advance,

Joachim

Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020... I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader. One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read. In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions. Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader? Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels. You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3. I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear... I *guess* this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-) So I wonder if there are any tried approaches to this problem. One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next. This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever. But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object.... Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems? Thanks in advance, Joachim
J
jtuchel@objektfabrik.de
Mon, Jan 4, 2021 1:46 PM

Please find attached a small test case to demonstrate what I mean. There
is just some nonsense Business Object class and a simple test case in
this fileout.

Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there
are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong
number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to
poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import
that may have been working for years may one day tear a whole web
server image down because of a wrong number of fieldAccessors. This is
bad on many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use
the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines
with 4 columns each. If you remove one of the fieldAccessors, an
#upToEnd will yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an
endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It
seems to not peek forward to the end of the line. I have the gut
feeling #peekChar should peek instead of reading the #next character
form the input Stream, but #peekChar has too many senders to just go
ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number
of separators matches the number of fieldAccessors minus 1 (and go
through the hoops of handling separators in quoted fields and
such...). Only if that test succeeds, I would then hand a Stream with
the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me
the line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best
practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout. Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de: > Happy new year to all of you! May 2021 be an increasingly less crazy > year than 2020... > > > I have a question that sounds a bit strange, but we have two effects > with NeoCSVReader related to wrong definitions of the reader. > > One effect is that reading a Stream #upToEnd leads to an endless loop, > the other is that the Reader produces twice as many objects as there > are lines in the file that is being read. > > In both scenarios, the reason is that the CSV Reader has a wrong > number of column definitions. > > Of course that is my fault: why do I feed a "malformed" CSV file to > poor NeoCSVReader? > > Let me explain: we have a few import interfaces which end users can > define using a more or less nice assistant in our Application. The CSV > files they upload to our App come from third parties like payment > providers, banks and other sources. These change their file structures > whenever they feel like it and never tell anybody. So a CSV import > that may have been working for years may one day tear a whole web > server image down because of a wrong number of fieldAccessors. This is > bad on many levels. > > You can easily try the doubling effect at home: define a working CSV > Reader and comment out one of the addField: commands before you use > the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines > with 4 columns each. If you remove one of the fieldAccessors, an > #upToEnd will yoield an Array of 6 objects rather than 3. > > I haven't found the reason for the cases where this leads to an > endless loop, but at least this one is clear... > > I *guess* this is due to the way #readEndOfLine is implemented. It > seems to not peek forward to the end of the line. I have the gut > feeling #peekChar should peek instead of reading the #next character > form the input Stream, but #peekChar has too many senders to just go > ahead and mess with it ;-) > > So I wonder if there are any tried approaches to this problem. > > One thing I might do is not use #upToEnd, but read each line using > PositionableStream>>#nextLine and first check each line if the number > of separators matches the number of fieldAccessors minus 1 (and go > through the hoops of handling separators in quoted fields and > such...). Only if that test succeeds, I would then hand a Stream with > the whole line to the reader and do a #next. > > This will, however, mean a lot of extra cycles for large files. Of > course I could do this only for some lines, maybe just the first one. > Whatever. > > > But somehow I have the feeling I should get an exception telling me > the line is not compatible to the Reader's definition or such. Or > #readAtEndOrEndOfLine should just walk the line to the end and ignore > the rest of the line, returnong an incomplete object.... > > > Maybe I am just missing the right setting or switch? What best > practices did you guys come up with for such problems? > > > Thanks in advance, > > > Joachim > > > > > > > > > > > > > > > > > > > -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
PD
Paul DeBruicker
Mon, Jan 4, 2021 6:23 PM

After instantiating the reader and before doing the reading you can
#readHeader and check that the reader field count and header field count
match.  Would that help?

If the CSV doesn't use headers then you can process the "header" as  the
first record and then process the rest of the file.

jtuchel wrote

Please find attached a small test case to demonstrate what I mean. There
is just some nonsense Business Object class and a simple test case in
this fileout.

Am 04.01.21 um 14:36 schrieb

jtuchel@

:

Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there
are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong
number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to
poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import
that may have been working for years may one day tear a whole web
server image down because of a wrong number of fieldAccessors. This is
bad on many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use
the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines
with 4 columns each. If you remove one of the fieldAccessors, an
#upToEnd will yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an
endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It
seems to not peek forward to the end of the line. I have the gut
feeling #peekChar should peek instead of reading the #next character
form the input Stream, but #peekChar has too many senders to just go
ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number
of separators matches the number of fieldAccessors minus 1 (and go
through the hoops of handling separators in quoted fields and
such...). Only if that test succeeds, I would then hand a Stream with
the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me
the line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best
practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel          mailto:

jtuchel@

Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

NeoCSVEndlessLoopTest.st (2K)
<http://forum.world.st/attachment/5125853/0/NeoCSVEndlessLoopTest.st>

After instantiating the reader and before doing the reading you can #readHeader and check that the reader field count and header field count match. Would that help? If the CSV doesn't use headers then you can process the "header" as the first record and then process the rest of the file. jtuchel wrote > Please find attached a small test case to demonstrate what I mean. There > is just some nonsense Business Object class and a simple test case in > this fileout. > > > Am 04.01.21 um 14:36 schrieb > jtuchel@ > : >> Happy new year to all of you! May 2021 be an increasingly less crazy >> year than 2020... >> >> >> I have a question that sounds a bit strange, but we have two effects >> with NeoCSVReader related to wrong definitions of the reader. >> >> One effect is that reading a Stream #upToEnd leads to an endless loop, >> the other is that the Reader produces twice as many objects as there >> are lines in the file that is being read. >> >> In both scenarios, the reason is that the CSV Reader has a wrong >> number of column definitions. >> >> Of course that is my fault: why do I feed a "malformed" CSV file to >> poor NeoCSVReader? >> >> Let me explain: we have a few import interfaces which end users can >> define using a more or less nice assistant in our Application. The CSV >> files they upload to our App come from third parties like payment >> providers, banks and other sources. These change their file structures >> whenever they feel like it and never tell anybody. So a CSV import >> that may have been working for years may one day tear a whole web >> server image down because of a wrong number of fieldAccessors. This is >> bad on many levels. >> >> You can easily try the doubling effect at home: define a working CSV >> Reader and comment out one of the addField: commands before you use >> the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines >> with 4 columns each. If you remove one of the fieldAccessors, an >> #upToEnd will yoield an Array of 6 objects rather than 3. >> >> I haven't found the reason for the cases where this leads to an >> endless loop, but at least this one is clear... >> >> I *guess* this is due to the way #readEndOfLine is implemented. It >> seems to not peek forward to the end of the line. I have the gut >> feeling #peekChar should peek instead of reading the #next character >> form the input Stream, but #peekChar has too many senders to just go >> ahead and mess with it ;-) >> >> So I wonder if there are any tried approaches to this problem. >> >> One thing I might do is not use #upToEnd, but read each line using >> PositionableStream>>#nextLine and first check each line if the number >> of separators matches the number of fieldAccessors minus 1 (and go >> through the hoops of handling separators in quoted fields and >> such...). Only if that test succeeds, I would then hand a Stream with >> the whole line to the reader and do a #next. >> >> This will, however, mean a lot of extra cycles for large files. Of >> course I could do this only for some lines, maybe just the first one. >> Whatever. >> >> >> But somehow I have the feeling I should get an exception telling me >> the line is not compatible to the Reader's definition or such. Or >> #readAtEndOrEndOfLine should just walk the line to the end and ignore >> the rest of the line, returnong an incomplete object.... >> >> >> Maybe I am just missing the right setting or switch? What best >> practices did you guys come up with for such problems? >> >> >> Thanks in advance, >> >> >> Joachim >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchel mailto: > jtuchel@ > Fliederweg 1 http://www.objektfabrik.de > D-71640 Ludwigsburg http://joachimtuchel.wordpress.com > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 > > > > > NeoCSVEndlessLoopTest.st (2K) > &lt;http://forum.world.st/attachment/5125853/0/NeoCSVEndlessLoopTest.st&gt; -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
SV
Sven Van Caekenberghe
Mon, Jan 4, 2021 7:57 PM

Hi Joachim,

Thanks for the detailed feedback. This is most helpful. I need to think more about this and experiment a bit. This is what I came up with in a Workspace/Playground:

input := 'foo,1
bar,2
foobar,3'.

(NeoCSVReader on: input readStream) upToEnd.
(NeoCSVReader on: input readStream) addField; upToEnd.
(NeoCSVReader on: input readStream) addField; addField; addField; upToEnd.

(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.

In my opinion there are two distinct issues:

  1. what to do when you define a specific number of fields to be read and there are not enough of them in the input (underflow), or there are too many of them in the input (overflow).

it is clear that the underflow case is wrong and a bug that has to be fixed.
the overflow case seems OK (resulting in nil fields)

  1. to validate the input (a functionality not yet present)

this would basically mean to signal an error in the under or overflow case.
but wrong type conversions should be errors too.

I understand that you want to validate foreign input.

It is a pity that you cannot produce an infinite loop example, that would also be useful.

That's it for now, I will come back to you.

Regards,

Sven

On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote:

Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout.

Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020...

I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels.

You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever.

But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

<NeoCSVEndlessLoopTest.st>

Hi Joachim, Thanks for the detailed feedback. This is most helpful. I need to think more about this and experiment a bit. This is what I came up with in a Workspace/Playground: input := 'foo,1 bar,2 foobar,3'. (NeoCSVReader on: input readStream) upToEnd. (NeoCSVReader on: input readStream) addField; upToEnd. (NeoCSVReader on: input readStream) addField; addField; addField; upToEnd. (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; upToEnd. (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. (NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. In my opinion there are two distinct issues: 1. what to do when you define a specific number of fields to be read and there are not enough of them in the input (underflow), or there are too many of them in the input (overflow). it is clear that the underflow case is wrong and a bug that has to be fixed. the overflow case seems OK (resulting in nil fields) 2. to validate the input (a functionality not yet present) this would basically mean to signal an error in the under or overflow case. but wrong type conversions should be errors too. I understand that you want to validate foreign input. It is a pity that you cannot produce an infinite loop example, that would also be useful. That's it for now, I will come back to you. Regards, Sven > On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote: > > Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout. > > > Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de: >> Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020... >> >> >> I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader. >> >> One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read. >> >> In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions. >> >> Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader? >> >> Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels. >> >> You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3. >> >> I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear... >> >> I *guess* this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-) >> >> So I wonder if there are any tried approaches to this problem. >> >> One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next. >> >> This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever. >> >> >> But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object.... >> >> >> Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems? >> >> >> Thanks in advance, >> >> >> Joachim >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de > Fliederweg 1 http://www.objektfabrik.de > D-71640 Ludwigsburg http://joachimtuchel.wordpress.com > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 > > > <NeoCSVEndlessLoopTest.st>
J
jtuchel@objektfabrik.de
Tue, Jan 5, 2021 7:21 AM

Paul,

thank you very much for this idea. Your suggestion is probably "good
enough" to at least catch errors when the number of columns doesn't
match in the whole file or the first row. For my use case, it wouldn't
make any difference if the first row contains header information or not.
There might be cases where you should check each and every line and
produce an error if the number of coumns doesn't match.

So I'll definitely take a look at #readHeader and see if I can use this
as a guard against maybe 80% of error cases, which is already a great
step forward.

For now I am still hunting for the endless loop, which I guess is
related, but I cannot say for sure yet. Maybe your fix also
significantly reduces the risk of these loops.

Joachim

As a side note: isn't it interesting how many creative ideas people have
around what they call "CSV export"? Like a header of 12 nicely formatted
text lines that beautifully displays in Excel, but of course doesn't
resemble anything near the ida of CSV.... Som even add some column
separators at the beginning of header lines to make the text apper in
column B or C, but never append any empty fields in their header lines...

Am 04.01.21 um 19:23 schrieb Paul DeBruicker:

After instantiating the reader and before doing the reading you can
#readHeader and check that the reader field count and header field count
match.  Would that help?

If the CSV doesn't use headers then you can process the "header" as  the
first record and then process the rest of the file.

jtuchel wrote

Please find attached a small test case to demonstrate what I mean. There
is just some nonsense Business Object class and a simple test case in
this fileout.

Am 04.01.21 um 14:36 schrieb
jtuchel@
:

Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there
are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong
number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to
poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import
that may have been working for years may one day tear a whole web
server image down because of a wrong number of fieldAccessors. This is
bad on many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use
the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines
with 4 columns each. If you remove one of the fieldAccessors, an
#upToEnd will yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an
endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It
seems to not peek forward to the end of the line. I have the gut
feeling #peekChar should peek instead of reading the #next character
form the input Stream, but #peekChar has too many senders to just go
ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number
of separators matches the number of fieldAccessors minus 1 (and go
through the hoops of handling separators in quoted fields and
such...). Only if that test succeeds, I would then hand a Stream with
the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me
the line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best
practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel          mailto:
jtuchel@
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

NeoCSVEndlessLoopTest.st (2K)
<http://forum.world.st/attachment/5125853/0/NeoCSVEndlessLoopTest.st>

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Paul, thank you very much for this idea. Your suggestion is probably "good enough" to at least catch errors when the number of columns doesn't match in the whole file or the first row. For my use case, it wouldn't make any difference if the first row contains header information or not. There might be cases where you should check each and every line and produce an error if the number of coumns doesn't match. So I'll definitely take a look at #readHeader and see if I can use this as a guard against maybe 80% of error cases, which is already a great step forward. For now I am still hunting for the endless loop, which I guess is related, but I cannot say for sure yet. Maybe your fix also significantly reduces the risk of these loops. Joachim As a side note: isn't it interesting how many creative ideas people have around what they call "CSV export"? Like a header of 12 nicely formatted text lines that beautifully displays in Excel, but of course doesn't resemble anything near the ida of CSV.... Som even add some column separators at the beginning of header lines to make the text apper in column B or C, but never append any empty fields in their header lines... Am 04.01.21 um 19:23 schrieb Paul DeBruicker: > After instantiating the reader and before doing the reading you can > #readHeader and check that the reader field count and header field count > match. Would that help? > > If the CSV doesn't use headers then you can process the "header" as the > first record and then process the rest of the file. > > > > jtuchel wrote >> Please find attached a small test case to demonstrate what I mean. There >> is just some nonsense Business Object class and a simple test case in >> this fileout. >> >> >> Am 04.01.21 um 14:36 schrieb >> jtuchel@ >> : >>> Happy new year to all of you! May 2021 be an increasingly less crazy >>> year than 2020... >>> >>> >>> I have a question that sounds a bit strange, but we have two effects >>> with NeoCSVReader related to wrong definitions of the reader. >>> >>> One effect is that reading a Stream #upToEnd leads to an endless loop, >>> the other is that the Reader produces twice as many objects as there >>> are lines in the file that is being read. >>> >>> In both scenarios, the reason is that the CSV Reader has a wrong >>> number of column definitions. >>> >>> Of course that is my fault: why do I feed a "malformed" CSV file to >>> poor NeoCSVReader? >>> >>> Let me explain: we have a few import interfaces which end users can >>> define using a more or less nice assistant in our Application. The CSV >>> files they upload to our App come from third parties like payment >>> providers, banks and other sources. These change their file structures >>> whenever they feel like it and never tell anybody. So a CSV import >>> that may have been working for years may one day tear a whole web >>> server image down because of a wrong number of fieldAccessors. This is >>> bad on many levels. >>> >>> You can easily try the doubling effect at home: define a working CSV >>> Reader and comment out one of the addField: commands before you use >>> the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines >>> with 4 columns each. If you remove one of the fieldAccessors, an >>> #upToEnd will yoield an Array of 6 objects rather than 3. >>> >>> I haven't found the reason for the cases where this leads to an >>> endless loop, but at least this one is clear... >>> >>> I *guess* this is due to the way #readEndOfLine is implemented. It >>> seems to not peek forward to the end of the line. I have the gut >>> feeling #peekChar should peek instead of reading the #next character >>> form the input Stream, but #peekChar has too many senders to just go >>> ahead and mess with it ;-) >>> >>> So I wonder if there are any tried approaches to this problem. >>> >>> One thing I might do is not use #upToEnd, but read each line using >>> PositionableStream>>#nextLine and first check each line if the number >>> of separators matches the number of fieldAccessors minus 1 (and go >>> through the hoops of handling separators in quoted fields and >>> such...). Only if that test succeeds, I would then hand a Stream with >>> the whole line to the reader and do a #next. >>> >>> This will, however, mean a lot of extra cycles for large files. Of >>> course I could do this only for some lines, maybe just the first one. >>> Whatever. >>> >>> >>> But somehow I have the feeling I should get an exception telling me >>> the line is not compatible to the Reader's definition or such. Or >>> #readAtEndOrEndOfLine should just walk the line to the end and ignore >>> the rest of the line, returnong an incomplete object.... >>> >>> >>> Maybe I am just missing the right setting or switch? What best >>> practices did you guys come up with for such problems? >>> >>> >>> Thanks in advance, >>> >>> >>> Joachim >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> -- >> ----------------------------------------------------------------------- >> Objektfabrik Joachim Tuchel mailto: >> jtuchel@ >> Fliederweg 1 http://www.objektfabrik.de >> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >> >> >> >> >> NeoCSVEndlessLoopTest.st (2K) >> &lt;http://forum.world.st/attachment/5125853/0/NeoCSVEndlessLoopTest.st&gt; > > > > > -- > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html > -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
J
jtuchel@objektfabrik.de
Tue, Jan 5, 2021 7:49 AM

Sven,

first of all thanks a lot for taking your time with this!

Your test case is so beautifully small I can't believe it ;-)

While I think some kind of validation could help with parsing CSV, I
remember reading your comment on this in some other discussion long ago.
You wrote you don't see it as a responsibility of a parser and that you
wouldn't want to add this to NeoCSV. I must say I tend to agree mostly.
Whatever you do at parsing can only cover part of the problems related
to validation. There will be checks that require access to other fields
from the same line, or some object that will be the owner of the
Collection that you are just importing, so a lot of validation must be
done after parsing anyways.

So I think we can mostly ignore the validation part. Whatever a reader
will do, it will not be good enough.

A nice way of exposing conversion errors for fields created with
#addField:converter: would help a lot, however.

I am glad you agree on the underflow bug. This is more a question of
well-formedness than of validation. If a reader finds out it doesn't fit
for a file structure, it should tell the user/developer about it or at
least gracefully return some more or less incomplete object resembling
what it could parse. But it shouldn't cross line borders and return a
wrong number of objects.

I will definitely continue my hunt for the endless loop. It is not an
ideal situation if one user of our Seaside Application completely blocks
an image that may be serving a few other users by just using a CVS
parser that doesn't fit with the file. I suspect this has to do with
#readEndOfLine in some special case of the underflow bug, but cannot
prove it yet. But I have a file and parser that reliably goes into an
endless loop. I just need to isolate the bare CSV parsing from the whole
machinery we've build around NeoCSV reader for these user-defined
mappings... I wouldn't be surprised if it is a problem buried somewhere
in our preparations in building a parser from user-defined data... I
will report my progress here, I promise!

One question I keep thinking about in NeoCSV: You implemented a method
called #peekChar, but it doesn't #peek. It buffers a character and does
read the #next character. I tried replacing the #next with #peek, but
that is definitely a shortcut to 100% CPU, because #peekChar is used a
lot, not only for consuming an "unmapped remainder" of a line... I
somehow have the feeling that at least in #readEndOfLine the next char
should bee peeked instead of consumed in order to find out if it's
workload or part of the crlf/lf...
Shouldn't a reader step forward by using #peek to see whether there is
more data after all fieldAccessors have been applied to the line (see
#readNextRecordAsObject)? Otoh, at one point the reader has to skip to
the next line, so I am not sure if peek has any place here... I need to
debug a little more to understand...

Joachim

Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe:

Hi Joachim,

Thanks for the detailed feedback. This is most helpful. I need to think more about this and experiment a bit. This is what I came up with in a Workspace/Playground:

input := 'foo,1
bar,2
foobar,3'.

(NeoCSVReader on: input readStream) upToEnd.
(NeoCSVReader on: input readStream) addField; upToEnd.
(NeoCSVReader on: input readStream) addField; addField; addField; upToEnd.

(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.

In my opinion there are two distinct issues:

  1. what to do when you define a specific number of fields to be read and there are not enough of them in the input (underflow), or there are too many of them in the input (overflow).

it is clear that the underflow case is wrong and a bug that has to be fixed.
the overflow case seems OK (resulting in nil fields)

  1. to validate the input (a functionality not yet present)

this would basically mean to signal an error in the under or overflow case.
but wrong type conversions should be errors too.

I understand that you want to validate foreign input.

It is a pity that you cannot produce an infinite loop example, that would also be useful.

That's it for now, I will come back to you.

Regards,

Sven

On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote:

Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout.

Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020...

I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels.

You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever.

But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

<NeoCSVEndlessLoopTest.st>

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Sven, first of all thanks a lot for taking your time with this! Your test case is so beautifully small I can't believe it ;-) While I think some kind of validation could help with parsing CSV, I remember reading your comment on this in some other discussion long ago. You wrote you don't see it as a responsibility of a parser and that you wouldn't want to add this to NeoCSV. I must say I tend to agree mostly. Whatever you do at parsing can only cover part of the problems related to validation. There will be checks that require access to other fields from the same line, or some object that will be the owner of the Collection that you are just importing, so a lot of validation must be done after parsing anyways. So I think we can mostly ignore the validation part. Whatever a reader will do, it will not be good enough. A nice way of exposing conversion errors for fields created with #addField:converter: would help a lot, however. I am glad you agree on the underflow bug. This is more a question of well-formedness than of validation. If a reader finds out it doesn't fit for a file structure, it should tell the user/developer about it or at least gracefully return some more or less incomplete object resembling what it could parse. But it shouldn't cross line borders and return a wrong number of objects. I will definitely continue my hunt for the endless loop. It is not an ideal situation if one user of our Seaside Application completely blocks an image that may be serving a few other users by just using a CVS parser that doesn't fit with the file. I suspect this has to do with #readEndOfLine in some special case of the underflow bug, but cannot prove it yet. But I have a file and parser that reliably goes into an endless loop. I just need to isolate the bare CSV parsing from the whole machinery we've build around NeoCSV reader for these user-defined mappings... I wouldn't be surprised if it is a problem buried somewhere in our preparations in building a parser from user-defined data... I will report my progress here, I promise! One question I keep thinking about in NeoCSV: You implemented a method called #peekChar, but it doesn't #peek. It buffers a character and does read the #next character. I tried replacing the #next with #peek, but that is definitely a shortcut to 100% CPU, because #peekChar is used a lot, not only for consuming an "unmapped remainder" of a line... I somehow have the feeling that at least in #readEndOfLine the next char should bee peeked instead of consumed in order to find out if it's workload or part of the crlf/lf... Shouldn't a reader step forward by using #peek to see whether there is more data after all fieldAccessors have been applied to the line (see #readNextRecordAsObject)? Otoh, at one point the reader has to skip to the next line, so I am not sure if peek has any place here... I need to debug a little more to understand... Joachim Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe: > Hi Joachim, > > Thanks for the detailed feedback. This is most helpful. I need to think more about this and experiment a bit. This is what I came up with in a Workspace/Playground: > > input := 'foo,1 > bar,2 > foobar,3'. > > (NeoCSVReader on: input readStream) upToEnd. > (NeoCSVReader on: input readStream) addField; upToEnd. > (NeoCSVReader on: input readStream) addField; addField; addField; upToEnd. > > (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; upToEnd. > (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. > (NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. > > In my opinion there are two distinct issues: > > 1. what to do when you define a specific number of fields to be read and there are not enough of them in the input (underflow), or there are too many of them in the input (overflow). > > it is clear that the underflow case is wrong and a bug that has to be fixed. > the overflow case seems OK (resulting in nil fields) > > 2. to validate the input (a functionality not yet present) > > this would basically mean to signal an error in the under or overflow case. > but wrong type conversions should be errors too. > > I understand that you want to validate foreign input. > > It is a pity that you cannot produce an infinite loop example, that would also be useful. > > That's it for now, I will come back to you. > > Regards, > > Sven > >> On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote: >> >> Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout. >> >> >> Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de: >>> Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020... >>> >>> >>> I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader. >>> >>> One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read. >>> >>> In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions. >>> >>> Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader? >>> >>> Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels. >>> >>> You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3. >>> >>> I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear... >>> >>> I *guess* this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-) >>> >>> So I wonder if there are any tried approaches to this problem. >>> >>> One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next. >>> >>> This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever. >>> >>> >>> But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object.... >>> >>> >>> Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems? >>> >>> >>> Thanks in advance, >>> >>> >>> Joachim >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> -- >> ----------------------------------------------------------------------- >> Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de >> Fliederweg 1 http://www.objektfabrik.de >> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >> >> >> <NeoCSVEndlessLoopTest.st> -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
SV
Sven Van Caekenberghe
Tue, Jan 5, 2021 4:49 PM

Hi Joachim,

Have a look at the following commit:

https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39

and specifically the added unit tests. These should help clarify the new behaviour.

If anything is not clear, please ask.

HTH,

Sven

On 5 Jan 2021, at 08:49, jtuchel@objektfabrik.de wrote:

Sven,

first of all thanks a lot for taking your time with this!

Your test case is so beautifully small I can't believe it ;-)

While I think some kind of validation could help with parsing CSV, I remember reading your comment on this in some other discussion long ago. You wrote you don't see it as a responsibility of a parser and that you wouldn't want to add this to NeoCSV. I must say I tend to agree mostly. Whatever you do at parsing can only cover part of the problems related to validation. There will be checks that require access to other fields from the same line, or some object that will be the owner of the Collection that you are just importing, so a lot of validation must be done after parsing anyways.

So I think we can mostly ignore the validation part. Whatever a reader will do, it will not be good enough.

A nice way of exposing conversion errors for fields created with #addField:converter: would help a lot, however.

I am glad you agree on the underflow bug. This is more a question of well-formedness than of validation. If a reader finds out it doesn't fit for a file structure, it should tell the user/developer about it or at least gracefully return some more or less incomplete object resembling what it could parse. But it shouldn't cross line borders and return a wrong number of objects.

I will definitely continue my hunt for the endless loop. It is not an ideal situation if one user of our Seaside Application completely blocks an image that may be serving a few other users by just using a CVS parser that doesn't fit with the file. I suspect this has to do with #readEndOfLine in some special case of the underflow bug, but cannot prove it yet. But I have a file and parser that reliably goes into an endless loop. I just need to isolate the bare CSV parsing from the whole machinery we've build around NeoCSV reader for these user-defined mappings... I wouldn't be surprised if it is a problem buried somewhere in our preparations in building a parser from user-defined data... I will report my progress here, I promise!

One question I keep thinking about in NeoCSV: You implemented a method called #peekChar, but it doesn't #peek. It buffers a character and does read the #next character. I tried replacing the #next with #peek, but that is definitely a shortcut to 100% CPU, because #peekChar is used a lot, not only for consuming an "unmapped remainder" of a line... I somehow have the feeling that at least in #readEndOfLine the next char should bee peeked instead of consumed in order to find out if it's workload or part of the crlf/lf...
Shouldn't a reader step forward by using #peek to see whether there is more data after all fieldAccessors have been applied to the line (see #readNextRecordAsObject)? Otoh, at one point the reader has to skip to the next line, so I am not sure if peek has any place here... I need to debug a little more to understand...

Joachim

Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe:

Hi Joachim,

Thanks for the detailed feedback. This is most helpful. I need to think more about this and experiment a bit. This is what I came up with in a Workspace/Playground:

input := 'foo,1
bar,2
foobar,3'.

(NeoCSVReader on: input readStream) upToEnd.
(NeoCSVReader on: input readStream) addField; upToEnd.
(NeoCSVReader on: input readStream) addField; addField; addField; upToEnd.

(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.

In my opinion there are two distinct issues:

  1. what to do when you define a specific number of fields to be read and there are not enough of them in the input (underflow), or there are too many of them in the input (overflow).

it is clear that the underflow case is wrong and a bug that has to be fixed.
the overflow case seems OK (resulting in nil fields)

  1. to validate the input (a functionality not yet present)

this would basically mean to signal an error in the under or overflow case.
but wrong type conversions should be errors too.

I understand that you want to validate foreign input.

It is a pity that you cannot produce an infinite loop example, that would also be useful.

That's it for now, I will come back to you.

Regards,

Sven

On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote:

Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout.

Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020...

I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels.

You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever.

But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

<NeoCSVEndlessLoopTest.st>

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Hi Joachim, Have a look at the following commit: https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39 and specifically the added unit tests. These should help clarify the new behaviour. If anything is not clear, please ask. HTH, Sven > On 5 Jan 2021, at 08:49, jtuchel@objektfabrik.de wrote: > > Sven, > > first of all thanks a lot for taking your time with this! > > Your test case is so beautifully small I can't believe it ;-) > > While I think some kind of validation could help with parsing CSV, I remember reading your comment on this in some other discussion long ago. You wrote you don't see it as a responsibility of a parser and that you wouldn't want to add this to NeoCSV. I must say I tend to agree mostly. Whatever you do at parsing can only cover part of the problems related to validation. There will be checks that require access to other fields from the same line, or some object that will be the owner of the Collection that you are just importing, so a lot of validation must be done after parsing anyways. > > So I think we can mostly ignore the validation part. Whatever a reader will do, it will not be good enough. > > A nice way of exposing conversion errors for fields created with #addField:converter: would help a lot, however. > > I am glad you agree on the underflow bug. This is more a question of well-formedness than of validation. If a reader finds out it doesn't fit for a file structure, it should tell the user/developer about it or at least gracefully return some more or less incomplete object resembling what it could parse. But it shouldn't cross line borders and return a wrong number of objects. > > > I will definitely continue my hunt for the endless loop. It is not an ideal situation if one user of our Seaside Application completely blocks an image that may be serving a few other users by just using a CVS parser that doesn't fit with the file. I suspect this has to do with #readEndOfLine in some special case of the underflow bug, but cannot prove it yet. But I have a file and parser that reliably goes into an endless loop. I just need to isolate the bare CSV parsing from the whole machinery we've build around NeoCSV reader for these user-defined mappings... I wouldn't be surprised if it is a problem buried somewhere in our preparations in building a parser from user-defined data... I will report my progress here, I promise! > > > One question I keep thinking about in NeoCSV: You implemented a method called #peekChar, but it doesn't #peek. It buffers a character and does read the #next character. I tried replacing the #next with #peek, but that is definitely a shortcut to 100% CPU, because #peekChar is used a lot, not only for consuming an "unmapped remainder" of a line... I somehow have the feeling that at least in #readEndOfLine the next char should bee peeked instead of consumed in order to find out if it's workload or part of the crlf/lf... > Shouldn't a reader step forward by using #peek to see whether there is more data after all fieldAccessors have been applied to the line (see #readNextRecordAsObject)? Otoh, at one point the reader has to skip to the next line, so I am not sure if peek has any place here... I need to debug a little more to understand... > > > > Joachim > > > > > > > Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe: >> Hi Joachim, >> >> Thanks for the detailed feedback. This is most helpful. I need to think more about this and experiment a bit. This is what I came up with in a Workspace/Playground: >> >> input := 'foo,1 >> bar,2 >> foobar,3'. >> >> (NeoCSVReader on: input readStream) upToEnd. >> (NeoCSVReader on: input readStream) addField; upToEnd. >> (NeoCSVReader on: input readStream) addField; addField; addField; upToEnd. >> >> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; upToEnd. >> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. >> (NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. >> >> In my opinion there are two distinct issues: >> >> 1. what to do when you define a specific number of fields to be read and there are not enough of them in the input (underflow), or there are too many of them in the input (overflow). >> >> it is clear that the underflow case is wrong and a bug that has to be fixed. >> the overflow case seems OK (resulting in nil fields) >> >> 2. to validate the input (a functionality not yet present) >> >> this would basically mean to signal an error in the under or overflow case. >> but wrong type conversions should be errors too. >> >> I understand that you want to validate foreign input. >> >> It is a pity that you cannot produce an infinite loop example, that would also be useful. >> >> That's it for now, I will come back to you. >> >> Regards, >> >> Sven >> >>> On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote: >>> >>> Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout. >>> >>> >>> Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de: >>>> Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020... >>>> >>>> >>>> I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader. >>>> >>>> One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read. >>>> >>>> In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions. >>>> >>>> Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader? >>>> >>>> Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels. >>>> >>>> You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3. >>>> >>>> I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear... >>>> >>>> I *guess* this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-) >>>> >>>> So I wonder if there are any tried approaches to this problem. >>>> >>>> One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next. >>>> >>>> This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever. >>>> >>>> >>>> But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object.... >>>> >>>> >>>> Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems? >>>> >>>> >>>> Thanks in advance, >>>> >>>> >>>> Joachim >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> -- >>> ----------------------------------------------------------------------- >>> Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de >>> Fliederweg 1 http://www.objektfabrik.de >>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >>> >>> >>> <NeoCSVEndlessLoopTest.st> > > > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de > Fliederweg 1 http://www.objektfabrik.de > D-71640 Ludwigsburg http://joachimtuchel.wordpress.com > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >
J
jtuchel@objektfabrik.de
Tue, Jan 5, 2021 6:52 PM

Hi Sven,

all I can say is: wow. I have no words.

I will have to learn a bit about Pharo and github real quick now in
order to try your changes....

Thank you very much. I'll give you feedback as fast as I can.

(And forget my questions about #readAtEndOrEndOfLine. I somhow didn't
understand it is expected to return a Boolean. Not sure why. I thought
of 'read' as a command, not a question in simple past..., so I thought
its job should be to read the rest of the line if we're not there yet)

Joachim

Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe:

Hi Joachim,

Have a look at the following commit:

https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39

and specifically the added unit tests. These should help clarify the new behaviour.

If anything is not clear, please ask.

HTH,

Sven

On 5 Jan 2021, at 08:49, jtuchel@objektfabrik.de wrote:

Sven,

first of all thanks a lot for taking your time with this!

Your test case is so beautifully small I can't believe it ;-)

While I think some kind of validation could help with parsing CSV, I remember reading your comment on this in some other discussion long ago. You wrote you don't see it as a responsibility of a parser and that you wouldn't want to add this to NeoCSV. I must say I tend to agree mostly. Whatever you do at parsing can only cover part of the problems related to validation. There will be checks that require access to other fields from the same line, or some object that will be the owner of the Collection that you are just importing, so a lot of validation must be done after parsing anyways.

So I think we can mostly ignore the validation part. Whatever a reader will do, it will not be good enough.

A nice way of exposing conversion errors for fields created with #addField:converter: would help a lot, however.

I am glad you agree on the underflow bug. This is more a question of well-formedness than of validation. If a reader finds out it doesn't fit for a file structure, it should tell the user/developer about it or at least gracefully return some more or less incomplete object resembling what it could parse. But it shouldn't cross line borders and return a wrong number of objects.

I will definitely continue my hunt for the endless loop. It is not an ideal situation if one user of our Seaside Application completely blocks an image that may be serving a few other users by just using a CVS parser that doesn't fit with the file. I suspect this has to do with #readEndOfLine in some special case of the underflow bug, but cannot prove it yet. But I have a file and parser that reliably goes into an endless loop. I just need to isolate the bare CSV parsing from the whole machinery we've build around NeoCSV reader for these user-defined mappings... I wouldn't be surprised if it is a problem buried somewhere in our preparations in building a parser from user-defined data... I will report my progress here, I promise!

One question I keep thinking about in NeoCSV: You implemented a method called #peekChar, but it doesn't #peek. It buffers a character and does read the #next character. I tried replacing the #next with #peek, but that is definitely a shortcut to 100% CPU, because #peekChar is used a lot, not only for consuming an "unmapped remainder" of a line... I somehow have the feeling that at least in #readEndOfLine the next char should bee peeked instead of consumed in order to find out if it's workload or part of the crlf/lf...
Shouldn't a reader step forward by using #peek to see whether there is more data after all fieldAccessors have been applied to the line (see #readNextRecordAsObject)? Otoh, at one point the reader has to skip to the next line, so I am not sure if peek has any place here... I need to debug a little more to understand...

Joachim

Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe:

Hi Joachim,

Thanks for the detailed feedback. This is most helpful. I need to think more about this and experiment a bit. This is what I came up with in a Workspace/Playground:

input := 'foo,1
bar,2
foobar,3'.

(NeoCSVReader on: input readStream) upToEnd.
(NeoCSVReader on: input readStream) addField; upToEnd.
(NeoCSVReader on: input readStream) addField; addField; addField; upToEnd.

(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.

In my opinion there are two distinct issues:

  1. what to do when you define a specific number of fields to be read and there are not enough of them in the input (underflow), or there are too many of them in the input (overflow).

it is clear that the underflow case is wrong and a bug that has to be fixed.
the overflow case seems OK (resulting in nil fields)

  1. to validate the input (a functionality not yet present)

this would basically mean to signal an error in the under or overflow case.
but wrong type conversions should be errors too.

I understand that you want to validate foreign input.

It is a pity that you cannot produce an infinite loop example, that would also be useful.

That's it for now, I will come back to you.

Regards,

Sven

On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote:

Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout.

Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020...

I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels.

You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever.

But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

<NeoCSVEndlessLoopTest.st>

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Hi Sven, all I can say is: wow. I have no words. I will have to learn a bit about Pharo and github real quick now in order to try your changes.... Thank you very much. I'll give you feedback as fast as I can. (And forget my questions about #readAtEndOrEndOfLine. I somhow didn't understand it is expected to return a Boolean. Not sure why. I thought of 'read' as a command, not a question in simple past..., so I thought its job should be to read the rest of the line if we're not there yet) Joachim Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe: > Hi Joachim, > > Have a look at the following commit: > > https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39 > > and specifically the added unit tests. These should help clarify the new behaviour. > > If anything is not clear, please ask. > > HTH, > > Sven > >> On 5 Jan 2021, at 08:49, jtuchel@objektfabrik.de wrote: >> >> Sven, >> >> first of all thanks a lot for taking your time with this! >> >> Your test case is so beautifully small I can't believe it ;-) >> >> While I think some kind of validation could help with parsing CSV, I remember reading your comment on this in some other discussion long ago. You wrote you don't see it as a responsibility of a parser and that you wouldn't want to add this to NeoCSV. I must say I tend to agree mostly. Whatever you do at parsing can only cover part of the problems related to validation. There will be checks that require access to other fields from the same line, or some object that will be the owner of the Collection that you are just importing, so a lot of validation must be done after parsing anyways. >> >> So I think we can mostly ignore the validation part. Whatever a reader will do, it will not be good enough. >> >> A nice way of exposing conversion errors for fields created with #addField:converter: would help a lot, however. >> >> I am glad you agree on the underflow bug. This is more a question of well-formedness than of validation. If a reader finds out it doesn't fit for a file structure, it should tell the user/developer about it or at least gracefully return some more or less incomplete object resembling what it could parse. But it shouldn't cross line borders and return a wrong number of objects. >> >> >> I will definitely continue my hunt for the endless loop. It is not an ideal situation if one user of our Seaside Application completely blocks an image that may be serving a few other users by just using a CVS parser that doesn't fit with the file. I suspect this has to do with #readEndOfLine in some special case of the underflow bug, but cannot prove it yet. But I have a file and parser that reliably goes into an endless loop. I just need to isolate the bare CSV parsing from the whole machinery we've build around NeoCSV reader for these user-defined mappings... I wouldn't be surprised if it is a problem buried somewhere in our preparations in building a parser from user-defined data... I will report my progress here, I promise! >> >> >> One question I keep thinking about in NeoCSV: You implemented a method called #peekChar, but it doesn't #peek. It buffers a character and does read the #next character. I tried replacing the #next with #peek, but that is definitely a shortcut to 100% CPU, because #peekChar is used a lot, not only for consuming an "unmapped remainder" of a line... I somehow have the feeling that at least in #readEndOfLine the next char should bee peeked instead of consumed in order to find out if it's workload or part of the crlf/lf... >> Shouldn't a reader step forward by using #peek to see whether there is more data after all fieldAccessors have been applied to the line (see #readNextRecordAsObject)? Otoh, at one point the reader has to skip to the next line, so I am not sure if peek has any place here... I need to debug a little more to understand... >> >> >> >> Joachim >> >> >> >> >> >> >> Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe: >>> Hi Joachim, >>> >>> Thanks for the detailed feedback. This is most helpful. I need to think more about this and experiment a bit. This is what I came up with in a Workspace/Playground: >>> >>> input := 'foo,1 >>> bar,2 >>> foobar,3'. >>> >>> (NeoCSVReader on: input readStream) upToEnd. >>> (NeoCSVReader on: input readStream) addField; upToEnd. >>> (NeoCSVReader on: input readStream) addField; addField; addField; upToEnd. >>> >>> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; upToEnd. >>> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. >>> (NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. >>> >>> In my opinion there are two distinct issues: >>> >>> 1. what to do when you define a specific number of fields to be read and there are not enough of them in the input (underflow), or there are too many of them in the input (overflow). >>> >>> it is clear that the underflow case is wrong and a bug that has to be fixed. >>> the overflow case seems OK (resulting in nil fields) >>> >>> 2. to validate the input (a functionality not yet present) >>> >>> this would basically mean to signal an error in the under or overflow case. >>> but wrong type conversions should be errors too. >>> >>> I understand that you want to validate foreign input. >>> >>> It is a pity that you cannot produce an infinite loop example, that would also be useful. >>> >>> That's it for now, I will come back to you. >>> >>> Regards, >>> >>> Sven >>> >>>> On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote: >>>> >>>> Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout. >>>> >>>> >>>> Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de: >>>>> Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020... >>>>> >>>>> >>>>> I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader. >>>>> >>>>> One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read. >>>>> >>>>> In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions. >>>>> >>>>> Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader? >>>>> >>>>> Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels. >>>>> >>>>> You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3. >>>>> >>>>> I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear... >>>>> >>>>> I *guess* this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-) >>>>> >>>>> So I wonder if there are any tried approaches to this problem. >>>>> >>>>> One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next. >>>>> >>>>> This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever. >>>>> >>>>> >>>>> But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object.... >>>>> >>>>> >>>>> Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems? >>>>> >>>>> >>>>> Thanks in advance, >>>>> >>>>> >>>>> Joachim >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> -- >>>> ----------------------------------------------------------------------- >>>> Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de >>>> Fliederweg 1 http://www.objektfabrik.de >>>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>>> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >>>> >>>> >>>> <NeoCSVEndlessLoopTest.st> >> >> -- >> ----------------------------------------------------------------------- >> Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de >> Fliederweg 1 http://www.objektfabrik.de >> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >> -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
J
jtuchel@objektfabrik.de
Tue, Jan 5, 2021 8:06 PM

Sven,

I tested your change with the file and filter (our own way of defining
csv mappings by the end users) which used to send our application into
an endless loop.

And voila: we get an exception instead of a frozen image! I will give
the conversion errors a test drive tomorrow.

I am absolutely happy with your change. Thank you very much.

Joachim

P.S: I even learned a little bit about Iceberg. I am not really sure
each of my mouse clicks made sense, but I had your commit in the image
and could test it and port the deltas over to my Smalltalk dialect...

Am 05.01.21 um 19:52 schrieb jtuchel@objektfabrik.de:

Hi Sven,

all I can say is: wow. I have no words.

I will have to learn a bit about Pharo and github real quick now in
order to try your changes....

Thank you very much. I'll give you feedback as fast as I can.

(And forget my questions about #readAtEndOrEndOfLine. I somhow didn't
understand it is expected to return a Boolean. Not sure why. I thought
of 'read' as a command, not a question in simple past..., so I thought
its job should be to read the rest of the line if we're not there yet)

Joachim

Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe:

Hi Joachim,

Have a look at the following commit:

https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39

and specifically the added unit tests. These should help clarify the
new behaviour.

If anything is not clear, please ask.

HTH,

Sven

On 5 Jan 2021, at 08:49, jtuchel@objektfabrik.de wrote:

Sven,

first of all thanks a lot for taking your time with this!

Your test case is so beautifully small I can't believe it ;-)

While I think some kind of validation could help with parsing CSV, I
remember reading your comment on this in some other discussion long
ago. You wrote you don't see it as a responsibility of a parser and
that you wouldn't want to add this to NeoCSV. I must say I tend to
agree mostly. Whatever you do at parsing can only cover part of the
problems related to validation. There will be checks that require
access to other fields from the same line, or some object that will
be the owner of the Collection that you are just importing, so a lot
of validation must be done after parsing anyways.

So I think we can mostly ignore the validation part. Whatever a
reader will do, it will not be good enough.

A nice way of exposing conversion errors for fields created with
#addField:converter: would help a lot, however.

I am glad you agree on the underflow bug. This is more a question of
well-formedness than of validation. If a reader finds out it doesn't
fit for a file structure, it should tell the user/developer about it
or at least gracefully return some more or less incomplete object
resembling what it could parse. But it shouldn't cross line borders
and return a wrong number of objects.

I will definitely continue my hunt for the endless loop. It is not
an ideal situation if one user of our Seaside Application completely
blocks an image that may be serving a few other users by just using
a CVS parser that doesn't fit with the file. I suspect this has to
do with #readEndOfLine in some special case of the underflow bug,
but cannot prove it yet. But I have a file and parser that reliably
goes into an endless loop. I just need to isolate the bare CSV
parsing from the whole machinery we've build around NeoCSV reader
for these user-defined mappings... I wouldn't be surprised if it is
a problem buried somewhere in our preparations in building a parser
from user-defined data... I will report my progress here, I promise!

One question I keep thinking about in NeoCSV: You implemented a
method called #peekChar, but it doesn't #peek. It buffers a
character and does read the #next character. I tried replacing the
#next with #peek, but that is definitely a shortcut to 100% CPU,
because #peekChar is used a lot, not only for consuming an "unmapped
remainder" of a line... I somehow have the feeling that at least in
#readEndOfLine the next char should bee peeked instead of consumed
in order to find out if it's workload or part of the crlf/lf...
Shouldn't a reader step forward by using #peek to see whether there
is more data after all fieldAccessors have been applied to the line
(see #readNextRecordAsObject)? Otoh, at one point the reader has to
skip to the next line, so I am not sure if peek has any place
here... I need to debug a little more to understand...

Joachim

Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe:

Hi Joachim,

Thanks for the detailed feedback. This is most helpful. I need to
think more about this and experiment a bit. This is what I came up
with in a Workspace/Playground:

input := 'foo,1
bar,2
foobar,3'.

(NeoCSVReader on: input readStream) upToEnd.
(NeoCSVReader on: input readStream) addField; upToEnd.
(NeoCSVReader on: input readStream) addField; addField; addField;
upToEnd.

(NeoCSVReader on: input readStream) recordClass: Dictionary;
addField: [ :obj :str | obj at: #one put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary;
addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj
:str | obj at: #two put: str]; addField: [ :obj :str | obj at:
#three put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary;
emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one
put: str]; addField: [ :obj :str | obj at: #two put: str];
addField: [ :obj :str | obj at: #three put: str]; upToEnd.

In my opinion there are two distinct issues:

  1. what to do when you define a specific number of fields to be
    read and there are not enough of them in the input (underflow), or
    there are too many of them in the input (overflow).

it is clear that the underflow case is wrong and a bug that has to
be fixed.
the overflow case seems OK (resulting in nil fields)

  1. to validate the input (a functionality not yet present)

this would basically mean to signal an error in the under or
overflow case.
but wrong type conversions should be errors too.

I understand that you want to validate foreign input.

It is a pity that you cannot produce an infinite loop example, that
would also be useful.

That's it for now, I will come back to you.

Regards,

Sven

On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote:

Please find attached a small test case to demonstrate what I mean.
There is just some nonsense Business Object class and a simple
test case in this fileout.

Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less
crazy year than 2020...

I have a question that sounds a bit strange, but we have two
effects with NeoCSVReader related to wrong definitions of the
reader.

One effect is that reading a Stream #upToEnd leads to an endless
loop, the other is that the Reader produces twice as many objects
as there are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong
number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file
to poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users
can define using a more or less nice assistant in our
Application. The CSV files they upload to our App come from third
parties like payment providers, banks and other sources. These
change their file structures whenever they feel like it and never
tell anybody. So a CSV import that may have been working for
years may one day tear a whole web server image down because of a
wrong number of fieldAccessors. This is bad on many levels.

You can easily try the doubling effect at home: define a working
CSV Reader and comment out one of the addField: commands before
you use the NeoCSVReader to parse a CSV file. Say your CSV file
has 3 lines with 4 columns each. If you remove one of the
fieldAccessors, an #upToEnd will yoield an Array of 6 objects
rather than 3.

I haven't found the reason for the cases where this leads to an
endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented.
It seems to not peek forward to the end of the line. I have the
gut feeling #peekChar should peek instead of reading the #next
character form the input Stream, but #peekChar has too many
senders to just go ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line
using PositionableStream>>#nextLine and first check each line if
the number of separators matches the number of fieldAccessors
minus 1 (and go through the hoops of handling separators in
quoted fields and such...). Only if that test succeeds, I would
then hand a Stream with the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files.
Of course I could do this only for some lines, maybe just the
first one. Whatever.

But somehow I have the feeling I should get an exception telling
me the line is not compatible to the Reader's definition or such.
Or #readAtEndOrEndOfLine should just walk the line to the end and
ignore the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best
practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de
Fliederweg 1 http://www.objektfabrik.de
D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1

<NeoCSVEndlessLoopTest.st>

--

Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de
Fliederweg 1 http://www.objektfabrik.de
D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Sven, I tested your change with the file and filter (our own way of defining csv mappings by the end users) which used to send our application into an endless loop. And voila: we get an exception instead of a frozen image! I will give the conversion errors a test drive tomorrow. I am absolutely happy with your change. Thank you very much. Joachim P.S: I even learned a little bit about Iceberg. I am not really sure each of my mouse clicks made sense, but I had your commit in the image and could test it and port the deltas over to my Smalltalk dialect... Am 05.01.21 um 19:52 schrieb jtuchel@objektfabrik.de: > Hi Sven, > > > all I can say is: wow. I have no words. > > I will have to learn a bit about Pharo and github real quick now in > order to try your changes.... > > Thank you very much. I'll give you feedback as fast as I can. > > (And forget my questions about #readAtEndOrEndOfLine. I somhow didn't > understand it is expected to return a Boolean. Not sure why. I thought > of 'read' as a command, not a question in simple past..., so I thought > its job should be to read the rest of the line if we're not there yet) > > > Joachim > > > > > > > > > > Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe: >> Hi Joachim, >> >> Have a look at the following commit: >> >> https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39 >> >> and specifically the added unit tests. These should help clarify the >> new behaviour. >> >> If anything is not clear, please ask. >> >> HTH, >> >> Sven >> >>> On 5 Jan 2021, at 08:49, jtuchel@objektfabrik.de wrote: >>> >>> Sven, >>> >>> first of all thanks a lot for taking your time with this! >>> >>> Your test case is so beautifully small I can't believe it ;-) >>> >>> While I think some kind of validation could help with parsing CSV, I >>> remember reading your comment on this in some other discussion long >>> ago. You wrote you don't see it as a responsibility of a parser and >>> that you wouldn't want to add this to NeoCSV. I must say I tend to >>> agree mostly. Whatever you do at parsing can only cover part of the >>> problems related to validation. There will be checks that require >>> access to other fields from the same line, or some object that will >>> be the owner of the Collection that you are just importing, so a lot >>> of validation must be done after parsing anyways. >>> >>> So I think we can mostly ignore the validation part. Whatever a >>> reader will do, it will not be good enough. >>> >>> A nice way of exposing conversion errors for fields created with >>> #addField:converter: would help a lot, however. >>> >>> I am glad you agree on the underflow bug. This is more a question of >>> well-formedness than of validation. If a reader finds out it doesn't >>> fit for a file structure, it should tell the user/developer about it >>> or at least gracefully return some more or less incomplete object >>> resembling what it could parse. But it shouldn't cross line borders >>> and return a wrong number of objects. >>> >>> >>> I will definitely continue my hunt for the endless loop. It is not >>> an ideal situation if one user of our Seaside Application completely >>> blocks an image that may be serving a few other users by just using >>> a CVS parser that doesn't fit with the file. I suspect this has to >>> do with #readEndOfLine in some special case of the underflow bug, >>> but cannot prove it yet. But I have a file and parser that reliably >>> goes into an endless loop. I just need to isolate the bare CSV >>> parsing from the whole machinery we've build around NeoCSV reader >>> for these user-defined mappings... I wouldn't be surprised if it is >>> a problem buried somewhere in our preparations in building a parser >>> from user-defined data... I will report my progress here, I promise! >>> >>> >>> One question I keep thinking about in NeoCSV: You implemented a >>> method called #peekChar, but it doesn't #peek. It buffers a >>> character and does read the #next character. I tried replacing the >>> #next with #peek, but that is definitely a shortcut to 100% CPU, >>> because #peekChar is used a lot, not only for consuming an "unmapped >>> remainder" of a line... I somehow have the feeling that at least in >>> #readEndOfLine the next char should bee peeked instead of consumed >>> in order to find out if it's workload or part of the crlf/lf... >>> Shouldn't a reader step forward by using #peek to see whether there >>> is more data after all fieldAccessors have been applied to the line >>> (see #readNextRecordAsObject)? Otoh, at one point the reader has to >>> skip to the next line, so I am not sure if peek has any place >>> here... I need to debug a little more to understand... >>> >>> >>> >>> Joachim >>> >>> >>> >>> >>> >>> >>> Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe: >>>> Hi Joachim, >>>> >>>> Thanks for the detailed feedback. This is most helpful. I need to >>>> think more about this and experiment a bit. This is what I came up >>>> with in a Workspace/Playground: >>>> >>>> input := 'foo,1 >>>> bar,2 >>>> foobar,3'. >>>> >>>> (NeoCSVReader on: input readStream) upToEnd. >>>> (NeoCSVReader on: input readStream) addField; upToEnd. >>>> (NeoCSVReader on: input readStream) addField; addField; addField; >>>> upToEnd. >>>> >>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; >>>> addField: [ :obj :str | obj at: #one put: str]; upToEnd. >>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; >>>> addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj >>>> :str | obj at: #two put: str]; addField: [ :obj :str | obj at: >>>> #three put: str]; upToEnd. >>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; >>>> emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one >>>> put: str]; addField: [ :obj :str | obj at: #two put: str]; >>>> addField: [ :obj :str | obj at: #three put: str]; upToEnd. >>>> >>>> In my opinion there are two distinct issues: >>>> >>>> 1. what to do when you define a specific number of fields to be >>>> read and there are not enough of them in the input (underflow), or >>>> there are too many of them in the input (overflow). >>>> >>>> it is clear that the underflow case is wrong and a bug that has to >>>> be fixed. >>>> the overflow case seems OK (resulting in nil fields) >>>> >>>> 2. to validate the input (a functionality not yet present) >>>> >>>> this would basically mean to signal an error in the under or >>>> overflow case. >>>> but wrong type conversions should be errors too. >>>> >>>> I understand that you want to validate foreign input. >>>> >>>> It is a pity that you cannot produce an infinite loop example, that >>>> would also be useful. >>>> >>>> That's it for now, I will come back to you. >>>> >>>> Regards, >>>> >>>> Sven >>>> >>>>> On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote: >>>>> >>>>> Please find attached a small test case to demonstrate what I mean. >>>>> There is just some nonsense Business Object class and a simple >>>>> test case in this fileout. >>>>> >>>>> >>>>> Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de: >>>>>> Happy new year to all of you! May 2021 be an increasingly less >>>>>> crazy year than 2020... >>>>>> >>>>>> >>>>>> I have a question that sounds a bit strange, but we have two >>>>>> effects with NeoCSVReader related to wrong definitions of the >>>>>> reader. >>>>>> >>>>>> One effect is that reading a Stream #upToEnd leads to an endless >>>>>> loop, the other is that the Reader produces twice as many objects >>>>>> as there are lines in the file that is being read. >>>>>> >>>>>> In both scenarios, the reason is that the CSV Reader has a wrong >>>>>> number of column definitions. >>>>>> >>>>>> Of course that is my fault: why do I feed a "malformed" CSV file >>>>>> to poor NeoCSVReader? >>>>>> >>>>>> Let me explain: we have a few import interfaces which end users >>>>>> can define using a more or less nice assistant in our >>>>>> Application. The CSV files they upload to our App come from third >>>>>> parties like payment providers, banks and other sources. These >>>>>> change their file structures whenever they feel like it and never >>>>>> tell anybody. So a CSV import that may have been working for >>>>>> years may one day tear a whole web server image down because of a >>>>>> wrong number of fieldAccessors. This is bad on many levels. >>>>>> >>>>>> You can easily try the doubling effect at home: define a working >>>>>> CSV Reader and comment out one of the addField: commands before >>>>>> you use the NeoCSVReader to parse a CSV file. Say your CSV file >>>>>> has 3 lines with 4 columns each. If you remove one of the >>>>>> fieldAccessors, an #upToEnd will yoield an Array of 6 objects >>>>>> rather than 3. >>>>>> >>>>>> I haven't found the reason for the cases where this leads to an >>>>>> endless loop, but at least this one is clear... >>>>>> >>>>>> I *guess* this is due to the way #readEndOfLine is implemented. >>>>>> It seems to not peek forward to the end of the line. I have the >>>>>> gut feeling #peekChar should peek instead of reading the #next >>>>>> character form the input Stream, but #peekChar has too many >>>>>> senders to just go ahead and mess with it ;-) >>>>>> >>>>>> So I wonder if there are any tried approaches to this problem. >>>>>> >>>>>> One thing I might do is not use #upToEnd, but read each line >>>>>> using PositionableStream>>#nextLine and first check each line if >>>>>> the number of separators matches the number of fieldAccessors >>>>>> minus 1 (and go through the hoops of handling separators in >>>>>> quoted fields and such...). Only if that test succeeds, I would >>>>>> then hand a Stream with the whole line to the reader and do a #next. >>>>>> >>>>>> This will, however, mean a lot of extra cycles for large files. >>>>>> Of course I could do this only for some lines, maybe just the >>>>>> first one. Whatever. >>>>>> >>>>>> >>>>>> But somehow I have the feeling I should get an exception telling >>>>>> me the line is not compatible to the Reader's definition or such. >>>>>> Or #readAtEndOrEndOfLine should just walk the line to the end and >>>>>> ignore the rest of the line, returnong an incomplete object.... >>>>>> >>>>>> >>>>>> Maybe I am just missing the right setting or switch? What best >>>>>> practices did you guys come up with for such problems? >>>>>> >>>>>> >>>>>> Thanks in advance, >>>>>> >>>>>> >>>>>> Joachim >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> -- >>>>> ----------------------------------------------------------------------- >>>>> >>>>> Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de >>>>> Fliederweg 1 http://www.objektfabrik.de >>>>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>>>> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1 >>>>> >>>>> >>>>> <NeoCSVEndlessLoopTest.st> >>> >>> -- >>> ----------------------------------------------------------------------- >>> Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de >>> Fliederweg 1 http://www.objektfabrik.de >>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1 >>> > -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
RO
Richard O'Keefe
Wed, Jan 6, 2021 4:10 AM

NeoCSVReader is described as efficient.  What is that
in comparison to?  What benchmark data are used?
Here are benchmark results measured today.
(5,000 data line file, 9,145,009 characters).
method                time(ms)
Just read characters  410
CSVDecoder>>next      3415  astc's CSV reader (defaults). 1.26 x CSVParser
NeoCSVReader>>next    4798  NeoCSVReader (default state). 1.78 x CSVParser
CSVParser>>next      2701  pared-to-the-bone CSV reader. 1.00 reference.

(10,000 data line file, 1,544,836 characters).
method                time(ms)
Just read characters    93
CSVDecoder>>next      530  astc's CSV reader (defaults). 1.26 x
CSVParser
NeoCSVReader>>next    737  NeoCSVReader (default state). 1.75 x
CSVParser
CSVParser>>next        421  pared-to-the-bone CSV reader. 1.00 reference.

CSVParser is just 78 lines and is not customisable.  It really is
stripped to pretty much an absolute minimum.  All of the parsers
were configured (if that made sense) to return an Array of Strings.
Many of the CSV files I've worked with use short records instead
of ending a line with a lot of commas.  Some of them also have the
occasional stray comment off to the right, not mentioned in the header.
I've also found it necessary to skip multiple lines at the beginning
and/or end.  (Really, some government agencies seem to have NO idea
that anyone might want to do more with a CSV file than eyeball it in
Excel.)

If there is a benchmark suite I can use to improve CSVDecoder,
I would like to try it out.

On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de <
jtuchel@objektfabrik.de> wrote:

Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there are
lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number
of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor
NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import that
may have been working for years may one day tear a whole web server
image down because of a wrong number of fieldAccessors. This is bad on
many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use the
NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
columns each. If you remove one of the fieldAccessors, an #upToEnd will
yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless
loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems
to not peek forward to the end of the line. I have the gut feeling
#peekChar should peek instead of reading the #next character form the
input Stream, but #peekChar has too many senders to just go ahead and
mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number of
separators matches the number of fieldAccessors minus 1 (and go through
the hoops of handling separators in quoted fields and such...). Only if
that test succeeds, I would then hand a Stream with the whole line to
the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me the
line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices
did you guys come up with for such problems?

Thanks in advance,

Joachim

NeoCSVReader is described as efficient. What is that in comparison to? What benchmark data are used? Here are benchmark results measured today. (5,000 data line file, 9,145,009 characters). method time(ms) Just read characters 410 CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x CSVParser NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x CSVParser CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 reference. (10,000 data line file, 1,544,836 characters). method time(ms) Just read characters 93 CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x CSVParser NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x CSVParser CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 reference. CSVParser is just 78 lines and is not customisable. It really is stripped to pretty much an absolute minimum. All of the parsers were configured (if that made sense) to return an Array of Strings. Many of the CSV files I've worked with use short records instead of ending a line with a lot of commas. Some of them also have the occasional stray comment off to the right, not mentioned in the header. I've also found it necessary to skip multiple lines at the beginning and/or end. (Really, some government agencies seem to have NO idea that anyone might want to do more with a CSV file than eyeball it in Excel.) If there is a benchmark suite I can use to improve CSVDecoder, I would like to try it out. On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de < jtuchel@objektfabrik.de> wrote: > Happy new year to all of you! May 2021 be an increasingly less crazy > year than 2020... > > > I have a question that sounds a bit strange, but we have two effects > with NeoCSVReader related to wrong definitions of the reader. > > One effect is that reading a Stream #upToEnd leads to an endless loop, > the other is that the Reader produces twice as many objects as there are > lines in the file that is being read. > > In both scenarios, the reason is that the CSV Reader has a wrong number > of column definitions. > > Of course that is my fault: why do I feed a "malformed" CSV file to poor > NeoCSVReader? > > Let me explain: we have a few import interfaces which end users can > define using a more or less nice assistant in our Application. The CSV > files they upload to our App come from third parties like payment > providers, banks and other sources. These change their file structures > whenever they feel like it and never tell anybody. So a CSV import that > may have been working for years may one day tear a whole web server > image down because of a wrong number of fieldAccessors. This is bad on > many levels. > > You can easily try the doubling effect at home: define a working CSV > Reader and comment out one of the addField: commands before you use the > NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 > columns each. If you remove one of the fieldAccessors, an #upToEnd will > yoield an Array of 6 objects rather than 3. > > I haven't found the reason for the cases where this leads to an endless > loop, but at least this one is clear... > > I *guess* this is due to the way #readEndOfLine is implemented. It seems > to not peek forward to the end of the line. I have the gut feeling > #peekChar should peek instead of reading the #next character form the > input Stream, but #peekChar has too many senders to just go ahead and > mess with it ;-) > > So I wonder if there are any tried approaches to this problem. > > One thing I might do is not use #upToEnd, but read each line using > PositionableStream>>#nextLine and first check each line if the number of > separators matches the number of fieldAccessors minus 1 (and go through > the hoops of handling separators in quoted fields and such...). Only if > that test succeeds, I would then hand a Stream with the whole line to > the reader and do a #next. > > This will, however, mean a lot of extra cycles for large files. Of > course I could do this only for some lines, maybe just the first one. > Whatever. > > > But somehow I have the feeling I should get an exception telling me the > line is not compatible to the Reader's definition or such. Or > #readAtEndOrEndOfLine should just walk the line to the end and ignore > the rest of the line, returnong an incomplete object.... > > > Maybe I am just missing the right setting or switch? What best practices > did you guys come up with for such problems? > > > Thanks in advance, > > > Joachim > > > > > > > > > > > > > > > > > > >