Empathy List Archives

J

jtuchel@objektfabrik.de

Mon, Jan 4, 2021 1:36 PM

Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there are
lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number
of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor
NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import that
may have been working for years may one day tear a whole web server
image down because of a wrong number of fieldAccessors. This is bad on
many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use the
NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
columns each. If you remove one of the fieldAccessors, an #upToEnd will
yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless
loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems
to not peek forward to the end of the line. I have the gut feeling
#peekChar should peek instead of reading the #next character form the
input Stream, but #peekChar has too many senders to just go ahead and
mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number of
separators matches the number of fieldAccessors minus 1 (and go through
the hoops of handling separators in quoted fields and such...). Only if
that test succeeds, I would then hand a Stream with the whole line to
the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me the
line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices
did you guys come up with for such problems?

Thanks in advance,

Joachim

Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020... I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader. One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read. In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions. Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader? Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels. You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3. I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear... I *guess* this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-) So I wonder if there are any tried approaches to this problem. One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next. This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever. But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object.... Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems? Thanks in advance, Joachim

J

jtuchel@objektfabrik.de

Mon, Jan 4, 2021 1:46 PM

Please find attached a small test case to demonstrate what I mean. There
is just some nonsense Business Object class and a simple test case in
this fileout.

Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there
are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong
number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to
poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import
that may have been working for years may one day tear a whole web
server image down because of a wrong number of fieldAccessors. This is
bad on many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use
the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines
with 4 columns each. If you remove one of the fieldAccessors, an
#upToEnd will yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an
endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It
seems to not peek forward to the end of the line. I have the gut
feeling #peekChar should peek instead of reading the #next character
form the input Stream, but #peekChar has too many senders to just go
ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number
of separators matches the number of fieldAccessors minus 1 (and go
through the hoops of handling separators in quoted fields and
such...). Only if that test succeeds, I would then hand a Stream with
the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me
the line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best
practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de
Fliederweg 1 http://www.objektfabrik.de
D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1

Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout. Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de: > Happy new year to all of you! May 2021 be an increasingly less crazy > year than 2020... > > > I have a question that sounds a bit strange, but we have two effects > with NeoCSVReader related to wrong definitions of the reader. > > One effect is that reading a Stream #upToEnd leads to an endless loop, > the other is that the Reader produces twice as many objects as there > are lines in the file that is being read. > > In both scenarios, the reason is that the CSV Reader has a wrong > number of column definitions. > > Of course that is my fault: why do I feed a "malformed" CSV file to > poor NeoCSVReader? > > Let me explain: we have a few import interfaces which end users can > define using a more or less nice assistant in our Application. The CSV > files they upload to our App come from third parties like payment > providers, banks and other sources. These change their file structures > whenever they feel like it and never tell anybody. So a CSV import > that may have been working for years may one day tear a whole web > server image down because of a wrong number of fieldAccessors. This is > bad on many levels. > > You can easily try the doubling effect at home: define a working CSV > Reader and comment out one of the addField: commands before you use > the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines > with 4 columns each. If you remove one of the fieldAccessors, an > #upToEnd will yoield an Array of 6 objects rather than 3. > > I haven't found the reason for the cases where this leads to an > endless loop, but at least this one is clear... > > I *guess* this is due to the way #readEndOfLine is implemented. It > seems to not peek forward to the end of the line. I have the gut > feeling #peekChar should peek instead of reading the #next character > form the input Stream, but #peekChar has too many senders to just go > ahead and mess with it ;-) > > So I wonder if there are any tried approaches to this problem. > > One thing I might do is not use #upToEnd, but read each line using > PositionableStream>>#nextLine and first check each line if the number > of separators matches the number of fieldAccessors minus 1 (and go > through the hoops of handling separators in quoted fields and > such...). Only if that test succeeds, I would then hand a Stream with > the whole line to the reader and do a #next. > > This will, however, mean a lot of extra cycles for large files. Of > course I could do this only for some lines, maybe just the first one. > Whatever. > > > But somehow I have the feeling I should get an exception telling me > the line is not compatible to the Reader's definition or such. Or > #readAtEndOrEndOfLine should just walk the line to the end and ignore > the rest of the line, returnong an incomplete object.... > > > Maybe I am just missing the right setting or switch? What best > practices did you guys come up with for such problems? > > > Thanks in advance, > > > Joachim > > > > > > > > > > > > > > > > > > > -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1

PD

Paul DeBruicker

Mon, Jan 4, 2021 6:23 PM

After instantiating the reader and before doing the reading you can
#readHeader and check that the reader field count and header field count
match. Would that help?

If the CSV doesn't use headers then you can process the "header" as the
first record and then process the rest of the file.

jtuchel wrote

Please find attached a small test case to demonstrate what I mean. There
is just some nonsense Business Object class and a simple test case in
this fileout.

Am 04.01.21 um 14:36 schrieb

jtuchel@

: