pharo-users@lists.pharo.org

Any question about pharo is welcome

View all threads

NeoCSVReader and wrong number of fieldAccessors

SD
Stéphane Ducasse
Wed, Jan 6, 2021 8:33 AM

John

I’m sorry to tell you that but you cannot write mail like that without telling us where is the code of CVSParser.
You cannot basically discredit the work on Sven without providing code to compare.

S.

On 6 Jan 2021, at 05:10, Richard O'Keefe raoknz@gmail.com wrote:

NeoCSVReader is described as efficient.  What is that
in comparison to?  What benchmark data are used?
Here are benchmark results measured today.
(5,000 data line file, 9,145,009 characters).
method                time(ms)
Just read characters  410
CSVDecoder>>next      3415  astc's CSV reader (defaults). 1.26 x CSVParser
NeoCSVReader>>next    4798  NeoCSVReader (default state). 1.78 x CSVParser
CSVParser>>next      2701  pared-to-the-bone CSV reader. 1.00 reference.

(10,000 data line file, 1,544,836 characters).
method                time(ms)
Just read characters    93
CSVDecoder>>next      530  astc's CSV reader (defaults). 1.26 x CSVParser
NeoCSVReader>>next    737  NeoCSVReader (default state). 1.75 x CSVParser
CSVParser>>next        421  pared-to-the-bone CSV reader. 1.00 reference.

CSVParser is just 78 lines and is not customisable.  It really is
stripped to pretty much an absolute minimum.  All of the parsers
were configured (if that made sense) to return an Array of Strings.
Many of the CSV files I've worked with use short records instead
of ending a line with a lot of commas.  Some of them also have the occasional stray comment off to the right, not mentioned in the header.
I've also found it necessary to skip multiple lines at the beginning
and/or end.  (Really, some government agencies seem to have NO idea
that anyone might want to do more with a CSV file than eyeball it in
Excel.)

If there is a benchmark suite I can use to improve CSVDecoder,
I would like to try it out.

On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de mailto:jtuchel@objektfabrik.de <jtuchel@objektfabrik.de mailto:jtuchel@objektfabrik.de> wrote:
Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there are
lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number
of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor
NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import that
may have been working for years may one day tear a whole web server
image down because of a wrong number of fieldAccessors. This is bad on
many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use the
NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
columns each. If you remove one of the fieldAccessors, an #upToEnd will
yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless
loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems
to not peek forward to the end of the line. I have the gut feeling
#peekChar should peek instead of reading the #next character form the
input Stream, but #peekChar has too many senders to just go ahead and
mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number of
separators matches the number of fieldAccessors minus 1 (and go through
the hoops of handling separators in quoted fields and such...). Only if
that test succeeds, I would then hand a Stream with the whole line to
the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me the
line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices
did you guys come up with for such problems?

Thanks in advance,

Joachim


Stéphane Ducasse
http://stephane.ducasse.free.fr / http://www.pharo.org
03 59 35 87 52
Assistant: Aurore Dalle
FAX 03 59 57 78 50
TEL 03 59 35 86 16
S. Ducasse - Inria
40, avenue Halley,
Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
Villeneuve d'Ascq 59650
France

John I’m sorry to tell you that but you cannot write mail like that without telling us where is the code of CVSParser. You cannot basically discredit the work on Sven without providing code to compare. S. > On 6 Jan 2021, at 05:10, Richard O'Keefe <raoknz@gmail.com> wrote: > > NeoCSVReader is described as efficient. What is that > in comparison to? What benchmark data are used? > Here are benchmark results measured today. > (5,000 data line file, 9,145,009 characters). > method time(ms) > Just read characters 410 > CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x CSVParser > NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x CSVParser > CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 reference. > > (10,000 data line file, 1,544,836 characters). > method time(ms) > Just read characters 93 > CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x CSVParser > NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x CSVParser > CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 reference. > > CSVParser is just 78 lines and is not customisable. It really is > stripped to pretty much an absolute minimum. All of the parsers > were configured (if that made sense) to return an Array of Strings. > Many of the CSV files I've worked with use short records instead > of ending a line with a lot of commas. Some of them also have the occasional stray comment off to the right, not mentioned in the header. > I've also found it necessary to skip multiple lines at the beginning > and/or end. (Really, some government agencies seem to have NO idea > that anyone might want to do more with a CSV file than eyeball it in > Excel.) > > If there is a benchmark suite I can use to improve CSVDecoder, > I would like to try it out. > > On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de <mailto:jtuchel@objektfabrik.de> <jtuchel@objektfabrik.de <mailto:jtuchel@objektfabrik.de>> wrote: > Happy new year to all of you! May 2021 be an increasingly less crazy > year than 2020... > > > I have a question that sounds a bit strange, but we have two effects > with NeoCSVReader related to wrong definitions of the reader. > > One effect is that reading a Stream #upToEnd leads to an endless loop, > the other is that the Reader produces twice as many objects as there are > lines in the file that is being read. > > In both scenarios, the reason is that the CSV Reader has a wrong number > of column definitions. > > Of course that is my fault: why do I feed a "malformed" CSV file to poor > NeoCSVReader? > > Let me explain: we have a few import interfaces which end users can > define using a more or less nice assistant in our Application. The CSV > files they upload to our App come from third parties like payment > providers, banks and other sources. These change their file structures > whenever they feel like it and never tell anybody. So a CSV import that > may have been working for years may one day tear a whole web server > image down because of a wrong number of fieldAccessors. This is bad on > many levels. > > You can easily try the doubling effect at home: define a working CSV > Reader and comment out one of the addField: commands before you use the > NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 > columns each. If you remove one of the fieldAccessors, an #upToEnd will > yoield an Array of 6 objects rather than 3. > > I haven't found the reason for the cases where this leads to an endless > loop, but at least this one is clear... > > I *guess* this is due to the way #readEndOfLine is implemented. It seems > to not peek forward to the end of the line. I have the gut feeling > #peekChar should peek instead of reading the #next character form the > input Stream, but #peekChar has too many senders to just go ahead and > mess with it ;-) > > So I wonder if there are any tried approaches to this problem. > > One thing I might do is not use #upToEnd, but read each line using > PositionableStream>>#nextLine and first check each line if the number of > separators matches the number of fieldAccessors minus 1 (and go through > the hoops of handling separators in quoted fields and such...). Only if > that test succeeds, I would then hand a Stream with the whole line to > the reader and do a #next. > > This will, however, mean a lot of extra cycles for large files. Of > course I could do this only for some lines, maybe just the first one. > Whatever. > > > But somehow I have the feeling I should get an exception telling me the > line is not compatible to the Reader's definition or such. Or > #readAtEndOrEndOfLine should just walk the line to the end and ignore > the rest of the line, returnong an incomplete object.... > > > Maybe I am just missing the right setting or switch? What best practices > did you guys come up with for such problems? > > > Thanks in advance, > > > Joachim > > > > > > > > > > > > > > > > > > -------------------------------------------- Stéphane Ducasse http://stephane.ducasse.free.fr / http://www.pharo.org 03 59 35 87 52 Assistant: Aurore Dalle FAX 03 59 57 78 50 TEL 03 59 35 86 16 S. Ducasse - Inria 40, avenue Halley, Parc Scientifique de la Haute Borne, Bât.A, Park Plaza Villeneuve d'Ascq 59650 France
SD
Stéphane Ducasse
Wed, Jan 6, 2021 8:35 AM

Another point:

In open-source and in this community.
Either the code people mentioned is open-source and accessible or it does not exist.
If it does not exist then this is easy :)

S.

On 6 Jan 2021, at 05:10, Richard O'Keefe raoknz@gmail.com wrote:

NeoCSVReader is described as efficient.  What is that
in comparison to?  What benchmark data are used?
Here are benchmark results measured today.
(5,000 data line file, 9,145,009 characters).
method                time(ms)
Just read characters  410
CSVDecoder>>next      3415  astc's CSV reader (defaults). 1.26 x CSVParser
NeoCSVReader>>next    4798  NeoCSVReader (default state). 1.78 x CSVParser
CSVParser>>next      2701  pared-to-the-bone CSV reader. 1.00 reference.

(10,000 data line file, 1,544,836 characters).
method                time(ms)
Just read characters    93
CSVDecoder>>next      530  astc's CSV reader (defaults). 1.26 x CSVParser
NeoCSVReader>>next    737  NeoCSVReader (default state). 1.75 x CSVParser
CSVParser>>next        421  pared-to-the-bone CSV reader. 1.00 reference.

CSVParser is just 78 lines and is not customisable.  It really is
stripped to pretty much an absolute minimum.  All of the parsers
were configured (if that made sense) to return an Array of Strings.
Many of the CSV files I've worked with use short records instead
of ending a line with a lot of commas.  Some of them also have the occasional stray comment off to the right, not mentioned in the header.
I've also found it necessary to skip multiple lines at the beginning
and/or end.  (Really, some government agencies seem to have NO idea
that anyone might want to do more with a CSV file than eyeball it in
Excel.)

If there is a benchmark suite I can use to improve CSVDecoder,
I would like to try it out.

On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de mailto:jtuchel@objektfabrik.de <jtuchel@objektfabrik.de mailto:jtuchel@objektfabrik.de> wrote:
Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there are
lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number
of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor
NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import that
may have been working for years may one day tear a whole web server
image down because of a wrong number of fieldAccessors. This is bad on
many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use the
NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
columns each. If you remove one of the fieldAccessors, an #upToEnd will
yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless
loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems
to not peek forward to the end of the line. I have the gut feeling
#peekChar should peek instead of reading the #next character form the
input Stream, but #peekChar has too many senders to just go ahead and
mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number of
separators matches the number of fieldAccessors minus 1 (and go through
the hoops of handling separators in quoted fields and such...). Only if
that test succeeds, I would then hand a Stream with the whole line to
the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me the
line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices
did you guys come up with for such problems?

Thanks in advance,

Joachim


Stéphane Ducasse
http://stephane.ducasse.free.fr / http://www.pharo.org
03 59 35 87 52
Assistant: Aurore Dalle
FAX 03 59 57 78 50
TEL 03 59 35 86 16
S. Ducasse - Inria
40, avenue Halley,
Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
Villeneuve d'Ascq 59650
France

Another point: In open-source and in this community. Either the code people mentioned is open-source and accessible or it does not exist. If it does not exist then this is easy :) S. > On 6 Jan 2021, at 05:10, Richard O'Keefe <raoknz@gmail.com> wrote: > > NeoCSVReader is described as efficient. What is that > in comparison to? What benchmark data are used? > Here are benchmark results measured today. > (5,000 data line file, 9,145,009 characters). > method time(ms) > Just read characters 410 > CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x CSVParser > NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x CSVParser > CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 reference. > > (10,000 data line file, 1,544,836 characters). > method time(ms) > Just read characters 93 > CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x CSVParser > NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x CSVParser > CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 reference. > > CSVParser is just 78 lines and is not customisable. It really is > stripped to pretty much an absolute minimum. All of the parsers > were configured (if that made sense) to return an Array of Strings. > Many of the CSV files I've worked with use short records instead > of ending a line with a lot of commas. Some of them also have the occasional stray comment off to the right, not mentioned in the header. > I've also found it necessary to skip multiple lines at the beginning > and/or end. (Really, some government agencies seem to have NO idea > that anyone might want to do more with a CSV file than eyeball it in > Excel.) > > If there is a benchmark suite I can use to improve CSVDecoder, > I would like to try it out. > > On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de <mailto:jtuchel@objektfabrik.de> <jtuchel@objektfabrik.de <mailto:jtuchel@objektfabrik.de>> wrote: > Happy new year to all of you! May 2021 be an increasingly less crazy > year than 2020... > > > I have a question that sounds a bit strange, but we have two effects > with NeoCSVReader related to wrong definitions of the reader. > > One effect is that reading a Stream #upToEnd leads to an endless loop, > the other is that the Reader produces twice as many objects as there are > lines in the file that is being read. > > In both scenarios, the reason is that the CSV Reader has a wrong number > of column definitions. > > Of course that is my fault: why do I feed a "malformed" CSV file to poor > NeoCSVReader? > > Let me explain: we have a few import interfaces which end users can > define using a more or less nice assistant in our Application. The CSV > files they upload to our App come from third parties like payment > providers, banks and other sources. These change their file structures > whenever they feel like it and never tell anybody. So a CSV import that > may have been working for years may one day tear a whole web server > image down because of a wrong number of fieldAccessors. This is bad on > many levels. > > You can easily try the doubling effect at home: define a working CSV > Reader and comment out one of the addField: commands before you use the > NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 > columns each. If you remove one of the fieldAccessors, an #upToEnd will > yoield an Array of 6 objects rather than 3. > > I haven't found the reason for the cases where this leads to an endless > loop, but at least this one is clear... > > I *guess* this is due to the way #readEndOfLine is implemented. It seems > to not peek forward to the end of the line. I have the gut feeling > #peekChar should peek instead of reading the #next character form the > input Stream, but #peekChar has too many senders to just go ahead and > mess with it ;-) > > So I wonder if there are any tried approaches to this problem. > > One thing I might do is not use #upToEnd, but read each line using > PositionableStream>>#nextLine and first check each line if the number of > separators matches the number of fieldAccessors minus 1 (and go through > the hoops of handling separators in quoted fields and such...). Only if > that test succeeds, I would then hand a Stream with the whole line to > the reader and do a #next. > > This will, however, mean a lot of extra cycles for large files. Of > course I could do this only for some lines, maybe just the first one. > Whatever. > > > But somehow I have the feeling I should get an exception telling me the > line is not compatible to the Reader's definition or such. Or > #readAtEndOrEndOfLine should just walk the line to the end and ignore > the rest of the line, returnong an incomplete object.... > > > Maybe I am just missing the right setting or switch? What best practices > did you guys come up with for such problems? > > > Thanks in advance, > > > Joachim > > > > > > > > > > > > > > > > > > -------------------------------------------- Stéphane Ducasse http://stephane.ducasse.free.fr / http://www.pharo.org 03 59 35 87 52 Assistant: Aurore Dalle FAX 03 59 57 78 50 TEL 03 59 35 86 16 S. Ducasse - Inria 40, avenue Halley, Parc Scientifique de la Haute Borne, Bât.A, Park Plaza Villeneuve d'Ascq 59650 France
J
jtuchel@objektfabrik.de
Wed, Jan 6, 2021 9:22 AM

Richard,

I am not sure what point you are trying to make here.
You have something cooler and faster? Great, how about sharing?
You could make a faster one when it doesn't convert numbers and stuff?
Great. I guess the time will be spent after parsing in 95% of the use
cases. It depends. And that is exactly what you are saying. The word
efficient means nothing without context. How is that related to this thread?

I think this thread mostly shows the strength of a community, especially
when there are members who are active, friendly and highly motivated. My
problem git solved in blazing speed without me paying anything for it.
Just because Sven thought my problem could be other people's problem as
well.

I am happy with NeoCSV's speed, even if there may be more lightweigt and
faster solutions. Tbh, my main concern with NeoCSV is not speed, but how
well I can understand problems and fix them. I care about data types on
parsing. A non-configurable csv parser gives me a bunch of dictionaries
and Strings. That could be a waste of cycles and memory once you need
the data as objects.
My use case is not importing trillions of records all day, and for a few
hundred or maybe sometimes thousands, it is good/fast enough.

Joachim

Am 06.01.21 um 05:10 schrieb Richard O'Keefe:

NeoCSVReader is described as efficient.  What is that
in comparison to? What benchmark data are used?
Here are benchmark results measured today.
(5,000 data line file, 9,145,009 characters).
 method                time(ms)
 Just read characters   410
 CSVDecoder>>next      3415   astc's CSV reader (defaults). 1.26 x
CSVParser
 NeoCSVReader>>next    4798   NeoCSVReader (default state). 1.78 x
CSVParser
 CSVParser>>next       2701   pared-to-the-bone CSV reader. 1.00
reference.

(10,000 data line file, 1,544,836 characters).
 method                time(ms)
 Just read characters    93
 CSVDecoder>>next       530   astc's CSV reader (defaults). 1.26 x
CSVParser
 NeoCSVReader>>next     737   NeoCSVReader (default state). 1.75 x
CSVParser
 CSVParser>>next        421   pared-to-the-bone CSV reader. 1.00
reference.

CSVParser is just 78 lines and is not customisable.  It really is
stripped to pretty much an absolute minimum.  All of the parsers
were configured (if that made sense) to return an Array of Strings.
Many of the CSV files I've worked with use short records instead
of ending a line with a lot of commas.  Some of them also have the
occasional stray comment off to the right, not mentioned in the header.
I've also found it necessary to skip multiple lines at the beginning
and/or end.  (Really, some government agencies seem to have NO idea
that anyone might want to do more with a CSV file than eyeball it in
Excel.)

If there is a benchmark suite I can use to improve CSVDecoder,
I would like to try it out.

On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de
mailto:jtuchel@objektfabrik.de <jtuchel@objektfabrik.de
mailto:jtuchel@objektfabrik.de> wrote:

 Happy new year to all of you! May 2021 be an increasingly less crazy
 year than 2020...


 I have a question that sounds a bit strange, but we have two effects
 with NeoCSVReader related to wrong definitions of the reader.

 One effect is that reading a Stream #upToEnd leads to an endless
 loop,
 the other is that the Reader produces twice as many objects as
 there are
 lines in the file that is being read.

 In both scenarios, the reason is that the CSV Reader has a wrong
 number
 of column definitions.

 Of course that is my fault: why do I feed a "malformed" CSV file
 to poor
 NeoCSVReader?

 Let me explain: we have a few import interfaces which end users can
 define using a more or less nice assistant in our Application. The
 CSV
 files they upload to our App come from third parties like payment
 providers, banks and other sources. These change their file
 structures
 whenever they feel like it and never tell anybody. So a CSV import
 that
 may have been working for years may one day tear a whole web server
 image down because of a wrong number of fieldAccessors. This is
 bad on
 many levels.

 You can easily try the doubling effect at home: define a working CSV
 Reader and comment out one of the addField: commands before you
 use the
 NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines
 with 4
 columns each. If you remove one of the fieldAccessors, an #upToEnd
 will
 yoield an Array of 6 objects rather than 3.

 I haven't found the reason for the cases where this leads to an
 endless
 loop, but at least this one is clear...

 I *guess* this is due to the way #readEndOfLine is implemented. It
 seems
 to not peek forward to the end of the line. I have the gut feeling
 #peekChar should peek instead of reading the #next character form the
 input Stream, but #peekChar has too many senders to just go ahead and
 mess with it ;-)

 So I wonder if there are any tried approaches to this problem.

 One thing I might do is not use #upToEnd, but read each line using
 PositionableStream>>#nextLine and first check each line if the
 number of
 separators matches the number of fieldAccessors minus 1 (and go
 through
 the hoops of handling separators in quoted fields and such...).
 Only if
 that test succeeds, I would then hand a Stream with the whole line to
 the reader and do a #next.

 This will, however, mean a lot of extra cycles for large files. Of
 course I could do this only for some lines, maybe just the first one.
 Whatever.


 But somehow I have the feeling I should get an exception telling
 me the
 line is not compatible to the Reader's definition or such. Or
 #readAtEndOrEndOfLine should just walk the line to the end and ignore
 the rest of the line, returnong an incomplete object....


 Maybe I am just missing the right setting or switch? What best
 practices
 did you guys come up with for such problems?


 Thanks in advance,


 Joachim

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Richard, I am not sure what point you are trying to make here. You have something cooler and faster? Great, how about sharing? You could make a faster one when it doesn't convert numbers and stuff? Great. I guess the time will be spent after parsing in 95% of the use cases. It depends. And that is exactly what you are saying. The word efficient means nothing without context. How is that related to this thread? I think this thread mostly shows the strength of a community, especially when there are members who are active, friendly and highly motivated. My problem git solved in blazing speed without me paying anything for it. Just because Sven thought my problem could be other people's problem as well. I am happy with NeoCSV's speed, even if there may be more lightweigt and faster solutions. Tbh, my main concern with NeoCSV is not speed, but how well I can understand problems and fix them. I care about data types on parsing. A non-configurable csv parser gives me a bunch of dictionaries and Strings. That could be a waste of cycles and memory once you need the data as objects. My use case is not importing trillions of records all day, and for a few hundred or maybe sometimes thousands, it is good/fast enough. Joachim Am 06.01.21 um 05:10 schrieb Richard O'Keefe: > NeoCSVReader is described as efficient.  What is that > in comparison to? What benchmark data are used? > Here are benchmark results measured today. > (5,000 data line file, 9,145,009 characters). >  method                time(ms) >  Just read characters   410 >  CSVDecoder>>next      3415   astc's CSV reader (defaults). 1.26 x > CSVParser >  NeoCSVReader>>next    4798   NeoCSVReader (default state). 1.78 x > CSVParser >  CSVParser>>next       2701   pared-to-the-bone CSV reader. 1.00 > reference. > > (10,000 data line file, 1,544,836 characters). >  method                time(ms) >  Just read characters    93 >  CSVDecoder>>next       530   astc's CSV reader (defaults). 1.26 x > CSVParser >  NeoCSVReader>>next     737   NeoCSVReader (default state). 1.75 x > CSVParser >  CSVParser>>next        421   pared-to-the-bone CSV reader. 1.00 > reference. > > CSVParser is just 78 lines and is not customisable.  It really is > stripped to pretty much an absolute minimum.  All of the parsers > were configured (if that made sense) to return an Array of Strings. > Many of the CSV files I've worked with use short records instead > of ending a line with a lot of commas.  Some of them also have the > occasional stray comment off to the right, not mentioned in the header. > I've also found it necessary to skip multiple lines at the beginning > and/or end.  (Really, some government agencies seem to have NO idea > that anyone might want to do more with a CSV file than eyeball it in > Excel.) > > If there is a benchmark suite I can use to improve CSVDecoder, > I would like to try it out. > > On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de > <mailto:jtuchel@objektfabrik.de> <jtuchel@objektfabrik.de > <mailto:jtuchel@objektfabrik.de>> wrote: > > Happy new year to all of you! May 2021 be an increasingly less crazy > year than 2020... > > > I have a question that sounds a bit strange, but we have two effects > with NeoCSVReader related to wrong definitions of the reader. > > One effect is that reading a Stream #upToEnd leads to an endless > loop, > the other is that the Reader produces twice as many objects as > there are > lines in the file that is being read. > > In both scenarios, the reason is that the CSV Reader has a wrong > number > of column definitions. > > Of course that is my fault: why do I feed a "malformed" CSV file > to poor > NeoCSVReader? > > Let me explain: we have a few import interfaces which end users can > define using a more or less nice assistant in our Application. The > CSV > files they upload to our App come from third parties like payment > providers, banks and other sources. These change their file > structures > whenever they feel like it and never tell anybody. So a CSV import > that > may have been working for years may one day tear a whole web server > image down because of a wrong number of fieldAccessors. This is > bad on > many levels. > > You can easily try the doubling effect at home: define a working CSV > Reader and comment out one of the addField: commands before you > use the > NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines > with 4 > columns each. If you remove one of the fieldAccessors, an #upToEnd > will > yoield an Array of 6 objects rather than 3. > > I haven't found the reason for the cases where this leads to an > endless > loop, but at least this one is clear... > > I *guess* this is due to the way #readEndOfLine is implemented. It > seems > to not peek forward to the end of the line. I have the gut feeling > #peekChar should peek instead of reading the #next character form the > input Stream, but #peekChar has too many senders to just go ahead and > mess with it ;-) > > So I wonder if there are any tried approaches to this problem. > > One thing I might do is not use #upToEnd, but read each line using > PositionableStream>>#nextLine and first check each line if the > number of > separators matches the number of fieldAccessors minus 1 (and go > through > the hoops of handling separators in quoted fields and such...). > Only if > that test succeeds, I would then hand a Stream with the whole line to > the reader and do a #next. > > This will, however, mean a lot of extra cycles for large files. Of > course I could do this only for some lines, maybe just the first one. > Whatever. > > > But somehow I have the feeling I should get an exception telling > me the > line is not compatible to the Reader's definition or such. Or > #readAtEndOrEndOfLine should just walk the line to the end and ignore > the rest of the line, returnong an incomplete object.... > > > Maybe I am just missing the right setting or switch? What best > practices > did you guys come up with for such problems? > > > Thanks in advance, > > > Joachim > > > > > > > > > > > > > > > > > > -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
SV
Sven Van Caekenberghe
Wed, Jan 6, 2021 9:52 AM

Hi Richard,

Benchmarking is a can of worms, many factors have to be considered. But the first requirement is obviously to be completely open over what you are doing and what you are comparing.

NeoCSV contains a simple benchmark suite called NeoCSVBenchmark, which was used during development. Note that it is a bit tricky to use: you need to run a write benchmark with a specific configuration before you can try read benchmarks.

The core data is a 100.000 line file (2.5 MB) like this:

1,-1,99999
2,-2,99998
3,-3,99997
4,-4,99996
5,-5,99995
6,-6,99994
7,-7,99993
8,-8,99992
9,-9,99991
10,-10,99990
...

That parses in ~250ms on my machine.

NeoCSV has quite a bit of features and handles various edge cases. Obviously, a minimal, custom implementation could be faster.

NeoCSV is called efficient not just because it is reasonably fast, but because it can be configured to generate domain objects without intermediate structures and because it can convert individual fields (parse numbers, dates, times, ...) while parsing.

Like you said, some generated CSV output out in the wild is very irregular. I try to stick with standard CSV as much as possible.

Sven

On 6 Jan 2021, at 05:10, Richard O'Keefe raoknz@gmail.com wrote:

NeoCSVReader is described as efficient.  What is that
in comparison to?  What benchmark data are used?
Here are benchmark results measured today.
(5,000 data line file, 9,145,009 characters).
method                time(ms)
Just read characters  410
CSVDecoder>>next      3415  astc's CSV reader (defaults). 1.26 x CSVParser
NeoCSVReader>>next    4798  NeoCSVReader (default state). 1.78 x CSVParser
CSVParser>>next      2701  pared-to-the-bone CSV reader. 1.00 reference.

(10,000 data line file, 1,544,836 characters).
method                time(ms)
Just read characters    93
CSVDecoder>>next      530  astc's CSV reader (defaults). 1.26 x CSVParser
NeoCSVReader>>next    737  NeoCSVReader (default state). 1.75 x CSVParser
CSVParser>>next        421  pared-to-the-bone CSV reader. 1.00 reference.

CSVParser is just 78 lines and is not customisable.  It really is
stripped to pretty much an absolute minimum.  All of the parsers
were configured (if that made sense) to return an Array of Strings.
Many of the CSV files I've worked with use short records instead
of ending a line with a lot of commas.  Some of them also have the occasional stray comment off to the right, not mentioned in the header.
I've also found it necessary to skip multiple lines at the beginning
and/or end.  (Really, some government agencies seem to have NO idea
that anyone might want to do more with a CSV file than eyeball it in
Excel.)

If there is a benchmark suite I can use to improve CSVDecoder,
I would like to try it out.

On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de jtuchel@objektfabrik.de wrote:
Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there are
lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number
of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor
NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import that
may have been working for years may one day tear a whole web server
image down because of a wrong number of fieldAccessors. This is bad on
many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use the
NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
columns each. If you remove one of the fieldAccessors, an #upToEnd will
yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless
loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems
to not peek forward to the end of the line. I have the gut feeling
#peekChar should peek instead of reading the #next character form the
input Stream, but #peekChar has too many senders to just go ahead and
mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number of
separators matches the number of fieldAccessors minus 1 (and go through
the hoops of handling separators in quoted fields and such...). Only if
that test succeeds, I would then hand a Stream with the whole line to
the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me the
line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices
did you guys come up with for such problems?

Thanks in advance,

Joachim

Hi Richard, Benchmarking is a can of worms, many factors have to be considered. But the first requirement is obviously to be completely open over what you are doing and what you are comparing. NeoCSV contains a simple benchmark suite called NeoCSVBenchmark, which was used during development. Note that it is a bit tricky to use: you need to run a write benchmark with a specific configuration before you can try read benchmarks. The core data is a 100.000 line file (2.5 MB) like this: 1,-1,99999 2,-2,99998 3,-3,99997 4,-4,99996 5,-5,99995 6,-6,99994 7,-7,99993 8,-8,99992 9,-9,99991 10,-10,99990 ... That parses in ~250ms on my machine. NeoCSV has quite a bit of features and handles various edge cases. Obviously, a minimal, custom implementation could be faster. NeoCSV is called efficient not just because it is reasonably fast, but because it can be configured to generate domain objects without intermediate structures and because it can convert individual fields (parse numbers, dates, times, ...) while parsing. Like you said, some generated CSV output out in the wild is very irregular. I try to stick with standard CSV as much as possible. Sven > On 6 Jan 2021, at 05:10, Richard O'Keefe <raoknz@gmail.com> wrote: > > NeoCSVReader is described as efficient. What is that > in comparison to? What benchmark data are used? > Here are benchmark results measured today. > (5,000 data line file, 9,145,009 characters). > method time(ms) > Just read characters 410 > CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x CSVParser > NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x CSVParser > CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 reference. > > (10,000 data line file, 1,544,836 characters). > method time(ms) > Just read characters 93 > CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x CSVParser > NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x CSVParser > CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 reference. > > CSVParser is just 78 lines and is not customisable. It really is > stripped to pretty much an absolute minimum. All of the parsers > were configured (if that made sense) to return an Array of Strings. > Many of the CSV files I've worked with use short records instead > of ending a line with a lot of commas. Some of them also have the occasional stray comment off to the right, not mentioned in the header. > I've also found it necessary to skip multiple lines at the beginning > and/or end. (Really, some government agencies seem to have NO idea > that anyone might want to do more with a CSV file than eyeball it in > Excel.) > > If there is a benchmark suite I can use to improve CSVDecoder, > I would like to try it out. > > On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de <jtuchel@objektfabrik.de> wrote: > Happy new year to all of you! May 2021 be an increasingly less crazy > year than 2020... > > > I have a question that sounds a bit strange, but we have two effects > with NeoCSVReader related to wrong definitions of the reader. > > One effect is that reading a Stream #upToEnd leads to an endless loop, > the other is that the Reader produces twice as many objects as there are > lines in the file that is being read. > > In both scenarios, the reason is that the CSV Reader has a wrong number > of column definitions. > > Of course that is my fault: why do I feed a "malformed" CSV file to poor > NeoCSVReader? > > Let me explain: we have a few import interfaces which end users can > define using a more or less nice assistant in our Application. The CSV > files they upload to our App come from third parties like payment > providers, banks and other sources. These change their file structures > whenever they feel like it and never tell anybody. So a CSV import that > may have been working for years may one day tear a whole web server > image down because of a wrong number of fieldAccessors. This is bad on > many levels. > > You can easily try the doubling effect at home: define a working CSV > Reader and comment out one of the addField: commands before you use the > NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 > columns each. If you remove one of the fieldAccessors, an #upToEnd will > yoield an Array of 6 objects rather than 3. > > I haven't found the reason for the cases where this leads to an endless > loop, but at least this one is clear... > > I *guess* this is due to the way #readEndOfLine is implemented. It seems > to not peek forward to the end of the line. I have the gut feeling > #peekChar should peek instead of reading the #next character form the > input Stream, but #peekChar has too many senders to just go ahead and > mess with it ;-) > > So I wonder if there are any tried approaches to this problem. > > One thing I might do is not use #upToEnd, but read each line using > PositionableStream>>#nextLine and first check each line if the number of > separators matches the number of fieldAccessors minus 1 (and go through > the hoops of handling separators in quoted fields and such...). Only if > that test succeeds, I would then hand a Stream with the whole line to > the reader and do a #next. > > This will, however, mean a lot of extra cycles for large files. Of > course I could do this only for some lines, maybe just the first one. > Whatever. > > > But somehow I have the feeling I should get an exception telling me the > line is not compatible to the Reader's definition or such. Or > #readAtEndOrEndOfLine should just walk the line to the end and ignore > the rest of the line, returnong an incomplete object.... > > > Maybe I am just missing the right setting or switch? What best practices > did you guys come up with for such problems? > > > Thanks in advance, > > > Joachim > > > > > > > > > > > > > > > > > >
J
jtuchel@objektfabrik.de
Wed, Jan 6, 2021 10:21 AM

Hi Sven,

I must say I am really happy with your change. We get a nice exception
whenever the number of fieldAccessor doesn't match with the number of
defined fieldAccessors. So far it also seems the endless loops are gone
as well. What a leap forward!

I'm adding an issue on github about the conversion errors, I hope that
is a convenient place for such comments/ideas?

Joachim

Am 05.01.21 um 21:06 schrieb jtuchel@objektfabrik.de:

Sven,

I tested your change with the file and filter (our own way of defining
csv mappings by the end users) which used to send our application into
an endless loop.

And voila: we get an exception instead of a frozen image! I will give
the conversion errors a test drive tomorrow.

I am absolutely happy with your change. Thank you very much.

Joachim

P.S: I even learned a little bit about Iceberg. I am not really sure
each of my mouse clicks made sense, but I had your commit in the image
and could test it and port the deltas over to my Smalltalk dialect...

Am 05.01.21 um 19:52 schrieb jtuchel@objektfabrik.de:

Hi Sven,

all I can say is: wow. I have no words.

I will have to learn a bit about Pharo and github real quick now in
order to try your changes....

Thank you very much. I'll give you feedback as fast as I can.

(And forget my questions about #readAtEndOrEndOfLine. I somhow didn't
understand it is expected to return a Boolean. Not sure why. I
thought of 'read' as a command, not a question in simple past..., so
I thought its job should be to read the rest of the line if we're not
there yet)

Joachim

Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe:

Hi Joachim,

Have a look at the following commit:

https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39

and specifically the added unit tests. These should help clarify the
new behaviour.

If anything is not clear, please ask.

HTH,

Sven

On 5 Jan 2021, at 08:49, jtuchel@objektfabrik.de wrote:

Sven,

first of all thanks a lot for taking your time with this!

Your test case is so beautifully small I can't believe it ;-)

While I think some kind of validation could help with parsing CSV,
I remember reading your comment on this in some other discussion
long ago. You wrote you don't see it as a responsibility of a
parser and that you wouldn't want to add this to NeoCSV. I must say
I tend to agree mostly. Whatever you do at parsing can only cover
part of the problems related to validation. There will be checks
that require access to other fields from the same line, or some
object that will be the owner of the Collection that you are just
importing, so a lot of validation must be done after parsing anyways.

So I think we can mostly ignore the validation part. Whatever a
reader will do, it will not be good enough.

A nice way of exposing conversion errors for fields created with
#addField:converter: would help a lot, however.

I am glad you agree on the underflow bug. This is more a question
of well-formedness than of validation. If a reader finds out it
doesn't fit for a file structure, it should tell the user/developer
about it or at least gracefully return some more or less incomplete
object resembling what it could parse. But it shouldn't cross line
borders and return a wrong number of objects.

I will definitely continue my hunt for the endless loop. It is not
an ideal situation if one user of our Seaside Application
completely blocks an image that may be serving a few other users by
just using a CVS parser that doesn't fit with the file. I suspect
this has to do with #readEndOfLine in some special case of the
underflow bug, but cannot prove it yet. But I have a file and
parser that reliably goes into an endless loop. I just need to
isolate the bare CSV parsing from the whole machinery we've build
around NeoCSV reader for these user-defined mappings... I wouldn't
be surprised if it is a problem buried somewhere in our
preparations in building a parser from user-defined data... I will
report my progress here, I promise!

One question I keep thinking about in NeoCSV: You implemented a
method called #peekChar, but it doesn't #peek. It buffers a
character and does read the #next character. I tried replacing the
#next with #peek, but that is definitely a shortcut to 100% CPU,
because #peekChar is used a lot, not only for consuming an
"unmapped remainder" of a line... I somehow have the feeling that
at least in #readEndOfLine the next char should bee peeked instead
of consumed in order to find out if it's workload or part of the
crlf/lf...
Shouldn't a reader step forward by using #peek to see whether there
is more data after all fieldAccessors have been applied to the line
(see #readNextRecordAsObject)? Otoh, at one point the reader has to
skip to the next line, so I am not sure if peek has any place
here... I need to debug a little more to understand...

Joachim

Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe:

Hi Joachim,

Thanks for the detailed feedback. This is most helpful. I need to
think more about this and experiment a bit. This is what I came up
with in a Workspace/Playground:

input := 'foo,1
bar,2
foobar,3'.

(NeoCSVReader on: input readStream) upToEnd.
(NeoCSVReader on: input readStream) addField; upToEnd.
(NeoCSVReader on: input readStream) addField; addField; addField;
upToEnd.

(NeoCSVReader on: input readStream) recordClass: Dictionary;
addField: [ :obj :str | obj at: #one put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary;
addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj
:str | obj at: #two put: str]; addField: [ :obj :str | obj at:
#three put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary;
emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one
put: str]; addField: [ :obj :str | obj at: #two put: str];
addField: [ :obj :str | obj at: #three put: str]; upToEnd.

In my opinion there are two distinct issues:

  1. what to do when you define a specific number of fields to be
    read and there are not enough of them in the input (underflow), or
    there are too many of them in the input (overflow).

it is clear that the underflow case is wrong and a bug that has to
be fixed.
the overflow case seems OK (resulting in nil fields)

  1. to validate the input (a functionality not yet present)

this would basically mean to signal an error in the under or
overflow case.
but wrong type conversions should be errors too.

I understand that you want to validate foreign input.

It is a pity that you cannot produce an infinite loop example,
that would also be useful.

That's it for now, I will come back to you.

Regards,

Sven

On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote:

Please find attached a small test case to demonstrate what I
mean. There is just some nonsense Business Object class and a
simple test case in this fileout.

Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less
crazy year than 2020...

I have a question that sounds a bit strange, but we have two
effects with NeoCSVReader related to wrong definitions of the
reader.

One effect is that reading a Stream #upToEnd leads to an endless
loop, the other is that the Reader produces twice as many
objects as there are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong
number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file
to poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users
can define using a more or less nice assistant in our
Application. The CSV files they upload to our App come from
third parties like payment providers, banks and other sources.
These change their file structures whenever they feel like it
and never tell anybody. So a CSV import that may have been
working for years may one day tear a whole web server image down
because of a wrong number of fieldAccessors. This is bad on many
levels.

You can easily try the doubling effect at home: define a working
CSV Reader and comment out one of the addField: commands before
you use the NeoCSVReader to parse a CSV file. Say your CSV file
has 3 lines with 4 columns each. If you remove one of the
fieldAccessors, an #upToEnd will yoield an Array of 6 objects
rather than 3.

I haven't found the reason for the cases where this leads to an
endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented.
It seems to not peek forward to the end of the line. I have the
gut feeling #peekChar should peek instead of reading the #next
character form the input Stream, but #peekChar has too many
senders to just go ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line
using PositionableStream>>#nextLine and first check each line if
the number of separators matches the number of fieldAccessors
minus 1 (and go through the hoops of handling separators in
quoted fields and such...). Only if that test succeeds, I would
then hand a Stream with the whole line to the reader and do a
#next.

This will, however, mean a lot of extra cycles for large files.
Of course I could do this only for some lines, maybe just the
first one. Whatever.

But somehow I have the feeling I should get an exception telling
me the line is not compatible to the Reader's definition or
such. Or #readAtEndOrEndOfLine should just walk the line to the
end and ignore the rest of the line, returnong an incomplete
object....

Maybe I am just missing the right setting or switch? What best
practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de
Fliederweg 1 http://www.objektfabrik.de
D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1

<NeoCSVEndlessLoopTest.st>

--

Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de
Fliederweg 1 http://www.objektfabrik.de
D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Hi Sven, I must say I am really happy with your change. We get a nice exception whenever the number of fieldAccessor doesn't match with the number of defined fieldAccessors. So far it also seems the endless loops are gone as well. What a leap forward! I'm adding an issue on github about the conversion errors, I hope that is a convenient place for such comments/ideas? Joachim Am 05.01.21 um 21:06 schrieb jtuchel@objektfabrik.de: > Sven, > > > I tested your change with the file and filter (our own way of defining > csv mappings by the end users) which used to send our application into > an endless loop. > > And voila: we get an exception instead of a frozen image! I will give > the conversion errors a test drive tomorrow. > > I am absolutely happy with your change. Thank you very much. > > > Joachim > > > P.S: I even learned a little bit about Iceberg. I am not really sure > each of my mouse clicks made sense, but I had your commit in the image > and could test it and port the deltas over to my Smalltalk dialect... > > > > > > > > Am 05.01.21 um 19:52 schrieb jtuchel@objektfabrik.de: >> Hi Sven, >> >> >> all I can say is: wow. I have no words. >> >> I will have to learn a bit about Pharo and github real quick now in >> order to try your changes.... >> >> Thank you very much. I'll give you feedback as fast as I can. >> >> (And forget my questions about #readAtEndOrEndOfLine. I somhow didn't >> understand it is expected to return a Boolean. Not sure why. I >> thought of 'read' as a command, not a question in simple past..., so >> I thought its job should be to read the rest of the line if we're not >> there yet) >> >> >> Joachim >> >> >> >> >> >> >> >> >> >> Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe: >>> Hi Joachim, >>> >>> Have a look at the following commit: >>> >>> https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39 >>> >>> >>> and specifically the added unit tests. These should help clarify the >>> new behaviour. >>> >>> If anything is not clear, please ask. >>> >>> HTH, >>> >>> Sven >>> >>>> On 5 Jan 2021, at 08:49, jtuchel@objektfabrik.de wrote: >>>> >>>> Sven, >>>> >>>> first of all thanks a lot for taking your time with this! >>>> >>>> Your test case is so beautifully small I can't believe it ;-) >>>> >>>> While I think some kind of validation could help with parsing CSV, >>>> I remember reading your comment on this in some other discussion >>>> long ago. You wrote you don't see it as a responsibility of a >>>> parser and that you wouldn't want to add this to NeoCSV. I must say >>>> I tend to agree mostly. Whatever you do at parsing can only cover >>>> part of the problems related to validation. There will be checks >>>> that require access to other fields from the same line, or some >>>> object that will be the owner of the Collection that you are just >>>> importing, so a lot of validation must be done after parsing anyways. >>>> >>>> So I think we can mostly ignore the validation part. Whatever a >>>> reader will do, it will not be good enough. >>>> >>>> A nice way of exposing conversion errors for fields created with >>>> #addField:converter: would help a lot, however. >>>> >>>> I am glad you agree on the underflow bug. This is more a question >>>> of well-formedness than of validation. If a reader finds out it >>>> doesn't fit for a file structure, it should tell the user/developer >>>> about it or at least gracefully return some more or less incomplete >>>> object resembling what it could parse. But it shouldn't cross line >>>> borders and return a wrong number of objects. >>>> >>>> >>>> I will definitely continue my hunt for the endless loop. It is not >>>> an ideal situation if one user of our Seaside Application >>>> completely blocks an image that may be serving a few other users by >>>> just using a CVS parser that doesn't fit with the file. I suspect >>>> this has to do with #readEndOfLine in some special case of the >>>> underflow bug, but cannot prove it yet. But I have a file and >>>> parser that reliably goes into an endless loop. I just need to >>>> isolate the bare CSV parsing from the whole machinery we've build >>>> around NeoCSV reader for these user-defined mappings... I wouldn't >>>> be surprised if it is a problem buried somewhere in our >>>> preparations in building a parser from user-defined data... I will >>>> report my progress here, I promise! >>>> >>>> >>>> One question I keep thinking about in NeoCSV: You implemented a >>>> method called #peekChar, but it doesn't #peek. It buffers a >>>> character and does read the #next character. I tried replacing the >>>> #next with #peek, but that is definitely a shortcut to 100% CPU, >>>> because #peekChar is used a lot, not only for consuming an >>>> "unmapped remainder" of a line... I somehow have the feeling that >>>> at least in #readEndOfLine the next char should bee peeked instead >>>> of consumed in order to find out if it's workload or part of the >>>> crlf/lf... >>>> Shouldn't a reader step forward by using #peek to see whether there >>>> is more data after all fieldAccessors have been applied to the line >>>> (see #readNextRecordAsObject)? Otoh, at one point the reader has to >>>> skip to the next line, so I am not sure if peek has any place >>>> here... I need to debug a little more to understand... >>>> >>>> >>>> >>>> Joachim >>>> >>>> >>>> >>>> >>>> >>>> >>>> Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe: >>>>> Hi Joachim, >>>>> >>>>> Thanks for the detailed feedback. This is most helpful. I need to >>>>> think more about this and experiment a bit. This is what I came up >>>>> with in a Workspace/Playground: >>>>> >>>>> input := 'foo,1 >>>>> bar,2 >>>>> foobar,3'. >>>>> >>>>> (NeoCSVReader on: input readStream) upToEnd. >>>>> (NeoCSVReader on: input readStream) addField; upToEnd. >>>>> (NeoCSVReader on: input readStream) addField; addField; addField; >>>>> upToEnd. >>>>> >>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; >>>>> addField: [ :obj :str | obj at: #one put: str]; upToEnd. >>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; >>>>> addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj >>>>> :str | obj at: #two put: str]; addField: [ :obj :str | obj at: >>>>> #three put: str]; upToEnd. >>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; >>>>> emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one >>>>> put: str]; addField: [ :obj :str | obj at: #two put: str]; >>>>> addField: [ :obj :str | obj at: #three put: str]; upToEnd. >>>>> >>>>> In my opinion there are two distinct issues: >>>>> >>>>> 1. what to do when you define a specific number of fields to be >>>>> read and there are not enough of them in the input (underflow), or >>>>> there are too many of them in the input (overflow). >>>>> >>>>> it is clear that the underflow case is wrong and a bug that has to >>>>> be fixed. >>>>> the overflow case seems OK (resulting in nil fields) >>>>> >>>>> 2. to validate the input (a functionality not yet present) >>>>> >>>>> this would basically mean to signal an error in the under or >>>>> overflow case. >>>>> but wrong type conversions should be errors too. >>>>> >>>>> I understand that you want to validate foreign input. >>>>> >>>>> It is a pity that you cannot produce an infinite loop example, >>>>> that would also be useful. >>>>> >>>>> That's it for now, I will come back to you. >>>>> >>>>> Regards, >>>>> >>>>> Sven >>>>> >>>>>> On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote: >>>>>> >>>>>> Please find attached a small test case to demonstrate what I >>>>>> mean. There is just some nonsense Business Object class and a >>>>>> simple test case in this fileout. >>>>>> >>>>>> >>>>>> Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de: >>>>>>> Happy new year to all of you! May 2021 be an increasingly less >>>>>>> crazy year than 2020... >>>>>>> >>>>>>> >>>>>>> I have a question that sounds a bit strange, but we have two >>>>>>> effects with NeoCSVReader related to wrong definitions of the >>>>>>> reader. >>>>>>> >>>>>>> One effect is that reading a Stream #upToEnd leads to an endless >>>>>>> loop, the other is that the Reader produces twice as many >>>>>>> objects as there are lines in the file that is being read. >>>>>>> >>>>>>> In both scenarios, the reason is that the CSV Reader has a wrong >>>>>>> number of column definitions. >>>>>>> >>>>>>> Of course that is my fault: why do I feed a "malformed" CSV file >>>>>>> to poor NeoCSVReader? >>>>>>> >>>>>>> Let me explain: we have a few import interfaces which end users >>>>>>> can define using a more or less nice assistant in our >>>>>>> Application. The CSV files they upload to our App come from >>>>>>> third parties like payment providers, banks and other sources. >>>>>>> These change their file structures whenever they feel like it >>>>>>> and never tell anybody. So a CSV import that may have been >>>>>>> working for years may one day tear a whole web server image down >>>>>>> because of a wrong number of fieldAccessors. This is bad on many >>>>>>> levels. >>>>>>> >>>>>>> You can easily try the doubling effect at home: define a working >>>>>>> CSV Reader and comment out one of the addField: commands before >>>>>>> you use the NeoCSVReader to parse a CSV file. Say your CSV file >>>>>>> has 3 lines with 4 columns each. If you remove one of the >>>>>>> fieldAccessors, an #upToEnd will yoield an Array of 6 objects >>>>>>> rather than 3. >>>>>>> >>>>>>> I haven't found the reason for the cases where this leads to an >>>>>>> endless loop, but at least this one is clear... >>>>>>> >>>>>>> I *guess* this is due to the way #readEndOfLine is implemented. >>>>>>> It seems to not peek forward to the end of the line. I have the >>>>>>> gut feeling #peekChar should peek instead of reading the #next >>>>>>> character form the input Stream, but #peekChar has too many >>>>>>> senders to just go ahead and mess with it ;-) >>>>>>> >>>>>>> So I wonder if there are any tried approaches to this problem. >>>>>>> >>>>>>> One thing I might do is not use #upToEnd, but read each line >>>>>>> using PositionableStream>>#nextLine and first check each line if >>>>>>> the number of separators matches the number of fieldAccessors >>>>>>> minus 1 (and go through the hoops of handling separators in >>>>>>> quoted fields and such...). Only if that test succeeds, I would >>>>>>> then hand a Stream with the whole line to the reader and do a >>>>>>> #next. >>>>>>> >>>>>>> This will, however, mean a lot of extra cycles for large files. >>>>>>> Of course I could do this only for some lines, maybe just the >>>>>>> first one. Whatever. >>>>>>> >>>>>>> >>>>>>> But somehow I have the feeling I should get an exception telling >>>>>>> me the line is not compatible to the Reader's definition or >>>>>>> such. Or #readAtEndOrEndOfLine should just walk the line to the >>>>>>> end and ignore the rest of the line, returnong an incomplete >>>>>>> object.... >>>>>>> >>>>>>> >>>>>>> Maybe I am just missing the right setting or switch? What best >>>>>>> practices did you guys come up with for such problems? >>>>>>> >>>>>>> >>>>>>> Thanks in advance, >>>>>>> >>>>>>> >>>>>>> Joachim >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> ----------------------------------------------------------------------- >>>>>> >>>>>> Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de >>>>>> Fliederweg 1 http://www.objektfabrik.de >>>>>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>>>>> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1 >>>>>> >>>>>> >>>>>> <NeoCSVEndlessLoopTest.st> >>>> >>>> -- >>>> ----------------------------------------------------------------------- >>>> >>>> Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de >>>> Fliederweg 1 http://www.objektfabrik.de >>>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>>> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1 >>>> >> > -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
SV
Sven Van Caekenberghe
Wed, Jan 6, 2021 10:43 AM

Joachim,

On 6 Jan 2021, at 11:21, jtuchel@objektfabrik.de wrote:

Hi Sven,

I must say I am really happy with your change. We get a nice exception whenever the number of fieldAccessor doesn't match with the number of defined fieldAccessors. So far it also seems the endless loops are gone as well. What a leap forward!

Thank you for your kind words.

But thank you as well: it really helps to get constructive feedback from actual users, to improve the code for everyone.

I'm adding an issue on github about the conversion errors, I hope that is a convenient place for such comments/ideas?

Did you see NeoCSVReaderTests>>#testConversionErrors ?

It is not perfect, but you do get an error when a number conversion fails, you could make your own conversions fail similarly.

(NeoCSVReader on: 'a' readStream) addIntegerField; upToEnd.

Like you said: some validation can be done at the CSV level, but certainly not everything.

Sven

Joachim

Am 05.01.21 um 21:06 schrieb jtuchel@objektfabrik.de:

Sven,

I tested your change with the file and filter (our own way of defining csv mappings by the end users) which used to send our application into an endless loop.

And voila: we get an exception instead of a frozen image! I will give the conversion errors a test drive tomorrow.

I am absolutely happy with your change. Thank you very much.

Joachim

P.S: I even learned a little bit about Iceberg. I am not really sure each of my mouse clicks made sense, but I had your commit in the image and could test it and port the deltas over to my Smalltalk dialect...

Am 05.01.21 um 19:52 schrieb jtuchel@objektfabrik.de:

Hi Sven,

all I can say is: wow. I have no words.

I will have to learn a bit about Pharo and github real quick now in order to try your changes....

Thank you very much. I'll give you feedback as fast as I can.

(And forget my questions about #readAtEndOrEndOfLine. I somhow didn't understand it is expected to return a Boolean. Not sure why. I thought of 'read' as a command, not a question in simple past..., so I thought its job should be to read the rest of the line if we're not there yet)

Joachim

Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe:

Hi Joachim,

Have a look at the following commit:

https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39

and specifically the added unit tests. These should help clarify the new behaviour.

If anything is not clear, please ask.

HTH,

Sven

On 5 Jan 2021, at 08:49, jtuchel@objektfabrik.de wrote:

Sven,

first of all thanks a lot for taking your time with this!

Your test case is so beautifully small I can't believe it ;-)

While I think some kind of validation could help with parsing CSV, I remember reading your comment on this in some other discussion long ago. You wrote you don't see it as a responsibility of a parser and that you wouldn't want to add this to NeoCSV. I must say I tend to agree mostly. Whatever you do at parsing can only cover part of the problems related to validation. There will be checks that require access to other fields from the same line, or some object that will be the owner of the Collection that you are just importing, so a lot of validation must be done after parsing anyways.

So I think we can mostly ignore the validation part. Whatever a reader will do, it will not be good enough.

A nice way of exposing conversion errors for fields created with #addField:converter: would help a lot, however.

I am glad you agree on the underflow bug. This is more a question of well-formedness than of validation. If a reader finds out it doesn't fit for a file structure, it should tell the user/developer about it or at least gracefully return some more or less incomplete object resembling what it could parse. But it shouldn't cross line borders and return a wrong number of objects.

I will definitely continue my hunt for the endless loop. It is not an ideal situation if one user of our Seaside Application completely blocks an image that may be serving a few other users by just using a CVS parser that doesn't fit with the file. I suspect this has to do with #readEndOfLine in some special case of the underflow bug, but cannot prove it yet. But I have a file and parser that reliably goes into an endless loop. I just need to isolate the bare CSV parsing from the whole machinery we've build around NeoCSV reader for these user-defined mappings... I wouldn't be surprised if it is a problem buried somewhere in our preparations in building a parser from user-defined data... I will report my progress here, I promise!

One question I keep thinking about in NeoCSV: You implemented a method called #peekChar, but it doesn't #peek. It buffers a character and does read the #next character. I tried replacing the #next with #peek, but that is definitely a shortcut to 100% CPU, because #peekChar is used a lot, not only for consuming an "unmapped remainder" of a line... I somehow have the feeling that at least in #readEndOfLine the next char should bee peeked instead of consumed in order to find out if it's workload or part of the crlf/lf...
Shouldn't a reader step forward by using #peek to see whether there is more data after all fieldAccessors have been applied to the line (see #readNextRecordAsObject)? Otoh, at one point the reader has to skip to the next line, so I am not sure if peek has any place here... I need to debug a little more to understand...

Joachim

Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe:

Hi Joachim,

Thanks for the detailed feedback. This is most helpful. I need to think more about this and experiment a bit. This is what I came up with in a Workspace/Playground:

input := 'foo,1
bar,2
foobar,3'.

(NeoCSVReader on: input readStream) upToEnd.
(NeoCSVReader on: input readStream) addField; upToEnd.
(NeoCSVReader on: input readStream) addField; addField; addField; upToEnd.

(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.

In my opinion there are two distinct issues:

  1. what to do when you define a specific number of fields to be read and there are not enough of them in the input (underflow), or there are too many of them in the input (overflow).

it is clear that the underflow case is wrong and a bug that has to be fixed.
the overflow case seems OK (resulting in nil fields)

  1. to validate the input (a functionality not yet present)

this would basically mean to signal an error in the under or overflow case.
but wrong type conversions should be errors too.

I understand that you want to validate foreign input.

It is a pity that you cannot produce an infinite loop example, that would also be useful.

That's it for now, I will come back to you.

Regards,

Sven

On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote:

Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout.

Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020...

I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader?

Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels.

You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever.

But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de
Fliederweg 1 http://www.objektfabrik.de
D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

<NeoCSVEndlessLoopTest.st>

--

Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de
Fliederweg 1 http://www.objektfabrik.de
D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Joachim, > On 6 Jan 2021, at 11:21, jtuchel@objektfabrik.de wrote: > > Hi Sven, > > > I must say I am really happy with your change. We get a nice exception whenever the number of fieldAccessor doesn't match with the number of defined fieldAccessors. So far it also seems the endless loops are gone as well. What a leap forward! Thank you for your kind words. But thank you as well: it really helps to get constructive feedback from actual users, to improve the code for everyone. > I'm adding an issue on github about the conversion errors, I hope that is a convenient place for such comments/ideas? Did you see NeoCSVReaderTests>>#testConversionErrors ? It is not perfect, but you do get an error when a number conversion fails, you could make your own conversions fail similarly. (NeoCSVReader on: 'a' readStream) addIntegerField; upToEnd. Like you said: some validation can be done at the CSV level, but certainly not everything. Sven > Joachim > > > > > > > Am 05.01.21 um 21:06 schrieb jtuchel@objektfabrik.de: >> Sven, >> >> >> I tested your change with the file and filter (our own way of defining csv mappings by the end users) which used to send our application into an endless loop. >> >> And voila: we get an exception instead of a frozen image! I will give the conversion errors a test drive tomorrow. >> >> I am absolutely happy with your change. Thank you very much. >> >> >> Joachim >> >> >> P.S: I even learned a little bit about Iceberg. I am not really sure each of my mouse clicks made sense, but I had your commit in the image and could test it and port the deltas over to my Smalltalk dialect... >> >> >> >> >> >> >> >> Am 05.01.21 um 19:52 schrieb jtuchel@objektfabrik.de: >>> Hi Sven, >>> >>> >>> all I can say is: wow. I have no words. >>> >>> I will have to learn a bit about Pharo and github real quick now in order to try your changes.... >>> >>> Thank you very much. I'll give you feedback as fast as I can. >>> >>> (And forget my questions about #readAtEndOrEndOfLine. I somhow didn't understand it is expected to return a Boolean. Not sure why. I thought of 'read' as a command, not a question in simple past..., so I thought its job should be to read the rest of the line if we're not there yet) >>> >>> >>> Joachim >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe: >>>> Hi Joachim, >>>> >>>> Have a look at the following commit: >>>> >>>> https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39 >>>> >>>> and specifically the added unit tests. These should help clarify the new behaviour. >>>> >>>> If anything is not clear, please ask. >>>> >>>> HTH, >>>> >>>> Sven >>>> >>>>> On 5 Jan 2021, at 08:49, jtuchel@objektfabrik.de wrote: >>>>> >>>>> Sven, >>>>> >>>>> first of all thanks a lot for taking your time with this! >>>>> >>>>> Your test case is so beautifully small I can't believe it ;-) >>>>> >>>>> While I think some kind of validation could help with parsing CSV, I remember reading your comment on this in some other discussion long ago. You wrote you don't see it as a responsibility of a parser and that you wouldn't want to add this to NeoCSV. I must say I tend to agree mostly. Whatever you do at parsing can only cover part of the problems related to validation. There will be checks that require access to other fields from the same line, or some object that will be the owner of the Collection that you are just importing, so a lot of validation must be done after parsing anyways. >>>>> >>>>> So I think we can mostly ignore the validation part. Whatever a reader will do, it will not be good enough. >>>>> >>>>> A nice way of exposing conversion errors for fields created with #addField:converter: would help a lot, however. >>>>> >>>>> I am glad you agree on the underflow bug. This is more a question of well-formedness than of validation. If a reader finds out it doesn't fit for a file structure, it should tell the user/developer about it or at least gracefully return some more or less incomplete object resembling what it could parse. But it shouldn't cross line borders and return a wrong number of objects. >>>>> >>>>> >>>>> I will definitely continue my hunt for the endless loop. It is not an ideal situation if one user of our Seaside Application completely blocks an image that may be serving a few other users by just using a CVS parser that doesn't fit with the file. I suspect this has to do with #readEndOfLine in some special case of the underflow bug, but cannot prove it yet. But I have a file and parser that reliably goes into an endless loop. I just need to isolate the bare CSV parsing from the whole machinery we've build around NeoCSV reader for these user-defined mappings... I wouldn't be surprised if it is a problem buried somewhere in our preparations in building a parser from user-defined data... I will report my progress here, I promise! >>>>> >>>>> >>>>> One question I keep thinking about in NeoCSV: You implemented a method called #peekChar, but it doesn't #peek. It buffers a character and does read the #next character. I tried replacing the #next with #peek, but that is definitely a shortcut to 100% CPU, because #peekChar is used a lot, not only for consuming an "unmapped remainder" of a line... I somehow have the feeling that at least in #readEndOfLine the next char should bee peeked instead of consumed in order to find out if it's workload or part of the crlf/lf... >>>>> Shouldn't a reader step forward by using #peek to see whether there is more data after all fieldAccessors have been applied to the line (see #readNextRecordAsObject)? Otoh, at one point the reader has to skip to the next line, so I am not sure if peek has any place here... I need to debug a little more to understand... >>>>> >>>>> >>>>> >>>>> Joachim >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe: >>>>>> Hi Joachim, >>>>>> >>>>>> Thanks for the detailed feedback. This is most helpful. I need to think more about this and experiment a bit. This is what I came up with in a Workspace/Playground: >>>>>> >>>>>> input := 'foo,1 >>>>>> bar,2 >>>>>> foobar,3'. >>>>>> >>>>>> (NeoCSVReader on: input readStream) upToEnd. >>>>>> (NeoCSVReader on: input readStream) addField; upToEnd. >>>>>> (NeoCSVReader on: input readStream) addField; addField; addField; upToEnd. >>>>>> >>>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; upToEnd. >>>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. >>>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. >>>>>> >>>>>> In my opinion there are two distinct issues: >>>>>> >>>>>> 1. what to do when you define a specific number of fields to be read and there are not enough of them in the input (underflow), or there are too many of them in the input (overflow). >>>>>> >>>>>> it is clear that the underflow case is wrong and a bug that has to be fixed. >>>>>> the overflow case seems OK (resulting in nil fields) >>>>>> >>>>>> 2. to validate the input (a functionality not yet present) >>>>>> >>>>>> this would basically mean to signal an error in the under or overflow case. >>>>>> but wrong type conversions should be errors too. >>>>>> >>>>>> I understand that you want to validate foreign input. >>>>>> >>>>>> It is a pity that you cannot produce an infinite loop example, that would also be useful. >>>>>> >>>>>> That's it for now, I will come back to you. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Sven >>>>>> >>>>>>> On 4 Jan 2021, at 14:46, jtuchel@objektfabrik.de wrote: >>>>>>> >>>>>>> Please find attached a small test case to demonstrate what I mean. There is just some nonsense Business Object class and a simple test case in this fileout. >>>>>>> >>>>>>> >>>>>>> Am 04.01.21 um 14:36 schrieb jtuchel@objektfabrik.de: >>>>>>>> Happy new year to all of you! May 2021 be an increasingly less crazy year than 2020... >>>>>>>> >>>>>>>> >>>>>>>> I have a question that sounds a bit strange, but we have two effects with NeoCSVReader related to wrong definitions of the reader. >>>>>>>> >>>>>>>> One effect is that reading a Stream #upToEnd leads to an endless loop, the other is that the Reader produces twice as many objects as there are lines in the file that is being read. >>>>>>>> >>>>>>>> In both scenarios, the reason is that the CSV Reader has a wrong number of column definitions. >>>>>>>> >>>>>>>> Of course that is my fault: why do I feed a "malformed" CSV file to poor NeoCSVReader? >>>>>>>> >>>>>>>> Let me explain: we have a few import interfaces which end users can define using a more or less nice assistant in our Application. The CSV files they upload to our App come from third parties like payment providers, banks and other sources. These change their file structures whenever they feel like it and never tell anybody. So a CSV import that may have been working for years may one day tear a whole web server image down because of a wrong number of fieldAccessors. This is bad on many levels. >>>>>>>> >>>>>>>> You can easily try the doubling effect at home: define a working CSV Reader and comment out one of the addField: commands before you use the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects rather than 3. >>>>>>>> >>>>>>>> I haven't found the reason for the cases where this leads to an endless loop, but at least this one is clear... >>>>>>>> >>>>>>>> I *guess* this is due to the way #readEndOfLine is implemented. It seems to not peek forward to the end of the line. I have the gut feeling #peekChar should peek instead of reading the #next character form the input Stream, but #peekChar has too many senders to just go ahead and mess with it ;-) >>>>>>>> >>>>>>>> So I wonder if there are any tried approaches to this problem. >>>>>>>> >>>>>>>> One thing I might do is not use #upToEnd, but read each line using PositionableStream>>#nextLine and first check each line if the number of separators matches the number of fieldAccessors minus 1 (and go through the hoops of handling separators in quoted fields and such...). Only if that test succeeds, I would then hand a Stream with the whole line to the reader and do a #next. >>>>>>>> >>>>>>>> This will, however, mean a lot of extra cycles for large files. Of course I could do this only for some lines, maybe just the first one. Whatever. >>>>>>>> >>>>>>>> >>>>>>>> But somehow I have the feeling I should get an exception telling me the line is not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine should just walk the line to the end and ignore the rest of the line, returnong an incomplete object.... >>>>>>>> >>>>>>>> >>>>>>>> Maybe I am just missing the right setting or switch? What best practices did you guys come up with for such problems? >>>>>>>> >>>>>>>> >>>>>>>> Thanks in advance, >>>>>>>> >>>>>>>> >>>>>>>> Joachim >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> ----------------------------------------------------------------------- >>>>>>> Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de >>>>>>> Fliederweg 1 http://www.objektfabrik.de >>>>>>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>>>>>> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >>>>>>> >>>>>>> >>>>>>> <NeoCSVEndlessLoopTest.st> >>>>> >>>>> -- >>>>> ----------------------------------------------------------------------- >>>>> Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de >>>>> Fliederweg 1 http://www.objektfabrik.de >>>>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>>>> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >>>>> >>> >> > > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de > Fliederweg 1 http://www.objektfabrik.de > D-71640 Ludwigsburg http://joachimtuchel.wordpress.com > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >
RO
Richard O'Keefe
Thu, Jan 7, 2021 6:15 AM

You aren't sure what point I was making?
How about the one I actually wrote down:
What test data was NeoCSV benchmarked with
and can I get my hands on it?
THAT is the point.  The data points I showed (and
many others I have not) are not satisfactory to me.
I have been searching for CSV test collections.
One site offered 6 files of which only one downloaded.
I found a "benchmark suite" for CSV containing no
actual CSV files.
So where else should I look for benchmark data than
associated with a parser people in this community are
generally happy with that is described as "efficient"?

Is it so unreasonable to suspect that my results might
be a fluke?  Is it bad manners to assume that something
described as efficient has tests showing that?

On Wed, 6 Jan 2021 at 22:23, jtuchel@objektfabrik.de <
jtuchel@objektfabrik.de> wrote:

Richard,

I am not sure what point you are trying to make here.
You have something cooler and faster? Great, how about sharing?
You could make a faster one when it doesn't convert numbers and stuff?
Great. I guess the time will be spent after parsing in 95% of the use
cases. It depends. And that is exactly what you are saying. The word
efficient means nothing without context. How is that related to this thread?

I think this thread mostly shows the strength of a community, especially
when there are members who are active, friendly and highly motivated. My
problem git solved in blazing speed without me paying anything for it. Just
because Sven thought my problem could be other people's problem as well.

I am happy with NeoCSV's speed, even if there may be more lightweigt and
faster solutions. Tbh, my main concern with NeoCSV is not speed, but how
well I can understand problems and fix them. I care about data types on
parsing. A non-configurable csv parser gives me a bunch of dictionaries and
Strings. That could be a waste of cycles and memory once you need the data
as objects.
My use case is not importing trillions of records all day, and for a few
hundred or maybe sometimes thousands, it is good/fast enough.

Joachim

Am 06.01.21 um 05:10 schrieb Richard O'Keefe:

NeoCSVReader is described as efficient.  What is that
in comparison to?  What benchmark data are used?
Here are benchmark results measured today.
(5,000 data line file, 9,145,009 characters).
method                time(ms)
Just read characters  410
CSVDecoder>>next      3415  astc's CSV reader (defaults). 1.26 x
CSVParser
NeoCSVReader>>next    4798  NeoCSVReader (default state). 1.78 x
CSVParser
CSVParser>>next      2701  pared-to-the-bone CSV reader. 1.00 reference.

(10,000 data line file, 1,544,836 characters).
method                time(ms)
Just read characters    93
CSVDecoder>>next      530  astc's CSV reader (defaults). 1.26 x
CSVParser
NeoCSVReader>>next    737  NeoCSVReader (default state). 1.75 x
CSVParser
CSVParser>>next        421  pared-to-the-bone CSV reader. 1.00 reference.

CSVParser is just 78 lines and is not customisable.  It really is
stripped to pretty much an absolute minimum.  All of the parsers
were configured (if that made sense) to return an Array of Strings.
Many of the CSV files I've worked with use short records instead
of ending a line with a lot of commas.  Some of them also have the
occasional stray comment off to the right, not mentioned in the header.
I've also found it necessary to skip multiple lines at the beginning
and/or end.  (Really, some government agencies seem to have NO idea
that anyone might want to do more with a CSV file than eyeball it in
Excel.)

If there is a benchmark suite I can use to improve CSVDecoder,
I would like to try it out.

On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de <
jtuchel@objektfabrik.de> wrote:

Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there are
lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number
of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor
NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import that
may have been working for years may one day tear a whole web server
image down because of a wrong number of fieldAccessors. This is bad on
many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use the
NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
columns each. If you remove one of the fieldAccessors, an #upToEnd will
yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless
loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems
to not peek forward to the end of the line. I have the gut feeling
#peekChar should peek instead of reading the #next character form the
input Stream, but #peekChar has too many senders to just go ahead and
mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number of
separators matches the number of fieldAccessors minus 1 (and go through
the hoops of handling separators in quoted fields and such...). Only if
that test succeeds, I would then hand a Stream with the whole line to
the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me the
line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices
did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

You aren't sure what point I was making? How about the one I actually wrote down: What test data was NeoCSV benchmarked with and can I get my hands on it? THAT is the point. The data points I showed (and many others I have not) are not satisfactory to me. I have been searching for CSV test collections. One site offered 6 files of which only one downloaded. I found a "benchmark suite" for CSV containing no actual CSV files. So where *else* should I look for benchmark data than associated with a parser people in this community are generally happy with that is described as "efficient"? Is it so unreasonable to suspect that my results might be a fluke? Is it bad manners to assume that something described as efficient has tests showing that? On Wed, 6 Jan 2021 at 22:23, jtuchel@objektfabrik.de < jtuchel@objektfabrik.de> wrote: > Richard, > > I am not sure what point you are trying to make here. > You have something cooler and faster? Great, how about sharing? > You could make a faster one when it doesn't convert numbers and stuff? > Great. I guess the time will be spent after parsing in 95% of the use > cases. It depends. And that is exactly what you are saying. The word > efficient means nothing without context. How is that related to this thread? > > I think this thread mostly shows the strength of a community, especially > when there are members who are active, friendly and highly motivated. My > problem git solved in blazing speed without me paying anything for it. Just > because Sven thought my problem could be other people's problem as well. > > I am happy with NeoCSV's speed, even if there may be more lightweigt and > faster solutions. Tbh, my main concern with NeoCSV is not speed, but how > well I can understand problems and fix them. I care about data types on > parsing. A non-configurable csv parser gives me a bunch of dictionaries and > Strings. That could be a waste of cycles and memory once you need the data > as objects. > My use case is not importing trillions of records all day, and for a few > hundred or maybe sometimes thousands, it is good/fast enough. > > > Joachim > > > > > > Am 06.01.21 um 05:10 schrieb Richard O'Keefe: > > NeoCSVReader is described as efficient. What is that > in comparison to? What benchmark data are used? > Here are benchmark results measured today. > (5,000 data line file, 9,145,009 characters). > method time(ms) > Just read characters 410 > CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x > CSVParser > NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x > CSVParser > CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 reference. > > (10,000 data line file, 1,544,836 characters). > method time(ms) > Just read characters 93 > CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x > CSVParser > NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x > CSVParser > CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 reference. > > CSVParser is just 78 lines and is not customisable. It really is > stripped to pretty much an absolute minimum. All of the parsers > were configured (if that made sense) to return an Array of Strings. > Many of the CSV files I've worked with use short records instead > of ending a line with a lot of commas. Some of them also have the > occasional stray comment off to the right, not mentioned in the header. > I've also found it necessary to skip multiple lines at the beginning > and/or end. (Really, some government agencies seem to have NO idea > that anyone might want to do more with a CSV file than eyeball it in > Excel.) > > If there is a benchmark suite I can use to improve CSVDecoder, > I would like to try it out. > > On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de < > jtuchel@objektfabrik.de> wrote: > >> Happy new year to all of you! May 2021 be an increasingly less crazy >> year than 2020... >> >> >> I have a question that sounds a bit strange, but we have two effects >> with NeoCSVReader related to wrong definitions of the reader. >> >> One effect is that reading a Stream #upToEnd leads to an endless loop, >> the other is that the Reader produces twice as many objects as there are >> lines in the file that is being read. >> >> In both scenarios, the reason is that the CSV Reader has a wrong number >> of column definitions. >> >> Of course that is my fault: why do I feed a "malformed" CSV file to poor >> NeoCSVReader? >> >> Let me explain: we have a few import interfaces which end users can >> define using a more or less nice assistant in our Application. The CSV >> files they upload to our App come from third parties like payment >> providers, banks and other sources. These change their file structures >> whenever they feel like it and never tell anybody. So a CSV import that >> may have been working for years may one day tear a whole web server >> image down because of a wrong number of fieldAccessors. This is bad on >> many levels. >> >> You can easily try the doubling effect at home: define a working CSV >> Reader and comment out one of the addField: commands before you use the >> NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 >> columns each. If you remove one of the fieldAccessors, an #upToEnd will >> yoield an Array of 6 objects rather than 3. >> >> I haven't found the reason for the cases where this leads to an endless >> loop, but at least this one is clear... >> >> I *guess* this is due to the way #readEndOfLine is implemented. It seems >> to not peek forward to the end of the line. I have the gut feeling >> #peekChar should peek instead of reading the #next character form the >> input Stream, but #peekChar has too many senders to just go ahead and >> mess with it ;-) >> >> So I wonder if there are any tried approaches to this problem. >> >> One thing I might do is not use #upToEnd, but read each line using >> PositionableStream>>#nextLine and first check each line if the number of >> separators matches the number of fieldAccessors minus 1 (and go through >> the hoops of handling separators in quoted fields and such...). Only if >> that test succeeds, I would then hand a Stream with the whole line to >> the reader and do a #next. >> >> This will, however, mean a lot of extra cycles for large files. Of >> course I could do this only for some lines, maybe just the first one. >> Whatever. >> >> >> But somehow I have the feeling I should get an exception telling me the >> line is not compatible to the Reader's definition or such. Or >> #readAtEndOrEndOfLine should just walk the line to the end and ignore >> the rest of the line, returnong an incomplete object.... >> >> >> Maybe I am just missing the right setting or switch? What best practices >> did you guys come up with for such problems? >> >> >> Thanks in advance, >> >> >> Joachim >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de <jtuchel@objektfabrik.de> > Fliederweg 1 http://www.objektfabrik.de > D-71640 Ludwigsburg http://joachimtuchel.wordpress.com > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 > > > >
J
jtuchel@objektfabrik.de
Thu, Jan 7, 2021 7:29 AM

Richard,

Am 07.01.21 um 07:15 schrieb Richard O'Keefe:

You aren't sure what point I was making?

exactly, the thread you answered was about a possible bug in NeoCSV
parser. Your post was about your doubts about the claim of efficiency on
the parser's web site. So you threw in some completely unrelated topic
and started by sounding more or less destructive (maybe this is a word
too harsh, but I am not a native english speaker... maybe "challenging"
is a better word?).

I cannot comment on the efficiency of NeoCSV, other than it is fast
enough for my use case and it gives me the option of combining reading
CSV and producing objects in one run, even if some checks,
backpointering, whatever has to be done after the parsing. It has a nice
API and is supported quite well. The thread and Sven's reaction
underline this last statement quite impressively: my bug was fixed
within hours.

How about the one I actually wrote down:
  What test data was NeoCSV benchmarked with
  and can I get my hands on it?

That is a valid question. It is off-topic in the thread, however. And
maybe your tone was a bit less kind than it should be. Nevertheless, the
discussion itself is worth its own thread. If the raw speed of reading
lots of CSV data is of concern in a use case, we should look for and at
alternatives. You are of course free to ask about alternatives, present
your measurements or alternative implementation and ask for comments,
ideas, all kinds of input. That's what yields progress.

THAT is the point. The data points I showed (and
many others I have not) are not satisfactory to me.

Fine and absolutely worth discussing. Maybe in its own discussion thread
and started with a friendly invitation for discussion. Your post was
more like "Oh, and, by the way, NeoCSV sucks". Maybe unintended, but
that is what I read.

I have been searching for CSV test collections.
One site offered 6 files of which only one downloaded.
I found a "benchmark suite" for CSV containing no
actual CSV files.
So where else should I look for benchmark data than
associated with a parser people in this community are
generally happy with that is described as "efficient"?

So you would like the developers of NeoCSV to provide test data that
allows for benchmarking and comparison? A valid point.

Is it so unreasonable to suspect that my results might
be a fluke?  Is it bad manners to assume that something
described as efficient has tests showing that?

Well, no. It is absolutely okay to ask if a claim like "efficient" can
be proven. You are free to present better choices and discuss your
definition of efficiency.

For me personally, your post sounded a bit like some earlier ones of
yours which seemed to have no other point than "I have something better,
but I won't show you". Hence my reaction. I may have read something into
your post that you haven't written into it. Sorry for that.

Joachim

On Wed, 6 Jan 2021 at 22:23, jtuchel@objektfabrik.de
mailto:jtuchel@objektfabrik.de <jtuchel@objektfabrik.de
mailto:jtuchel@objektfabrik.de> wrote:

 Richard,

 I am not sure what point you are trying to make here.
 You have something cooler and faster? Great, how about sharing?
 You could make a faster one when it doesn't convert numbers and
 stuff? Great. I guess the time will be spent after parsing in 95%
 of the use cases. It depends. And that is exactly what you are
 saying. The word efficient means nothing without context. How is
 that related to this thread?

 I think this thread mostly shows the strength of a community,
 especially when there are members who are active, friendly and
 highly motivated. My problem git solved in blazing speed without
 me paying anything for it. Just because Sven thought my problem
 could be other people's problem as well.

 I am happy with NeoCSV's speed, even if there may be more
 lightweigt and faster solutions. Tbh, my main concern with NeoCSV
 is not speed, but how well I can understand problems and fix them.
 I care about data types on parsing. A non-configurable csv parser
 gives me a bunch of dictionaries and Strings. That could be a
 waste of cycles and memory once you need the data as objects.
 My use case is not importing trillions of records all day, and for
 a few hundred or maybe sometimes thousands, it is good/fast enough.


 Joachim





 Am 06.01.21 um 05:10 schrieb Richard O'Keefe:
 NeoCSVReader is described as efficient.  What is that
 in comparison to?  What benchmark data are used?
 Here are benchmark results measured today.
 (5,000 data line file, 9,145,009 characters).
  method                time(ms)
  Just read characters   410
  CSVDecoder>>next      3415   astc's CSV reader (defaults). 1.26
 x CSVParser
  NeoCSVReader>>next    4798   NeoCSVReader (default state). 1.78
 x CSVParser
  CSVParser>>next       2701   pared-to-the-bone CSV reader. 1.00
 reference.

 (10,000 data line file, 1,544,836 characters).
  method                time(ms)
  Just read characters    93
  CSVDecoder>>next       530   astc's CSV reader (defaults). 1.26
 x CSVParser
  NeoCSVReader>>next     737   NeoCSVReader (default state). 1.75
 x CSVParser
  CSVParser>>next        421   pared-to-the-bone CSV reader. 1.00
 reference.

 CSVParser is just 78 lines and is not customisable.  It really is
 stripped to pretty much an absolute minimum.  All of the parsers
 were configured (if that made sense) to return an Array of Strings.
 Many of the CSV files I've worked with use short records instead
 of ending a line with a lot of commas.  Some of them also have
 the occasional stray comment off to the right, not mentioned in
 the header.
 I've also found it necessary to skip multiple lines at the beginning
 and/or end. (Really, some government agencies seem to have NO idea
 that anyone might want to do more with a CSV file than eyeball it in
 Excel.)

 If there is a benchmark suite I can use to improve CSVDecoder,
 I would like to try it out.

 On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de
 <mailto:jtuchel@objektfabrik.de> <jtuchel@objektfabrik.de
 <mailto:jtuchel@objektfabrik.de>> wrote:

     Happy new year to all of you! May 2021 be an increasingly
     less crazy
     year than 2020...


     I have a question that sounds a bit strange, but we have two
     effects
     with NeoCSVReader related to wrong definitions of the reader.

     One effect is that reading a Stream #upToEnd leads to an
     endless loop,
     the other is that the Reader produces twice as many objects
     as there are
     lines in the file that is being read.

     In both scenarios, the reason is that the CSV Reader has a
     wrong number
     of column definitions.

     Of course that is my fault: why do I feed a "malformed" CSV
     file to poor
     NeoCSVReader?

     Let me explain: we have a few import interfaces which end
     users can
     define using a more or less nice assistant in our
     Application. The CSV
     files they upload to our App come from third parties like
     payment
     providers, banks and other sources. These change their file
     structures
     whenever they feel like it and never tell anybody. So a CSV
     import that
     may have been working for years may one day tear a whole web
     server
     image down because of a wrong number of fieldAccessors. This
     is bad on
     many levels.

     You can easily try the doubling effect at home: define a
     working CSV
     Reader and comment out one of the addField: commands before
     you use the
     NeoCSVReader to parse a CSV file. Say your CSV file has 3
     lines with 4
     columns each. If you remove one of the fieldAccessors, an
     #upToEnd will
     yoield an Array of 6 objects rather than 3.

     I haven't found the reason for the cases where this leads to
     an endless
     loop, but at least this one is clear...

     I *guess* this is due to the way #readEndOfLine is
     implemented. It seems
     to not peek forward to the end of the line. I have the gut
     feeling
     #peekChar should peek instead of reading the #next character
     form the
     input Stream, but #peekChar has too many senders to just go
     ahead and
     mess with it ;-)

     So I wonder if there are any tried approaches to this problem.

     One thing I might do is not use #upToEnd, but read each line
     using
     PositionableStream>>#nextLine and first check each line if
     the number of
     separators matches the number of fieldAccessors minus 1 (and
     go through
     the hoops of handling separators in quoted fields and
     such...). Only if
     that test succeeds, I would then hand a Stream with the whole
     line to
     the reader and do a #next.

     This will, however, mean a lot of extra cycles for large
     files. Of
     course I could do this only for some lines, maybe just the
     first one.
     Whatever.


     But somehow I have the feeling I should get an exception
     telling me the
     line is not compatible to the Reader's definition or such. Or
     #readAtEndOrEndOfLine should just walk the line to the end
     and ignore
     the rest of the line, returnong an incomplete object....


     Maybe I am just missing the right setting or switch? What
     best practices
     did you guys come up with for such problems?


     Thanks in advance,


     Joachim
 -- 
 -----------------------------------------------------------------------
 Objektfabrik Joachim Tuchelmailto:jtuchel@objektfabrik.de  <mailto:jtuchel@objektfabrik.de>
 Fliederweg 1http://www.objektfabrik.de  <http://www.objektfabrik.de>
 D-71640 Ludwigsburghttp://joachimtuchel.wordpress.com  <http://joachimtuchel.wordpress.com>
 Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Richard, Am 07.01.21 um 07:15 schrieb Richard O'Keefe: > You aren't sure what point I was making? exactly, the thread you answered was about a possible bug in NeoCSV parser. Your post was about your doubts about the claim of efficiency on the parser's web site. So you threw in some completely unrelated topic and started by sounding more or less destructive (maybe this is a word too harsh, but I am not a native english speaker... maybe "challenging" is a better word?). I cannot comment on the efficiency of NeoCSV, other than it is fast enough for my use case and it gives me the option of combining reading CSV and producing objects in one run, even if some checks, backpointering, whatever has to be done after the parsing. It has a nice API and is supported quite well. The thread and Sven's reaction underline this last statement quite impressively: my bug was fixed within hours. > How about the one I actually wrote down: >   What test data was NeoCSV benchmarked with >   and can I get my hands on it? That is a valid question. It is off-topic in the thread, however. And maybe your tone was a bit less kind than it should be. Nevertheless, the discussion itself is worth its own thread. If the raw speed of reading lots of CSV data is of concern in a use case, we should look for and at alternatives. You are of course free to ask about alternatives, present your measurements or alternative implementation and ask for comments, ideas, all kinds of input. That's what yields progress. > THAT is the point. The data points I showed (and > many others I have not) are not satisfactory to me. Fine and absolutely worth discussing. Maybe in its own discussion thread and started with a friendly invitation for discussion. Your post was more like "Oh, and, by the way, NeoCSV sucks". Maybe unintended, but that is what I read. > I have been searching for CSV test collections. > One site offered 6 files of which only one downloaded. > I found a "benchmark suite" for CSV containing no > actual CSV files. > So where *else* should I look for benchmark data than > associated with a parser people in this community are > generally happy with that is described as "efficient"? So you would like the developers of NeoCSV to provide test data that allows for benchmarking and comparison? A valid point. > > Is it so unreasonable to suspect that my results might > be a fluke?  Is it bad manners to assume that something > described as efficient has tests showing that? Well, no. It is absolutely okay to ask if a claim like "efficient" can be proven. You are free to present better choices and discuss your definition of efficiency. For me personally, your post sounded a bit like some earlier ones of yours which seemed to have no other point than "I have something better, but I won't show you". Hence my reaction. I may have read something into your post that you haven't written into it. Sorry for that. Joachim > > > > On Wed, 6 Jan 2021 at 22:23, jtuchel@objektfabrik.de > <mailto:jtuchel@objektfabrik.de> <jtuchel@objektfabrik.de > <mailto:jtuchel@objektfabrik.de>> wrote: > > Richard, > > I am not sure what point you are trying to make here. > You have something cooler and faster? Great, how about sharing? > You could make a faster one when it doesn't convert numbers and > stuff? Great. I guess the time will be spent after parsing in 95% > of the use cases. It depends. And that is exactly what you are > saying. The word efficient means nothing without context. How is > that related to this thread? > > I think this thread mostly shows the strength of a community, > especially when there are members who are active, friendly and > highly motivated. My problem git solved in blazing speed without > me paying anything for it. Just because Sven thought my problem > could be other people's problem as well. > > I am happy with NeoCSV's speed, even if there may be more > lightweigt and faster solutions. Tbh, my main concern with NeoCSV > is not speed, but how well I can understand problems and fix them. > I care about data types on parsing. A non-configurable csv parser > gives me a bunch of dictionaries and Strings. That could be a > waste of cycles and memory once you need the data as objects. > My use case is not importing trillions of records all day, and for > a few hundred or maybe sometimes thousands, it is good/fast enough. > > > Joachim > > > > > > Am 06.01.21 um 05:10 schrieb Richard O'Keefe: >> NeoCSVReader is described as efficient.  What is that >> in comparison to?  What benchmark data are used? >> Here are benchmark results measured today. >> (5,000 data line file, 9,145,009 characters). >>  method                time(ms) >>  Just read characters   410 >>  CSVDecoder>>next      3415   astc's CSV reader (defaults). 1.26 >> x CSVParser >>  NeoCSVReader>>next    4798   NeoCSVReader (default state). 1.78 >> x CSVParser >>  CSVParser>>next       2701   pared-to-the-bone CSV reader. 1.00 >> reference. >> >> (10,000 data line file, 1,544,836 characters). >>  method                time(ms) >>  Just read characters    93 >>  CSVDecoder>>next       530   astc's CSV reader (defaults). 1.26 >> x CSVParser >>  NeoCSVReader>>next     737   NeoCSVReader (default state). 1.75 >> x CSVParser >>  CSVParser>>next        421   pared-to-the-bone CSV reader. 1.00 >> reference. >> >> CSVParser is just 78 lines and is not customisable.  It really is >> stripped to pretty much an absolute minimum.  All of the parsers >> were configured (if that made sense) to return an Array of Strings. >> Many of the CSV files I've worked with use short records instead >> of ending a line with a lot of commas.  Some of them also have >> the occasional stray comment off to the right, not mentioned in >> the header. >> I've also found it necessary to skip multiple lines at the beginning >> and/or end. (Really, some government agencies seem to have NO idea >> that anyone might want to do more with a CSV file than eyeball it in >> Excel.) >> >> If there is a benchmark suite I can use to improve CSVDecoder, >> I would like to try it out. >> >> On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de >> <mailto:jtuchel@objektfabrik.de> <jtuchel@objektfabrik.de >> <mailto:jtuchel@objektfabrik.de>> wrote: >> >> Happy new year to all of you! May 2021 be an increasingly >> less crazy >> year than 2020... >> >> >> I have a question that sounds a bit strange, but we have two >> effects >> with NeoCSVReader related to wrong definitions of the reader. >> >> One effect is that reading a Stream #upToEnd leads to an >> endless loop, >> the other is that the Reader produces twice as many objects >> as there are >> lines in the file that is being read. >> >> In both scenarios, the reason is that the CSV Reader has a >> wrong number >> of column definitions. >> >> Of course that is my fault: why do I feed a "malformed" CSV >> file to poor >> NeoCSVReader? >> >> Let me explain: we have a few import interfaces which end >> users can >> define using a more or less nice assistant in our >> Application. The CSV >> files they upload to our App come from third parties like >> payment >> providers, banks and other sources. These change their file >> structures >> whenever they feel like it and never tell anybody. So a CSV >> import that >> may have been working for years may one day tear a whole web >> server >> image down because of a wrong number of fieldAccessors. This >> is bad on >> many levels. >> >> You can easily try the doubling effect at home: define a >> working CSV >> Reader and comment out one of the addField: commands before >> you use the >> NeoCSVReader to parse a CSV file. Say your CSV file has 3 >> lines with 4 >> columns each. If you remove one of the fieldAccessors, an >> #upToEnd will >> yoield an Array of 6 objects rather than 3. >> >> I haven't found the reason for the cases where this leads to >> an endless >> loop, but at least this one is clear... >> >> I *guess* this is due to the way #readEndOfLine is >> implemented. It seems >> to not peek forward to the end of the line. I have the gut >> feeling >> #peekChar should peek instead of reading the #next character >> form the >> input Stream, but #peekChar has too many senders to just go >> ahead and >> mess with it ;-) >> >> So I wonder if there are any tried approaches to this problem. >> >> One thing I might do is not use #upToEnd, but read each line >> using >> PositionableStream>>#nextLine and first check each line if >> the number of >> separators matches the number of fieldAccessors minus 1 (and >> go through >> the hoops of handling separators in quoted fields and >> such...). Only if >> that test succeeds, I would then hand a Stream with the whole >> line to >> the reader and do a #next. >> >> This will, however, mean a lot of extra cycles for large >> files. Of >> course I could do this only for some lines, maybe just the >> first one. >> Whatever. >> >> >> But somehow I have the feeling I should get an exception >> telling me the >> line is not compatible to the Reader's definition or such. Or >> #readAtEndOrEndOfLine should just walk the line to the end >> and ignore >> the rest of the line, returnong an incomplete object.... >> >> >> Maybe I am just missing the right setting or switch? What >> best practices >> did you guys come up with for such problems? >> >> >> Thanks in advance, >> >> >> Joachim >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchelmailto:jtuchel@objektfabrik.de <mailto:jtuchel@objektfabrik.de> > Fliederweg 1http://www.objektfabrik.de <http://www.objektfabrik.de> > D-71640 Ludwigsburghttp://joachimtuchel.wordpress.com <http://joachimtuchel.wordpress.com> > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 > > -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
SV
Sven Van Caekenberghe
Thu, Jan 7, 2021 8:05 AM

On 7 Jan 2021, at 07:15, Richard O'Keefe raoknz@gmail.com wrote:

You aren't sure what point I was making?
How about the one I actually wrote down:
What test data was NeoCSV benchmarked with
and can I get my hands on it?
THAT is the point.  The data points I showed (and
many others I have not) are not satisfactory to me.
I have been searching for CSV test collections.
One site offered 6 files of which only one downloaded.
I found a "benchmark suite" for CSV containing no
actual CSV files.
So where else should I look for benchmark data than
associated with a parser people in this community are
generally happy with that is described as "efficient"?

Did you actually read my email and look at the code ?

NeoCSVBenchmark generates its own test data.

Is it so unreasonable to suspect that my results might
be a fluke?  Is it bad manners to assume that something
described as efficient has tests showing that?

On Wed, 6 Jan 2021 at 22:23, jtuchel@objektfabrik.de jtuchel@objektfabrik.de wrote:
Richard,

I am not sure what point you are trying to make here.
You have something cooler and faster? Great, how about sharing?
You could make a faster one when it doesn't convert numbers and stuff? Great. I guess the time will be spent after parsing in 95% of the use cases. It depends. And that is exactly what you are saying. The word efficient means nothing without context. How is that related to this thread?

I think this thread mostly shows the strength of a community, especially when there are members who are active, friendly and highly motivated. My problem git solved in blazing speed without me paying anything for it. Just because Sven thought my problem could be other people's problem as well.

I am happy with NeoCSV's speed, even if there may be more lightweigt and faster solutions. Tbh, my main concern with NeoCSV is not speed, but how well I can understand problems and fix them. I care about data types on parsing. A non-configurable csv parser gives me a bunch of dictionaries and Strings. That could be a waste of cycles and memory once you need the data as objects.
My use case is not importing trillions of records all day, and for a few hundred or maybe sometimes thousands, it is good/fast enough.

Joachim

Am 06.01.21 um 05:10 schrieb Richard O'Keefe:

NeoCSVReader is described as efficient.  What is that
in comparison to?  What benchmark data are used?
Here are benchmark results measured today.
(5,000 data line file, 9,145,009 characters).
method                time(ms)
Just read characters  410
CSVDecoder>>next      3415  astc's CSV reader (defaults). 1.26 x CSVParser
NeoCSVReader>>next    4798  NeoCSVReader (default state). 1.78 x CSVParser
CSVParser>>next      2701  pared-to-the-bone CSV reader. 1.00 reference.

(10,000 data line file, 1,544,836 characters).
method                time(ms)
Just read characters    93
CSVDecoder>>next      530  astc's CSV reader (defaults). 1.26 x CSVParser
NeoCSVReader>>next    737  NeoCSVReader (default state). 1.75 x CSVParser
CSVParser>>next        421  pared-to-the-bone CSV reader. 1.00 reference.

CSVParser is just 78 lines and is not customisable.  It really is
stripped to pretty much an absolute minimum.  All of the parsers
were configured (if that made sense) to return an Array of Strings.
Many of the CSV files I've worked with use short records instead
of ending a line with a lot of commas.  Some of them also have the occasional stray comment off to the right, not mentioned in the header.
I've also found it necessary to skip multiple lines at the beginning
and/or end.  (Really, some government agencies seem to have NO idea
that anyone might want to do more with a CSV file than eyeball it in
Excel.)

If there is a benchmark suite I can use to improve CSVDecoder,
I would like to try it out.

On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de jtuchel@objektfabrik.de wrote:
Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there are
lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number
of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor
NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import that
may have been working for years may one day tear a whole web server
image down because of a wrong number of fieldAccessors. This is bad on
many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use the
NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
columns each. If you remove one of the fieldAccessors, an #upToEnd will
yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless
loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems
to not peek forward to the end of the line. I have the gut feeling
#peekChar should peek instead of reading the #next character form the
input Stream, but #peekChar has too many senders to just go ahead and
mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number of
separators matches the number of fieldAccessors minus 1 (and go through
the hoops of handling separators in quoted fields and such...). Only if
that test succeeds, I would then hand a Stream with the whole line to
the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me the
line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices
did you guys come up with for such problems?

Thanks in advance,

Joachim

--

Objektfabrik Joachim Tuchel
mailto:jtuchel@objektfabrik.de

Fliederweg 1
http://www.objektfabrik.de

D-71640 Ludwigsburg
http://joachimtuchel.wordpress.com

Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

> On 7 Jan 2021, at 07:15, Richard O'Keefe <raoknz@gmail.com> wrote: > > You aren't sure what point I was making? > How about the one I actually wrote down: > What test data was NeoCSV benchmarked with > and can I get my hands on it? > THAT is the point. The data points I showed (and > many others I have not) are not satisfactory to me. > I have been searching for CSV test collections. > One site offered 6 files of which only one downloaded. > I found a "benchmark suite" for CSV containing no > actual CSV files. > So where *else* should I look for benchmark data than > associated with a parser people in this community are > generally happy with that is described as "efficient"? Did you actually read my email and look at the code ? NeoCSVBenchmark generates its own test data. > Is it so unreasonable to suspect that my results might > be a fluke? Is it bad manners to assume that something > described as efficient has tests showing that? > > > > On Wed, 6 Jan 2021 at 22:23, jtuchel@objektfabrik.de <jtuchel@objektfabrik.de> wrote: > Richard, > > I am not sure what point you are trying to make here. > You have something cooler and faster? Great, how about sharing? > You could make a faster one when it doesn't convert numbers and stuff? Great. I guess the time will be spent after parsing in 95% of the use cases. It depends. And that is exactly what you are saying. The word efficient means nothing without context. How is that related to this thread? > > I think this thread mostly shows the strength of a community, especially when there are members who are active, friendly and highly motivated. My problem git solved in blazing speed without me paying anything for it. Just because Sven thought my problem could be other people's problem as well. > > I am happy with NeoCSV's speed, even if there may be more lightweigt and faster solutions. Tbh, my main concern with NeoCSV is not speed, but how well I can understand problems and fix them. I care about data types on parsing. A non-configurable csv parser gives me a bunch of dictionaries and Strings. That could be a waste of cycles and memory once you need the data as objects. > My use case is not importing trillions of records all day, and for a few hundred or maybe sometimes thousands, it is good/fast enough. > > > Joachim > > > > > > Am 06.01.21 um 05:10 schrieb Richard O'Keefe: >> NeoCSVReader is described as efficient. What is that >> in comparison to? What benchmark data are used? >> Here are benchmark results measured today. >> (5,000 data line file, 9,145,009 characters). >> method time(ms) >> Just read characters 410 >> CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x CSVParser >> NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x CSVParser >> CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 reference. >> >> (10,000 data line file, 1,544,836 characters). >> method time(ms) >> Just read characters 93 >> CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x CSVParser >> NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x CSVParser >> CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 reference. >> >> CSVParser is just 78 lines and is not customisable. It really is >> stripped to pretty much an absolute minimum. All of the parsers >> were configured (if that made sense) to return an Array of Strings. >> Many of the CSV files I've worked with use short records instead >> of ending a line with a lot of commas. Some of them also have the occasional stray comment off to the right, not mentioned in the header. >> I've also found it necessary to skip multiple lines at the beginning >> and/or end. (Really, some government agencies seem to have NO idea >> that anyone might want to do more with a CSV file than eyeball it in >> Excel.) >> >> If there is a benchmark suite I can use to improve CSVDecoder, >> I would like to try it out. >> >> On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de <jtuchel@objektfabrik.de> wrote: >> Happy new year to all of you! May 2021 be an increasingly less crazy >> year than 2020... >> >> >> I have a question that sounds a bit strange, but we have two effects >> with NeoCSVReader related to wrong definitions of the reader. >> >> One effect is that reading a Stream #upToEnd leads to an endless loop, >> the other is that the Reader produces twice as many objects as there are >> lines in the file that is being read. >> >> In both scenarios, the reason is that the CSV Reader has a wrong number >> of column definitions. >> >> Of course that is my fault: why do I feed a "malformed" CSV file to poor >> NeoCSVReader? >> >> Let me explain: we have a few import interfaces which end users can >> define using a more or less nice assistant in our Application. The CSV >> files they upload to our App come from third parties like payment >> providers, banks and other sources. These change their file structures >> whenever they feel like it and never tell anybody. So a CSV import that >> may have been working for years may one day tear a whole web server >> image down because of a wrong number of fieldAccessors. This is bad on >> many levels. >> >> You can easily try the doubling effect at home: define a working CSV >> Reader and comment out one of the addField: commands before you use the >> NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 >> columns each. If you remove one of the fieldAccessors, an #upToEnd will >> yoield an Array of 6 objects rather than 3. >> >> I haven't found the reason for the cases where this leads to an endless >> loop, but at least this one is clear... >> >> I *guess* this is due to the way #readEndOfLine is implemented. It seems >> to not peek forward to the end of the line. I have the gut feeling >> #peekChar should peek instead of reading the #next character form the >> input Stream, but #peekChar has too many senders to just go ahead and >> mess with it ;-) >> >> So I wonder if there are any tried approaches to this problem. >> >> One thing I might do is not use #upToEnd, but read each line using >> PositionableStream>>#nextLine and first check each line if the number of >> separators matches the number of fieldAccessors minus 1 (and go through >> the hoops of handling separators in quoted fields and such...). Only if >> that test succeeds, I would then hand a Stream with the whole line to >> the reader and do a #next. >> >> This will, however, mean a lot of extra cycles for large files. Of >> course I could do this only for some lines, maybe just the first one. >> Whatever. >> >> >> But somehow I have the feeling I should get an exception telling me the >> line is not compatible to the Reader's definition or such. Or >> #readAtEndOrEndOfLine should just walk the line to the end and ignore >> the rest of the line, returnong an incomplete object.... >> >> >> Maybe I am just missing the right setting or switch? What best practices >> did you guys come up with for such problems? >> >> >> Thanks in advance, >> >> >> Joachim >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchel > mailto:jtuchel@objektfabrik.de > > Fliederweg 1 > http://www.objektfabrik.de > > D-71640 Ludwigsburg > http://joachimtuchel.wordpress.com > > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 > > >
RO
Richard O'Keefe
Thu, Jan 7, 2021 8:54 AM

Thank you very much.
I converted your benchmark to my Smalltalk dialect and was
pleased with the results.  This gave me the impetus I needed
to implement the #recordClass: feature of NeoCSVReader,
although in my case it requires the class to implement #withAll:
and the operand is a (reused) OrderedCollection.

There's one difference between CSVEncoder and NeoCSVWriter that
might be of interest: you can't tell CSVEncoder whether a field
is #raw or #quoted because it always figures that out for itself.
I was prepared to pay an efficiency penalty to make sure I did not
get this wrong, and am pleased to find it wasn't as much of a
penalty as I feared.

On Wed, 6 Jan 2021 at 22:52, Sven Van Caekenberghe sven@stfx.eu wrote:

Hi Richard,

Benchmarking is a can of worms, many factors have to be considered. But
the first requirement is obviously to be completely open over what you are
doing and what you are comparing.

NeoCSV contains a simple benchmark suite called NeoCSVBenchmark, which was
used during development. Note that it is a bit tricky to use: you need to
run a write benchmark with a specific configuration before you can try read
benchmarks.

The core data is a 100.000 line file (2.5 MB) like this:

1,-1,99999
2,-2,99998
3,-3,99997
4,-4,99996
5,-5,99995
6,-6,99994
7,-7,99993
8,-8,99992
9,-9,99991
10,-10,99990
...

That parses in ~250ms on my machine.

NeoCSV has quite a bit of features and handles various edge cases.
Obviously, a minimal, custom implementation could be faster.

NeoCSV is called efficient not just because it is reasonably fast, but
because it can be configured to generate domain objects without
intermediate structures and because it can convert individual fields (parse
numbers, dates, times, ...) while parsing.

Like you said, some generated CSV output out in the wild is very
irregular. I try to stick with standard CSV as much as possible.

Sven

On 6 Jan 2021, at 05:10, Richard O'Keefe raoknz@gmail.com wrote:

NeoCSVReader is described as efficient.  What is that
in comparison to?  What benchmark data are used?
Here are benchmark results measured today.
(5,000 data line file, 9,145,009 characters).
method                time(ms)
Just read characters  410
CSVDecoder>>next      3415  astc's CSV reader (defaults). 1.26 x

CSVParser

NeoCSVReader>>next    4798  NeoCSVReader (default state). 1.78 x

CSVParser

CSVParser>>next      2701  pared-to-the-bone CSV reader. 1.00

reference.

(10,000 data line file, 1,544,836 characters).
method                time(ms)
Just read characters    93
CSVDecoder>>next      530  astc's CSV reader (defaults). 1.26 x

CSVParser

NeoCSVReader>>next    737  NeoCSVReader (default state). 1.75 x

CSVParser

CSVParser>>next        421  pared-to-the-bone CSV reader. 1.00

reference.

CSVParser is just 78 lines and is not customisable.  It really is
stripped to pretty much an absolute minimum.  All of the parsers
were configured (if that made sense) to return an Array of Strings.
Many of the CSV files I've worked with use short records instead
of ending a line with a lot of commas.  Some of them also have the

occasional stray comment off to the right, not mentioned in the header.

I've also found it necessary to skip multiple lines at the beginning
and/or end.  (Really, some government agencies seem to have NO idea
that anyone might want to do more with a CSV file than eyeball it in
Excel.)

If there is a benchmark suite I can use to improve CSVDecoder,
I would like to try it out.

On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de <

Happy new year to all of you! May 2021 be an increasingly less crazy
year than 2020...

I have a question that sounds a bit strange, but we have two effects
with NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop,
the other is that the Reader produces twice as many objects as there are
lines in the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number
of column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor
NeoCSVReader?

Let me explain: we have a few import interfaces which end users can
define using a more or less nice assistant in our Application. The CSV
files they upload to our App come from third parties like payment
providers, banks and other sources. These change their file structures
whenever they feel like it and never tell anybody. So a CSV import that
may have been working for years may one day tear a whole web server
image down because of a wrong number of fieldAccessors. This is bad on
many levels.

You can easily try the doubling effect at home: define a working CSV
Reader and comment out one of the addField: commands before you use the
NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
columns each. If you remove one of the fieldAccessors, an #upToEnd will
yoield an Array of 6 objects rather than 3.

I haven't found the reason for the cases where this leads to an endless
loop, but at least this one is clear...

I guess this is due to the way #readEndOfLine is implemented. It seems
to not peek forward to the end of the line. I have the gut feeling
#peekChar should peek instead of reading the #next character form the
input Stream, but #peekChar has too many senders to just go ahead and
mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using
PositionableStream>>#nextLine and first check each line if the number of
separators matches the number of fieldAccessors minus 1 (and go through
the hoops of handling separators in quoted fields and such...). Only if
that test succeeds, I would then hand a Stream with the whole line to
the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of
course I could do this only for some lines, maybe just the first one.
Whatever.

But somehow I have the feeling I should get an exception telling me the
line is not compatible to the Reader's definition or such. Or
#readAtEndOrEndOfLine should just walk the line to the end and ignore
the rest of the line, returnong an incomplete object....

Maybe I am just missing the right setting or switch? What best practices
did you guys come up with for such problems?

Thanks in advance,

Joachim

Thank you very much. I converted your benchmark to my Smalltalk dialect and was pleased with the results. This gave me the impetus I needed to implement the #recordClass: feature of NeoCSVReader, although in my case it requires the class to implement #withAll: and the operand is a (reused) OrderedCollection. There's one difference between CSVEncoder and NeoCSVWriter that might be of interest: you can't tell CSVEncoder whether a field is #raw or #quoted because it always figures that out for itself. I was prepared to pay an efficiency penalty to make sure I did not get this wrong, and am pleased to find it wasn't as much of a penalty as I feared. On Wed, 6 Jan 2021 at 22:52, Sven Van Caekenberghe <sven@stfx.eu> wrote: > Hi Richard, > > Benchmarking is a can of worms, many factors have to be considered. But > the first requirement is obviously to be completely open over what you are > doing and what you are comparing. > > NeoCSV contains a simple benchmark suite called NeoCSVBenchmark, which was > used during development. Note that it is a bit tricky to use: you need to > run a write benchmark with a specific configuration before you can try read > benchmarks. > > The core data is a 100.000 line file (2.5 MB) like this: > > 1,-1,99999 > 2,-2,99998 > 3,-3,99997 > 4,-4,99996 > 5,-5,99995 > 6,-6,99994 > 7,-7,99993 > 8,-8,99992 > 9,-9,99991 > 10,-10,99990 > ... > > That parses in ~250ms on my machine. > > NeoCSV has quite a bit of features and handles various edge cases. > Obviously, a minimal, custom implementation could be faster. > > NeoCSV is called efficient not just because it is reasonably fast, but > because it can be configured to generate domain objects without > intermediate structures and because it can convert individual fields (parse > numbers, dates, times, ...) while parsing. > > Like you said, some generated CSV output out in the wild is very > irregular. I try to stick with standard CSV as much as possible. > > Sven > > > On 6 Jan 2021, at 05:10, Richard O'Keefe <raoknz@gmail.com> wrote: > > > > NeoCSVReader is described as efficient. What is that > > in comparison to? What benchmark data are used? > > Here are benchmark results measured today. > > (5,000 data line file, 9,145,009 characters). > > method time(ms) > > Just read characters 410 > > CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x > CSVParser > > NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x > CSVParser > > CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 > reference. > > > > (10,000 data line file, 1,544,836 characters). > > method time(ms) > > Just read characters 93 > > CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x > CSVParser > > NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x > CSVParser > > CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 > reference. > > > > CSVParser is just 78 lines and is not customisable. It really is > > stripped to pretty much an absolute minimum. All of the parsers > > were configured (if that made sense) to return an Array of Strings. > > Many of the CSV files I've worked with use short records instead > > of ending a line with a lot of commas. Some of them also have the > occasional stray comment off to the right, not mentioned in the header. > > I've also found it necessary to skip multiple lines at the beginning > > and/or end. (Really, some government agencies seem to have NO idea > > that anyone might want to do more with a CSV file than eyeball it in > > Excel.) > > > > If there is a benchmark suite I can use to improve CSVDecoder, > > I would like to try it out. > > > > On Tue, 5 Jan 2021 at 02:36, jtuchel@objektfabrik.de < > jtuchel@objektfabrik.de> wrote: > > Happy new year to all of you! May 2021 be an increasingly less crazy > > year than 2020... > > > > > > I have a question that sounds a bit strange, but we have two effects > > with NeoCSVReader related to wrong definitions of the reader. > > > > One effect is that reading a Stream #upToEnd leads to an endless loop, > > the other is that the Reader produces twice as many objects as there are > > lines in the file that is being read. > > > > In both scenarios, the reason is that the CSV Reader has a wrong number > > of column definitions. > > > > Of course that is my fault: why do I feed a "malformed" CSV file to poor > > NeoCSVReader? > > > > Let me explain: we have a few import interfaces which end users can > > define using a more or less nice assistant in our Application. The CSV > > files they upload to our App come from third parties like payment > > providers, banks and other sources. These change their file structures > > whenever they feel like it and never tell anybody. So a CSV import that > > may have been working for years may one day tear a whole web server > > image down because of a wrong number of fieldAccessors. This is bad on > > many levels. > > > > You can easily try the doubling effect at home: define a working CSV > > Reader and comment out one of the addField: commands before you use the > > NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 > > columns each. If you remove one of the fieldAccessors, an #upToEnd will > > yoield an Array of 6 objects rather than 3. > > > > I haven't found the reason for the cases where this leads to an endless > > loop, but at least this one is clear... > > > > I *guess* this is due to the way #readEndOfLine is implemented. It seems > > to not peek forward to the end of the line. I have the gut feeling > > #peekChar should peek instead of reading the #next character form the > > input Stream, but #peekChar has too many senders to just go ahead and > > mess with it ;-) > > > > So I wonder if there are any tried approaches to this problem. > > > > One thing I might do is not use #upToEnd, but read each line using > > PositionableStream>>#nextLine and first check each line if the number of > > separators matches the number of fieldAccessors minus 1 (and go through > > the hoops of handling separators in quoted fields and such...). Only if > > that test succeeds, I would then hand a Stream with the whole line to > > the reader and do a #next. > > > > This will, however, mean a lot of extra cycles for large files. Of > > course I could do this only for some lines, maybe just the first one. > > Whatever. > > > > > > But somehow I have the feeling I should get an exception telling me the > > line is not compatible to the Reader's definition or such. Or > > #readAtEndOrEndOfLine should just walk the line to the end and ignore > > the rest of the line, returnong an incomplete object.... > > > > > > Maybe I am just missing the right setting or switch? What best practices > > did you guys come up with for such problems? > > > > > > Thanks in advance, > > > > > > Joachim > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >