pharo-users@lists.pharo.org

Any question about pharo is welcome

View all threads

NeoCSVReader and wrong number of fieldAccessors

KO
Kasper Osterbye
Fri, Jan 22, 2021 6:42 AM

As it happened, I ran into the exact same scenario as Joachim just the
other day,
that is, the external provider of my csv had added some new columns. In my
case
manifested itself in an error that an integer field was not an integer
(because new
columns were added in the middle).

Reading through this whole thread leaves me with the feeling that no matter
what Sven
adds, there is still a risk for error. Nevertheless, my suggestion would be
to add a
functionality to #skipHeaders, or make a sister method:
#assertAndSkipHeaders: numberOfColumns onFailDo: aBlock given the actual
number of headers
That would give me a way to handle the error up front.

This will only be interesting if your data has headers of cause.

Thanks for NeoCSV which I use all the time!

Best,

Kasper

As it happened, I ran into the exact same scenario as Joachim just the other day, that is, the external provider of my csv had added some new columns. In my case manifested itself in an error that an integer field was not an integer (because new columns were added in the middle). Reading through this whole thread leaves me with the feeling that no matter what Sven adds, there is still a risk for error. Nevertheless, my suggestion would be to add a functionality to #skipHeaders, or make a sister method: #assertAndSkipHeaders: numberOfColumns onFailDo: aBlock given the actual number of headers That would give me a way to handle the error up front. This will only be interesting if your data has headers of cause. Thanks for NeoCSV which I use all the time! Best, Kasper
J
jtuchel@objektfabrik.de
Fri, Jan 22, 2021 7:06 AM

Kasper,

I think this is somewhat close to another thing I am describing here:
https://github.com/svenvc/NeoCSV/issues/20
https://github.com/svenvc/NeoCSV/issues/20

The problem with extending NeoCSV endlessly is that some of the things
we need with "real-life" CSV files is the fact that they are not CSV
files at all, so it is hard to tell if it is a good idea to litter
NeoCSV with methods for edge cases that have literally to do with the
fact that part of our files are not CSV at all...

I somehow have the feeling some of what we need is a subclass of Stream
that knows about constraints like "a line-break is only the end of a
record if it is not part of a quoted field". So I am somewhat torn
between wanting that stuff in NeoCSV and not wanting to mix csv parsing
with handling of stupid ideas people have when exporting data in some
CSV-like file.

Maybe it is a good idea to collect a few concepts that have been
mentioned in these threads:

  • Sometimes we want to skip lines (header, footer) without
    interpreting their contents and without any side effects (not
    CSV-compliant)
  • Sometimes we want to ignore "additional data" after the end of a
    defined number of columns (not CSV-compliant)
  • Sometimes we need to know which line/column couldn't be parsed
    correctly (related to CSV and non-CSV)
    A plus would be if we could add the column name to the error message
    like in "The column 'amount' in line 34 cannot be interpreted as
    monetary amount" - but this is surely quite some work!
  • Sometimes we want to interpret columns by the column names in the
    header line (which may or may not be the first line of the file,
    only the former being CSV-compliant)

Of course this all doesn't mean I am not a fan of NeoCSV. It is
well-written, works very well for "real" CSV and performs very well for
my use cases. Most of the things we are talking about here are problems
that arise when a CSV-file is not a CSV-file...

Joachim

Am 22.01.21 um 07:42 schrieb Kasper Osterbye:

As it happened, I ran into the exact same scenario as Joachim just the
other day,
that is, the external provider of my csv had added some new columns.
In my case
manifested itself in an error that an integer field was not an integer
(because new
columns were added in the middle).

Reading through this whole thread leaves me with the feeling that no
matter what Sven
adds, there is still a risk for error. Nevertheless, my suggestion
would be to add a
functionality to #skipHeaders, or make a sister method:
#assertAndSkipHeaders: numberOfColumns onFailDo: aBlock given the
actual number of headers
That would give me a way to handle the error up front.

This will only be interesting if your data has headers of cause.

Thanks for NeoCSV which I use all the time!

Best,

Kasper

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Kasper, I think this is somewhat close to another thing I am describing here: https://github.com/svenvc/NeoCSV/issues/20 <https://github.com/svenvc/NeoCSV/issues/20> The problem with extending NeoCSV endlessly is that some of the things we need with "real-life" CSV files is the fact that they are not CSV files at all, so it is hard to tell if it is a good idea to litter NeoCSV with methods for edge cases that have literally to do with the fact that part of our files are not CSV at all... I somehow have the feeling some of what we need is a subclass of Stream that knows about constraints like "a line-break is only the end of a record if it is not part of a quoted field". So I am somewhat torn between wanting that stuff in NeoCSV and not wanting to mix csv parsing with handling of stupid ideas people have when exporting data in some CSV-like file. Maybe it is a good idea to collect a few concepts that have been mentioned in these threads: * Sometimes we want to skip lines (header, footer) without interpreting their contents and without any side effects (not CSV-compliant) * Sometimes we want to ignore "additional data" after the end of a defined number of columns (not CSV-compliant) * Sometimes we need to know which line/column couldn't be parsed correctly (related to CSV and non-CSV) A plus would be if we could add the column name to the error message like in "The column 'amount' in line 34 cannot be interpreted as monetary amount" - but this is surely quite some work! * Sometimes we want to interpret columns by the column names in the header line (which may or may not be the first line of the file, only the former being CSV-compliant) Of course this all doesn't mean I am not a fan of NeoCSV. It is well-written, works very well for "real" CSV and performs very well for my use cases. Most of the things we are talking about here are problems that arise when a CSV-file is not a CSV-file... Joachim Am 22.01.21 um 07:42 schrieb Kasper Osterbye: > As it happened, I ran into the exact same scenario as Joachim just the > other day, > that is, the external provider of my csv had added some new columns. > In my case > manifested itself in an error that an integer field was not an integer > (because new > columns were added in the middle). > > Reading through this whole thread leaves me with the feeling that no > matter what Sven > adds, there is still a risk for error. Nevertheless, my suggestion > would be to add a > functionality to #skipHeaders, or make a sister method: > #assertAndSkipHeaders: numberOfColumns onFailDo: aBlock given the > actual number of headers > That would give me a way to handle the error up front. > > This will only be interesting if your data has headers of cause. > > Thanks for NeoCSV which I use all the time! > > Best, > > Kasper -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
TM
Tim Mackinnon
Fri, Jan 22, 2021 9:22 AM

I’m not doing any CSV processing at the moment, but have in the past - so was interested in this thread.

@Kasper, can’t you just use #readHeader upfront, and do the assertion yourself, and then proceed to loop through your records? It would seem that the Neo caters for what you are suggesting - and if you want to add a helper method extension you have the building blocks to already do this?

The only flaw I can think of, is if there is no header present then I can’t recall what Neo does - ideally throws an exception so you can decide what to do - potentially continue if the number of columns is what you expect and the data matches the columns - or you fail with an error that a header is required. But I think you would always need to do some basic initial checks when processing CSV due to the nature of the format?

Tim

On Fri, 22 Jan 2021, at 6:42 AM, Kasper Osterbye wrote:

As it happened, I ran into the exact same scenario as Joachim just the other day,
that is, the external provider of my csv had added some new columns. In my case
manifested itself in an error that an integer field was not an integer (because new
columns were added in the middle).

Reading through this whole thread leaves me with the feeling that no matter what Sven
adds, there is still a risk for error. Nevertheless, my suggestion would be to add a
functionality to #skipHeaders, or make a sister method:
#assertAndSkipHeaders: numberOfColumns onFailDo: aBlock given the actual number of headers
That would give me a way to handle the error up front.

This will only be interesting if your data has headers of cause.

Thanks for NeoCSV which I use all the time!

Best,

Kasper

I’m not doing any CSV processing at the moment, but have in the past - so was interested in this thread. @Kasper, can’t you just use #readHeader upfront, and do the assertion yourself, and then proceed to loop through your records? It would seem that the Neo caters for what you are suggesting - and if you want to add a helper method extension you have the building blocks to already do this? The only flaw I can think of, is if there is no header present then I can’t recall what Neo does - ideally throws an exception so you can decide what to do - potentially continue if the number of columns is what you expect and the data matches the columns - or you fail with an error that a header is required. But I think you would always need to do some basic initial checks when processing CSV due to the nature of the format? Tim On Fri, 22 Jan 2021, at 6:42 AM, Kasper Osterbye wrote: > As it happened, I ran into the exact same scenario as Joachim just the other day, > that is, the external provider of my csv had added some new columns. In my case > manifested itself in an error that an integer field was not an integer (because new > columns were added in the middle). > > Reading through this whole thread leaves me with the feeling that no matter what Sven > adds, there is still a risk for error. Nevertheless, my suggestion would be to add a > functionality to #skipHeaders, or make a sister method: > #assertAndSkipHeaders: numberOfColumns onFailDo: aBlock given the actual number of headers > That would give me a way to handle the error up front. > > This will only be interesting if your data has headers of cause. > > Thanks for NeoCSV which I use all the time! > > Best, > > Kasper
J
jtuchel@objektfabrik.de
Fri, Jan 22, 2021 11:15 AM

Tim,

Am 22.01.21 um 10:22 schrieb Tim Mackinnon:

I’m not doing any CSV processing at the moment, but have in the past -
so was interested in this thread.

@Kasper, can’t you just use #readHeader upfront, and do the assertion
yourself, and then proceed to loop through your records? It would seem
that the Neo caters for what you are suggesting - and if you want to
add a helper method extension you have the building blocks to already
do this?

This is a good idea. One caveat, however: #readHeader in its current
implementation does 2 things:

  • read the line respecting each field (thereby, respect line breaks
    within quoted fields - perfect for this purpose)
  • update the number of Columns for further reading (assuming
    #readHeader's purpose is to interpret the header line)

This second thing is in our way, because it may influence the way the
following lines will be interpreted. That is ecactly why I created an
issue on github (https://github.com/svenvc/NeoCSV/issues/20
https://github.com/svenvc/NeoCSV/issues/20).
A method that reads a line without any side effects (other than pushing
the position pointer forward to the next line) would come in handy for
such scenarios. But you can always argue that this has nothing to do
with CSV, because in CSV all lines have the same number of columns, each
of them containing the same kind of information, and there may be
exactly one header line. Anything else is just some file that may
contain CSV-y stuff in it. So I am really not sure if NeoCSV should
build lots of stuff for such files. I'd love to have this, but I'd
understand if Sven refused to integrate it.... ;-)

The only flaw I can think of, is if there is no header present then I
can’t recall what Neo does - ideally throws an exception so you can
decide what to do - potentially continue if the number of columns is
what you expect and the data matches the columns - or you fail with an
error that a header is required. But I think you would always need to
do some basic initial checks when processing CSV due to the nature of
the format?

Right. You'd always have to write some specific logic for this
particular file format and make NeoCSV ignore the right stuff...

Joachim

Tim

On Fri, 22 Jan 2021, at 6:42 AM, Kasper Osterbye wrote:

As it happened, I ran into the exact same scenario as Joachim just
the other day,
that is, the external provider of my csv had added some new columns.
In my case
manifested itself in an error that an integer field was not an
integer (because new
columns were added in the middle).

Reading through this whole thread leaves me with the feeling that no
matter what Sven
adds, there is still a risk for error. Nevertheless, my suggestion
would be to add a
functionality to #skipHeaders, or make a sister method:
#assertAndSkipHeaders: numberOfColumns onFailDo: aBlock given the
actual number of headers
That would give me a way to handle the error up front.

This will only be interesting if your data has headers of cause.

Thanks for NeoCSV which I use all the time!

Best,

Kasper

--

Objektfabrik Joachim Tuchel          mailto:jtuchel@objektfabrik.de
Fliederweg 1                        http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

Tim, Am 22.01.21 um 10:22 schrieb Tim Mackinnon: > I’m not doing any CSV processing at the moment, but have in the past - > so was interested in this thread. > > @Kasper, can’t you just use #readHeader upfront, and do the assertion > yourself, and then proceed to loop through your records? It would seem > that the Neo caters for what you are suggesting - and if you want to > add a helper method extension you have the building blocks to already > do this? > This is a good idea. One caveat, however: #readHeader in its current implementation does 2 things: * read the line respecting each field (thereby, respect line breaks within quoted fields - perfect for this purpose) * update the number of Columns for further reading (assuming #readHeader's purpose is to interpret the header line) This second thing is in our way, because it may influence the way the following lines will be interpreted. That is ecactly why I created an issue on github (https://github.com/svenvc/NeoCSV/issues/20 <https://github.com/svenvc/NeoCSV/issues/20>). A method that reads a line without any side effects (other than pushing the position pointer forward to the next line) would come in handy for such scenarios. But you can always argue that this has nothing to do with CSV, because in CSV all lines have the same number of columns, each of them containing the same kind of information, and there may be exactly one header line. Anything else is just some file that may contain CSV-y stuff in it. So I am really not sure if NeoCSV should build lots of stuff for such files. I'd love to have this, but I'd understand if Sven refused to integrate it.... ;-) > The only flaw I can think of, is if there is no header present then I > can’t recall what Neo does - ideally throws an exception so you can > decide what to do - potentially continue if the number of columns is > what you expect and the data matches the columns - or you fail with an > error that a header is required. But I think you would always need to > do some basic initial checks when processing CSV due to the nature of > the format? Right. You'd always have to write some specific logic for this particular file format and make NeoCSV ignore the right stuff... Joachim > > Tim > > On Fri, 22 Jan 2021, at 6:42 AM, Kasper Osterbye wrote: >> As it happened, I ran into the exact same scenario as Joachim just >> the other day, >> that is, the external provider of my csv had added some new columns. >> In my case >> manifested itself in an error that an integer field was not an >> integer (because new >> columns were added in the middle). >> >> Reading through this whole thread leaves me with the feeling that no >> matter what Sven >> adds, there is still a risk for error. Nevertheless, my suggestion >> would be to add a >> functionality to #skipHeaders, or make a sister method: >> #assertAndSkipHeaders: numberOfColumns onFailDo: aBlock given the >> actual number of headers >> That would give me a way to handle the error up front. >> >> This will only be interesting if your data has headers of cause. >> >> Thanks for NeoCSV which I use all the time! >> >> Best, >> >> Kasper > -- ----------------------------------------------------------------------- Objektfabrik Joachim Tuchel mailto:jtuchel@objektfabrik.de Fliederweg 1 http://www.objektfabrik.de D-71640 Ludwigsburg http://joachimtuchel.wordpress.com Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
SV
Sven Van Caekenberghe
Thu, May 13, 2021 3:36 PM

There is now the following commit:

https://github.com/svenvc/NeoCSV/commit/0acc2270b382f52533c478f2f1585341e390d4b5

which should address a couple of issues.

On 22 Jan 2021, at 12:15, jtuchel@objektfabrik.de wrote:

Tim,

Am 22.01.21 um 10:22 schrieb Tim Mackinnon:

I’m not doing any CSV processing at the moment, but have in the past - so was interested in this thread.

@Kasper, can’t you just use #readHeader upfront, and do the assertion yourself, and then proceed to loop through your records? It would seem that the Neo caters for what you are suggesting - and if you want to add a helper method extension you have the building blocks to already do this?

This is a good idea. One caveat, however: #readHeader in its current implementation does 2 things:

• read the line respecting each field (thereby, respect line breaks within quoted fields - perfect for this purpose)
• update the number of Columns for further reading (assuming #readHeader's purpose is to interpret the header line) 

This second thing is in our way, because it may influence the way the following lines will be interpreted. That is ecactly why I created an issue on github (https://github.com/svenvc/NeoCSV/issues/20).
A method that reads a line without any side effects (other than pushing the position pointer forward to the next line) would come in handy for such scenarios. But you can always argue that this has nothing to do with CSV, because in CSV all lines have the same number of columns, each of them containing the same kind of information, and there may be exactly one header line. Anything else is just some file that may contain CSV-y stuff in it. So I am really not sure if NeoCSV should build lots of stuff for such files. I'd love to have this, but I'd understand if Sven refused to integrate it.... ;-)

The only flaw I can think of, is if there is no header present then I can’t recall what Neo does - ideally throws an exception so you can decide what to do - potentially continue if the number of columns is what you expect and the data matches the columns - or you fail with an error that a header is required. But I think you would always need to do some basic initial checks when processing CSV due to the nature of the format?

Right. You'd always have to write some specific logic for this particular file format and make NeoCSV ignore the right stuff...

Joachim

Tim

On Fri, 22 Jan 2021, at 6:42 AM, Kasper Osterbye wrote:

As it happened, I ran into the exact same scenario as Joachim just the other day,
that is, the external provider of my csv had added some new columns. In my case
manifested itself in an error that an integer field was not an integer (because new
columns were added in the middle).

Reading through this whole thread leaves me with the feeling that no matter what Sven
adds, there is still a risk for error. Nevertheless, my suggestion would be to add a
functionality to #skipHeaders, or make a sister method:
#assertAndSkipHeaders: numberOfColumns onFailDo: aBlock given the actual number of headers
That would give me a way to handle the error up front.

This will only be interesting if your data has headers of cause.

Thanks for NeoCSV which I use all the time!

Best,

Kasper

--

Objektfabrik Joachim Tuchel
mailto:jtuchel@objektfabrik.de

Fliederweg 1
http://www.objektfabrik.de

D-71640 Ludwigsburg
http://joachimtuchel.wordpress.com

Telefon: +49 7141 56 10 86 0        Fax: +49 7141 56 10 86 1

There is now the following commit: https://github.com/svenvc/NeoCSV/commit/0acc2270b382f52533c478f2f1585341e390d4b5 which should address a couple of issues. > On 22 Jan 2021, at 12:15, jtuchel@objektfabrik.de wrote: > > Tim, > > > > > Am 22.01.21 um 10:22 schrieb Tim Mackinnon: >> I’m not doing any CSV processing at the moment, but have in the past - so was interested in this thread. >> >> @Kasper, can’t you just use #readHeader upfront, and do the assertion yourself, and then proceed to loop through your records? It would seem that the Neo caters for what you are suggesting - and if you want to add a helper method extension you have the building blocks to already do this? >> > This is a good idea. One caveat, however: #readHeader in its current implementation does 2 things: > > • read the line respecting each field (thereby, respect line breaks within quoted fields - perfect for this purpose) > • update the number of Columns for further reading (assuming #readHeader's purpose is to interpret the header line) > This second thing is in our way, because it may influence the way the following lines will be interpreted. That is ecactly why I created an issue on github (https://github.com/svenvc/NeoCSV/issues/20). > A method that reads a line without any side effects (other than pushing the position pointer forward to the next line) would come in handy for such scenarios. But you can always argue that this has nothing to do with CSV, because in CSV all lines have the same number of columns, each of them containing the same kind of information, and there may be exactly one header line. Anything else is just some file that may contain CSV-y stuff in it. So I am really not sure if NeoCSV should build lots of stuff for such files. I'd love to have this, but I'd understand if Sven refused to integrate it.... ;-) > > >> The only flaw I can think of, is if there is no header present then I can’t recall what Neo does - ideally throws an exception so you can decide what to do - potentially continue if the number of columns is what you expect and the data matches the columns - or you fail with an error that a header is required. But I think you would always need to do some basic initial checks when processing CSV due to the nature of the format? > Right. You'd always have to write some specific logic for this particular file format and make NeoCSV ignore the right stuff... > > > > Joachim > > > > > >> >> Tim >> >> On Fri, 22 Jan 2021, at 6:42 AM, Kasper Osterbye wrote: >>> As it happened, I ran into the exact same scenario as Joachim just the other day, >>> that is, the external provider of my csv had added some new columns. In my case >>> manifested itself in an error that an integer field was not an integer (because new >>> columns were added in the middle). >>> >>> Reading through this whole thread leaves me with the feeling that no matter what Sven >>> adds, there is still a risk for error. Nevertheless, my suggestion would be to add a >>> functionality to #skipHeaders, or make a sister method: >>> #assertAndSkipHeaders: numberOfColumns onFailDo: aBlock given the actual number of headers >>> That would give me a way to handle the error up front. >>> >>> This will only be interesting if your data has headers of cause. >>> >>> Thanks for NeoCSV which I use all the time! >>> >>> Best, >>> >>> Kasper >> > > > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchel > mailto:jtuchel@objektfabrik.de > > Fliederweg 1 > http://www.objektfabrik.de > > D-71640 Ludwigsburg > http://joachimtuchel.wordpress.com > > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 > > > >