pharo-users@lists.pharo.org

Any question about pharo is welcome

View all threads

How to handle (recover) from a ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding error?

TM
Tim Mackinnon
Tue, Jul 20, 2021 8:31 AM

Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is.

It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”.

For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice.

My code is like this (and I get the error when doing nextLine)

parseStream: aFileStream with: aBlock
| line items |
[ (line := aFileStream nextLine) isNil ]
whileFalse: [
items := $/ split: line.
items size = 3 ifTrue: [aBlock value: items]]

My stream is created like this:

firmEfs := (pathName , '/' , firmName , '_files') asFileReference.
details parseStream: firmEfs readStream.

Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character?

Thanks for any help.

Tim

Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is. It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”. For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice. My code is like this (and I get the error when doing nextLine) parseStream: aFileStream with: aBlock | line items | [ (line := aFileStream nextLine) isNil ] whileFalse: [ items := $/ split: line. items size = 3 ifTrue: [aBlock value: items]] My stream is created like this: firmEfs := (pathName , '/' , firmName , '_files') asFileReference. details parseStream: firmEfs readStream. Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character? Thanks for any help. Tim
SV
Sven Van Caekenberghe
Tue, Jul 20, 2021 9:03 AM

Hi Tim,

An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book.

The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard.

Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ?

It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings).

I can show you how to do that if you want.

Sven

On 20 Jul 2021, at 10:31, Tim Mackinnon tim@testit.works wrote:

Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is.

It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”.

For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice.

My code is like this (and I get the error when doing nextLine)

parseStream: aFileStream with: aBlock
| line items |
[ (line := aFileStream nextLine) isNil ]
whileFalse: [
items := $/ split: line.
items size = 3 ifTrue: [aBlock value: items]]

My stream is created like this:

firmEfs := (pathName , '/' , firmName , '_files') asFileReference.
details parseStream: firmEfs readStream.

Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character?

Thanks for any help.

Tim

Hi Tim, An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book. The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard. Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ? It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings). I can show you how to do that if you want. Sven > On 20 Jul 2021, at 10:31, Tim Mackinnon <tim@testit.works> wrote: > > Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is. > > It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”. > > For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice. > > My code is like this (and I get the error when doing nextLine) > > > parseStream: aFileStream with: aBlock > | line items | > [ (line := aFileStream nextLine) isNil ] > whileFalse: [ > items := $/ split: line. > items size = 3 ifTrue: [aBlock value: items]] > > My stream is created like this: > > firmEfs := (pathName , '/' , firmName , '_files') asFileReference. > details parseStream: firmEfs readStream. > > > Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character? > > Thanks for any help. > > Tim
SV
Sven Van Caekenberghe
Tue, Jul 20, 2021 9:45 AM

On 20 Jul 2021, at 11:03, Sven Van Caekenberghe sven@stfx.eu wrote:

Hi Tim,

An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book.

The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard.

Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ?

It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings).

I can show you how to do that if you want.

'/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ].

'/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
(ZnCharacterReadStream on: in encoding: #ascii) upToEnd ].

'/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
(ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii beLenient) upToEnd ].

HTH

Sven

On 20 Jul 2021, at 10:31, Tim Mackinnon tim@testit.works wrote:

Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is.

It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”.

For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice.

My code is like this (and I get the error when doing nextLine)

parseStream: aFileStream with: aBlock
| line items |
[ (line := aFileStream nextLine) isNil ]
whileFalse: [
items := $/ split: line.
items size = 3 ifTrue: [aBlock value: items]]

My stream is created like this:

firmEfs := (pathName , '/' , firmName , '_files') asFileReference.
details parseStream: firmEfs readStream.

Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character?

Thanks for any help.

Tim

> On 20 Jul 2021, at 11:03, Sven Van Caekenberghe <sven@stfx.eu> wrote: > > Hi Tim, > > An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book. > > The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard. > > Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ? > > It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings). > > I can show you how to do that if you want. '/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ]. '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in | (ZnCharacterReadStream on: in encoding: #ascii) upToEnd ]. '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in | (ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii beLenient) upToEnd ]. HTH > Sven > >> On 20 Jul 2021, at 10:31, Tim Mackinnon <tim@testit.works> wrote: >> >> Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is. >> >> It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”. >> >> For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice. >> >> My code is like this (and I get the error when doing nextLine) >> >> >> parseStream: aFileStream with: aBlock >> | line items | >> [ (line := aFileStream nextLine) isNil ] >> whileFalse: [ >> items := $/ split: line. >> items size = 3 ifTrue: [aBlock value: items]] >> >> My stream is created like this: >> >> firmEfs := (pathName , '/' , firmName , '_files') asFileReference. >> details parseStream: firmEfs readStream. >> >> >> Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character? >> >> Thanks for any help. >> >> Tim >
GP
Guillermo Polito
Tue, Jul 20, 2021 10:11 AM

El 20 jul 2021, a las 11:45, Sven Van Caekenberghe sven@stfx.eu escribió:

On 20 Jul 2021, at 11:03, Sven Van Caekenberghe sven@stfx.eu wrote:

Hi Tim,

An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book.

The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard.

Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ?

It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings).

I can show you how to do that if you want.

'/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ].

'/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
(ZnCharacterReadStream on: in encoding: #ascii) upToEnd ].

'/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
(ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii beLenient) upToEnd ].

There is also readStreamEncoded:[do:], which is a bit more concise but does the same :)

HTH

Sven

On 20 Jul 2021, at 10:31, Tim Mackinnon tim@testit.works wrote:

Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is.

It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”.

For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice.

My code is like this (and I get the error when doing nextLine)

parseStream: aFileStream with: aBlock
| line items |
[ (line := aFileStream nextLine) isNil ]
whileFalse: [
items := $/ split: line.
items size = 3 ifTrue: [aBlock value: items]]

My stream is created like this:

firmEfs := (pathName , '/' , firmName , '_files') asFileReference.
details parseStream: firmEfs readStream.

Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character?

Thanks for any help.

Tim

> El 20 jul 2021, a las 11:45, Sven Van Caekenberghe <sven@stfx.eu> escribió: > > > >> On 20 Jul 2021, at 11:03, Sven Van Caekenberghe <sven@stfx.eu> wrote: >> >> Hi Tim, >> >> An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book. >> >> The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard. >> >> Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ? >> >> It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings). >> >> I can show you how to do that if you want. > > '/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ]. > > '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in | > (ZnCharacterReadStream on: in encoding: #ascii) upToEnd ]. > > '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in | > (ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii beLenient) upToEnd ]. There is also readStreamEncoded:[do:], which is a bit more concise but does the same :) > > HTH > >> Sven >> >>> On 20 Jul 2021, at 10:31, Tim Mackinnon <tim@testit.works> wrote: >>> >>> Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is. >>> >>> It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”. >>> >>> For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice. >>> >>> My code is like this (and I get the error when doing nextLine) >>> >>> >>> parseStream: aFileStream with: aBlock >>> | line items | >>> [ (line := aFileStream nextLine) isNil ] >>> whileFalse: [ >>> items := $/ split: line. >>> items size = 3 ifTrue: [aBlock value: items]] >>> >>> My stream is created like this: >>> >>> firmEfs := (pathName , '/' , firmName , '_files') asFileReference. >>> details parseStream: firmEfs readStream. >>> >>> >>> Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character? >>> >>> Thanks for any help. >>> >>> Tim
SV
Sven Van Caekenberghe
Tue, Jul 20, 2021 10:42 AM

On 20 Jul 2021, at 12:11, Guillermo Polito guillermopolito@gmail.com wrote:

El 20 jul 2021, a las 11:45, Sven Van Caekenberghe sven@stfx.eu escribió:

On 20 Jul 2021, at 11:03, Sven Van Caekenberghe sven@stfx.eu wrote:

Hi Tim,

An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book.

The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard.

Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ?

It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings).

I can show you how to do that if you want.

'/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ].

'/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
(ZnCharacterReadStream on: in encoding: #ascii) upToEnd ].

'/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
(ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii beLenient) upToEnd ].

There is also readStreamEncoded:[do:], which is a bit more concise but does the same :)

Yes indeed !

HTH

Sven

On 20 Jul 2021, at 10:31, Tim Mackinnon tim@testit.works wrote:

Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is.

It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”.

For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice.

My code is like this (and I get the error when doing nextLine)

parseStream: aFileStream with: aBlock
| line items |
[ (line := aFileStream nextLine) isNil ]
whileFalse: [
items := $/ split: line.
items size = 3 ifTrue: [aBlock value: items]]

My stream is created like this:

firmEfs := (pathName , '/' , firmName , '_files') asFileReference.
details parseStream: firmEfs readStream.

Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character?

Thanks for any help.

Tim

> On 20 Jul 2021, at 12:11, Guillermo Polito <guillermopolito@gmail.com> wrote: > > > >> El 20 jul 2021, a las 11:45, Sven Van Caekenberghe <sven@stfx.eu> escribió: >> >> >> >>> On 20 Jul 2021, at 11:03, Sven Van Caekenberghe <sven@stfx.eu> wrote: >>> >>> Hi Tim, >>> >>> An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book. >>> >>> The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard. >>> >>> Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ? >>> >>> It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings). >>> >>> I can show you how to do that if you want. >> >> '/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ]. >> >> '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in | >> (ZnCharacterReadStream on: in encoding: #ascii) upToEnd ]. >> >> '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in | >> (ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii beLenient) upToEnd ]. > > There is also readStreamEncoded:[do:], which is a bit more concise but does the same :) Yes indeed ! >> HTH >> >>> Sven >>> >>>> On 20 Jul 2021, at 10:31, Tim Mackinnon <tim@testit.works> wrote: >>>> >>>> Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is. >>>> >>>> It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”. >>>> >>>> For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice. >>>> >>>> My code is like this (and I get the error when doing nextLine) >>>> >>>> >>>> parseStream: aFileStream with: aBlock >>>> | line items | >>>> [ (line := aFileStream nextLine) isNil ] >>>> whileFalse: [ >>>> items := $/ split: line. >>>> items size = 3 ifTrue: [aBlock value: items]] >>>> >>>> My stream is created like this: >>>> >>>> firmEfs := (pathName , '/' , firmName , '_files') asFileReference. >>>> details parseStream: firmEfs readStream. >>>> >>>> >>>> Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character? >>>> >>>> Thanks for any help. >>>> >>>> Tim
TM
Tim Mackinnon
Tue, Jul 20, 2021 1:47 PM

Hey thanks guys - so looking at readStreamEncoded: - how do I know what the valid encodings are? Skimming those doc’s Sven referenced, I can start to pick out some - but is there a list? I see that method parameter says “anEncoding” but the type hint on that is misleading as it seems like its a String or is it a Symbol? If I search for Encoder classes - I do find ZnCharacterEncoder - and it has class methods for latin1, utf8, ascii - so is this the definitive list? And should the encoding strings used in those methods be constants or something I can reference in my code?

Gosh - this raises a whole host of things I just naively assumed happened for me.

So it looks like the file giving me issues - seems to have characters like £ or ¬ in it. So I’m wondering how I know what the proper encoding format would be (I think these files were written out with some PHP app) - is it just a trial and error thing?

I tried changing my code to:

details parseStream: (firmEfs readStreamEncoded: 'iso-8859-1’). - and other variants like ‘ASCII’ and ‘latin1’ - and this then gives me another error:
"ZnCharacterEncodingError: Character Unicode code point outside encoder range”

So it does sound like I have a file that isn’t conforming to known standards - and I guess I have to use #beLenient option.

Sven - In the examples for using #beLenient - you seem to show something that assumes you will iterate with Do - as my existing code takes a stream, that it wants to do a #nextLine on - would it be bad to do something like this:

efsStream := (firmEfs readStreamEncoded: 'latin1').
efsStream encoder beLenient.

details parsStream: efsStream.

That is - get the endcoder from my Stream and make it lenient?

Appreciate the pointers on this guys - I’m definitely learning something new here.

Tim

On 20 Jul 2021, at 12:11, Guillermo Polito <guillermopolito@gmail.com mailto:guillermopolito@gmail.com> wrote:

El 20 jul 2021, a las 11:45, Sven Van Caekenberghe <sven@stfx.eu mailto:sven@stfx.eu> escribió:

On 20 Jul 2021, at 11:03, Sven Van Caekenberghe <sven@stfx.eu mailto:sven@stfx.eu> wrote:

Hi Tim,

An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book.

The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard.

Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ?

It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings).

I can show you how to do that if you want.

'/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ].

'/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
(ZnCharacterReadStream on: in encoding: #ascii) upToEnd ].

'/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
(ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii beLenient) upToEnd ].

There is also readStreamEncoded:[do:], which is a bit more concise but does the same :)

HTH

Sven

On 20 Jul 2021, at 10:31, Tim Mackinnon <tim@testit.works mailto:tim@testit.works> wrote:

Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is.

It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”.

For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice.

My code is like this (and I get the error when doing nextLine)

parseStream: aFileStream with: aBlock
| line items |
[ (line := aFileStream nextLine) isNil ]
whileFalse: [
items := $/ split: line.
items size = 3 ifTrue: [aBlock value: items]]

My stream is created like this:

firmEfs := (pathName , '/' , firmName , '_files') asFileReference.
details parseStream: firmEfs readStream.

Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character?

Thanks for any help.

Tim

Hey thanks guys - so looking at readStreamEncoded: - how do I know what the valid encodings are? Skimming those doc’s Sven referenced, I can start to pick out some - but is there a list? I see that method parameter says “anEncoding” but the type hint on that is misleading as it seems like its a String or is it a Symbol? If I search for Encoder classes - I do find ZnCharacterEncoder - and it has class methods for latin1, utf8, ascii - so is this the definitive list? And should the encoding strings used in those methods be constants or something I can reference in my code? Gosh - this raises a whole host of things I just naively assumed happened for me. So it looks like the file giving me issues - seems to have characters like £ or ¬ in it. So I’m wondering how I know what the proper encoding format would be (I think these files were written out with some PHP app) - is it just a trial and error thing? I tried changing my code to: details parseStream: (firmEfs readStreamEncoded: 'iso-8859-1’). - and other variants like ‘ASCII’ and ‘latin1’ - and this then gives me another error: "ZnCharacterEncodingError: Character Unicode code point outside encoder range” So it does sound like I have a file that isn’t conforming to known standards - and I guess I have to use #beLenient option. Sven - In the examples for using #beLenient - you seem to show something that assumes you will iterate with Do - as my existing code takes a stream, that it wants to do a #nextLine on - would it be bad to do something like this: efsStream := (firmEfs readStreamEncoded: 'latin1'). efsStream encoder beLenient. details parsStream: efsStream. That is - get the endcoder from my Stream and make it lenient? Appreciate the pointers on this guys - I’m definitely learning something new here. Tim > On 20 Jul 2021, at 12:11, Guillermo Polito <guillermopolito@gmail.com <mailto:guillermopolito@gmail.com>> wrote: > > > >> El 20 jul 2021, a las 11:45, Sven Van Caekenberghe <sven@stfx.eu <mailto:sven@stfx.eu>> escribió: >> >> >> >>> On 20 Jul 2021, at 11:03, Sven Van Caekenberghe <sven@stfx.eu <mailto:sven@stfx.eu>> wrote: >>> >>> Hi Tim, >>> >>> An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html <https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html> [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book. >>> >>> The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard. >>> >>> Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ? >>> >>> It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings). >>> >>> I can show you how to do that if you want. >> >> '/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ]. >> >> '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in | >> (ZnCharacterReadStream on: in encoding: #ascii) upToEnd ]. >> >> '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in | >> (ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii beLenient) upToEnd ]. > > There is also readStreamEncoded:[do:], which is a bit more concise but does the same :) > >> >> HTH >> >>> Sven >>> >>>> On 20 Jul 2021, at 10:31, Tim Mackinnon <tim@testit.works <mailto:tim@testit.works>> wrote: >>>> >>>> Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is. >>>> >>>> It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”. >>>> >>>> For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice. >>>> >>>> My code is like this (and I get the error when doing nextLine) >>>> >>>> >>>> parseStream: aFileStream with: aBlock >>>> | line items | >>>> [ (line := aFileStream nextLine) isNil ] >>>> whileFalse: [ >>>> items := $/ split: line. >>>> items size = 3 ifTrue: [aBlock value: items]] >>>> >>>> My stream is created like this: >>>> >>>> firmEfs := (pathName , '/' , firmName , '_files') asFileReference. >>>> details parseStream: firmEfs readStream. >>>> >>>> >>>> Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character? >>>> >>>> Thanks for any help. >>>> >>>> Tim
SV
Sven Van Caekenberghe
Tue, Jul 20, 2021 1:59 PM

There is ZnCharacterEncoder knownEncodingIdentifiers.

You either provide an identifier from this list (as string or symbol) or an instance (the argument gets sent #asZnCharacterEncoder if you want to know).

Most text editors will tell you the encoding they are using to read your file and you can use that to inspect the contents.

If you want, you can sent me such a file privately.

Yes, you can access the encoder from the character read stream to configure it further. Or you do it upfront as an instance instead of an identifier.

On 20 Jul 2021, at 15:47, Tim Mackinnon tim@testit.works wrote:

Hey thanks guys - so looking at readStreamEncoded: - how do I know what the valid encodings are? Skimming those doc’s Sven referenced, I can start to pick out some - but is there a list? I see that method parameter says “anEncoding” but the type hint on that is misleading as it seems like its a String or is it a Symbol? If I search for Encoder classes - I do find ZnCharacterEncoder - and it has class methods for latin1, utf8, ascii - so is this the definitive list? And should the encoding strings used in those methods be constants or something I can reference in my code?

Gosh - this raises a whole host of things I just naively assumed happened for me.

So it looks like the file giving me issues - seems to have characters like £ or ¬ in it. So I’m wondering how I know what the proper encoding format would be (I think these files were written out with some PHP app) - is it just a trial and error thing?

I tried changing my code to:

details parseStream: (firmEfs readStreamEncoded: 'iso-8859-1’). - and other variants like ‘ASCII’ and ‘latin1’ - and this then gives me another error:
"ZnCharacterEncodingError: Character Unicode code point outside encoder range”

So it does sound like I have a file that isn’t conforming to known standards - and I guess I have to use #beLenient option.

Sven - In the examples for using #beLenient - you seem to show something that assumes you will iterate with Do - as my existing code takes a stream, that it wants to do a #nextLine on - would it be bad to do something like this:

efsStream := (firmEfs readStreamEncoded: 'latin1').
efsStream encoder beLenient.

details parsStream: efsStream.

That is - get the endcoder from my Stream and make it lenient?

Appreciate the pointers on this guys - I’m definitely learning something new here.

Tim

On 20 Jul 2021, at 12:11, Guillermo Polito guillermopolito@gmail.com wrote:

El 20 jul 2021, a las 11:45, Sven Van Caekenberghe sven@stfx.eu escribió:

On 20 Jul 2021, at 11:03, Sven Van Caekenberghe sven@stfx.eu wrote:

Hi Tim,

An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book.

The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard.

Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ?

It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings).

I can show you how to do that if you want.

'/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ].

'/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
(ZnCharacterReadStream on: in encoding: #ascii) upToEnd ].

'/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
(ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii beLenient) upToEnd ].

There is also readStreamEncoded:[do:], which is a bit more concise but does the same :)

HTH

Sven

On 20 Jul 2021, at 10:31, Tim Mackinnon tim@testit.works wrote:

Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is.

It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”.

For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice.

My code is like this (and I get the error when doing nextLine)

parseStream: aFileStream with: aBlock
| line items |
[ (line := aFileStream nextLine) isNil ]
whileFalse: [
items := $/ split: line.
items size = 3 ifTrue: [aBlock value: items]]

My stream is created like this:

firmEfs := (pathName , '/' , firmName , '_files') asFileReference.
details parseStream: firmEfs readStream.

Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character?

Thanks for any help.

Tim

There is ZnCharacterEncoder knownEncodingIdentifiers. You either provide an identifier from this list (as string or symbol) or an instance (the argument gets sent #asZnCharacterEncoder if you want to know). Most text editors will tell you the encoding they are using to read your file and you can use that to inspect the contents. If you want, you can sent me such a file privately. Yes, you can access the encoder from the character read stream to configure it further. Or you do it upfront as an instance instead of an identifier. > On 20 Jul 2021, at 15:47, Tim Mackinnon <tim@testit.works> wrote: > > Hey thanks guys - so looking at readStreamEncoded: - how do I know what the valid encodings are? Skimming those doc’s Sven referenced, I can start to pick out some - but is there a list? I see that method parameter says “anEncoding” but the type hint on that is misleading as it seems like its a String or is it a Symbol? If I search for Encoder classes - I do find ZnCharacterEncoder - and it has class methods for latin1, utf8, ascii - so is this the definitive list? And should the encoding strings used in those methods be constants or something I can reference in my code? > > Gosh - this raises a whole host of things I just naively assumed happened for me. > > So it looks like the file giving me issues - seems to have characters like £ or ¬ in it. So I’m wondering how I know what the proper encoding format would be (I think these files were written out with some PHP app) - is it just a trial and error thing? > > I tried changing my code to: > > details parseStream: (firmEfs readStreamEncoded: 'iso-8859-1’). - and other variants like ‘ASCII’ and ‘latin1’ - and this then gives me another error: > "ZnCharacterEncodingError: Character Unicode code point outside encoder range” > > So it does sound like I have a file that isn’t conforming to known standards - and I guess I have to use #beLenient option. > > Sven - In the examples for using #beLenient - you seem to show something that assumes you will iterate with Do - as my existing code takes a stream, that it wants to do a #nextLine on - would it be bad to do something like this: > > efsStream := (firmEfs readStreamEncoded: 'latin1'). > efsStream encoder beLenient. > > details parsStream: efsStream. > > That is - get the endcoder from my Stream and make it lenient? > > Appreciate the pointers on this guys - I’m definitely learning something new here. > > Tim > >> On 20 Jul 2021, at 12:11, Guillermo Polito <guillermopolito@gmail.com> wrote: >> >> >> >>> El 20 jul 2021, a las 11:45, Sven Van Caekenberghe <sven@stfx.eu> escribió: >>> >>> >>> >>>> On 20 Jul 2021, at 11:03, Sven Van Caekenberghe <sven@stfx.eu> wrote: >>>> >>>> Hi Tim, >>>> >>>> An introduction to this part of the system is in https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html [Character Encoding and Resource Meta Description] from the "Enterprise Pharo" book. >>>> >>>> The error means that a file that you try to read as UTF-8 does contain things that are invalid with respect to the UTF-8 standard. >>>> >>>> Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or something else ? >>>> >>>> It is possible to customise the encoding to something different than the default UTF-8. For non-UTF encoders, there is a strict/lenient option to disallow/allow illegal stuff (but then you will get these in your strings). >>>> >>>> I can show you how to do that if you want. >>> >>> '/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ]. >>> >>> '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in | >>> (ZnCharacterReadStream on: in encoding: #ascii) upToEnd ]. >>> >>> '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in | >>> (ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii beLenient) upToEnd ]. >> >> There is also readStreamEncoded:[do:], which is a bit more concise but does the same :) >> >>> >>> HTH >>> >>>> Sven >>>> >>>>> On 20 Jul 2021, at 10:31, Tim Mackinnon <tim@testit.works> wrote: >>>>> >>>>> Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an unexpected error and am wondering what the best way to approach it is. >>>>> >>>>> It seems that I have a log file that has unexpected characters, and so my readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding”. >>>>> >>>>> For some reason this file (unlike my others) seems to contain characters that it shouldn’t - but what is the best way for me to continue processing? Should I be opening my files in a different way - or can I resume the error somehow- I’m not familiar with this area of Pharo and am after a bit of advice. >>>>> >>>>> My code is like this (and I get the error when doing nextLine) >>>>> >>>>> >>>>> parseStream: aFileStream with: aBlock >>>>> | line items | >>>>> [ (line := aFileStream nextLine) isNil ] >>>>> whileFalse: [ >>>>> items := $/ split: line. >>>>> items size = 3 ifTrue: [aBlock value: items]] >>>>> >>>>> My stream is created like this: >>>>> >>>>> firmEfs := (pathName , '/' , firmName , '_files') asFileReference. >>>>> details parseStream: firmEfs readStream. >>>>> >>>>> >>>>> Should I be opening the stream a bit differently - or can I catch that encoding error and resume it with some safe character? >>>>> >>>>> Thanks for any help. >>>>> >>>>> Tim >