[Pharo-users] [From StackOverflow] How to parse ndjson in Pharo with NeoJSON

Sven Van Caekenberghe sven at stfx.eu
Fri Jan 22 10:49:28 EST 2016


> On 22 Jan 2016, at 16:13, MartinW <wm at fastmail.fm> wrote:
> 
> Thank you, Sven! (I asked the question on StackOverflow)
> 
> And also let me thank you for NeoJSON, NeoCSV and Zinc, which I use a lot
> and which are a joy to use! Also the documentation is very good and helps a
> lot.

Thanks, Martin.

> Your code works well and I save a bit of memory by avoiding intermediary
> data structures, but still this operation uses a lot more memory than I had
> expected (the example file I use is 80 MB).

Well, it is quite a bit of data (I didn't look too deeply), 50.000 records of structured/nested data with quite a lot of strings. If each record is 1Kb, that makes 50Mb.

How do you measure your memory consumption ? What did you expect ?

Right now, your JSON is parsed and the result is a combination of lists (Array) and maps (Dictionary). If you know/understand well what is inside it, and it is regular enough, you could try to build your own specialised/optimised data/domain model for it. NeoJSON can also parse directly to your objects, instead of the general ones (a process called mapping). This is some work, of course, and it might not be worth it, YMMV.

Sven  

> I tried to parse with
> PetitParser but the results were similar. I guess, i have to learn to find
> out were all the memory goes.
> 
> Best regards,
> Martin.
> 
> 
> 
> Sven Van Caekenberghe-2 wrote
>> (I don't do StackOverflow)
>> 
>> Reading the 'format' is easy, just keep on doing #next for each JSON
>> expression (whitespace is ignored).
>> 
>> | data reader |
>> data := '{"smalltalk": "cool"}
>> {"pharo": "cooler"}'.
>> reader := NeoJSONReader on: data readStream.
>> Array streamContents: [ :out |
>>  [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].
>> 
>> Preventing intermediary data structures is easy too, use streaming.
>> 
>> | client reader data networkStream |
>> (client := ZnClient new)
>>  streaming: true;
>>  url:
>> 'https://github.com/NYPL-publicdomain/data-and-utilities/blob/master/items/pd_items_1.ndjson?raw=true';
>>  get.
>> networkStream := ZnCharacterReadStream on: client contents.
>> reader := NeoJSONReader on: networkStream.
>> data := Array streamContents: [ :out |
>>  [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].
>> client close.
>> data.
>> 
>> It took a couple of seconds, it is 80MB+ over the network for 50K items
>> after all.
>> 
>> 
>> 
>> HTH,
>> 
>> Sven 
>> 
>> 
>>> On 21 Jan 2016, at 12:02, Esteban Lorenzano <
> 
>> estebanlm@
> 
>> > wrote:
>>> 
>>> Hi, 
>>> 
>>> there is a question I don’t know how to answer.
>>> 
>>> http://stackoverflow.com/questions/34904337/how-to-parse-ndjson-in-pharo-with-neojson
>>> 
>>> Transcript: 
>>> 
>>> I want to parse ndjson (newline delimited json) data with NeoJSON on
>>> Pharo Smalltalk.
>>> 
>>> ndjson data looks like this:
>>> 
>>> {"smalltalk": "cool"}
>>> {"pharo": "cooler"}
>>> At the moment I convert my file stream to a string, split it on newline
>>> and then parse the single parts using NeoJSON. This seems to use an
>>> unnecessary (and extremely huge) amount of memory and time, probably
>>> because of converting streams to strings and vice-versa all the time.
>>> What would be an efficient way to do this task?
>>> 
>>> 
>>> Takers?
>>> Esteban
>> 
>> 
>> 
>> Screen Shot 2016-01-21 at 13.33.57.png (480K)
>> <http://forum.world.st/attachment/4873112/0/Screen%20Shot%202016-01-21%20at%2013.33.57.png>
> 
> 
> 
> 
> 
> --
> View this message in context: http://forum.world.st/From-StackOverflow-How-to-parse-ndjson-in-Pharo-with-NeoJSON-tp4873097p4873385.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.





More information about the Pharo-users mailing list