[Pharo-project] Fuel - a fast object deployment tool

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Fri Jun 17 17:39:19 EDT 2011


2011/6/17 Eliot Miranda <eliot.miranda at gmail.com>:
>
>
> On Fri, Jun 17, 2011 at 1:26 AM, Martin Dias <tinchodias at gmail.com> wrote:
>>
>> Hi Eliot,
>> I am very happy to read your mail.
>>
>> On Wed, Jun 15, 2011 at 3:29 PM, Eliot Miranda <eliot.miranda at gmail.com>
>> wrote:
>>>
>>> Hi Martin & Mariano,
>>>     regarding filtering.  Yesterday my colleague Yaron and I successfully
>>> finished our port of Fuel to Newspeak and are successfully using it to save
>>> and restore our data sets; thank you, its a cool framework.  We had to
>>> implement two extensions, the first of which the ability to save and restore
>>> Newspeak classes, which is complex because these are instantiated classes
>>> inside instantiated Newspeak modules, not static Smalltalk classes in the
>>> Smalltalk dictionary.  The second extension is the ability to map specific
>>> objects to nil, to prune objects on the way out.  I want to discuss this
>>> latter extension.
>>> In our data set we have a set of references to objects that are logically
>>> not persistent and hence not to be saved.  I'm sure that this will be a
>>> common case.  The requirement is for the pickling system to prune certain
>>> objects, typically by arranging that when an object graph is pickled,
>>> references to the pruned objects are replaced by references to nil.  One way
>>> of doing this is as described below, by specifiying per-class lists of
>>> instance variables whose referents shoudl not be saved.  But this can be
>>> clumsy; there may be references to objects one wants to prune from e.g. more
>>> than one class, in which case one may have to provide multiple lists of the
>>> relevant inst vars; there may be references to objects one wants to prune
>>> from e.g. collections (e.g. sets and dictionaries) in which case the
>>> instance variable list approach just doesn't work.
>>> Here are two more general schemes.  VFirst, most directly, Fuel could
>>> provide two filters, implemented in the default mapper, or the core
>>> analyser.  One is a set of classes whose instances are not to be saved.  Any
>>> reference to an instance of a class in the toBePrunedClasses set is saved as
>>> nil.  The other is a set of instances that are not to be saved, and also any
>>> reference to an instance in the toBePruned set is saved as nil.  Why have
>>> both?  It can be convenient and efficient to filter by class (in our case we
>>> had many instances of a specific class, all of which should be filtered, and
>>> finding them could be time consuming), but filtering by class can be too
>>> inflexible, there may indeed be specific instances to exclude (thing for
>>> example of part of the object graph that functions as a cache; pruning the
>>> specific objects in the cache is the right thing to do; pruning all
>>> instances of classes whose instances exist in the cache may prune too much).
>>> As an example here's how we implemented pruning.  Our system is called
>>> Glue, and we start with a mapper for Glue objects, FLGlueMapper:
>>> FLMapper subclass: #FLGlueMapper
>>> instanceVariableNames: 'prunedObjectClasses newspeakClassesCluster
>>> modelClasses'
>>> classVariableNames: ''
>>> poolDictionaries: ''
>>> category: 'Fuel-Core-Mappers'
>>> It accepts newspeak objects and filters instances in the
>>> prunedObjectsClasses set, and as a side-effect collects certain classes that
>>> we need in a manifest:
>>> FLGlueMapper>>accepts: anObject
>>> "Tells if the received object is handled by this analyzer.  We want to
>>> hand-off
>>> instantiated Newspeak classes to the newspeakClassesCluster, and we want
>>> to record other model classes.  We want to filter-out instances of any
>>> class
>>> in prunedObjectClasses."
>>> ^anObject isBehavior
>>> ifTrue:
>>> [(self isInstantiatedNewspeakClass: anObject)
>>> ifTrue: [true]
>>> ifFalse:
>>> [(anObject inheritsFrom: GlueDataObject) ifTrue:
>>> [modelClasses add: anObject].
>>> false]]
>>> ifFalse:
>>> [prunedObjectClasses includes: anObject class]
>>> It prunes by mapping instances of the prunedObjectClasses to a special
>>> cluster.  It can do this in visitObject: since any newspeak objects it is
>>> accepting will be visited in its visitClassOrTrait: method (i.e. it's
>>> implicit that all arguments to visitObjects: are instances of the
>>> prunedObjectsClasses set).
>>> FLGlueMapper>>visitObject: anObject
>>> analyzer
>>> mapAndTrace: anObject
>>> to: FLPrunedObjectsCluster instance
>>> into: analyzer clustersWithBaselevelObjects
>>> FLPrunedObjectsCluster is a specialization of the nil,true,false cluster
>>> that maps its objects to nil:
>>> FLNilTrueFalseCluster subclass: #FLPrunedObjectsCluster
>>> instanceVariableNames: ''
>>> classVariableNames: ''
>>> poolDictionaries: ''
>>> category: 'Fuel-Core-Clusters'
>>> FLPrunedObjectsCluster >>serialize: aPrunedObject on: aWriteStream
>>> super serialize: nil on: aWriteStream
>>>
>>> So this would generalize by the analyser having an e.g. FLPruningMapper
>>> as the first mapper, and this having a prunedObjects and a
>>> priunedObjectClasses set and going something like this:
>>> FLPruningMapper>>accepts: anObject
>>> ^(prunedObjects includes: anObject) or: [prunedObjectClasses includes:
>>> anObject class]
>>> FLPruningMapper >>visitObject: anObject
>>> analyzer
>>> mapAndTrace: anObject
>>> to: FLPrunedObjectsCluster instance
>>> into: analyzer clustersWithBaselevelObjects
>>> and then one would provide accessors in FLSerialzer and/or FLAnalyser to
>>> add objects and classes to the prunedObjects and prunedObjectClasses set.
>>> For efficiency one could arrange that the FLPruningMapper was not added
>>> to the sequence of mappers unless and until objects or classes were added
>>> to the prunedObjects and prunedObjectClasses set.
>>
>> Excellent. I love the botanical metaphor of pruning! Of course we can
>> include FLPruningMapper and FLPrunedObjectsCluster in Fuel.
>>
>> We are also interested in pruning objects but not necessarily replacing
>> them by nil, but for another user defined objects. For example proxies. We
>> can extend the pruning stuff for doing that.
>
> That was an idea Yaron came up with.  That instead of
> using fuelIgnoredInstanceVariableNames one uses e.g.
> Object>>objectToSerialize
>     ^self
> and then if one wants to prune specific inst vars in MyClass one implements
> MyClass>>objectToSerialize
>     ^self shallowCopy prepareForSerialization

Hi Eliot,

I'm not convinced by the shallowCopy solution, except for the simple structures.
If object graph is complex (have share nodes, loops, ...) then you
gonna end up in a replication problem equivalent to the one Fuel is
trying to solve.

Nicolas

> MyClass>>prepareForSerialization
>     instVarIDontWantToSerialize := nil.
>     ^self
> and for objects one doesn't want to serlalize one implements
> MyNotToBeSerializedClass>>objectToSerialize
>     ^nil
> So its more general.  But I would pass the analyser in as an argument, which
> would allow things like
> MyPerhapsNotToBeSerializedClass>>objectToSerializeIn: anFLAnalyser
>     ^(anFLAnalyser shouldPrune: self)
>         ifFalse: [self]
>         ifTrue: [nil]
> which would of course be the default in Object:
> Object>>objectToSerializeIn: anFLAnalyser
>     ^(anFLAnalyser shouldPrune: self) ifFalse:: [self]
>
>>
>>
>>>
>>> I think both Yaron and I feel the Fuel framework is comprehensible and
>>> flexible.  We enjoyed using it and while we took two passes at coming up
>>> with the pruning scheme we liked (our first was based on not serializing
>>> specific ins vars and was much more complex than our second, based on
>>> pruning instances of specific classes) we got there quickly and will very
>>> little frustration along the way.  Thank you very much.
>>
>> :-) thank you!
>>
>>>
>>> Finally, a couple of things.  First, it may be more flexible to implement
>>> fuelCluster as fuelClusterIn: anFLAnalyser so that if one is trying to
>>> override certain parts of the mapping framework an implementation can access
>>> the analyser to find existing clusters, e.g.
>>> MyClass>>fuelClusterIn: anFLAnalyser
>>> ^self shouldBeInASpecialCluster
>>> ifTrue: [anFLAnalyser clusterWithId: MySpecialCluster id]
>>> ifFalse: [super fuelClusterIn: anFLAnalyser]
>>> This makes it easier to find a specific unique cluster to handle a group
>>> of objects specially.
>>
>> I can't imagine a concrete example but I see that it is more flexible...
>> the cluster obtained via double dispatch can be anything polymorphic with
>> MySpecialCluster... that's the point?
>
> To be honest I'm not sure.  But passing in the analyser in things like
> fuelCluster or objectToSerialize is I think a good idea as it provides a
> convenient communication path which in turn provides considerable
> flexibility.
>>
>>
>>>
>>> Lastly, the class-side cluster ids are a bit of a pain.  It would be nice
>>> to know a) are these byte values or general integer values, i.e. can there
>>> be more than 256 types of cluster?, and b) is there any meaning to the ids?
>>>  For example, are clusters ordered by id, or is this just an integer tag?
>>>  Also, some class-side code to assign an unused id would be nice.
>>> You might think of virtualizing the id scheme.  For example, if FLCluster
>>> maintained a weak array of all its subclasses then the id of a cluster could
>>> be the index in the array, and the array could be cleaned up occasionally.
>>>  Then each fuel serialization could start with the list of cluster class
>>> names and ids, so that specific values of ids are specific to a particular
>>> serialization.
>>
>> I do agree, these ids are an heritage from the first prototypes of fuel,
>> they should be revised. a) yes, it is encoded in only one byte; b) just an
>> integer tag, the only purpose of the id was for decoding fast: read a byte
>> and then look in a dictionary for the corresponding cluster instance. We
>> could even store the cluster class name but that's inefficient.
>
> Yes, but how inefficient?  What's the size of all the cluster names?
>     FLCluster allSubclasses inject: 0 into: [:t :c| t + c name size + 1] 670
>
> So you'd add less than a kilobyte to the size of each serialization and get
> complete freedom from ids.  Something to think about.
>>
>> Virtualizing the id scheme is a good idea. Much more elegant and
>> extensible. The current mechanism not only limits the number of possible
>> clusters, but also "user defined" extensions can collide, for example if
>> your Glue cluster id is the same of the Moose cluster id.
>>
>> I added an issue in our tracker.
>>
>> If it makes sense, maybe the weak array you suggest can be also used to
>> avoid instantiating lots of FLObjectCluster like we are doing in Object:
>>
>> fuelCluster
>>     ^ self class isVariable
>>         ifTrue: [ FLVariableObjectCluster for: self class ]
>>         ifFalse: [ FLFixedObjectCluster for: self class ]
>>
>> the second time you send fuelCluster to an object, it can reuse the
>> cluster instance.
>
> Right.  I think that's important, and is one reason why I think passing in
> the analyser is important, because it allows certain objects to discover
> existing clusters in the analyzer and join them if they want to, instead of
> having to invent and maintain their own cluster uniquing solution
> .
>>>
>>> again thanks for a great framework.
>>
>> Thanks for your words and the feedback. Is Glue published somewhere?
>
> No, and its extremely proprietary :)  Newspeak however is available and we
> may end up maintaining a port of Fuel for Newspeak.
> best regards,
> Eliot
>
>>
>> regards
>> Martin
>>
>>
>>>
>>> best,
>>> Eliot
>>
>>
>>>
>>> On Mon, Jun 13, 2011 at 10:16 AM, Mariano Martinez Peck
>>> <marianopeck at gmail.com> wrote:
>>>>
>>>>
>>>> On Thu, Jun 9, 2011 at 3:35 AM, Eliot Miranda <eliot.miranda at gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi Martin and Mariano,
>>>>>     a couple of questions.  What's the right way to exclude certain
>>>>> objects from the serialization?  Is there a way of excluding certain inst
>>>>> vars from certain objects?
>>>>
>>>>
>>>> Eliot and the rest....Martin implemented this feature in
>>>> Fuel-MartinDias.258. For the moment, we decided to put
>>>> #fuelIgnoredInstanceVariableNames at class side.
>>>>
>>>> Behavior >> fuelIgnoredInstanceVariableNames
>>>>     "Indicates which variables have to be ignored during serialization."
>>>>
>>>>     ^#()
>>>>
>>>>
>>>> MyClass class >> fuelIgnoredInstanceVariableNames
>>>>   ^ #('instVar1')
>>>>
>>>>
>>>> The impact in speed is nothing, so this is good. Now....we were thinking
>>>> if it is common to need that 2 different instances of the same class need
>>>> different instVars to ignore. Is this common ? do you usually need this ?
>>>> We checked in SIXX and it is at instance side. Java uses the prefix
>>>> 'transient' so it is at class side...
>>>>
>>>> thanks
>>>>
>>>>
>>>> --
>>>> Mariano
>>>> http://marianopeck.wordpress.com
>>>>
>>>
>>
>
>




More information about the Pharo-dev mailing list