- 
                Notifications
    You must be signed in to change notification settings 
- Fork 181
WIP refactor func globals to make external delegation possible #430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| Codecov Report
 @@            Coverage Diff             @@
##           master     #430      +/-   ##
==========================================
+ Coverage   91.99%   92.17%   +0.18%     
==========================================
  Files           4        4              
  Lines         687      703      +16     
  Branches      141      140       -1     
==========================================
+ Hits          632      648      +16     
  Misses         34       34              
  Partials       21       21              
 Continue to review full report at Codecov. 
 | 
| @staticmethod | ||
| def persistent_id(obj): | ||
| if ( | ||
| isinstance(obj, _FuncMetadataGlobals) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_FuncMetadataGlobals  and _FuncCodeGlobals are still private API so the subclassing pickler would still be impacted by changes in those constructs. However it feels maybe less brittle than overriding the save_function or _dynamic_function_reduce methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is worse than my solution:
- it is less intuitive as it forces you to use private API
- it also brings unnecessary wrapping of all dicts, so you will have to maintain these wrappers forever
My solution does not bring any more APIs as it follows official python docs. It also does not complicate serialization for people who don't use persistent_id for func.__globals__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't like your solution because it's calling persistent_id with the global dict at a point where it's not supposed to be called: the full func globals dict is never pickled when persistent_id is not overridden, only the derived base_globals and f_globals are. Therefore your solution is a breach/abuse of the persistent_id protocol as defined in the pickle module documentation.
If we don't like the wrapping, we could instead inject a specific key inside the dict with the id of the full func.__globals__. But it would make it hard to write code to access it and it would make it impossible to know which function is being pickled (but maybe users don't need that info).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the full func globals dict is never pickled when persistent_id is not overridden, only the derived base_globals and f_globals are.
Are you talking about cloudpickle or pickle? Because pickle pickles full func globals dict and we want to be able to tell cloudpickle that we want the same behaviour here.
Therefore your solution is a breach/abuse of the persistent_id protocol as defined in the pickle module documentation.
I don't see how it is. When you replace cloudpickle.CloudPickler with pickle.Pickler in my example you receive the desired behavior:
import io
import pickle
__globals__ = {"a": "foo"}
class Pickler(pickle.Pickler):
    @staticmethod
    def persistent_id(obj):
        if id(obj) == id(__globals__):
            return "__globals__"
class Unpickler(pickle.Unpickler):
    @staticmethod
    def persistent_load(pid):
        return {"__globals__": __globals__}[pid]
exec("""
def get():
    return a
""", __globals__)
get = __globals__['get']
file = io.BytesIO()
Pickler(file).dump(get)
dumped = file.getvalue()
if b'foo' in dumped:
    raise Exception(repr(dumped))
get = Unpickler(io.BytesIO(dumped)).load()
if id(__globals__) != id(get.__globals__):
    raise Exception(f"id(__globals__) != id(get.__globals__)")
print(f"get() == {get()}")
__globals__['a'] = 'bar'
print(f"get() == {get()}")But it would make it hard to write code to access it and it would make it impossible to know which function is being pickled (but maybe users don't need that info).
That's right, users don't need that info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you replace cloudpickle.CloudPickler with pickle.Pickler in my example you receive the desired behavior
Your example here works without defining persistent_id and persistent_load, so I am not sure it is relevant to why your changes are "abusing" the persistent_id protocol.
In [9]: import io
   ...: import pickle
   ...:
   ...: __globals__ = {"a": "foo"}
   ...:
   ...:
   ...: class Pickler(pickle.Pickler):
   ...:     pass
   ...:
   ...: class Unpickler(pickle.Unpickler):
   ...:     pass
   ...:
   ...: exec("""
   ...: def get():
   ...:     return a
   ...: """, __globals__)
   ...: get = __globals__['get']
   ...: file = io.BytesIO()
   ...: Pickler(file).dump(get)
   ...: dumped = file.getvalue()
   ...: if b'foo' in dumped:
   ...:     raise Exception(repr(dumped))
   ...: get = Unpickler(io.BytesIO(dumped)).load()
   ...: if id(__globals__) != id(get.__globals__):
   ...:     raise Exception(f"id(__globals__) != id(get.__globals__)")
   ...: print(f"get() == {get()}")
   ...: __globals__['a'] = 'bar'
   ...: print(f"get() == {get()}")I'll try to explain one more time why your PR does not use persistent_id as it is intended to be used. According to the pickle documentation:
To pickle objects that have an external persistent ID, the pickler must have a custom persistent_id() method that takes an object as an argument and returns either None or the persistent ID for that object. When None is returned, the pickler simply pickles the object as normal. When a persistent ID string is returned, the pickler will pickle that object, along with a marker so that the unpickler will recognize it as a persistent ID.
In other words, persistent_id(obj) only has an impact on which obj is pickled. In your PR, persistent_id(func.__globals__) has an impact on the content of two different objects that are NOT func.__globals__: f_globals and base_globals. Such a case is never advertised by the pickle docs, and thus is an abuse of the persistent_id protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because pickle pickles full func globals dict and we want to be able to tell cloudpickle that we want the same behaviour here
"pickles full globals dict" in which setting?
import pickle
global_var = 1
def f():
    return global_var
pickle.dumps(f)does not pickle any global variable, or any dictionary containing global variables. It pickles f by reference.
| from cloudpickle.cloudpickle import _make_empty_cell, cell_set | ||
| from cloudpickle.cloudpickle import _extract_class_dict, _whichmodule | ||
| from cloudpickle.cloudpickle import _lookup_module_and_qualname | ||
| from cloudpickle.cloudpickle import _FuncMetadataGlobals, _FuncCodeGlobals | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make those directly importable from the top level cloudpickle package even if they are private API?
| @ogrisel feel free to ping me when this is ready for review. | 
| 
 I am not sure if we want merge this PR. If it does not solve anybodies problem, then it's not worth the added complexity. Maybe @faucct is best addressed by just overriding  | 
| 
 I don't like this solution. I will check if overriding  | 
| Also, what about putting .globals substituion behind a flag? | 
Prototype to tackle #411 without calling
self.persistent_idon a object that would not otherwise be part of the pickle stream.