Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enum used in map functions will raise a RecursionError with dill. #2643

Open
jorgeecardona opened this issue Jul 14, 2021 · 4 comments
Open
Labels
bug Something isn't working

Comments

@jorgeecardona
Copy link

jorgeecardona commented Jul 14, 2021

Describe the bug

Enums used in functions pass to map will fail at pickling with a maximum recursion exception as described here: uqfoundation/dill#250 (comment)

In my particular case, I use an enum to define an argument with fixed options using the TraininigArguments dataclass as base class and the HfArgumentParser. In the same file I use a ds.map that tries to pickle the content of the module including the definition of the enum that runs into the dill bug described above.

Steps to reproduce the bug

from datasets import load_dataset
from enum import Enum

class A(Enum):
    a = 'a'

def main():
    a = A.a
    
    def f(x):
        return {} if a == a.a else x
    
    ds = load_dataset('cnn_dailymail', '3.0.0')['test']
    ds = ds.map(f, num_proc=15)

if __name__ == "__main__":
    main()

Expected results

The known problem with dill could be prevented as explained in the link above (workaround.) Since HFArgumentParser nicely uses the enum class for choices it makes sense to also deal with this bug under the hood.

Actual results

  File "/home/xxxx/miniconda3/lib/python3.8/site-packages/dill/_dill.py", line 1373, in save_type
    pickler.save_reduce(_create_type, (type(obj), obj.__name__,
  File "/home/xxxx/miniconda3/lib/python3.8/pickle.py", line 690, in save_reduce
    save(args)
  File "/home/xxxx/miniconda3/lib/python3.8/pickle.py", line 558, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/xxxx/miniconda3/lib/python3.8/pickle.py", line 899, in save_tuple
    save(element)
  File "/home/xxxx/miniconda3/lib/python3.8/pickle.py", line 534, in save
    self.framer.commit_frame()
  File "/home/xxxx/miniconda3/lib/python3.8/pickle.py", line 220, in commit_frame
    if f.tell() >= self._FRAME_SIZE_TARGET or force:
RecursionError: maximum recursion depth exceeded while calling a Python object

Environment info

  • datasets version: 1.8.0
  • Platform: Linux-5.9.0-4-amd64-x86_64-with-glibc2.10
  • Python version: 3.8.5
  • PyArrow version: 3.0.0
@jorgeecardona jorgeecardona added the bug Something isn't working label Jul 14, 2021
@mbforbes
Copy link

mbforbes commented Aug 9, 2021

I'm running into this as well. (Thank you so much for reporting @jorgeecardona — was staring at this massive stack trace and unsure what exactly was wrong!)

@lhoestq
Copy link
Member

lhoestq commented Aug 23, 2021

Hi ! Thanks for reporting :)

Until this is fixed on dill's side, we could implement a custom saving in our Pickler indefined in utils.py_utils.py
There is already a suggestion in this message about how to do it:
uqfoundation/dill#250 (comment)

Let me know if such a workaround could help, and feel free to open a PR if you want to contribute !

@BitcoinNLPer
Copy link

I have the same bug.
the code is as follows:
image
the error is:
image

Look for the solution for this bug.

@lhoestq
Copy link
Member

lhoestq commented Nov 2, 2021

Hi ! I think your RecursionError comes from a different issue @BitcoinNLPer , could you open a separate issue please ?

Also which dataset are you using ? I tried loading CodedotAI/code_clippy but I get a different error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/quentinlhoest/Desktop/hf/datasets/src/datasets/load.py", line 1615, in load_dataset
    **config_kwargs,
  File "/Users/quentinlhoest/Desktop/hf/datasets/src/datasets/load.py", line 1446, in load_dataset_builder
    builder_cls = import_main_class(dataset_module.module_path)
  File "/Users/quentinlhoest/Desktop/hf/datasets/src/datasets/load.py", line 101, in import_main_class
    module = importlib.import_module(module_path)
  File "/Users/quentinlhoest/.virtualenvs/hf-datasets/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/quentinlhoest/.cache/huggingface/modules/datasets_modules/datasets/CodedotAI___code_clippy/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d/code_clippy.py", line 66, in <module>
    url_elements = results.find_all("a")
AttributeError: 'NoneType' object has no attribute 'find_all'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants