-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
I have been experienced intermittent and difficult to pin down failures when lightning tried to auto-requeue my jobs on our SLURM cluster on time out. After some debugging (print in the signal handler for SIGUSR1 and many runs), I saw that sometimes the handler would just stop running after or while saving the HPC checkpoint to disk.
Signal handlers are a pretty special environment, because they can run after any python bytecode instruction, so also in the middle of other functions. This can lead to difficult to track down problems and also crashes, e.g. python/cpython#112608. Or it could run in the middle of a backward pass and store a corrupted checkpoint. So the safe option would be to do as little work as possible in them and move the actual signal handling into normal code.
What version are you seeing the problem on?
v2.5, master
Reproduced in studio
No response
How to reproduce the bug
The issue is difficult to reproduce as it depends on signal timing and the exact environment. It never happens on one cluster but regularly on the other.Error messages and logs
No error messages. The signal handler just never completes until the program gets killed by SLURM.
Environment
Current environment
- CUDA:
- GPU: None
- available: False
- version: 12.6
- Lightning:
- lightning: 2.6.0
- lightning-utilities: 0.15.2
- pytorch-lightning: 2.6.0
- torch: 2.9.1+cu126
- torch-fidelity: 0.3.0
- torchdata: 0.11.0
- torchmetrics: 1.8.2
- torchvision: 0.24.1+cu126
- Packages:
- aiohappyeyeballs: 2.6.1
- aiohttp: 3.13.2
- aiosignal: 1.4.0
- annotated-types: 0.7.0
- antlr4-python3-runtime: 4.9.3
- anyio: 4.12.0
- argon2-cffi: 25.1.0
- argon2-cffi-bindings: 25.1.0
- arrow: 1.4.0
- asttokens: 3.0.1
- async-lru: 2.0.5
- attrs: 25.4.0
- autocommand: 2.2.2
- babel: 2.17.0
- backports.tarfile: 1.2.0
- beautifulsoup4: 4.14.3
- bleach: 6.3.0
- brezn: 0.1.0
- bsi: 0.1.0
- cachetools: 6.2.2
- cattrs: 25.3.0
- certifi: 2025.11.12
- cffi: 2.0.0
- charset-normalizer: 3.4.4
- click: 8.3.1
- cloudpickle: 3.1.2
- comm: 0.2.3
- contourpy: 1.3.3
- cycler: 0.12.1
- debugpy: 1.8.17
- decorator: 5.2.1
- defusedxml: 0.7.1
- einops: 0.8.1
- executing: 2.2.1
- fastjsonschema: 2.21.2
- filelock: 3.20.0
- fonttools: 4.61.0
- fqdn: 1.5.1
- frozenlist: 1.8.0
- fsspec: 2025.12.0
- gitdb: 4.0.12
- gitignorant: 0.4.0
- gitpython: 3.1.45
- h11: 0.16.0
- h5py: 3.15.1
- httpcore: 1.0.9
- httpx: 0.28.1
- hydra-core: 1.3.2
- hydra-submitit-launcher: 1.4.0.dev0
- idna: 3.11
- importlib-metadata: 8.0.0
- inflect: 7.3.1
- iniconfig: 2.3.0
- ipdb: 0.13.13
- ipykernel: 7.1.0
- ipympl: 0.9.8
- ipython: 9.8.0
- ipython-pygments-lexers: 1.1.1
- ipywidgets: 8.1.8
- isoduration: 20.11.0
- jaraco.collections: 5.1.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jaxtyping: 0.3.3
- jedi: 0.19.2
- jinja2: 3.1.6
- json5: 0.12.1
- jsonpointer: 3.0.0
- jsonschema: 4.25.1
- jsonschema-specifications: 2025.9.1
- jupyter-client: 8.6.3
- jupyter-core: 5.9.1
- jupyter-events: 0.12.0
- jupyter-lsp: 2.3.0
- jupyter-server: 2.17.0
- jupyter-server-terminals: 0.5.3
- jupyterlab: 4.5.0
- jupyterlab-pygments: 0.3.0
- jupyterlab-server: 2.28.0
- jupyterlab-widgets: 3.0.16
- kiwisolver: 1.4.9
- lark: 1.3.1
- lightning: 2.6.0
- lightning-utilities: 0.15.2
- loky: 3.5.6
- markdown-it-py: 4.0.0
- markupsafe: 3.0.3
- matplotlib: 3.10.7
- matplotlib-inline: 0.2.1
- mdurl: 0.1.2
- mistune: 3.1.4
- more-itertools: 10.3.0
- mpmath: 1.3.0
- multidict: 6.7.0
- nbclient: 0.10.2
- nbconvert: 7.16.6
- nbformat: 5.10.4
- nest-asyncio: 1.6.0
- networkx: 3.6
- notebook-shim: 0.2.4
- numpy: 2.3.5
- nvidia-cublas-cu12: 12.6.4.1
- nvidia-cuda-cupti-cu12: 12.6.80
- nvidia-cuda-nvrtc-cu12: 12.6.77
- nvidia-cuda-runtime-cu12: 12.6.77
- nvidia-cudnn-cu12: 9.10.2.21
- nvidia-cufft-cu12: 11.3.0.4
- nvidia-cufile-cu12: 1.11.1.6
- nvidia-curand-cu12: 10.3.7.77
- nvidia-cusolver-cu12: 11.7.1.2
- nvidia-cusparse-cu12: 12.5.4.2
- nvidia-cusparselt-cu12: 0.7.1
- nvidia-nccl-cu12: 2.27.5
- nvidia-nvjitlink-cu12: 12.6.85
- nvidia-nvshmem-cu12: 3.3.20
- nvidia-nvtx-cu12: 12.6.77
- omegaconf: 2.3.0
- packaging: 25.0
- pandocfilters: 1.5.1
- parso: 0.8.5
- pexpect: 4.9.0
- pillow: 12.0.0
- platformdirs: 4.5.0
- pluggy: 1.6.0
- prometheus-client: 0.23.1
- prompt-toolkit: 3.0.52
- propcache: 0.4.1
- protobuf: 6.33.1
- psutil: 7.1.3
- ptyprocess: 0.7.0
- pure-eval: 0.2.3
- pycparser: 2.23
- pydantic: 2.12.5
- pydantic-core: 2.41.5
- pygments: 2.19.2
- pyparsing: 3.2.5
- pytest: 9.0.1
- python-dateutil: 2.9.0.post0
- python-json-logger: 4.0.0
- pytorch-lightning: 2.6.0
- pyyaml: 6.0.3
- pyzmq: 27.1.0
- referencing: 0.37.0
- requests: 2.32.5
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rfc3987-syntax: 1.1.0
- rich: 14.2.0
- rpds-py: 0.30.0
- scipy: 1.16.3
- send2trash: 1.8.3
- sentry-sdk: 2.47.0
- setuptools: 80.9.0
- six: 1.17.0
- smmap: 5.0.2
- soupsieve: 2.8
- stack-data: 0.6.3
- submitit: 1.5.3
- sympy: 1.14.0
- terminado: 0.18.1
- tinycss2: 1.4.0
- toml: 0.10.2
- tomli: 2.0.1
- torch: 2.9.1+cu126
- torch-fidelity: 0.3.0
- torchdata: 0.11.0
- torchmetrics: 1.8.2
- torchvision: 0.24.1+cu126
- tornado: 6.5.2
- tqdm: 4.67.1
- traitlets: 5.14.3
- triton: 3.5.1
- typeguard: 4.3.0
- typing-extensions: 4.15.0
- typing-inspection: 0.4.2
- tzdata: 2025.2
- uri-template: 1.3.0
- urllib3: 2.5.0
- wadler-lindig: 0.1.7
- wandb: 0.23.1
- wcwidth: 0.2.14
- webcolors: 25.10.0
- webencodings: 0.5.1
- websocket-client: 1.9.0
- wheel: 0.45.1
- widgetsnbextension: 4.0.15
- yarl: 1.22.0
- zipp: 3.19.2
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.13.2
- release: 5.15.0-161-generic
- version: docs: add repo_name in the upright corner #171-Ubuntu SMP Sat Oct 11 08:17:01 UTC 2025
More info
No response