-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: <Crash of Kilosort and/or computer during first clustering> #766
Comments
@ttmysd70c6f5 It says you used Kilosort versions 4.0.5 and 4.0.16. Did it crash in the same place when using the latter version? That should have generated a log file at |
Hi @jacobpennington, thank you for your response, and I am sorry for my late reply. |
Okay, thanks. Can you also attach a screenshot of what the Kilosort4 GUI looks like after loading the data? If you're using the GUI. |
Hi all, I'm encountering a similar issue. I have a GeForce RTX 3060 graphics card, an SSD, plenty of RAM and CPU cores. My kilosort4 installation worked well for a NP1 recording of about 4500s length, but is crashing now always at the same point (Extracting spikes using cluster waveforms) for a ~6500s long recording. I've attached the log file. It seems that this should be far from maxing out my computer's capabilities. Any advice is appreciated! |
@katjaReinhard Can you please check that you're attaching the correct log file? That one is identical to the one uploaded above. Maybe try renaming it before uploading, it's possible github isn't handling the name clash correctly. |
thanks for the heads-up! I've attached it again as kilosort4_KR. |
Thanks. Just double checking, neither of you saw any error message in the GUI or terminal you were using to run Kilosort? I can see there's no error in the log file, I'm just wondering if there was any information elsewhere that didn't make it into the log for some reason. Otherwise, when it crashed, what happened exactly? Did the GUI just close on its own? |
In my case the GUI closed and the only message in the terminal was "killed", otherwise nothing. As you can see in the log, we're loading a raw file after some preprocessing. I haven't tried yet running it directly on the bin file but we didn't have problems for other experiments with the raw files. |
In my case, no message showed up in the terminal or the GUI except for this warning message:
though I guess this message is about something different. |
we are now getting still the "killed" error but also "CUDA error" when kilosort crashes. nvidia-smi says "gpu detected critical xid error". It seems that we have some issues with our graphics card, at least for long recordings. I'm not sure if that is the only issue as we didn't get the cuda error before though. |
Interesting.. that type of crash does make me think it's related to the hardware rather than Kilosort, but definitely something to check on. Are either of you able to share your data so I can try to reproduce the issue on my machine? |
thank you for brain storming with us! I'll have to check some options to share data. in any case, the possible graphics card failure really bothered me so I'm now writing gpu and cpu details to a log file while running kilosort and I'm quite surprised to see that GPU usage is consistently around 6 GB (out of 12) while my CPU RAM is constantly close to the limit (25-30 GB, I have 32 GB). From the guidelines I understood that cpu shouldn't be that critical and 32 GB should be enough for these kind of recordings (<2h). Are there any settings in kilosort that I could adjust to reduce the cpu usage? I think that this is actually what killed kilosort (I'll know for sure once I have the final logs). I've already increased swap to 8 GB, but I'm wondering if there's anything else I can do. |
@jacobpennington after installing kilosort4 from scratch on two computers running Linux as well as on one Windows computer, I can confirm based on the log files that the issue we encounter is that CPU RAM is maxed out. We have 32 GB RAM and it's hovering around 20-30 GB for most of the process and then usually around the Extracting Spikes or the latest at the first clustering step it exceeds 32 GB and crashes. We have an RTX 3060 12 GB graphics card with correctly installed Nvidia driver and CUDA. The graphics card is recognized by kilosort and used at around 5-7 GB throughout the process. Given the computer specs that are suggested on the kilosort site, this should not be a problem for a <2h recording with 1 probe. I also have the impression that other issues reported here might refer to the same problem. Do you have any idea why kilosort is using more CPU RAM than anticipated? This is really a big issue for us since we're sitting on a bunch of data that we have no way to process and analyse right now. |
Yes, using more than 32GB is definitely not expected for a recording that size. There are steps of the pipeline that primarily use GPU, and other steps that primarily use CPU, so presumably there's some reason why your data, probe, or settings are causing one of those steps to use a lot more memory than usual. Unfortunately, it's hard for me to debug that from the log alone. Options to narrow things down would be:
|
I can share my data with you. How can I send my data to you? |
A google drive link has worked best. You can post the link here, or e-mail it to me at [email protected] |
I shared the google drive folder with the data and the channel map file. Here is the link of the drive: https://drive.google.com/drive/folders/1wwjdNpaIlEsRug1y0UUSkaYnu8MM9_K1?usp=sharing |
@jacobpennington we just found out that the reason why kilosort used so much RAM in our hands is because the preprocessing step we did in spikeinterface saved the data as float64, which was not apparent at all. in any case, we changed the format in which the spikeinterface output is saved and now kilosort runs on our data. Apologies for taking up your time - we can only hope someone else will have a similar problem and be able to figure it out due to this report :) |
@katjaReinhard Great, thanks for the information! Just a reminder for anyone else coming across this issue: before submitting a bug report, please try running KS4 alone without spikeinterface (or any other 3rd-party package) as an initial debugging step. |
Describe the issue:
Hi, I have a problem that Kilosort4 and/or the computer crashes while running the first clustering. I observed that the CPU, RAM, and GPU usage was very high (close to 100%) while running Kilosort4.
My data were recorded with Neuropixel 1.0 with 384 channels, and its size is around 60~120GB. However, I could complete running kilosort4 without an issue when I ran it for a smaller dataset with the size of 2GB (downloaded from http://www.kilosort.org/downloads/ZFM-02370_mini.imec0.ap.bin). So I guess my dataset is too big to run kilosort4, but I think it should not be.
It is worth noting that my colleagues using similar hardware and datasets are not experiencing this issue. However, I still encountered crashes when I duplicated their conda environment on my hardware and ran spike sorting on my data.
I already tried to (1) use SSD, (2) reinstall conda and recreate conda environment, (3) apply "Clear PyTorch Cache" option, (4) try different dataset with a similar recording duration, and (5) check CUDA version, but they did not resolve the issue.
I have attached the settings of Kilosort4 when I had the crash. Kilosort4 created no log file in the output directory.
I greatly appreciate your help in figuring out the solution for this issue. Thank you!
Reproduce the bug:
No response
Error message:
No response
Version information:
My hardware is as follows:
CPU: intel core i9 13900K 3.00GHz
OS: Windows10 64-bit
RAM: 128GB
GPU: NVIDIA GeForce RTX 4090
Kilosort 4.0.5 and 4.0.16
CUDA toolkit: 11.8
NVIDIA driver
Kilosort settings:
settings = {
'data_file_path': WindowsPath('F:/Data/Kilosort4/test/20240208_082558_merged.probe1.dat'),
'results_dir': WindowsPath('F:/Data/Kilosort4/test/kilosort4'),
'probe': '... (use print probe)',
'probe_name': 'channelmap_probe1_240208_082558.mat',
'data_dtype': 'int16',
'n_chan_bin': 384,
'fs': 30000.0,
'batch_size': 60000,
'nblocks': 1,
'Th_universal': 9.0,
'Th_learned': 8.0,
'tmin': 0.0,
'tmax': inf,
'nt': 61,
'shift': None,
'scale': None,
'artifact_threshold': inf,
'nskip': 25,
'whitening_range': 32,
'highpass_cutoff': 300.0,
'binning_depth': 5.0,
'sig_interp': 20.0,
'drift_smoothing': [0.5, 0.5, 0.5],
'nt0min': None,
'dmin': None,
'dminx': 32.0,
'min_template_size': 10.0,
'template_sizes': 5,
'nearest_chans': 10,
'nearest_templates': 100,
'max_channel_distance': None,
'templates_from_data': True,
'n_templates': 6,
'n_pcs': 6,
'Th_single_ch': 6.0,
'acg_threshold': 0.2,
'ccg_threshold': 0.25,
'cluster_downsampling': 20,
'x_centers': None,
'duplicate_spike_ms': 0.25
}
The text was updated successfully, but these errors were encountered: