Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice for using cutouts of images #5

Open
JulioHC00 opened this issue May 23, 2023 · 3 comments
Open

Advice for using cutouts of images #5

JulioHC00 opened this issue May 23, 2023 · 3 comments
Assignees

Comments

@JulioHC00
Copy link

Hi! Not really an issue git the repository, but I'd really appreciate it if I can get some advice on how to approach my objective as I'm not used to dealing with such large amounts of data.

I need to download cutouts of the full Sun images that correspond to SHARPs patches. I have these bounding boxes stored in a database such that each is identifies by a timestamp (corresponding to one of the images) and to one active region (so there's potentially more than one bounding box per timestamp. Each bounding box is defined by x_min, x_max, y_min, y_max, so it can be used directly on the image. My problem is that there's roughly ~millions of cutouts and with 3 components of the magnetograms this scales up very quickly. So far, I'd been trying to process several chunks of each year's data at the same time and then storing each cutout in an individual file but that seems quite unefficient. Any guidance as to how this could be better approached would be greatly appreciated.

@PaulJWright
Copy link
Member

I would probably suggest dask or zarr as we use here, I would also check out this notebook from @wtbarnes https://gist.github.com/wtbarnes/8c1e8e8e39414784fa24cca3e697dfff

@PaulJWright PaulJWright self-assigned this May 23, 2023
@JulioHC00
Copy link
Author

Thanks! I've given it a try and at the moment it takes ~2-3 min per harpnum to process. Maybe this is as fast as I can get it to go, but it doesn't feel right. For example, for harp 104 the indices span from 9846 to 10684 which is from around 2010-07-31 to 2010-08-07 (about 7 days). I know that the data is stored in chunks, and the way I process it doesn't exploit these chunks. I've put the code that I wrote in a gist, if at any point you have some time to have a look at it, I'd really appreciate any suggestions you can make. Though I understand if you can't help with this and that's perfectly fine!

download_cutouts.py

@PaulJWright
Copy link
Member

Okay, i'll see if I can find time to look at this over the weekend or next week. Let me know if you come up with a solution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants