Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any way to integrate this within a pytorch dataloader/dataset? #17

Open
TheShadow29 opened this issue Jul 26, 2022 · 9 comments
Open

Any way to integrate this within a pytorch dataloader/dataset? #17

TheShadow29 opened this issue Jul 26, 2022 · 9 comments

Comments

@TheShadow29
Copy link

Hello, is there any easy way to use the frame extraction method within the dataset/dataloader for pytorch? My understanding that it is not trivial given the use of multiprocessing here: https://github.com/iejMac/video2numpy/blob/main/video2numpy/frame_reader.py#L39 which could conflict with what is already used by Pytorch dataloader.

@iejMac
Copy link
Owner

iejMac commented Jul 26, 2022

Hey, thanks for the issue. I have two thoughts about this:

  1. If your goal is to use the videos in a training loop (or any repetitive process for that matter) you might want to split up preprocessing into 2 steps - An initial run, decoding from mp4/links -> numpy arrays using video2numpy and then after that write some custom Dataset class that loads the numpy arrays and does your more specific preprocessing which you can load into a PyTorch DataLoader. We created this tool because video decoding/downloading often turns out to be a bottleneck in processing so splitting the work as I mentioned before seems to work much better.
    As an example, you can check out clip-video-encode which is another tool we're working on, it takes videos and encodes the frames using the CLIP image encoder. It does exactly what video2numpy does but instead of saving numpy arrays of frames, it saves numpy arrays of frame embeddings. We used it to create a CLIP embedding version of Kinetics700 according to this format which uses PyTorch dataloaders and is rather fast. You could do the exact same thing with video2numpy.

  2. If you're suggesting to use PyTorch dataloaders to make the reading faster, I haven't looked into this yet. Maybe there is some merit to this however currently the biggest issue we're having is the poor scaling with target video FPS which I'm starting to believe is unavoidable without creating a specific video codec. More details here - Performance Benchmarks for video decoding backends #5

Lmk if this answers your question and feel free to ask more.

@TheShadow29
Copy link
Author

@iejMac Thanks for your reply. My current case is I have a large number of video files, I am using webvid-2M dataset (https://m-bain.github.io/webvid-dataset/). There are around 2M videos each with variable length, and on average 18 seconds length.

After initial preprocessing of converting videos to 256x256, the resulting dataset is around 1.6 TB. Your suggestion of converting the mp4 links to numpy-arrays in first round of preprocessing for all videos would likely require very large amount of disk space which is not affordable.

My current data pipeline involves a dataset (torch.utils.data.Dataset) which (i) reads the video file (no decoding), (ii) decodes specific frames (iii) applies preprocessing on them (iv) return the corresponding tensor. I then wrap around the dataset into DataLoader class.

The main bottleneck is the decoding part though. I am currently using decord for this purpose, which on initial testing seems slightly faster than pyav, imageio-ffmpeg.

I do see your point of poor scaling wrt target fps, but I don't know if that is the main bottleneck.

The very ideal case to me seems like having your first thought, but doing it for every mini-batch of videos at training/inference time. Essentially, read all the mp4 files (no decoding) very fast, and then pass them to the video decoder. Unfortunately, the decoding part is essentially sequential (one worker decodes 1 video), so the same bottleneck will likely exist.

Not sure if there is any easy way to mitigate this issue.

@iejMac
Copy link
Owner

iejMac commented Jul 26, 2022

Hmmm, good point. For context - I hadn't thought about that since we get around the storage problem by only storing lower-dimensional frame embeddings. Additionally, the reason poor scaling wrt target fps is the bottleneck for us is because a target FPS of 1 is acceptable since we're looking into longer context video understanding.

Anyway, thanks a ton for the description of your pipeline. It seems you're doing a few things that we aren't so I'll look into these to see if they can improve video2numpy. Could you point me to the code for this?
Also, do you have performance metrics for your pipeline? I.e. how many samples/s you retrieve at full FPS (maybe with some estimate of average FPS in WebVid).

As to the ideal case, I think I agree. Like I said before, I'm getting to the point of considering calculating the theoretical maximum decoding speed given the way video is compressed. Don't want to be wasting time hyper-optimizing something if we're already close to theoretical max. Have you looked into this yet?

Also, why do you want to integrate this within a PyTorch DataLoader? The input and output for the FrameReader class in video2numpy is exactly the same (see this example + API section in README). Are you getting better performance with your approach? I guess our FrameReader doesn't allow custom preprocessing functions but this should be easy enough to add. Please let me know what video2numpy is missing for your use case.

@TheShadow29
Copy link
Author

@iejMac I believe even with larger time horizon videos, having access to full videos is often useful. For instance you can sample the frames with a bit of temporal jittering (like if you wanted to sample frame 45, now you sample frame 47), which can help in generalization

I haven't looked into maximum decoding speed on my end.

The main reason to use pytorch dataloader is due to the convenience of porting existing code. Many existing frameworks such as Pytorch Lightning expect a dataloader class into their trainer. Maybe if the video2numpy can be easily instantiated as a DataLoader class it could be really helpful.

Using dataloader definitely improves performance, but I don't have a good benchmark against video2numpy. On my side, with current dataloader, I am able to get around 2it/s on 8 A100 gpus, Batch size per gpu = 32, with num_workers=10 on each rank.

@iejMac
Copy link
Owner

iejMac commented Jul 27, 2022

@TheShadow29 I'm a bit confused with what you mean by "full videos" and how this relates to temporal jittering. Downsampling from say 25FPS to 1FPS still gets you the full video but more coarsely grained which fits our method of using semantic CLIP embeddings. Because of how CLIP is trained I don't think there would be much of a difference between the embedding of frame n and n+1 at full FPS. But in general I suppose I do agree that full FPS would be useful even in larger time horizons, I just don't know how useful.

Ok, I see. In that case I think it would be useful to make our video2numpy FrameReader class work with PyTorch dataloaders. Maybe by making it easily wrappable or something. I'll look into this when I have some time and let you know. If you'd like to investigate this sooner I welcome you to give it a shot and make a PR.

What do you mean by iteration in this context? frame or video?

@TheShadow29
Copy link
Author

@iejMac I agree that for CLIP it is likely not that useful. But I am not exactly training CLIP, but trying a different model. Video models such as SlowFast often expect 8 or 32 frames (for slow and fast paths). In videos such as those sourced from Kinetics with lot of actions, fps=1 wouldn't really work out.

Yeah, making the dataloader wrappable would be really useful. I highly doubt I have the bandwidth, but if I do get time, I will update you on the PR.

In the context, batch is a video (4 frames per video, forgot to note it previously).

@iejMac
Copy link
Owner

iejMac commented Jul 30, 2022

@TheShadow29 I made a PR with a wrapper that might work for you - #18
You can test it out on branch pt_dl.

A simple example that should work:

from video2numpy.frame_reader import FrameReader
from video2numpy.pytorch_wrapper import fr2dl

vids = get_your_vids() # either list of mp4 paths or YouTube links
fr = FrameReader(vids, workers=6, *other_args)
dl = fr2dl(fr)

print(type(dl)) # <class 'torch.utils.data.dataloader.DataLoader'>
for b, info in dl:
    print(b.shape) # (n_frames, size, size, 3)

credit to https://github.com/ClashLuke for the code

@rom1504
Copy link
Contributor

rom1504 commented Aug 6, 2022

I think it would be nice to add this in a examples/pytorch.py

@rom1504
Copy link
Contributor

rom1504 commented Aug 6, 2022

a example/jax.py and examples/tf.py definitely wouldn't hurt either

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants