resources JS vs actors

mark

My question here is, what is the bottleneck if the CPU and GPU are far from beeing stressed? As there isn't any video file playing and all content is generative only or comes from videocapture, it shoudn't be the Flashdrive (1500Mbit/s).

The speed of the hard drive has nothing to do with this issue.

As I mentioned above about the Video Delay actor, it needs to move the image from the GPU to the CPU. Then I said "Unfortunately there is a high cost performance wise when moving images from the GPU to the CPU. That's what you're seeing when you add those video delay actors."

GPUs are designed to pull data from CPU RAM very very quickly. But they are not designed to go in the other direction. (Why is this? Because GPUs are designed for gaming, not for video processing. A game never need to get the image back from the GPU, so GPUs are not designed to deal with this use case.)

In any case, when you ask the GPU to give the data back to you, it causes what's called a "stall" -- the GPU needs to finish all operations at the moment you ask for the image. Such a stall destroys the parallel processing (= threading) that makes the GPU so fast. Moreover, the CPU needs to sit and wait for all the pending GPU operations to complete.

It is possible that we could make a Video Delay actor that keeps all the frames on the GPU, which would make it more efficient. The problem is it's not trivial to find out how much memory is available in the GPU and to get the actor to fail gracefully if there isn't enough GPU memory.

Again, every frame of a 1920x1080 image consumes 8.29MB. You want a ten second delay at 30 fps? That's 30 x 10 x 8.29MB = 2.4GB. A lot of GPUs could handle this, but some could would run out of memory. It was this fact that led to my decision to keep the delayed frames in CPU RAM.

Best Wishes,
Mark

DillTheKraut

@mark said:

GPU to the CPU

Ok, I missed that part as I wasn't aware, that the direction of data transmission would be handled differently. This means, not the BUS system is the bottleneck, but the design of the grafikcards? In this case VRAM/shared Mem is comming to my mind, but maybe this is a whole other story?

Thank you, for allways taking the time to explain in the deep.

It shows the complexity of programming live video tools and therefore the affort you put in Isadora.

This community and having the creator as intense part of it, makes the Isadora project even more one of a kind!

Thanks a lot!

DillTheKraut

@mark said:

That's 30 x 10 x 8.29MB = 2.4GB

I guess that's what is copied back and forth per s? Given a PCI 2.0 16x PCIe BUS Speed has 8GB/s (That's the specs of the old 5,1 Mac) throughput, this would only allow a max. 3 of those delays, right?

mark

@dillthekraut said:

I guess that's what is copied back and forth per s?

No, that's not the case. You need to get 8.29 MB for each frame. For each render cycle, you're grabbing the 'video in' frame from the GPU (slow, bottleneck with stalls)and storing it in CPU RAM, but you are also grabbing one of the delayed frames in the CPU and shipping it to the GPU (fast). Assuming a frame rate of 30fps, 2 x 8.29MB x 30 = 497.4 MB per second transfer.

RE: this:

VRAM/shared Mem

Video RAM on the GPU is not shared with the CPU... at least on most existing computers with discrete graphics cards. Now, I'm actually not sure about an integrated GPUs like the Intel UHD Graphics 630 1536 MB on my computer... maybe there is some sharing there? (I just looked it up; kinda complicated.) Interestingly, I know on the new M1 chips the RAM is shared by the CPU and GPU, which might make this bottleneck go away as we transition to those machines. (Again, not 100% certain about this.... inferring this from a few things I've read.)

Best Wishes,
Mark

liminal_andy

@mark any thoughts on the new resizable BAR technologies that have just become available on most GPUs? Would this have an impact on Isadora's capabilities?