It captures two billion pixels per second. Essentially he captures the same scene several times (presumably 921,600 times to form a full 720 picture), watching a single pixel at a time, and composite all the captures together for form frames.
I suppose that for entirely deterministic and repeatable scenes, where you also don't care too much about noise and if you have infinite time on your hands to capture 1ms of footage, then yes you can effectively visualize 2B frames per second! But not capture.
I would say that everyone - you, other commenters disagreeing with you, and the video - are all technically correct here, and it really comes down to semantics and how we want to define fps. Not really necessary to debate in my opinion since the video clearly describes their methodology, but useful to call out the differences on HN where people frequently go straight to the comments before watching the video.