Ah ok. It sounds like it's important to distinguish the two by saying 2D image, ...

yjftsjthsd-h · on Dec 6, 2023

More likely to be distinguishing 2D from 3D graphics, and just saying "image" can be ambiguous when firmware images are in play.

DaiPlusPlus · on Dec 6, 2023

So, are video-files of 2D images technically 3D then?

and time-based volumetric recording of 3D video-games are 4D?

jacquesm · on Dec 7, 2023

You can render a video file as a volume. I've looked at using that to make a video compression algorithm that operated on the volume rather than on the 2D frame stream. My hunch was that shapes in the 3D volume changed more predictably than surfaces from frame to frame because the frames describe the movement of objects in space. But they're projected onto a two dimensional surface. So you get these interesting 3D shapes that have fairly predictable qualities across larger spans of time than your average 2D encoder sees while encoding a video. But I never could get it to work more efficiently than existing algorithms.

Still, it was a fun project.

kevindamm · on Dec 7, 2023

existing algorithms do a lot of compression across frame-sequences, but yeah not quite in the same way as the imputed volume.

I wonder if your idea would work for lightfield captures, or time sequences of a lightfield.

vidarh · on Dec 7, 2023

I suspect you'd need your compression to "understand" the object relationships and camera movement to do better than frame sequences, and it'd probably still be incredibly hard because you then add a lot of extra information first in the hope they let you discard more pixel data...

But the more you understand the scene, the more you can potentially outright reconstruct, and in some contexts more loss would be entirely fine if the artifacts are plausible.

jacquesm · on Dec 7, 2023

That's exactly where I ended with this: I was decomposing the scene and realized that if you had the ability to do that reliably enough you'd be recreating a model rather than an image and then re-rendering that model. But at that point I don't think you are looking at a compression algorithm any more other than in the very broadest sense of the word. Boundaries between objects would start to look fuzzy otherwise. As in: you'd no longer know exactly where the table ended and the hand started unless you modeled it precisely enough and at that point you have an object model. So you might as well use it to render the whole scene.

Note that I did this in '98 or so, when there was less of a computational budget, maybe what I couldn't hack back then is feasible today.

bmicraft · on Dec 7, 2023

2D images are already 3D if you consider the colors (rgba) a dimension or the image contains layers like gif

starttoaster · on Dec 6, 2023

> So, are video-files of 2D images technically 3D then?

Video files are a linked list of 2D images.

> and time-based volumetric recording of 3D video-games are 4D?

Kind of, but it would be an oversimplification. Typically when we refer to the dimensionality of objects, we're referring to physical dimensions. Time is a temporal dimension. I think it would be more specific to say this is a linked list of 3-dimensional images, right?

picture · on Dec 6, 2023

Time doesn't necessarily have to be another dimension. They're just what they are but organized in time.

maxbond · on Dec 7, 2023

You don't have to think of it as a dimension if that obfuscates the problem or isn't a useful mental model for you, but "organizing" it is what a dimension does. You can think of an image as colors or intensities "organized by" (I would call it "indexed by") width and height. If you work with videos in machine learning, you generally accept a 6 dimensional tensor (width, height, time, red, green, blue - your order may vary, time often comes first. And you may use grayscale instead of color to reduce the number of dimensions.).

DaiPlusPlus · on Dec 7, 2023

> If you work with videos in machine learning, you generally accept a 6 dimensional tensor (width, height, time, red, green, blue

(Assuming you do work with ML+videos) - it's surprising to hear you say you work with RGB instead of YUV - can you briefly explain how that's the case? I'd have thought that using luma/chroma separation would be much easier to work with (not just with traditional video tooling, but ML/NNs/etc themselves would have an easier time consuming it.

maxbond · on Dec 7, 2023

To clarify, I don't work professionally with videos, I've hacked on some projects and read some books about it. My professional experience with ML models is in writing backends to integrate with them, the models I've designed/trained were for my own education (so far, at least). The answer to your question is probably, "I'm a dilettante who doesn't know better, you may well know more than me."

I take the impression that much of the time, color doesn't provide much signal and gives your model things to overfit on, so you collapse it down to grayscale. (Which is to say, most of the time you care about shape, but you don't care about color.) But I bet there are problem spaces where your intuition holds, I'm sure that there's performance the be wrung out of a model by experimenting with different color spaces who's geometry might separate samples nicely.

I did something similarish a few months ago where I used LDA[1] to create a boutique grayscale model where the intensity was correlated to the classification problem at hand, rather than the luminosity of the subject. It worked better than I'd have guessed, just on it's own (though I suspect it wouldn't work very well for most problems). But the idea was to preprocess the frames of the video this way and then feed it into a CNN [2]. (Why not a transformer? Because I was still wrapping my mind around simpler architectures.)

[1] https://en.wikipedia.org/wiki/Linear_discriminant_analysis

[2] https://en.wikipedia.org/wiki/Convolutional_neural_network