Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm more interested in the technical details than the publicity. Pretty much anyone these days can learn what a diffusion model is, how they're implemented, what the control flow is. What about this new multimodal LLM? They have no problems with text, they generate images using tokens, but how exactly? There's no open-source implementations that I know of, and I'm struggling to find details.


This video is very good. https://youtu.be/EzDsrEvdgNQ?si=EWp3U1GMkwg1bMQQ

One thing i'd add is that generating the tokens at the target resolution from the start is no longer the only approach to autoregressive image generation.

Rather than predicting each patch at the target resolution right away, it starts with the image (as patches) at a very small resolution and increasingly scales up. Paper here - https://arxiv.org/abs/2404.02905




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: