More

ArnavAgrawal03 · 2025-12-03T04:27:57 1764736077

you can do that with Morphik already :)

We use an embedding model that processes videos and allows you to perform RAG on them.

eurekin · 2025-12-03T11:13:05 1764760385

Would it allow me to query my library for every movie that contains dance routing move1-move2-move3 in that order?

arresin · 2025-12-03T09:00:15 1764752415

Rag as in the content is used to generate an answer or rag as in searching for a video?

ArnavAgrawal03 · 2025-11-07T18:19:03 1762539543

Came here to say I love Ocaml too

ArnavAgrawal03 · 2025-09-01T20:58:27 1756760307

We use a BSL for our product (https://morphik.ai) and usually stay away from calling it anything. We'd just say "repo is public at: https://github.com/morphik-org/morphik-core". I like the term fair source, though.

Is it correct to assume that software than eventually becomes open under something like Apache or MIT is fair source? Or is there more subtlety to it?

the_mitsuhiko · 2025-09-01T21:01:07 1756760467

> Is it correct to assume that software than eventually becomes open under something like Apache or MIT is fair source? Or is there more subtlety to it?

The concrete definition we came up with and published:

> Fair Source is an alternative to closed source, allowing you to safely share access to your core products. Fair Source Software (FSS):

> - is publicly available to read;

> - allows use, modification, and redistribution with minimal restrictions to protect the producer’s business model; and

> - undergoes delayed Open Source publication (DOSP).

ArnavAgrawal03 · 2025-08-29T23:09:30 1756508970

we used multi-vector models at Morphik, and I can confirm the real-world effectiveness, especially when compared with dense-vector retrieval.

bjornsing · 2025-08-30T09:03:12 1756544592

Can you share more about the multi-vector approach? Is it open source?

codingjaguar · 2025-08-30T00:00:29 1756512029

Curious is that colbert-like ones?

ArnavAgrawal03 · 2025-08-15T16:17:45 1755274665

> They had known him for only 15 seconds, yet they still perceived the act of snapping him in half as violent.

This is right out of Community

WorkerBee28474 · 2025-08-15T19:24:17 1755285857

Clip from s01e01: https://www.youtube.com/watch?v=z906aLyP5fg

ArnavAgrawal03 · 2025-07-22T02:36:03 1753151763

Our argument in general is that even in the non-flattened cases, we see complex diagrams pop up in documents that won't work with a text-based approach.

In the context of RAG, the objective is to send information to the model, so LLMs are the right tool for the job.

ArnavAgrawal03 · 2025-07-22T00:14:39 1753143279

Would love to try our hand at it! We have a couple magazine use cases, but the harder it is, the more fun it is :)

ArnavAgrawal03 · 2025-07-22T00:13:32 1753143212

Would love feedback :)

ArnavAgrawal03 · 2025-07-22T00:11:56 1753143116

Yes! We have a use case in production with over a million pages. MUVERA is good for this, since it is basically akin to regular vector search + re-ranking.

In our current setup, we have the multivectors stored as .npy in S3 Express storage. We use Turbopuffer for the vector search + filtering part. Pre-warming the namespace, and pre-fetching the most common vectors from S3 means that the search latency is almost indistinguishable from regular vector search.

ColPali with binary vectors worked fine, but to be honest there have been so many specific improvements to single vectors that switching to MUVERA gave us a huge boost.

Regular multivector ColPali also suffers from a similar issue. Chamfer distance is just hard to compute at scale. Plaid is a good solution if your corpus is constant. If it isn't, using the regular mulitvector ColPali as a re-ranking step is a good bet.

ArnavAgrawal03 · 2025-07-21T22:35:11 1753137311

For HTML, in a lot of cases, using the tags to chunk things better works. However, I've found that when I'm trying to design a page, showing models the actual image of the page leads to way better debugging than just sending the code back.

1 vs I or 0 vs O are valid issues, but in practice - and there's probably selection bias here - we've seen documents with a ton of diagrams and charts (that are much simpler to deal with as images).