At Portkey, this is a problem we deal with quite a bit. Also the reason that Datadog and the traditional observability vendors did not work for LLM use cases since they're not built to handle large volumes of data.
We've done this through a careful combination of Clickhouse + MinIO for fast retrieval of log items + selected retrieval from the MinIO buckets.
Cost becomes a very big factor when managing, filtering and searching through TBs of data even for fairly small use cases.
One thing we lost in the process is full-text search over the request & response pairs and while we try to intelligently add metadata to requests to make searching easier, it isn't the complete experience yet. Still WIP as a problem statement to solve and maybe the last straw here. Any suggestions?
Clickhouse has text + vector indexes, so that may be native, though we have never used them and I find vector indexes tricky to scale w other DBs. Text... Or neither... may be enough in practice tho as we mostly only care about searching on metadata dimensions like task.
We are thinking about sampled hot data for ops staff in otel DB+UIs, and long-term full data in S3/Clickhouse for custom tooling. It'd be cool if we could send Clickhouse historical otel sessions to grafana etc on demand, but likely a bridge too far...
I think you can (pretty) easily set this up with an otel collector and something that replays data from S3 - there's a native implementation that converts otel to clickhouse
Our scenario would be more like using Clickhouse / a dwh for session cohort/workflow filtering and then populating otel tools for viz goodies. Interestingly, to your point, the otel python exporter libs are pretty simple, so SQL results -> otel spans -> Grafana temp storage should be simple!
We've done this through a careful combination of Clickhouse + MinIO for fast retrieval of log items + selected retrieval from the MinIO buckets.
Cost becomes a very big factor when managing, filtering and searching through TBs of data even for fairly small use cases.
One thing we lost in the process is full-text search over the request & response pairs and while we try to intelligently add metadata to requests to make searching easier, it isn't the complete experience yet. Still WIP as a problem statement to solve and maybe the last straw here. Any suggestions?