Running >100 Kuberenetes clusters with a team is already quite complex. Teams already struggle with Kubernetes core concepts and we have to make sure the time spent does not grow, because of some nice to have features that are not business relevant. Adding more complexity adds a lot of costs and most of the people looking into features like istio might see that it provides less value as promised by the advertisements.
We use instead of service mesh is simple and flexible ingress controller, which can be used as api gateway https://opensource.zalando.com/skipper/. With simple annotations (core Kubernetes concept) we can do blue/green deployments, A/B tests, shadow traffic, you can build complex http routing as you wish and change everything in http in the request or response path.
IMHO there is no much value left that service meshes provide: mtls and network policy maybe, everything else skipper provides.
Blue/green with a simple kubectl plugin: https://github.com/szuecs/kubectl-plugins
More advanced automated blue green deployments via https://github.com/zalando-incubator/stackset-controller.
Skipper also supports several openapi providers and is used by several production users.
We add dns names for cluster internal communication through coredns templates to this api gateway. https://opensource.zalando.com/skipper/kubernetes/east-west-...
You can adapt it step by step and stop without a bid all in approach as service mesh vendors promote.
If you have not enough you can get even more advanced and add hpa by requests per second.
We use all of these in >50 production clusters with regular shop and order traffic far beyond 10k requests a second. We are interested in features, but even more in stability. We are interested in the community and will fix most of the issues you will show us to make errors less likely happen for us, too.
We don’t sell software nor support, that’s why we have less advertisements and don’t show up on every kubecon.
We skipper maintainers don't want to really comment on that, because it is very likely that if you do your own benchmarks you show only the good parts and not the bad parts.
From our benchmarks we can outperform nginx in the pure routing case, while nginx will outperform skipper in the pure sendfile case. HAproxy we did not tested, but I would bet it will be not different from the nginx case. IMO: in general use skipper as microservice router and nginx as streaming router that serves a lot of pictures or videos.
I would be happy to help in case you see performance or scaling issues with skipper.
If you like join our Gopher slack channel #skipper in gophers.slack.com. https://invite.slack.golangbridge.org/
Can you share more information with us in a gh issue https://github.com/zalando/skipper/issues?
I am interested in getting details: payload size, resource configuration and access patterns in your test case.
In the past we used icinga at Zalando and it scaled for us to 40k checks, after that we got huge latency problems. We use now zmon https://github.com/zalando/zmon/ which is really great, because it scales the checks, the graph database is kairosdb on top of Cassandra, which also scales and even creating alerts can be automated and also added by development teams themselves and you can easily build team dashboards and reuse checks/alerts and filter to your entities.
Influxdb was a nice try, but clustering was very unstable in the beginning (tried with 0.7 and 0.8). If you don't want to be the monitoring configurator for your organization (application monitoring should also be created and maintained), I highly recommend to use zmon ( maybe Prometheus can also help). There is also a check to query Prometheus in zmon.
I am one of the maintainers of https://opensource.zalando.com/skipper http proxy library, which can support similar cases. We use this at Zalando https://www.zalando.com/ in Kubernetes and allow developers to connect to different kind of data applications including chat based LLMs or notebooks. We have of course OTel/Opentracing support https://opensource.zalando.com/skipper/operation/operation/#....
Likely the comparison with lb algorithms round robin and least connections is not a fair choice. Better would be to compare with consistent hash, that naturally does stateful load balancing. In skipper you can tune the behavior by filters https://opensource.zalando.com/skipper/reference/filters/#co... and https://opensource.zalando.com/skipper/reference/filters/#co... per route.
You don't want auto scaling? You can also limit concurrent requests to a route with queue support and make sure backends are not overloaded using scheduler filters https://opensource.zalando.com/skipper/reference/filters/#sc....
If you need more you can also help yourself and use lua filters to influence these options https://opensource.zalando.com/skipper/reference/scripts/ .
We are happy to hear from you, Sandor