Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hello (again) from the Gemma team! We are quite excited to push this release out and happy to answer any questions!

Opinions are our own and not of Google DeepMind.



It's fairly easy to pay OpenAI or Mistral money to use their API's. Figuring out how Google Cloud Vertex works and how it's billed is more complicated. Azure and AWS are similar in how complex they are to use for this. Could Google Cloud please provide an OpenAI compatible API and service? I know it's a different department. But it'd make using your models way easier. It often feels like Google Cloud has no UX or end-user testing done on it at all (not true for aistudio.google.com - that is better than before, for sure!).


Gemini models on Vertex AI can be called via a preview OpenAI-compatible endpoint [1], but shoving it into existing tooling where you don't have programmatic control over the API key and is long lived is non-trivial because GCP uses short lived access tokens (and long-lived ones are not great security-wise).

Billing for the Gemini models (on Vertex AI, the Generative Language AI variant still charges by tokens) I would argue is simpler than every other provider, simply because you're charged by characters/image/video-second/audio-second and don't need to run a tokenizer (if it's even available cough Claude 3 and Gemini) and having to figure out what the chat template is to calculate the token cost per message [2] or figure out how to calculate tokens for an image [3] to get cost estimates before actually submitting the request and getting usage info back.

[1]: https://cloud.google.com/vertex-ai/generative-ai/docs/multim...

[2]: https://platform.openai.com/docs/guides/text-generation/mana...

[3]: https://platform.openai.com/docs/guides/vision/calculating-c...


Good to know about this API preview. Hopefully the billing problem and UI maze of Vertex AI can be sorted too?


Google does plenty of ux studies on gcp. I took part in at least 3 of them.

I'm also not sure if I understand your problem with pricing? Depending on what you do with it, it's not just an LLM. It actually started before llms.

Pricing for image classification and other features are completely different products like an LLM.


They should do a whole lot more then! Ideally they'd have effective impact. It's a busy mess on GCP. If they wanted to compete well, they should do much better with UX design, especially for onboarding. Compare how easy setting up a Mistral account is with GCP to do some generative LLM in a Python script. GCP is a maze. Did you make an account to reply to this? I'm curious what you do with GCP? Are you a heavy user?


I create new accounts because I use hn too much.

I use gcp professional every day and always found it quite intuitive.

Did plenty of image classification with vertex ai too


Why would you make new accounts because you use HN too much? Doesn't make sense to me. Anyhow if you use GCP every day, you're going to have learned it's weird clunky behaviour. GCP's main problem is that they've steadily become a sprawling mess of complexity, which is in big contrast to quite a few LLM specific cloud services that are happy to take peoples money without extra complexity?


Not being logged in feels like a bigger hurdle to comment and check if someone responded to it.

It's a shitty solution to a stupid problem ;)

But I did mention that vertex AI is more than just hosting llms though


If you're an individual developer and not an enterprise, just go straight to Google AIStudio or GeminiAPI instead: https://aistudio.google.com/app/apikey. It's dead simple getting an API key and calling with a rest client.


Interesting but when I tried it, I couldn't figure out the billing model because it's all connected to Google projects, and there can be different billing things for each of them.

Each thing seems to have a bunch of clicks to setup that startup LLM providers don't hassle people with. They're more likely to just let you sign in with some generic third party oAuth, slap on Stripe billing, let you generate keys, show you some usage stats, getting started docs, with example queries and a prompt playground etc.

What about the Vertex models though? Are they all actually available via Google AI Studio?


Sadly, while gemma-2-27b-it is available (as a Preview model) on the AI Studio playground, it didn't show up via API on list_models() for me.


I have to agree with all of this. I tried switching to Gemini, but the lack of clear billing/quotas, horrible documentation, and even poor implementation of status codes on failed requests have led me to stick with OpenAI.

I don't know who writes Google's documentation or does the copyediting for their console, but it is hard to adapt. I have spent hours troubleshooting, only to find out it's because the documentation is referring to the same thing by two different names. It's 2024 also, I shouldn't be seeing print statements without parentheses.


We are working hard to improve this across ai.google.dev (Gemini API), Hang tight!


I plan on downloading a Q5 or Q6 version of the 27b for my 3090 once someone puts quants on HF, loading it in LM studio and starting the API server to call it from my scripts based on openai api. Hopefully it's better at code gen than llama 3 8b.


Happy to pass on any feedback to our Google Cloud friends. :)


I also hate the billing. It feels like configuring AWS more than calling APIs.


Thank you!


I also work at Google and on Gemma (so same disclaimers)

You can try 27b at www.aistudio,google.com. Send in your favorite prompts, and we hope you like the responses.


Why is AIStudio not available in Ukraine? I have no problem with using Gemini web UI or other LLM providers from Ukraine, but this Google API constrain is strange.


Will gemma2 be available through gemma.cpp? https://github.com/google/gemma.cpp


This is in the works in the dev branch (thanks pchx :)

https://github.com/google/gemma.cpp/pull/274


:) Confirmed working. We've just pushed the dev branch to main.


Awesome, I love this .cpp trend! Thanks for your work!!


The 4k sliding window context seems like a controversial choice after Mistral 7B mostly failed at showing any benefits from it. What was the rationale behind that instead of just going for full 8k or 16k?


This is mostly about inference speed, while maintaining long context performance.


Thanks for your work on this; excited to try it out!

The Google API models support 1M+ tokens, but these are just 8K. Is there a fundamental architecture difference, training set, something else?


No question. Thanks for thinking of 27B.


Given the goal of mitigating self-proliferation risks, have you observed a decrease in the model's ability to do things like help a user setup a local LLM with local or cloud software?

How much is pre-training dataset changes, how much is tuning?

How do you think about this problem, how do you solve it?

Seems tricky to me.


To quote Ludovic Peran, our amazing safety lead:

Literature has identified self-proliferation as dangerous capability of models, and details about how to define it and example of form it can take have been openly discussed by GDM (https://arxiv.org/pdf/2403.13793).

Current Gemma 2 models' success rate to end-to-end challenges is null (0 out 10), so the capabilities to perform such tasks are currently limited.


That's an interesting paper. `Install Mistral 7B on a GCP instance and use it to answer a simple question`. Some hosting providers and inference software might be easier to setup, for now. ;) But do you have to make it less capable, by being careful on what it's trained on? E.g: banning certain topics (like how to use Lamafile/llama.cpp, knowing what hosting providers have free trials, learning about ways to jailbreak web apps, free inference providers etc)?

Or does the model have to later be finetuned, to not be good at certain tasks?

Or are we not at that stage yet?

Is something like tree-of-thought used, to get the best of the models for these tasks?


Turns out LLM alignment is super easy, barely an inconvenience.


Alignment is tight!


One should not confuse alignment and current incapability.


Wow wow wow.... wow.


The paper suggests on one hand Gemma is on the same Pareto curve as Llama3, while on the other hand seems to suggest it’s exceeded its efficiency.

Is this a contradiction or am I misunderstanding something?

Btw overall very impressive work great job.


I think it makes sense to compare models trained with the same recipe on token count - usually more tokens will give you a better model.

However, I wouldn't draw conclusions about different model families, like Llama and Gemma, based on their token count alone. There are many other variables at play - the quality of those tokens, number of epochs, model architecture, hyperparameters, distillation, etc. that will have an influence on training efficiency.


Any gemma-2-9b or 27b 4 bit GGUF's on HuggingFace yet? Thanks!


Actually for the 9B model, this has 4-bit quantised weights (and others): https://huggingface.co/bartowski/gemma-2-9b-it-GGUF

Still no 27B 4-bit GGUF quants on HF yet!

I'm monitoring this search: https://huggingface.co/models?library=gguf&sort=trending&sea...



I'm curious about the quantization quality claims in the table there. Is this a Gemma 2 specific thing (more subtlety in the weights somehow?). In my testing and testing I've seen elsewhere at least for llama3 8B (and some less rigorous testing with other models) q_8 -> q4_K_M are basically indistinguishable from one another?


Yes, PPL and certain benchmarks do not detect differences from quantization. But recent work gives cause for concern, e.g., https://arxiv.org/pdf/2310.01382, https://arxiv.org/pdf/2405.18137.


The first paper is good to critique the performance of quantised models, it points out that 40-50% 'compression' typically results in only slight loss for RAG tasks relying on in-context learning, but for factual tasks replying on stored knowledge, performance very quickly dropped off. They looked at Vicuna, one of the earlier models, so I wonder how applicable it is to recent models like the Phi 3 range. I don't think deliberate clever adversarial attacks like those of the 2nd paper are a sensible worry for most, but it is fun. Thanks for the links @janwas.


It's on HuggingFace already: https://huggingface.co/google/gemma-2-9b


I know the safe tensors are there, but I said GGUF 4-bit quantised, which is kinda the standard for useful local applications, a typical balanced sweet spot of performance and quality. It's makes it much easier to use, works in more places, be it personal devices or a server etc.


If you are still looking for it, I just made it available on an app[1] that I am working on with Gemma2 support.

https://msty.app


Are you saying you put a 4-bit GGUF on HuggingFace?


How is Gemma-2 licensed?


The terms of use remain the same as Gemma 1 - https://ai.google.dev/gemma/terms.


Are Gemma-2 models available via API yet? Looks to me like it's not yet on vertexai



Do run gemma2 on your Google phone?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: