Hacker Newsnew | past | comments | ask | show | jobs | submit | adefa's commentslogin

I built a tmux clone in Rust:

https://github.com/TrevorS/rmux


Released uncensored versions of all four Gemma 4 models. bf16 + GGUF for each.

Collection: https://huggingface.co/collections/TrevorJS/gemma-4-uncensor...

Code: https://github.com/TrevorS/gemma-4-abliteration

Results

Refusal rates from 686 prompts across 4 datasets (JailbreakBench, tulu-harmbench, NousResearch, mlabonne). Manually audited — most flagged refusals are actually the model complying with a disclaimer attached.

  E2B (2.3B): 98% → 0.4%, KL Div 0.346
  E4B (4.5B): 99% → 0.7%, KL Div 0.068
  26B MoE:    98% → 0.7%, KL Div 0.090
  31B:       100% → 3.2%, KL Div 0.124
26B MoE

Standard abliteration only touches dense layers, which gets you from 98% -> 29% on the MoE. The remaining refusals are in the expert weights. Used Expert-Granular Abliteration (EGA, concept from OBLITERATUS [1]) with norm-preserving biprojection [2] on each of the 128 expert slices per layer. That gets it to 3%.

[1] https://github.com/elder-plinius/OBLITERATUS

[2] https://huggingface.co/blog/grimjim/abliteration-biprojectio...

How it was built

Set up an automated research loop -- an AI agent reads the current results and idea backlog, picks the next experiment, runs it on the GPU, records results, and repeats. It ran 22 experiments across the 4 models, discovered the false-positive problem in standard refusal markers, built the cross-dataset evaluation, and implemented the MoE expert abliteration when dense-only wasn't enough.

Full experiment history and code in the repo.

Downloads

Each model has bf16 safetensors + GGUF (Q4_K_M, Q8_0):

  E2B bf16: https://huggingface.co/TrevorJS/gemma-4-E2B-it-uncensored
  E2B GGUF: https://huggingface.co/TrevorJS/gemma-4-E2B-it-uncensored-GGUF
  E4B bf16: https://huggingface.co/TrevorJS/gemma-4-E4B-it-uncensored
  E4B GGUF: https://huggingface.co/TrevorJS/gemma-4-E4B-it-uncensored-GGUF
  26B bf16: https://huggingface.co/TrevorJS/gemma-4-26B-A4B-it-uncensored
  26B GGUF: https://huggingface.co/TrevorJS/gemma-4-26B-A4B-it-uncensored-GGUF
  31B bf16: https://huggingface.co/TrevorJS/gemma-4-31B-it-uncensored
  31B GGUF: https://huggingface.co/TrevorJS/gemma-4-31B-it-uncensored-GGUF
Quick start:

  llama-server -hf TrevorJS/gemma-4-26B-A4B-it-uncensored-GGUF -c 8192


What about the sampling parameters? You can't just run llama-server with no CLI arguments (other than a uselessly-small context size) and expect useful results.


True :)

After some performance improvements, it is realtime on my DGX Spark with an RTF of .416 -- now getting ~19.5 tokens per second. Check it out, see if it's better for you.


I'm curious to see if you are able to run the model now from the CLI?


The cubecl-wgpu were only needed to reduce the number of kernel workgroups, otherwise I was getting errors in WASM.


This should be fixed now. There were a number of bugs that kept the model from working correctly in different environments. Please let me know if you test again. :)


Cool! Thanks for the response, I'll give it a shot again sometime


Please try again. The model weights are unchanged, but the inference code is improved.


this should be fixed


Hello everyone, thanks for the interest. I merged a number of significant performance improvements that increase speed and accuracy across CUDA, Metal, and WASM as well as improve stability.

Here are the latest benchmarks running on DGX Spark:

https://github.com/TrevorS/voxtral-mini-realtime-rs#benchmar...


Hello, I pushed up and merged a PR that greatly improves performance on CUDA, Metal, and in WASM.

Depending on your hardware, the model is definitely real time (able to transcribe audio faster than the length of the audio).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: