More

adefa · 2026-06-08T21:51:36 1780955496

I built a tmux clone in Rust:

adefa · 2026-04-05T16:46:00 1775407560

Released uncensored versions of all four Gemma 4 models. bf16 + GGUF for each.

Collection: https://huggingface.co/collections/TrevorJS/gemma-4-uncensor...

Code: https://github.com/TrevorS/gemma-4-abliteration

Results

Refusal rates from 686 prompts across 4 datasets (JailbreakBench, tulu-harmbench, NousResearch, mlabonne). Manually audited — most flagged refusals are actually the model complying with a disclaimer attached.

  E2B (2.3B): 98% → 0.4%, KL Div 0.346
  E4B (4.5B): 99% → 0.7%, KL Div 0.068
  26B MoE:    98% → 0.7%, KL Div 0.090
  31B:       100% → 3.2%, KL Div 0.124

26B MoE

Standard abliteration only touches dense layers, which gets you from 98% -> 29% on the MoE. The remaining refusals are in the expert weights. Used Expert-Granular Abliteration (EGA, concept from OBLITERATUS [1]) with norm-preserving biprojection [2] on each of the 128 expert slices per layer. That gets it to 3%.

[1] https://github.com/elder-plinius/OBLITERATUS

[2] https://huggingface.co/blog/grimjim/abliteration-biprojectio...

How it was built

Set up an automated research loop -- an AI agent reads the current results and idea backlog, picks the next experiment, runs it on the GPU, records results, and repeats. It ran 22 experiments across the 4 models, discovered the false-positive problem in standard refusal markers, built the cross-dataset evaluation, and implemented the MoE expert abliteration when dense-only wasn't enough.

Full experiment history and code in the repo.

Downloads

Each model has bf16 safetensors + GGUF (Q4_K_M, Q8_0):

  E2B bf16: https://huggingface.co/TrevorJS/gemma-4-E2B-it-uncensored
  E2B GGUF: https://huggingface.co/TrevorJS/gemma-4-E2B-it-uncensored-GGUF
  E4B bf16: https://huggingface.co/TrevorJS/gemma-4-E4B-it-uncensored
  E4B GGUF: https://huggingface.co/TrevorJS/gemma-4-E4B-it-uncensored-GGUF
  26B bf16: https://huggingface.co/TrevorJS/gemma-4-26B-A4B-it-uncensored
  26B GGUF: https://huggingface.co/TrevorJS/gemma-4-26B-A4B-it-uncensored-GGUF
  31B bf16: https://huggingface.co/TrevorJS/gemma-4-31B-it-uncensored
  31B GGUF: https://huggingface.co/TrevorJS/gemma-4-31B-it-uncensored-GGUF

Quick start:

  llama-server -hf TrevorJS/gemma-4-26B-A4B-it-uncensored-GGUF -c 8192

CamperBob2 · 2026-04-05T18:23:51 1775413431

What about the sampling parameters? You can't just run llama-server with no CLI arguments (other than a uselessly-small context size) and expect useful results.

adefa · 2026-02-12T01:15:03 1770858903

True :)

After some performance improvements, it is realtime on my DGX Spark with an RTF of .416 -- now getting ~19.5 tokens per second. Check it out, see if it's better for you.

adefa · 2026-02-12T01:12:52 1770858772

I'm curious to see if you are able to run the model now from the CLI?

adefa · 2026-02-12T01:11:31 1770858691

The cubecl-wgpu were only needed to reduce the number of kernel workgroups, otherwise I was getting errors in WASM.

adefa · 2026-02-12T01:08:59 1770858539

This should be fixed now. There were a number of bugs that kept the model from working correctly in different environments. Please let me know if you test again. :)

mikebelanger · 2026-02-20T23:37:52 1771630672

Cool! Thanks for the response, I'll give it a shot again sometime

adefa · 2026-02-12T01:08:12 1770858492

Please try again. The model weights are unchanged, but the inference code is improved.

adefa · 2026-02-12T01:07:38 1770858458

this should be fixed

adefa · 2026-02-12T01:06:40 1770858400

Hello everyone, thanks for the interest. I merged a number of significant performance improvements that increase speed and accuracy across CUDA, Metal, and WASM as well as improve stability.

Here are the latest benchmarks running on DGX Spark:

https://github.com/TrevorS/voxtral-mini-realtime-rs#benchmar...

adefa · 2026-02-12T01:05:18 1770858318

Hello, I pushed up and merged a PR that greatly improves performance on CUDA, Metal, and in WASM.

Depending on your hardware, the model is definitely real time (able to transcribe audio faster than the length of the audio).