Carnice Qwen3.6 MoE 35B-A3B — Hermes-Focused Agentic Model
QLoRA fine-tune of Qwen3.6-35B-A3B (MoE, 3B active parameters) optimized for agentic workflows and Hermes Agent runtime. Two-stage training adapted from kai-os/Carnice-9b.
This is the successor to Carnice-MoE-35B-A3B (based on Qwen3.5), retrained on the newer Qwen3.6 base which brings improved agentic coding, extended context (262K native, up to 1M with RoPE scaling), and native multimodal support.
Credits
Training methodology adapted from kai-os/Carnice-9b — same two-stage approach and datasets, applied to the larger MoE architecture. Key inspiration: training on actual Hermes Agent execution traces for native agentic behavior.
Available Formats
| Format | Size | Location | Use Case |
|---|---|---|---|
| BF16 SafeTensors | 67 GB | Root | Full precision, Transformers / vLLM |
| FP8 Dynamic | 34 GB | fp8/ |
vLLM optimized, ~2x faster inference |
| GGUF | 19-65 GB | GGUF repo | llama.cpp, Ollama, LM Studio |
FP8 Usage (vLLM)
# Clone the repo and point vLLM to the fp8/ subfolder
vllm serve samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B --quantization fp8 --dtype auto
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.6-35B-A3B |
| Architecture | Mixture of Experts (MoE) |
| Total Parameters | ~35B |
| Active Parameters | ~3B per token |
| Native Context Length | 262,144 tokens |
| Thinking Modes | Thinking / Non-thinking (native Qwen3.6) |
What Makes This Different
Unlike generic reasoning distillation, this model was trained on actual Hermes Agent execution traces — real conversations where an AI agent:
- Executes terminal commands and processes output
- Performs file editing operations
- Chains multi-step tool calls with results feeding back
- Uses browser-assisted workflows
- Makes decisions based on environmental feedback
This teaches the model the exact conversation patterns Hermes expects, rather than just generic reasoning.
Training Details
Two-Stage Approach
Stage A — Reasoning Repair (1 epoch)
- Strengthens base model reasoning before agent-specific training
- Loss: 0.4281
| Dataset | Examples |
|---|---|
| bespokelabs/Bespoke-Stratos-17k | 16,710 |
| AI-MO/NuminaMath-CoT | 17,000 (capped) |
Stage B — Hermes Traces (2 epochs)
- Agent-specific behavioral training on real execution traces
- Loss: 0.3045
| Dataset | Examples |
|---|---|
| kai-os/carnice-glm5-hermes-traces | 1,627 (high quality) |
| open-thoughts/OpenThoughts-Agent-v1-SFT | 15,209 |
Training Configuration
| Parameter | Stage A | Stage B |
|---|---|---|
| LoRA Rank | 64 | 64 |
| LoRA Alpha | 64 | 64 |
| LoRA Targets | q, k, v, o projections | q, k, v, o projections |
| Learning Rate | 2e-5 (linear) | 1e-5 (cosine) |
| Epochs | 1 | 2 |
| Effective Batch | 12 | 12 |
| Context Length | 4096 | 4096 |
| Precision | 4-bit QLoRA + BF16 adapters | Same |
| GPU | RTX PRO 6000 Blackwell (98GB) | Same |
| Total Training Time | ~55 hours (both stages) |
Trainable Parameters
13,762,560 (0.04% of 35.1B total)
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B")
messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
vLLM
vllm serve samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B --dtype auto --max-model-len 262144
llama.cpp
For llama.cpp usage, see the GGUF repo.
Acknowledgements
- kai-os — Carnice training methodology and Hermes traces dataset
- open-thoughts — Agent SFT dataset
- bespokelabs — Bespoke-Stratos reasoning dataset
- Unsloth — QLoRA training framework
- Qwen — Base model
- Downloads last month
- -