CarbonAlpha Model Card

Model Summary

CarbonAlpha is a climate-aware portfolio reasoning agent for the portfolio_env OpenEnv environment. It reads one macro-news event, reasons through first-order and second-order effects, and emits a constrained PortfolioAction:

{
  "weights": [w_tech, w_oil, w_green, w_real_estate, w_bonds],
  "infra_commit": 0.0,
  "carbon_offset_buy": 0.0,
  "put_hedge": 0.0,
  "tech_bet": "status_quo"
}

Current best research model:

77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1

Base model:

unsloth/Qwen2.5-7B-Instruct

Adapter lineage:

  1. SFT warm-start on 400 curriculum traces.
  2. GRPO Phase 1 for 100 steps.
  3. Holdout and manual macro-eval checks before promotion.

The live Space can load this adapter through the MODEL_SUBFOLDER environment variable:

https://77ethers-carbonalpha-demo.hf.space/

Intended Use

This model is intended for the CarbonAlpha walkthrough demo and OpenEnv evaluation. It is not a financial advisor and should not be used to make real investment decisions.

The useful behavior to evaluate is:

  • strict <think>...</think> plus JSON formatting;
  • valid portfolio weights and bounded interventions;
  • recognition of macro regime shifts;
  • carbon-budget awareness;
  • performance against the environment's equal-weight baseline.

Training Data

The Qwen2.5 SFT warm-start used:

sft_traces/curriculum_400_e80_m160_h160.jsonl

Trace mix:

  • 80 easy traces;
  • 160 medium / ambiguous traces;
  • 160 hard traces.

The trace schema follows sft_traces/merged_v6_aligned.jsonl, with the same prompt and completion contract used during inference.

Training Pipeline

SFT

SFT artifact:

77ethers/CarbonAlpha/sft_qwen25_7b_curriculum400_v1

Training script:

scripts/hf_sft_qwen25_7b.py

Configuration:

  • QLoRA over unsloth/Qwen2.5-7B-Instruct;
  • LoRA rank 16;
  • lora_alpha=16;
  • 220 SFT steps;
  • effective batch size 4;
  • Hugging Face Jobs L40S.

SFT result:

  • generation sanity: 5/5 valid actions;
  • holdout: 5/5 valid;
  • mean holdout regret: +0.02796;
  • beats baseline on 3/5 holdout seeds.

GRPO

Best GRPO artifact:

77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1

Training script:

scripts/hf_grpo_qwen25_adapter.py

GRPO configuration:

  • warm-start from sft_qwen25_7b_curriculum400_v1;
  • use_vllm=False;
  • 100 GRPO steps;
  • 128 generated Phase-1 prompts;
  • 2 generations per prompt;
  • batch size 2;
  • learning rate 2e-6;
  • loss_type="dapo";
  • KL beta 0.02.

Reward functions:

  • format reward;
  • action-contract reward;
  • reasoning-shape reward;
  • Phase-1 simulator regret reward;
  • carbon-guard reward.

Important engineering choice: we avoided vLLM for the Qwen2.5 GRPO run because earlier vLLM-based Qwen3 rollouts collapsed to one-token completions. The plain-Transformers path was slower but healthier and easier to debug.

Evidence of Training

The 100-step GRPO run was launched as a Hugging Face Job:

https://huggingface.co/jobs/77ethers/69ed1ce0d70108f37acdeea3

Raw evidence committed in this repo:

training_logs/qwen25_grpo_phase1_100_v1.log
training_logs/qwen25_grpo_phase1_100_v1_rows.jsonl

The parsed JSONL contains 100 real GRPO metric rows extracted from the job log.

Loss and reward plots generated from those rows:

Qwen2.5 GRPO loss curve

Qwen2.5 GRPO reward curve

Additional rollout-health plot:

Qwen2.5 GRPO completion length health

The completion-length plot is included because one-token rollout collapse was the main failure mode in earlier GRPO attempts. In this successful run, completion lengths stayed well above the smoke threshold throughout training.

Evaluation

Holdout

Holdout seeds:

100, 200, 300, 400, 500

Best GRPO holdout results:

Metric Value
Valid completions 5/5
Mean holdout regret +0.1058
Beats baseline 5/5
Previous v6 SFT mean regret bar +0.034

Per-seed holdout:

Seed Shock Regret
100 hard_rare_earth_rotation +0.0755
200 easy_tech_earnings +0.1210
300 easy_tech_earnings +0.1442
400 hard_deflation_pulse +0.1527
500 ambig_ai_efficiency +0.0358

Manual Macro Eval

Eval set:

evals/macro_eval_10.jsonl

Report:

evals/macro_eval_10_grpo_report.json

Summary:

  • GRPO adapter: 10/10 valid JSON actions;
  • GRPO adapter: 10/10 closed <think>;
  • base model: 9/10 valid JSON actions;
  • GRPO was stronger on rare-earth export controls, global deflation pulse, and yen carry unwind.

Known weaknesses:

  • q02_oil_chokepoint_inflation: the model understood the inflation regime and hedged, but underweighted OIL despite the direct supply shock.
  • q04_ai_efficiency_paradox: the model correctly liked TECH and cut REAL_ESTATE, but gave GREEN too much weight despite lower data-center power demand expectations.

These are targeted follow-up items, not hidden failures.

Comparison With Qwen3 Base Branch

We also tested an isolated Qwen3-4B-Base branch:

77ethers/CarbonAlpha/grpo_qwen3_4b_base_smoke_v2

Result:

  • smoke gate passed mechanically;
  • no one-token collapse;
  • completions were too long, often near the 400-token cap;
  • holdout: 4/5 valid;
  • mean holdout regret: -0.0229;
  • did not beat the Qwen2.5 GRPO model.

Conclusion: Qwen3 Base is a viable research branch, but the current production candidate remains Qwen2.5-7B SFT plus GRPO.

Limitations

  • The GRPO run is Phase 1 only, so it is strongest on easy-shock simulator reward optimization.
  • The model still has known second-order reasoning weaknesses in specific macro setups.
  • The reward environment is synthetic and should be interpreted as a benchmark, not a market simulator.
  • The model is private on Hugging Face and requires HF_API_TOKEN for loading.

Reproducibility

Final notebook:

notebooks/carbonalpha_final_pipeline.ipynb

Colab link:

https://colab.research.google.com/github/capabl-machines/gridops/blob/round-2/notebooks/carbonalpha_final_pipeline.ipynb

The notebook verifies artifacts, loads metrics from Hugging Face, runs an environment smoke test, shows the manual eval set, and includes opt-in cells to relaunch the exact HF Jobs training runs.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for 77ethers/CarbonAlpha

Base model

Qwen/Qwen2.5-7B
Adapter
(443)
this model

Space using 77ethers/CarbonAlpha 1