CarbonAlpha Model Card

Model Summary

CarbonAlpha is a climate-aware portfolio reasoning agent for the portfolio_env OpenEnv environment. It reads one macro-news event, reasons through first-order and second-order effects, and emits a constrained PortfolioAction:

{
  "weights": [w_tech, w_oil, w_green, w_real_estate, w_bonds],
  "infra_commit": 0.0,
  "carbon_offset_buy": 0.0,
  "put_hedge": 0.0,
  "tech_bet": "status_quo"
}

Current best research model:

77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1

Base model:

unsloth/Qwen2.5-7B-Instruct

Adapter lineage:

SFT warm-start on 400 curriculum traces.
GRPO Phase 1 for 100 steps.
Holdout and manual macro-eval checks before promotion.

The live Space can load this adapter through the MODEL_SUBFOLDER environment variable:

https://77ethers-carbonalpha-demo.hf.space/

Intended Use

This model is intended for the CarbonAlpha walkthrough demo and OpenEnv evaluation. It is not a financial advisor and should not be used to make real investment decisions.

The useful behavior to evaluate is:

strict <think>...</think> plus JSON formatting;
valid portfolio weights and bounded interventions;
recognition of macro regime shifts;
carbon-budget awareness;
performance against the environment's equal-weight baseline.

Training Data

The Qwen2.5 SFT warm-start used:

sft_traces/curriculum_400_e80_m160_h160.jsonl

Trace mix:

80 easy traces;
160 medium / ambiguous traces;
160 hard traces.

The trace schema follows sft_traces/merged_v6_aligned.jsonl, with the same prompt and completion contract used during inference.

Training Pipeline

SFT

SFT artifact:

77ethers/CarbonAlpha/sft_qwen25_7b_curriculum400_v1

Training script:

scripts/hf_sft_qwen25_7b.py

Configuration:

QLoRA over unsloth/Qwen2.5-7B-Instruct;
LoRA rank 16;
lora_alpha=16;
220 SFT steps;
effective batch size 4;
Hugging Face Jobs L40S.

SFT result:

generation sanity: 5/5 valid actions;
holdout: 5/5 valid;
mean holdout regret: +0.02796;
beats baseline on 3/5 holdout seeds.

GRPO

Best GRPO artifact:

77ethers/CarbonAlpha/grpo_qwen25_7b_adapter_phase1_100_v1

Training script:

scripts/hf_grpo_qwen25_adapter.py

GRPO configuration:

warm-start from sft_qwen25_7b_curriculum400_v1;
use_vllm=False;
100 GRPO steps;
128 generated Phase-1 prompts;
2 generations per prompt;
batch size 2;
learning rate 2e-6;
loss_type="dapo";
KL beta 0.02.

Reward functions:

format reward;
action-contract reward;
reasoning-shape reward;
Phase-1 simulator regret reward;
carbon-guard reward.

Important engineering choice: we avoided vLLM for the Qwen2.5 GRPO run because earlier vLLM-based Qwen3 rollouts collapsed to one-token completions. The plain-Transformers path was slower but healthier and easier to debug.

Evidence of Training

The 100-step GRPO run was launched as a Hugging Face Job:

https://huggingface.co/jobs/77ethers/69ed1ce0d70108f37acdeea3

Raw evidence committed in this repo:

training_logs/qwen25_grpo_phase1_100_v1.log
training_logs/qwen25_grpo_phase1_100_v1_rows.jsonl

The parsed JSONL contains 100 real GRPO metric rows extracted from the job log.

Loss and reward plots generated from those rows:

Additional rollout-health plot:

The completion-length plot is included because one-token rollout collapse was the main failure mode in earlier GRPO attempts. In this successful run, completion lengths stayed well above the smoke threshold throughout training.

Evaluation

Holdout

Holdout seeds:

100, 200, 300, 400, 500

Best GRPO holdout results:

Metric	Value
Valid completions	5/5
Mean holdout regret	`+0.1058`
Beats baseline	5/5
Previous v6 SFT mean regret bar	`+0.034`

Per-seed holdout:

Seed	Shock	Regret
100	`hard_rare_earth_rotation`	`+0.0755`
200	`easy_tech_earnings`	`+0.1210`
300	`easy_tech_earnings`	`+0.1442`
400	`hard_deflation_pulse`	`+0.1527`
500	`ambig_ai_efficiency`	`+0.0358`

Manual Macro Eval

Eval set:

evals/macro_eval_10.jsonl

Report:

evals/macro_eval_10_grpo_report.json

Summary:

GRPO adapter: 10/10 valid JSON actions;
GRPO adapter: 10/10 closed <think>;
base model: 9/10 valid JSON actions;
GRPO was stronger on rare-earth export controls, global deflation pulse, and yen carry unwind.

Known weaknesses:

q02_oil_chokepoint_inflation: the model understood the inflation regime and hedged, but underweighted OIL despite the direct supply shock.
q04_ai_efficiency_paradox: the model correctly liked TECH and cut REAL_ESTATE, but gave GREEN too much weight despite lower data-center power demand expectations.

These are targeted follow-up items, not hidden failures.

Comparison With Qwen3 Base Branch

We also tested an isolated Qwen3-4B-Base branch:

77ethers/CarbonAlpha/grpo_qwen3_4b_base_smoke_v2

Result:

smoke gate passed mechanically;
no one-token collapse;
completions were too long, often near the 400-token cap;
holdout: 4/5 valid;
mean holdout regret: -0.0229;
did not beat the Qwen2.5 GRPO model.

Conclusion: Qwen3 Base is a viable research branch, but the current production candidate remains Qwen2.5-7B SFT plus GRPO.

Limitations

The GRPO run is Phase 1 only, so it is strongest on easy-shock simulator reward optimization.
The model still has known second-order reasoning weaknesses in specific macro setups.
The reward environment is synthetic and should be interpreted as a benchmark, not a market simulator.
The model is private on Hugging Face and requires HF_API_TOKEN for loading.

Reproducibility

Final notebook:

notebooks/carbonalpha_final_pipeline.ipynb

Colab link:

https://colab.research.google.com/github/capabl-machines/gridops/blob/round-2/notebooks/carbonalpha_final_pipeline.ipynb

The notebook verifies artifacts, loads metrics from Hugging Face, runs an environment smoke test, shows the manual eval set, and includes opt-in cells to relaunch the exact HF Jobs training runs.

Downloads last month: -

Model tree for 77ethers/CarbonAlpha

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct