Releases and updates to Axolotl AI on Github
Release Announcements
v0.7.0
Axolotl v0.7.0 is out!
GRPO support
Process Reward Model support
KD Training from offline top-k logprobs
Multi-GPU LoRA kernels
Deploy your training and evaluation workloads straight to Modal from the axolotl CLI
Sweeps
Chat template parsing improvements
Improved Mac OS support
Dependency upgrades & lots of various fixes
Process Reward Models
Take your test-time scaling to new heights by training your own Process Reward Models (PRM)! Thanks to PRM training support in @huggingface TRL we’ve streamlined fine-tuning and configuration of PRMs, which can be used as powerful step-by-step verifiers for reasoning models. We’ve also open-sourced several datasets which you can use out-of-the-box with our trainer, and a cookbook to help you evaluate your trained PRMs. Check out our blogpost below for more details.
Blog Post: https://axolotlai.substack.com/p/Process-Reward-Models
Cookbook: https://github.com/axolotl-ai-cloud/axolotl-cookbook/prm
Collection: PRM 🤗 Collection: https://huggingface.co/collections/axolotl-ai-co/process-reward-models-67b4b4355da4e1fe6ba44875
KD Training from offline top-k logprobs
The software stack for knowledge distillation from teacher models is now much simpler by simply leveraging top-k logprobs (instead of logits) from off-the-shelf inference engines like @vllm_project. We have online top-k KD on our roadmap for a future release.
Many thanks to Charles Goddard, Fernando and Lucas from @arcee_ai for their guidance on this.
Modal Deployment
Deploying your workloads to @modal_labs is now simpler through your local axolotl CLI. Just configure your cloud resources in a YAML file, and our CLI takes care of everything else.
Multi-GPU LoRA kernels
Accelerate your LoRA and QLoRA post-training runs using our newly implemented Triton kernels and custom autograd functions! Inspired by Unsloth, these optimizations can be patched into common LLM architectures in order to speedup model forward and backward passes ~25-50%, and save ~25-40% peak VRAM usage. Check out our forthcoming blog post for more details.
GRPO
Post-train your models using the latest SOTA RL technique pioneered by DeepSeek. We make it easier to configure your GRPO workload in @huggingface TRL. We’ve upstreamed our PEFT + vLLM support to TRL to improve the efficiency of post-training with GRPO.
Cookbook:
GitHubaxolotl-cookbook/grpo at main · axolotl-ai-cloud/axolotl-coo…
Release Notes: https://github.com/axolotl-ai-cloud/axolotl/releases/tag/v0.7.0
See All that Changed
fix build w pyproject to respect insalled torch version by @winglian in #2168
evaluation_strategy was fully deprecated in recent release by @winglian in #2169
parity for nightly ci - make sure to install setuptools by @winglian in #2176
Add hub model id config options to all example yml files. by @bursteratom in #2196
move the setting of PYTORCH_CUDA_ALLOC_CONF to the cli rather than train module by @winglian in #2183
use axolotl contribs for fix_untrained_tokens by @winglian in #2194
fix: use apply_chat_template to find turn boundaries and allow tool_calling field by @NanoCode012 in #2179
use DataCollatorWithFlattening when not sample packing by @winglian in #2167
adding test_datasets compat with pretraining_dataset (streaming) by @djsaunde in #2206
move the dataset loading from remote/disk to a shared function so we can re-use for RL by @winglian in #2204
add deepspeed example with torch compile enabled by @winglian in #2212
inference - don't default w accelerate, fix base model by @winglian in #2216
fix untrained tokens if specified explicitly from a list by @winglian in #2210
fix: allow trainer builder to use custom jinja chat template by @NJordan72 in #2219
make sure padding is labeled as -100 for pretraining by @winglian in #2227
Fixing OSX installation by @SalmanMohammadi in #2231
fix: mistral nemo does not recognize token_type_ids in forward by @NanoCode012 in #2233
feat: use SequentialSampler if
curriculum_sampling
is enabled withsample_packing
by @v-dicicco in #2235feat: add support for data_files in pretraining by @NanoCode012 in #2238
rename liger test so it properly runs in ci by @winglian in #2246
use 2.5.1 docker images as latest tag as it seems stable by @winglian in #2198
add helper to verify the correct model output file exists by @winglian in #2245
assume empty lora dropout means 0.0 and add tests by @winglian in #2243
rename references to dpo dataset prep to pref data by @winglian in #2258
fix: use text_column even when not packing for pretraining by @NanoCode012 in #2254
fix for indexing error inside torch.embeddings caused by num embeddings > num tokens in tokenizer by @jwongTensora in #2257
option to not concatenate during pretraining by @winglian in #2263
Add 5000 line history limit to tmux for docker cloud by @adi-kmt in #2268
use the extracted field_messages to parse the role fields by @winglian in #2265
support for latest transformers release 4.48.1 by @winglian in #2256
chore(doc): fix explanation on gcs creds retrieval by @NanoCode012 in #2272
Take
split
param from config in all load_dataset instances by @mashdragon in #2281chore(doc): improve explanation for *_steps and *_strategy by @NanoCode012 in #2270
support for custom lr groups for non-embedding modules by @winglian in #2213
chore: refactor SaveModelCallback to stop handle fractional save_steps by @NanoCode012 in #2291
Num epochs float by @mashdragon in #2282
Removing torch 2.3.1 by @SalmanMohammadi in #2294
Process reward models by @SalmanMohammadi in #2241
Ray Train Axolotl Integration by @erictang000 in #2251
native support for modal cloud from CLI by @winglian in #2237
Defaulting to
fused=True
AdamW by @SalmanMohammadi in #2293match the cuda version for 2.4.1 build w/o tmux by @winglian in #2299
make save_safetensors: true the default by @winglian in #2292
refactor README; hardcode links to quarto docs; add additional quarto doc pages by @djsaunde in #2295
fix: add warning for invalid eval_steps or save_steps by @NanoCode012 in #2298
set MODAL_IMAGE_BUILDER_VERSION=2024.10 to 2024.10 to test latest builder by @winglian in #2302
better handling of multipack dataset length by @winglian in #2296
fix: drop long seq even if not sample packing by @NanoCode012 in #2211
Torch 2.6 support for base docker image by @winglian in #2312
feat: add torch2.6 to ci by @NanoCode012 in #2311
feat: update FA to 2.7.4.post1 which includes torch2.6 binary by @NanoCode012 in #2315
chore: remove redundant py310 from tests by @NanoCode012 in #2316
fix(config): missing config not being documented and fix model_ override by @NanoCode012 in #2317
feat(doc): Add multi-node torchrun info by @NanoCode012 in #2304
Update faq.qmd by @bursteratom in #2319
disable ray tests for latest torch release by @winglian in #2328
[Fixing #2149] load_from_disk for RL-type training by @leeparkuky in #2193
feat(doc): Improve guide to dataset types with better examples by @NanoCode012 in #2286
feat(doc): add tensorboard config to docs by @NanoCode012 in #2329
Add
bos_token
andadd_generation_prompt
to the alpaca chat template by @minpeter in #2322fix: add missing shards_idx, preprocess_shards to docs and validator by @NanoCode012 in #2331
add support for include_tokens_per_second in training args by @winglian in #2269
Select input_ids explicitly after panda conversion by @seungduk-yanolja in #2335
Activation function Triton kernels, LoRA custom autograd functions by @djsaunde in #2324
feat: add config for optional parameters in a chat message by @NJordan72 in #2260
chore: cleanup deprecated config elements by @NJordan72 in #2309
Join us on Discord, Axolotl-AI