Distributed-Model-Training-on-Frontier This repository contains a supervised fine-tuning (SFT) walkthrough for running LoRA-based training of Meta-Llama-3 on ORNL's Frontier. The instructions below ...
This tutorial assumes that the cluster is configured to accept single jobs on each Graphical Processing Unit (GPU). Users can submit job arrays, which is the default way to submit jobs for distributed ...