Notebook Model Fine-tuning and Training General Solution
TOC
BackgroundScopePreparationLLM Model Fine-tuning StepsCreating a Notebook/VSCode InstancePreparing the ModelPreparing the Model Output LocationPreparing the DatasetHuggingface dataset formatLLaMA-Factory FormatPrepare to Fine-tune the Training Runtime ImageCreating and Fine-tuning the VolcanoJob TaskViewing and Managing Task StatusExperiment Tracking and ComparisonLaunching the Inference Service Using the Fine-tuned ModelAdapt Non-Nvidia GPUsPreparationVerifying the Original Vendor Solution (Optional)Converting the Vendor Solution to Run as a Kubernetes Job/Deployment (Optional)Modify the vendor solution to run as a volcano jobExperiment Tracking and ComparisonSummaryBackground
Model fine-tuning and training often require adapting to different model structures, hardware devices, and appropriate parallel training methods. Alauda AI Notebook provides a comprehensive approach, from model development to training task submission and management, and experiment tracking, helping model and algorithm engineers quickly adapt and complete the entire model fine-tuning and training process.
Alauda AI Notebook creates a Notebook/VSCode (CodeServer) container environment for development and debugging in a user namespace. Multiple Notebook/VSCode instances can be created within a namespace to preserve environments for different users and development tasks. Notebooks can request only CPU resources for development and cluster task submission, using the cluster's GPU resources to run tasks. GPU resources can also be requested for Notebooks, allowing tasks such as training and fine-tuning to be completed directly within the Notebook, regardless of the distributed model.
In addition, you can use the platform's built-in MLFlow to record various metrics for each model fine-tuning training session, making it easier to compare multiple experiments and select the final model.
We use VolcanoJob, the Kubernetes-native resource manager, to submit cluster tasks using Notebooks. The Volcano scheduler supports queues, priorities, and various scheduling policies, facilitating more efficient cluster task scheduling and improving resource utilization.
This solution uses the LLaMA-Factory tool to launch fine-tuning and training tasks. However, for larger-scale model fine-tuning and training scenarios requiring parallel methods like Tensor Parallelism, Context Parallelism, and Expert Parallelism to train larger models, it may be necessary to use other tools, build custom fine-tuning runtime images, and modify the task launch script to adapt to different tools and models. For more detailed LLaMA-Factory usage and parameter configuration, please refer to: https://llamafactory.readthedocs.io/en/latest/index.html
Scope
- This solution is applicable to Alauda AI 1.4 and later.
- This solution is applicable to x86/64 CPU and NVIDIA GPU scenarios.
- Fine-tuning and training of LLM models. If you need to train other types of models (such as Yolov5), you will need to use different images, startup scripts, datasets, etc.
- NPU scenarios require building a suitable runtime image based on this solution to be compatible.
Preparation
- You must first deploy the Kubeflow plugin to enable Notebook support.
- Turn on the "experimental" feature, or install
MLFlowplugin.
LLM Model Fine-tuning Steps
Creating a Notebook/VSCode Instance
Note: In versions of
Alauda AI >= 1.4, you can create a Notebook instance using Workbench in the left navigation.
From the navigation bar, go to Workbench and create a new workbench instance. Note that it is recommended that the Notebook only use CPU resources. Submitting a cluster task from within the Notebook will request GPU resources within the cluster to improve resource utilization.
- Click Workbench in the left navigation to enter the Workbench list page.
- Find the Create button and click Create to enter the creation page.
- Configure the Notebook instance:
- Name
- Image: You can start directly using the built-in Notebook image. You can also build a custom image based on the base Notebook image provided by Alauda. Select "Custom Image" and enter the image address.
- Container CPU and memory requirements. Expand "Advanced Options" to configure higher CPU and memory limits.
- GPU: Select the GPU resources to use. You can specify a full GPU or virtual GPU solution.
- Workspace Volume: The default storage volume (PVC) used for the Notebook directory. If not specified, a storage volume is automatically created for the current notebook. You can also click the drop-down button to configure the storage volume information.
- Data Volume: Mount one or more additional storage volumes within the Notebook. For example, if your dataset or model is stored on another storage volume, you can mount additional volumes.
- Configuration Item: You can leave this option unselected.
- Shared Memory: Enable this option if you want to use features such as multi-GPU communication within the Notebook. Otherwise, do not enable it.
Preparing the Model
Refer to the Alauda AI online documentation for detailed steps on how to upload a model using the notebook.
Preparing the Model Output Location
Create an empty model in the model repository to store the output model. When configuring the fine-tuning output location, enter the model's Git repository URL.
Preparing the Dataset
Download and push the sample identity dataset to the dataset repository. This dataset is used to fine-tune the LLM to answer user questions such as "Who are you?"
- First, create an empty dataset repository under "Datasets" - "Dataset Repository".
- Upload the zip file to the notebook, unzip it, then navigate to the dataset directory. Use git lfs to push the dataset to the dataset repository's Git URL. The steps are similar to uploading the model. For details, refer to the Alauda AI online documentation.
- After the push is complete, refresh the dataset page and you should see that the file has been successfully uploaded in the "File Management" tab.
If you wish to import a dataset in a different format, you must save the dataset in a format compatible with Huggingface datasets (see: https://huggingface.co/docs/datasets/repository_structure, https://huggingface.co/docs/datasets/create_dataset). Then, modify the README.md file in the dataset repository to provide a metadata description for the dataset. For example:
Sample README.md
Among them:
task_categories: Specifies the fine-tuning and training task types for this dataset.dataset_info: Configures the dataset's feature columns, label columns, and other information.configs: Configures one or more "configs." Each configuration specifies how the dataset is sliced and other information when using that configuration.
Note: The dataset format must be correctly recognized and read by the fine-tuning framework to be used in subsequent fine-tuning tasks. The following examples illustrate two common LLM fine-tuning dataset formats:
Huggingface dataset format
You can use the following code to check whether the dataset directory format can be correctly loaded by datasets:
LLaMA-Factory Format
If you use the LLaMA-Factory tool in the examples to complete training, the dataset format must conform to the LLaMA-Factory format. Reference: https://llamafactory.readthedocs.io/en/latest/getting_started/data_preparation.html
Prepare to Fine-tune the Training Runtime Image
Use the following Dockerfile to build the training image. If you wish to use a different training framework, such as YOLOv5, you may need to customize the image and install the required dependencies within it.
After building the image, you need to upload it to the Docker registry of the Alauda AI platform cluster and configure it in the following tasks.
Note: The
git lfscommand is required within the image to download and upload the model and dataset files.
Dockerfile
Creating and Fine-tuning the VolcanoJob Task
In Notebook, create the YAML file for the task submission. Refer to the following example:
VolcanoJob YAML File
In the YAML file for the above task, modify the following content to correctly submit the task in the environment.
- Task image: Contains the dependencies required for task execution.
- Locations of the original model, dataset, and output model for the task:
BASE_MODEL_URL: Change to the Git URL of the prepared model.DATASET_URL: Change to the Git URL of the prepared datasetidentity-alauda.OUTPUT_MODEL_URL: Create an empty model in the model repository to store the output model, and then enter the Git URL of this model.- Required resources for the task, including:
- PVC in the workspace: This is used to store the original model (if training is being performed, the original model/pretrained model is not required), the dataset, and training checkpoints.
- Manually specifying a PVC: After the task is executed, the PVC is retained. This method is useful if you want to retain the workspace and reuse the original model in the next task, verify checkpoints, and so on.
- Temporary PVC: After the task is executed, the PVC is automatically deleted to free up space.
- Shared Memory: For multi-GPU/distributed training tasks, it is recommended to allocate at least 4 Gi of shared memory.
- CPU, memory, and GPU resources required for the task (based on the GPU device plugin deployed in the cluster).
- Task Execution Script:
- The example script above includes caching the model from the model repository to the PVC, caching the training dataset to the PVC, and pushing the model to the new model repository after fine-tuning. If you need to modify the execution script, you can also include these steps.
- The example script uses the
LLaMA-Factorytool to launch the fine-tuning task, which can handle most LLM fine-tuning training scenarios. - Task Hyperparameters: In the example above, the task hyperparameters are defined directly in the startup script. You can also use environment variables to read hyperparameters that may be adjusted repeatedly, making it easier to run and configure multiple times.
After completing the configuration, open a terminal in Notebook and execute: kubectl create -f vcjob_sft.yaml to submit the VolcanoJob task to the cluster.
Viewing and Managing Task Status
In the Notebook terminal
- Run
kubectl get vcjobto view the task list, thenkubectl get vcjob <task name>to view the status of theVolcanoJobtask. - Run
kubectl get podto view the pod status, andkubectl logs <pod name>to view the task logs. Note that for distributed tasks, multiple pods may exist. - If the pod is not created, run
kubectl describe vcjob <task name>orkubectl get podgroupsto view the Volcano podgroup. You can also check theVolcanoscheduling information to determine if the scheduling issue is due to insufficient resources, an inability to mount a PVC, or other scheduling issues. - After the task successfully executes, the fine-tuned model will be automatically pushed to the model repository. Note that the task will automatically generate a repository branch for push based on the time. When using the output model, be sure to select the correct version.
Run kubectl delete vcjob <task name> to delete the task.
Experiment Tracking and Comparison
In the fine-tuning example task above, we used the LLaMA-Factory tool to launch the fine-tuning task and added report_to: mlflow to the task configuration. This automatically outputs training metrics to the mlflow server. After the task completes, we can find the experiment tracking records under Alauda AI - "Advanced" - "MLFlow" and compare multiple executions. For example, we can compare the loss convergence of multiple experiments.
Launching the Inference Service Using the Fine-tuned Model
After the fine-tuning task completes, the model is automatically pushed to the model repository. You can use the fine-tuned model to launch the inference service and access it.
Note: In the example task above, the LoRA partial fine-tuning method was used. Before uploading the model, the LoRA adapter was merged with the original model. This allows the output model to be directly published to the inference service. _Direct publishing is not currently supported on the platform if only the LoRA adapter is available. _
The specific steps are as follows:
- Go to AI > Model Repository, find the fine-tuned output model, go to Model Details > File Management > Modify Source Data, select "Text Classification" for Task Type, and "Transformers" for Framework.
- After completing the first step, click the "Publish Inference Service" button.
- On the Publish Inference Service page, configure the inference service to use the vllm inference runtime (select the CUDA version based on the supported drivers in the cluster), complete other PVC, resource, GPU configurations, and click "Publish."
- After the inference service starts, click the "Experience" button in the upper-right corner of the inference service page to experience a conversation with the model. (Note: Models that include the
chat_templateconfiguration only have conversational capabilities.)
Adapt Non-Nvidia GPUs
When using a non-Nvidia GPU environment, you can follow the common steps below to fine-tune models, launch training tasks, and manage them in AML Notebook.
Note: The following methods can also be reused for scenarios such as large model pre-training and small model training. These are general steps for converting a vendor solution to Notebook + VolcanoJob.
Preparation
- Prerequisite: The vendor GPU driver and Kubernetes device plugin have been deployed in the cluster. The devices can be accessed within the pod created by Kubernetes.
- Note: You will need to know the vendor GPU resource name and the total number of device resources in the cluster to facilitate subsequent task submission.
- For example, for Huawei NPUs, you can apply for an NPU card using:
huawei.com/Ascend910:1. - Obtain the vendor-provided solution documentation and materials for fine-tuning on the current vendor's GPU. This typically includes:
- Solution documentation and steps. This can be done on Kubernetes or in a container using Docker Run.
- Image to run the fine-tuning. For example, the vendor provides a fine-tuning solution using
LLaMA-Factoryand a correspondingLLaMA-Factoryimage (which may be included in the image). - Model to run the fine-tuning. Typically, vendor devices support a range of models. Use models that the device supports or the models provided in the vendor solution.
- Training data. Use the sample data provided in the vendor solution documentation or construct your own dataset in the same format.
- Task launch command and parameters. For example, the
LLaMA-Factoryframework fine-tuning solution uses thellamafactory-clicommand to launch the fine-tuning task and configure various parameters, including task hyperparameters, in a YAML file.
Verifying the Original Vendor Solution (Optional)
To ensure the correct execution of the vendor solution and reduce subsequent troubleshooting, you can first run it completely according to the vendor solution to verify that it works correctly.
This step can be skipped. However, if issues with task execution arise later, you can return to this step to verify that the original solution is the problem.
Converting the Vendor Solution to Run as a Kubernetes Job/Deployment (Optional)
If the vendor solution is already running as a Kubernetes job/deployment/pod, you can skip this step.
If the vendor solution uses a container execution method, such as docker run, you can first use a simple Kubernetes job to verify that the solution runs correctly in a Kubernetes environment where the vendor device plugin is deployed.
Note: This step can rule out issues with volcano jobs being unable to schedule vendor GPU devices, so it can be verified separately.
Reference:
Modify the vendor solution to run as a volcano job
Refer to the following YAML definition
VolcanoJob YAML File
Experiment Tracking and Comparison
Some fine-tuning/training frameworks automatically record experiment progress to various experiment tracking services. For example, the LLaMA-Factory and Transformers frameworks can specify recording of experiment progress to services such as mlflow and wandb. Depending on your deployment, you can configure the following environment variables:
MLFLOW_TRACKING_URI: The URL of the mlflow tracking server.MLFLOW_EXPERIMENT_NAME: The experiment name, typically using a namespace name. This distinguishes a group of tasks.
The framework also specifies the recording destination. For example, LLaMA-Factory requires specifying report_to: mlflow in the task parameter configuration YAML file.
After a training task begins, you can find the corresponding task in the Alauda AI - "Advanced" - MLFlow interface and view the curves of each recorded metric in "Metrics" or the parameter configuration for each execution. You can also compare multiple experiments.
Summary
Using the Alauda AI Notebook development environment, you can quickly submit fine-tuning and training tasks to a cluster using YAML and command-line tools, and manage the execution status of these tasks. This approach allows you to quickly develop and customize model fine-tuning and training steps, enabling operations such as LLM SFT, preference alignment, traditional model training, and multiple experimental comparisons.