File size: 6,033 Bytes

b4f4f9d

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "21111b1f-7cce-4e8b-8337-8f0cdab5804e",
   "metadata": {},
   "source": [
    "# AutoTrain"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd09a9fd-4b90-48f3-b61c-d2349eb7f43e",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52543575-f92e-4038-ad13-30967f47eb7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import subprocess\n",
    "\n",
    "import yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74987944-abfb-44f8-9331-ffbb2f7698bb",
   "metadata": {},
   "source": [
    "## Config"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97c25070-775a-4fb1-9694-4579250686a6",
   "metadata": {},
   "source": [
    "### Template\n",
    "Im creating a template so we can iterate through each of our experiments.\n",
    "\n",
    "Here you can see a few design decisions:\n",
    "- We leave `project_name` and `text_column` empty to overwrite later per experiment\n",
    "- We log in tensorboard, you can use wandb, but you will need to install it in the AutoTrain env that is run on spaces, which gets complex\n",
    "- I choose an `l4x1` from [these options](https://github.com/huggingface/autotrain-advanced/blob/2d787b2033414d06f1e9be2ea0caacad3097f5e8/src/autotrain/backends/base.py#L21)\n",
    "    - This is a [well priced](https://huggingface.co/pricing#spaces) way of training a 7B moodel \n",
    "    - It's very efficient as well at 24GB VRAM\n",
    "- It's becoming less common to use a `valid_split` \n",
    "- I run 2 epochs as the loss still decreases steadily, but some say for LoRAs you should just do 1\n",
    "- Its a good idea use `all-linear` when using LoRA "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc2a8514-51c1-404b-8cfa-6637cc810668",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Base config\n",
    "config_template = {\n",
    "    \"task\": \"llm-sft\",\n",
    "    \"base_model\": \"mistralai/Mistral-7B-Instruct-v0.3\",\n",
    "    \"project_name\": \"\",\n",
    "    \"log\": \"tensorboard\",\n",
    "    \"backend\": \"spaces-l4x1\",\n",
    "    \"data\": {\n",
    "        \"path\": \"derek-thomas/labeled-multiple-choice-explained-mistral-tokenized\",\n",
    "        \"train_split\": \"train\",\n",
    "        \"valid_split\": None,\n",
    "        \"chat_template\": \"none\",\n",
    "        \"column_mapping\": {\n",
    "            \"text_column\": \"\"\n",
    "            },\n",
    "        },\n",
    "    \"params\": {\n",
    "        \"block_size\": 1024,\n",
    "        \"model_max_length\": 1024,\n",
    "        \"epochs\": 2,\n",
    "        \"batch_size\": 1,\n",
    "        \"lr\": 3e-5,\n",
    "        \"peft\": True,\n",
    "        \"quantization\": \"int4\",\n",
    "        \"target_modules\": \"all-linear\",\n",
    "        \"padding\": \"left\",\n",
    "        \"optimizer\": \"adamw_torch\",\n",
    "        \"scheduler\": \"linear\",\n",
    "        \"gradient_accumulation\": 8,\n",
    "        \"mixed_precision\": \"bf16\",\n",
    "        },\n",
    "    \"hub\": {\n",
    "        \"username\": \"derek-thomas\",\n",
    "        \"token\": os.getenv('HF_TOKEN'),\n",
    "        \"push_to_hub\": True,\n",
    "        },\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22eb3d3a-0ab0-4f79-98c2-513a34ce1b6d",
   "metadata": {},
   "source": [
    "### Experiments\n",
    "Here we choose the `project_name` and `text_column` for each experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "957eb2b7-feec-422f-ba46-b293d9a77c1b",
   "metadata": {},
   "outputs": [],
   "source": [
    "project_suffixes = [\"RFA-gpt3-5\", \"RFA-mistral\", \"FAR-gpt3-5\", \"FAR-mistral\", \"FA\"]\n",
    "text_columns = [\"conversation_RFA_gpt3_5\", \"conversation_RFA_mistral\", \"conversation_FAR_gpt3_5\",\n",
    "                \"conversation_FAR_mistral\", \"conversation_FA\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5913085-83c9-4133-a90d-318fd13cc14e",
   "metadata": {},
   "source": [
    "Directory to store generated configs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b86702bf-f494-4951-863e-be5b8462fbd1",
   "metadata": {},
   "outputs": [],
   "source": [
    "output_dir = \"./autotrain_configs\"\n",
    "os.makedirs(output_dir, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3053d1e1-ca40-460c-8999-0787a1751d00",
   "metadata": {},
   "source": [
    "## AutoTrain for each Experiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "025ccd2f-de54-4ac2-9f36-f606876dcd3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate configs and run commands\n",
    "for project_suffix, text_column in zip(project_suffixes, text_columns):\n",
    "    # Modify the config\n",
    "    config = config_template.copy()\n",
    "    config[\"project_name\"] = f\"mistral-v03-poe-{project_suffix}\"\n",
    "    config[\"data\"][\"column_mapping\"][\"text_column\"] = text_column\n",
    "\n",
    "    # Save the config to a YAML file\n",
    "    config_path = os.path.join(output_dir, f\"{text_column}.yml\")\n",
    "    with open(config_path, \"w\") as f:\n",
    "        yaml.dump(config, f)\n",
    "\n",
    "    # Run the command\n",
    "    print(f\"Running autotrain with config: {config_path}\")\n",
    "    subprocess.run([\"autotrain\", \"--config\", config_path])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}