Continuing Pre-Training on Raw Text

18 Jan 2025

This blog post builds upon a community notebook from Unsloth titled Mistral 7B Text Completion - Raw Text Training Full Example.

I went through the original in one of my “Weekly Walkthrough” sessions, learned more about Continued Pre-Training (CPT) in the process, and decided to create a post from it with more code comments and to share the insights we gathered.

The code in the notebook remains largely unchanged, just with more comments and explanation.

by Chris McCormick

Contents
Introduction
▂▂▂▂▂▂▂▂▂▂▂▂
Example Code
▂▂▂▂▂▂▂▂▂▂▂▂
Appendix
- Memory Use

Introduction

Objective & Dataset

The goal of the pre-training in this Notebook is to have the LLM write in the style of the TinyStories dataset, created by Ronen Eldan at Microsoft Research.

From the abstract of the paper here, this dataset was actually designed to train tiny language models (e.g. under 10M parameters… For comparison, even BERT is 110M parameters!).

It consists of 2.5M very short stories that were generated by GPT and use the vocabulary of a 4-year old.

Here’s an example:

One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, “Mom, I found this needle. Can you share it with me and sew my shirt?” Her mom smiled and said, “Yes, Lily, we can share the needle and fix your shirt.”

Together, they shared the needle and sewed the button on Lily’s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.

It’s on the HuggingFace datasets repository here.

Continued Pretraining (CPT)

Pre-training is done with a “self-supervised” objective: given some text, predict what token comes next (“next-token prediction”). It’s “self-supervised” because all we need is raw text, no other labeling required!

Companies like Meta and Mistral perform this step on datasets of trillions of tokens to create and share the base models we use like Llama 3 8b and Mistral 7b.

Side Note: Pre-training a base model is an incredibly resource-intensive process. The 8-billion parameter version of Llama 3 was trained for 1.3M GPU hours! (from here. It’s only about 52 hours of training, though, if you divide that by the size of their compute cluster… 25,000 H100 GPUs 🤯 (from here).

CPT is where we take a base model and further train it using this same “next token prediction” task, but on new text.

i. Where CPT is Used

To understand why you might want to do this, let’s look at an example from the legal domain.

I poked around online for some ugly looking legal text, and ended up on the Affordable Care Act (here, plain text here).

Check out this excerpt:

SEC. 2713. <<NOTE: 42 USC 300gg-13.>> COVERAGE OF PREVENTIVE HEALTH
SERVICES.

    (a) In General.--A group health plan and a health insurance issuer
offering group or individual health insurance coverage shall, at a
minimum provide coverage for and shall not impose any cost sharing
requirements for--
            (1) evidence-based items or services that have in effect a
        rating of `A' or `B' in the current recommendations of the
        United States Preventive Services Task Force;
            (2) immunizations that have in effect a recommendation
        from the Advisory Committee on Immunization Practices of the
        Centers for Disease Control and Prevention with respect to the
        individual involved; and
        ...

This excerpt demonstrates two kinds of “knowledge” that a base model might be lacking:

Domain Knowledge

In order to understand terminology like:

“evidence-based items”,
“rating of ‘A’ or ‘B’”, and
“the United States Preventive Services Task Force”,

the model will need to be familiar with US healthcare policy and relevant government orginizations.

So domain knowledge includes things like learning about new “entities” (people, organizations, projects, …), and new “jargon”.

Side Note: Since medical and legal text do exist all over the internet, and huge models like GPT-4 seem quite knowledgable, I wonder if a more interesting use case would be getting the model to learn about the projects and acronyms and terminology that’s only used internally within a company?

Out-of-Distribution (OOD) Text

The text clearly follows some strict formatting conventions, like the section header SEC. 2713 and the legal citation <<NOTE: 42 USC 300gg-13.>>

It also includes some specific phrasing conventions, e.g., “shall, at a minimum provide coverage for”. The model understands all of those words, but it’s written in a unique style that you may want to teach the model to be better at.

Base models are trained to predict the next token by outputting a probability distribution over their entire vocabulary. You can combine the probabilities for each token in a paragraph to estimate how likely the model thinks the text is overall. When the format or style of the text is something the model hasn’t seen much of during training, it might assign low probabilities to the tokens. This makes the text “out of distribution” because it doesn’t match the patterns the model saw during training.

Summary

Overall, I think new domains are about knowledge, and OOD domains are about new formats and writing styles.

ii. Training on Raw Text

One of the best things about CPT is that it’s “self-supervised”, meaning no additional human labeling is required.

All you need is raw text from your domain, and every token in the text becomes a training sample (i.e., for each token, the prior text is the input and the next token is the label).

If you have a big repository of legal documents containing 100 million tokens, then you have a training set with 100 million samples.

Side Note: The Affordable Care Act is ~60k lines long and ~400k “words”, so 100M tokens doesn’t seem like too big of a stretch!

iii. CPT vs. Fine-Tuning

While CPT can use raw text to pick up new knowledge, formats, and styles, Fine-Tuning requires labeled data.

CPT is Self-Supervised and Fine-Tuning is Supervised, and lately I’ve been seeing it more explicitly named as “Supervised Fine-Tuning” (SFT).

Side Note: The name “fine-tuning” just implies that you’re doing a much smaller training run than what was used to create the base model, and you could do this with labeled data or raw text, so I think it makes sense to use the explicit “SFT” name when refering to training with labeled data.

Specializing

Essentially, I think SFT is about improving the model’s performance on tasks and domains that it’s already familiar with (either from its original pre-training or from additional CPT that you’ve done).

The mathematics of supervised training guarantee that it will improve the model’s performance on the training set. The question is, how badly did it “overfit” the task and the knowledge?

Overfitting the task means the model has lost performance on other tasks, and overfitting the knowledge means the model has frogotten other things.

We can apply techniques such as LoRA to help minimize this problem, but I think it’s probably safe to assume that the model’s getting worse at something.

From what I’ve gathered, here’s where I think you might use CPT and/or SFT:

Reason for Training	CPT	SFT
Learn new style or format	x
Learn new knowledge	x
Specialize on style or format	x	x
Specialize on knowledge	x	x
Specialize on task		x

We’ll see in this example that CPT on a small amount of data is enough to get it to specialize on a particular writing style.

GPT’s Explanation

I asked GPT to explain the differences and it feels like a really solid summary that I’m not sure I could improve on much, so just note that the remainder of this section is written by GPT.

**1. Fine-Tuning**

Fine-tuning involves training the model on a smaller, task-specific dataset, often with supervised labels or targeted examples.

**Advantages:**

Task-Specific Adaptation: Fine-tuning is excellent for making the model highly specialized in a task, such as sentiment analysis, summarization, or medical question-answering.
Data Efficiency: Fine-tuning can work well even with relatively small datasets compared to pretraining.
Precision: It allows the model to focus narrowly on the task or domain of interest.

**Limitations:**

Limited Generalization: Fine-tuning typically focuses on a specific task or dataset, which might lead to overfitting. The model may struggle to generalize to broader contexts within the domain.
Less Broad Knowledge Acquisition: Fine-tuning does not expose the model to large amounts of diverse data in the new domain. If the domain is vast and heterogeneous, the model’s understanding might remain incomplete.

**2. Continued Pretraining on Raw Text**

This involves training the model further using its original pretraining objective (e.g., next-token prediction) on raw text data from the new domain or OOD domain.

**Advantages:**

Broader Knowledge Acquisition: By training on raw text, the model absorbs a wide range of linguistic patterns, facts, and context from the new domain.
Improved Generalization: This method helps the model adapt not just to specific tasks but also to general use cases in the new domain or OOD data. It can perform better across various tasks without task-specific labels.
Alignment with Pretraining Objective: Continued pretraining aligns with the original self-supervised learning objective, making it efficient for improving foundational knowledge in the new domain.

**Limitations:**

Resource Intensive: Continued pretraining often requires more data, computational resources, and time than fine-tuning.
Less Task-Specific: It doesn’t directly optimize for a specific task or goal; additional fine-tuning might still be required for high performance on specific tasks.

**Which to Choose?**

For Adapting to New Domains:
- Use continued pretraining on raw text if you need the model to acquire broad, unsupervised domain knowledge.
- Use fine-tuning if the goal is to achieve high performance on specific tasks within the domain and you already have task-specific datasets.
For Adapting to OOD Domains:
- Continued pretraining is usually better for OOD domains because it allows the model to adjust to the style, structure, and context of the new data.
- Fine-tuning can still help but might require careful dataset curation to avoid overfitting or missing the broader linguistic shifts.

**Hybrid Approach**

In many cases, a combination of the two methods works best:

Continued pretraining on raw text from the domain or OOD data for foundational adaptation.
Fine-tuning on a task-specific dataset for targeted performance improvements.

This hybrid strategy leverages the strengths of both approaches: broad knowledge acquisition from pretraining and task-specific optimization from fine-tuning.

iv. CPT Considerations

1. Base Models vs. Instruction-Tuned

The “instruct” versions of models (e.g., “Meta-Llama-3-8B-Instruct” vs. “Meta-Llama-3-8B”) have essentially had CPT run on them to change their writing style to be a chatbot, like ChatGPT.

Any kind of CPT we do of our own is going to erase that behavior (unless the raw text we’re using is the same style?), so it makes sense to start from the base model rather than the “instruct” version.

(Insight from here).

2. Learning Rate on Embeddings

The Vocabulary Needs Delicate Handling

An LLM’s vocabulary embeddings store knowledge about the meaning and relationship of words.

Since all of the model’s complex functionality has been learned around this vocabulary, I think it makes some intuitive sense that modifying these embeddings too much could have an out-sized impact on the overall performance.

I think this is why the embedding layer is often “frozen” during fine-tuning, meaning we don’t make any changes to it at all.

For CPT, updating the embeddings makes more sense in order to teach the model new words, or to emphasize the meaning that a word has in our particular context.

One of the ways that we adjust the impact of our training (i.e., how much we change vs. preserve the model) is via the learning rate.

Aiside: How Learning Rates Work

When training neural networks, the “learning rate” is how we throttle (speed up or slow down) the impact of each batch of samples on the model.

The learning rate parameter is a tiny fraction, like 1e-4 (which is 1 / 10,000), that we set.

Side Note: Why are learning rates so tiny? It’s because they’re relative to the magnitude of the weight values, which also tend to be tiny fractions.

Learning rates follow a “schedule” which gradually decreases the learning rate to zero over the course of the training run.

The learning rate we specify, such as 1e-4, is actually the peak value, and it just gets smaller from there.

Reducing the Impact to the Vocabulary

Unsloth supports setting a different learning rate for the embedding matrix versus the rest of the model as a way to decrease the impact of our changes to the embeddings relative to the decoder layers.

The notes suggest that we typically want to set it 2-10x smaller for CPT, and in this example it’s set to 1/10th of the learning rate used on the decoder layers.

For further research: I wonder if it reduces the learning rate on the LM head by the same amount, since the vocabulary and LM head are similar / closely related?

3. LoRA

LoRA is a fine-tuning technique which serves two main purposes:

It substantially reduces the impact our training has on the model, which helps prevent overfitting.
It’s the only way to do any fine-tuning if we’re using quantization to compress the model (which is typically a requirement if we’re training on a single GPU).

With LoRA, we add on a small number of additional weights “alongside” the existing ones, and only update those additional weights.

Side Note: This is easily misunderstood as implying that updating a small fraction of the weights means it will only require a fraction of the memory and compute. The reality is that we still have to compute and store all of the model activations, and backpropagate the error through all of the model weights. It does mean that we only have to store a fraction of the optimizer state. However, when it comes to memory, what really matters is our sequence length. Once we get to 1,024 tokens or more, the memory savings from LoRA aren’t very meaningful.

Quantization + LoRA are a requirement in order for this example to fit within the memory of a free T4 in Colab.

I’m not sure LoRA is a good idea, though, if you’re trying to add substantially to the knowledge of the LLM–it seems too limiting.

It seems to be fine for this example, though, where we’re just directing the model to write in the style of a children’s story.

If you are going to / have to use LoRA, you can allow the model to learn more by increasing the number of LoRA weights, which is determined by the “rank” parameter, r.

▂▂▂▂▂▂▂▂▂▂▂▂

Example Code

Time for the actual example!

Unsloth largely follows the huggingface transformers paradigm, but does add some new parameters and options.

The introductory fine-tuning example in the Unsloth docs, here, seems like a solid reference if you’re curious about anything I don’t cover.

S1. Installation

Install Unsloth

%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

GPU Memory

gpu_mem_used

This function uses the “NVIDIA System Management Interface” nvidia-smi command line tool to retrieve the current memory usage.

There’s a function in PyTorch, torch.cuda.memory_allocated(), but it seems to severely under-report. 🤷‍♂️

import os
import torch

def gpu_mem_used():
    """
    Returns the current GPU memory usage as a string, e.g., "5.02 GB"
    """

    # This approach doesn't work, because PyTorch only tracks its own memory
    # usage, not the total memory consumption of the GPU.
    #gpu_bytes_used = torch.cuda.memory_allocated()

    # Run the nvidia-smi command line tool to get memory used in megabytes.
    buf = os.popen('nvidia-smi --query-gpu=memory.used, --format=csv,noheader,nounits')

    # It returns an unformated integer number of "MiB" (2^20 bytes).
    gpu_mb_used = float(buf.read())

    # Divide that by 1024 to get GB.
    mem_used = gpu_mb_used / float(1024)

    return ("{0:.2f} GB".format(mem_used))

print("GPU memory used: {:}".format(gpu_mem_used()))

GPU memory used: 0.00 GB

S2. Download Model

The FastLanguageModel class, which we’ll see below, is one of the key places that we’re picking up the unsloth-specific stuff.

Otherwise, we’ll see that it largely matches the HuggingFace transformers interface.

From the import notes below, it seems like the library actually “patches” transformers–I think that means replacing some of the existing code in the huggingface library?

from unsloth import FastLanguageModel
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!

4-bit Quantization

Quantization is a technique where we compress the model before loading it onto the GPU in order to save space.

The model is still 16-bits–with quantization we have to decompress the matrices back into 16-bits when we want to use them.

It also means that the model weights can’t be updated (without breaking the compression scheme), so we must use LoRA in order to fine-tune the model.

load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

Pre-Quantized Models

Unsloth releases pre-quantized versions of popular models in order to speed up the download. Quantization is “deterministic”–for a given pre-trained model, the quantized version will always be the same. They’re saving us a step (though it’s not compute intense–really we’re just saving on download speed).

These models are hosted by huggingfaces model repository. https://huggingface.co/unsloth

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

RoPE Scaling

Positional Encoding Vectors

There’s nothing about the self-attention equations that inherently indicates what order the words are in (The order of the rows in the matrix doesn’t matter!).

To indicate the word order, we add these special Positional Encoding (PE) vectors to each of the token embeddings, and the LLM is able to recognize the pattern.

RoPE: Rotational Position Embeddings

There have been different schemes for defining the PEVs, but the one that’s gained prominence lately is RoPE.

The key detail is that the RoPE vectors are all actually the same vector, just rotated different amounts to reflect the different positions in the sequence.

(Note: I haven’t studied RoPE thoroughly, so I may be missing some subtle details).

RoPE Scaling

When the base model was trained by Meta, Mistral, etc., they trained it with a specific context window length–a specific number of RoPE embeddings.

Let’s say the model was trained with 2,048 position embeddings.

It’s been found that we can increase this number to, e.g., 4,096 by simply inserting addition RoPE embeddings in between the existing ones (i.e., at an angle that falls between).

Something I’m not 100% clear on is whether this technique makes these new embeddings immediately useable, or if we have to do at least a little bit of additional training in order for the model to adjust its understanding of the PE embeddings.

Unsloth Support

RoPE scaling can be applied to existing pre-trained models, so long as they used RoPE as their positional encoding scheme.

The unsloth comment says that it’s supported “internally”, which I assume means they take your desired context size, compare it to what the model was trained with, and then add the appropriate number of interpolated RoPE embeddings.

Questions

What does the code do if you try applying this to an older model that didn’t use RoPE? Does it throw an error?
Can you do RoPE scaling and immediately use the model for inference, or does it have to be trained more first?

# From Unsloth: Choose any! We auto support RoPE Scaling internally!
# I'm guessing that if you specify this to be larger than what the model was
# pre-trained with, then unsloth will infer how much scaling is needed to
# accomodate it.
#
# For Mistral 7b, it was trained with a context of 8,192 tokens, so a maximum
# sequence length of 2,048 doesn't require any scaling.
max_seq_length = 2048

Data Type

More recent GPUs implement a 16-bit data type called “BFloat16”, which allocates the available precision in a way that’s better tailored to the needs of deep learning.

The “B” comes from Google Brain, who created it.

The bfloat16 data type can help prevent issues that occur due to “numerical underflow and overflow”, where a calculation results in a number that’s too small for the data type to represent, or too large.

dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+

Download the Model

It looks like with Unsloth, as a convenience feature, from_pretrained returns both the model and the tokenizer. (The normal hf paradigm is to load these separately–but it’s always the same step, so it makes sense to combine them).

Note that, in order to download Mistral, you’ll need to:

Have a Hugging Face account
Accept Mistral’s user license
Create a huggingface token to link this Notebook to your account (so they can verify that your account has accepted the license).

It looks like if you add your Hugging Face token to your Colab Secrets (the key-shaped icon in the panel on the left), and name it “HF_TOKEN”, the code will find it automatically and handle the authorization step.

model, tokenizer = FastLanguageModel.from_pretrained(
    # Mistral, version 3, 7b parameters
    model_name = "unsloth/mistral-7b-v0.3", # "unsloth/mistral-7b" for 16bit loading

    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,

    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.1.5: Fast Mistral patching. Transformers: 4.47.1.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Let’s check the GPU memory useage after loading the (quantized!) model.

gpu_mem_model = gpu_mem_used()

print("GPU memory used: {:}".format(gpu_mem_used()))

GPU memory used: 4.78 GB

S3. Data Prep

From the original Notebook:

We now use the Tiny Stories dataset from https://huggingface.co/datasets/roneneldan/TinyStories. We only sample the first 2500 rows to speed training up. We must add EOS_TOKEN or tokenizer.eos_token or else the model’s generation will go on forever.

If you want to use the ChatML template for ShareGPT datasets, try our conversational notebook.

from datasets import load_dataset

# Take 2500 samples from the TinyStories dataset
dataset = load_dataset("roneneldan/TinyStories", split = "train[:2500]")

EOS_TOKEN = tokenizer.eos_token

Format the dataset as below–apply the formatting function to all of the examples.

(add the end of sentence token to all examples).

def formatting_prompts_func(examples):
    # Wrap each sample as a Dictionary with one key--"text"
    # Also add the EOS_TOKEN to the end of each sample.
    return { "text" : [example + EOS_TOKEN for example in examples["text"]] }

# Apply the fomatting to all of the samples in the dataset.
dataset = dataset.map(formatting_prompts_func, batched = True,)

Print out 5 stories from Tiny Stories

import textwrap

wrapper = textwrap.TextWrapper(width=100)

# For each of the first 5 examples...
for row in dataset[:5]["text"]:

    # Print the example, and wrap lines at 100 characters.
    print("\n=========================")
    print(wrapper.fill(row))

=========================
One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with
it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on
her shirt.  Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and
sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."
Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them
because they were sharing and helping each other. After they finished, Lily thanked her mom for
sharing the needle and fixing her shirt. They both felt happy because they had shared and worked
together.</s>

=========================
Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep
was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.  One day,
Beep was driving in the park when he saw a big tree. The tree had many leaves that were falling.
Beep liked how the leaves fall and wanted to play with them. Beep drove under the tree and watched
the leaves fall on him. He laughed and beeped his horn.  Beep played with the falling leaves all
day. When it was time to go home, Beep knew he needed more fuel. He went to the fuel place and got
more healthy fuel. Now, Beep was ready to go fast and play again the next day. And Beep lived
happily ever after.</s>

=========================
One day, a little fish named Fin was swimming near the shore. He saw a big crab and wanted to be
friends. "Hi, I am Fin. Do you want to play?" asked the little fish. The crab looked at Fin and
said, "No, I don't want to play. I am cold and I don't feel fine."  Fin felt sad but wanted to help
the crab feel better. He swam away and thought of a plan. He remembered that the sun could make
things warm. So, Fin swam to the top of the water and called to the sun, "Please, sun, help my new
friend feel fine and not freeze!"  The sun heard Fin's call and shone its warm light on the shore.
The crab started to feel better and not so cold. He saw Fin and said, "Thank you, little fish, for
making me feel fine. I don't feel like I will freeze now. Let's play together!" And so, Fin and the
crab played and became good friends.</s>

=========================
Once upon a time, in a land full of trees, there was a little cherry tree. The cherry tree was very
sad because it did not have any friends. All the other trees were big and strong, but the cherry
tree was small and weak. The cherry tree was envious of the big trees.  One day, the cherry tree
felt a tickle in its branches. It was a little spring wind. The wind told the cherry tree not to be
sad. The wind said, "You are special because you have sweet cherries that everyone loves." The
cherry tree started to feel a little better.  As time went on, the cherry tree grew more and more
cherries. All the animals in the land came to eat the cherries and play under the cherry tree. The
cherry tree was happy because it had many friends now. The cherry tree learned that being different
can be a good thing. And they all lived happily ever after.</s>

=========================
Once upon a time, there was a little girl named Lily. Lily liked to pretend she was a popular
princess. She lived in a big castle with her best friends, a cat and a dog.  One day, while playing
in the castle, Lily found a big cobweb. The cobweb was in the way of her fun game. She wanted to get
rid of it, but she was scared of the spider that lived there.  Lily asked her friends, the cat and
the dog, to help her. They all worked together to clean the cobweb. The spider was sad, but it found
a new home outside. Lily, the cat, and the dog were happy they could play without the cobweb in the
way. And they all lived happily ever after.</s>

S4. Inference Prior to Training

Let’s see what the model generates before we do any CPT.

We’ll prompt it with “Once upon a time, in a galaxy, far far away,”

I copied the existing generation code from later in the Notebook, and asked GPT to add comments.

Record GPU Memory

Before we do any inferencing, let’s report how much memory the model is consuming.

(Below is Unsloth’s code, which uses the torch.cuda functions for analyzing memory use.)

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

# Display the currently connected GPU and its total memory.
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")

# Display how much memory we have prior to any training--this is memory consumed
# by our model weights.
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
4.363 GB of memory reserved.

Input Text

Invoking the tokenizer will split the text into tokens and replace them with their token IDs.

# Tokenize and encode the text and move it to the GPU.
inputs = tokenizer(
    "Once upon a time, in a galaxy, far far away,",
    return_tensors = "pt"  # Returning the tokenized inputs as PyTorch tensors.
)

# Move the inputs to the GPU--don't forget this step!
inputs.to("cuda")

print(f"The inputs are type:\n  {type(inputs)}")
print(f"\nThe input tokens are size:\n  {inputs['input_ids'].shape}")

The inputs are type:
  <class 'transformers.tokenization_utils_base.BatchEncoding'>

The input tokens are size:
  torch.Size([1, 14])

Streaming Text

A nice feature to have when generating text is to be able to see the words printed out in real time as the model generates them (especially since this can be a little slow).

To do this, here’s my rough understanding:

We need the Colab Notebook user interface to not be blocked by the generation code. So we run the generation in a separate thread, and use the enumerate paradigm to print out each token as its yielded by the generation thread.

I’m not sure of the exact interaction between the components, but it involves creating a TextIteratorStreamer around the tokenizer (presumably to decode the output of the model).

# Importing the TextIteratorStreamer from the Hugging Face Transformers library.
from transformers import TextIteratorStreamer

# Initializing the TextIteratorStreamer with the tokenizer.
# This is used to stream generated text from the model in real-time.
text_streamer = TextIteratorStreamer(tokenizer)

Generation Parameters

We define the keyword arguments (kwargs) for the text generation process.

# Creating a dictionary to hold the arguments for text generation.
generation_kwargs = dict(
    inputs,  # The tokenized inputs to the model.
    streamer=text_streamer,  # The text streamer to process generated text in real-time.
    max_new_tokens=256,  # The maximum number of tokens to generate.
    use_cache=True,  # Enables caching to improve efficiency during generation.
)

Inference vs. Training Mode

# Put the model into inference mode--a required step for generating text.
FastLanguageModel.for_inference(model)

# IMPORTANT: We'll need to put it back into training mode further down.

print("Model now in inference mode.")

Model now in inference mode.

Launching Text Generation in a Separate Thread

The text generation process is run on a separate thread to allow real-time streaming of the output.

# Importing the Thread class for running tasks in parallel.
from threading import Thread

# Creating a new thread to run the model's generate function.
# This allows the main program to process streamed output in real-time while
# the model generates text.
thread = Thread(
    target = model.generate, # Specify the function to be run in the Thread.
    kwargs = generation_kwargs # The dictionary of arguments that will be
                               # passed to `generate`
)

# Starting the thread to begin text generation (i.e., invoke `model.generate`)
thread.start()

Streaming and Printing Generated Text

Printing out the text one word at a time elegantly with wrapping is a little tricky.

Approach #1: Just Print

The simplest approach is to simply print the tokens out as they come. We can print out the new_text by setting end=””–then print won’t add the newline to each output, so we can keep appending to it.

# Looping through the streamed text output.
for j, new_text in enumerate(text_streamer):
    print(new_text, end="")

This outputs everything on a single line–not very convenient to read.

<s> Once upon a time, in a galaxy, far far away, there was a young man who was a huge fan of Star Wars. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that

Approach #2: Wrap by token count

To apply some rough wrapping, we could try adding a new line every, e.g., 20 pieces of text:

# Looping through the streamed text output.
for j, new_text in enumerate(text_streamer):

    # Append the new text to the existing output.
    print(new_text, end="")

    # Add a newline every xx tokens.
    if ((j + 1) % 20 == 0):
        print()

This works fairly well, but one problem is that the first new_text yielded is actually our input text, not a single word, so the first line ends up as more than 20 words.

<s> Once upon a time, in a galaxy, far far away, there was a young man who was a huge fan of Star Wars. He was so much
of a fan that he decided to make a movie of his own. He was so much of a
fan that he decided to make a movie of his own. He was so much of a fan that
he decided to make a movie of his own. He was so much of a fan that he decided
to make a movie of his own. He was so much of a fan that he decided to make
a movie of his own. He was so much of a fan that he decided to make a movie
of his own. He was so much of a fan that he decided to make a movie of his
own. He was so much of a fan that he decided to make a movie of his own.
He was so much of a fan that he decided to make a movie of his own. He was
so much of a fan that he decided to make a movie of his own. He was so much
of a fan that he decided to make a movie of his own. He was so much of a
fan that he decided to make a movie of his own. He was so much of a fan that
he decided to make a movie of his own. He was so much of a fan that

Approach #3: Wrap by character count

The version from the Unsloth notebook gets pretty fancy, wrapping to max 100 characters.

# Importing textwrap for formatting output to a fixed width.
import textwrap

# Setting the maximum width for printed text.
max_print_width = 100

# We'll track the character count of the current line.
line_length = 0

# Looping through the streamed text output.
for j, new_text in enumerate(text_streamer):

    # The first `new_text` is actually just our input text.
    # For this example, it's '<s> Once upon a time, in a galaxy, far far '
    if j == 0:
        # Use `textwrap` to split the input text into multiple lines.
        # It returns a list of strings (one per line)
        lines = textwrap.wrap(
            new_text,
            width = max_print_width,
            drop_whitespace = False # Make sure it doesn't strip the space off
                                    # the end of the last line.
        )

        # Store the length of the final line.
        line_length = len(lines[-1])

        # Combine the list of strings into a single one by adding newlines
        # in between.
        wrapped_text = '\n'.join(lines)

        # Print out the input text. Set end="" so that we can continue printing
        # right after the end of the input.
        print(wrapped_text, end="")

    # Subsequent pieces of new_text:
    #  - Sometimes empty string
    #  - Only single words?
    #  - Have any punctuation attached.
    # For example:
    # '', '','10 ', 'years ', 'old ', 'when ', 'the ', ..., 'came ', '', 'out. '
    else:
        # If adding `new_text` would exceed the maximum width...
        if (line_length + len(new_text)) >= max_print_width:
            print()  # Print a newline to end this line.
            print(new_text, end="")
            line_length = len(new_text)  # Reset the line length.

        else:
            # Print the new text chunk without adding a newline at the end.
            print(new_text, end="")

            # Update the current line length.
            line_length += len(new_text)

    pass  # Explicit pass statement for clarity (optional).

pass  # Explicit pass statement for clarity (optional).

<s> Once upon a time, in a galaxy, far far away, there was a young man who was a huge fan of Star 
Wars. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan 
that he decided to make a movie of his own. He was so much of a fan that he decided to make a 
movie of his own. He was so much of a fan that he decided to make a movie of his own. He was so 
much of a fan that he decided to make a movie of his own. He was so much of a fan that he decided 
to make a movie of his own. He was so much of a fan that he decided to make a movie of his own. He 
was so much of a fan that he decided to make a movie of his own. He was so much of a fan that he 
decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his 
own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan 
that he decided to make a movie of his own. He was so much of a fan that he decided to make a 
movie of his own. He was so much of a fan that

# Importing textwrap for formatting output to a fixed width.
import textwrap

# Setting the maximum width for printed text.
max_print_width = 100

Example output:

<s> Once upon a time, in a galaxy, far faraway, there was a young man who was a huge fan of Star
Wars. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that
he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his
own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that
he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his
own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that
he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his
own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that
he decided to make a movie of his own. He was so much of a fan that he decided to make a movie of his
own. He was so much of a fan that he decided to make a movie of his own. He was so much of a fan that

gpu_mem_forward_pass = gpu_mem_used()

print("GPU Memory used after forward pass:", gpu_mem_used())

GPU Memory used after forward pass: 5.03 GB

Put the model back into training mode…

# IMPORTANT: Make sure to do this before attempting training... This was missing
# in the original example code.
FastLanguageModel.for_training(model)

print("Model now in training mode.")

Model now in training mode.

S5. Add LoRA Weights

The Hugging Face paradigm for applying LoRA, which is followed here as well, is to do it as a separate step by calling get_peft_model.

“peft” standards for “Parameter-Efficient Fine-Tuning”, which is the general name for techniques like LoRA, but LoRA is the dominant approach.

Refer back to the “CPT Considerations” section for some reflections on the use of LoRA for CPT. LoRA is a requirement when using quantization, and can help avoid overfitting (particularly with smaller training datasets?). But if you’re trying to make big changes to the model’s knowledge or writing style, it may be too limiting?

Choosing Targets

We can choose which parts of the model we want to add LoRA weights to, but it’s best to apply it to ~everything.

The typical exceptions are:

The normalization layers
For fine-tuning, most examples don’t add LoRA to the vocabulary embeddings or to the “Language Modeling (LM) Head” (which is also a vocabulary of embeddings!).

Side Note: Many fine-tuning examples only apply LoRA to two of the attention matrices, because this was what the original authors did, but it turns out that applying it “everywhere” makes a significant improvement with minimal impact on the memory and compute requirements.

For CPT, it makes more sense to allow these input and output vocabularies to be modified by the training. See the “CPT Considerations” section for more.

Rank, r

You can think of r as how many additional neurons we want to add to each component of the model. Adding more means we can make bigger changes to the model’s knowledge and behavior, but also requires more training data to avoid overfitting.

If you want to try playing with the rank, the following approach makes sense to me:

Initial Rank: Start with a small rank, like 8, to avoid over-fitting. Leave alpha at 32 and don’t mess with it.
Tune Learning Rate: Before playing with the rank, tune the batch size and learning rate to find a good combo.
Tune Rank: Play with different values of r, but leave alpha alone–it’s purpose is to allow you to try different values for r without having to re-tune the learning rate.
Re-Tune Learning Rate: Once you’ve found a good r value, re-tune the learning rate to see if the ideal value has changed.

Apply LoRA!

# get_peft_model = Add LoRA matrices and freeze the main model weights.
model = FastLanguageModel.get_peft_model(

    model,

    # Larger r values add more trainable parameters to the model, allowing
    # you to have a bigger impact on its behavior.
    # Larger values of r make sense for CPT on large datasets.
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128

    # The value of 'alpha' kinda doesn't matter--just pick a value and stick
    # with it. Tuning alpha and tuning the learning rate are redundant.
    lora_alpha = 32,

    # "Rank stabilized" LoRA changes the scaling behavior (from alpha) such that
    # higher values of r (like 128 or 256) don't have their gradients scaled
    # down too much.
    use_rslora = True,  # We support rank stabilized LoRA

    # Which parts of the model to apply LoRA to (i.e., define new matrices and
    # freeze the originals.)
    # If something is not mentioned in this list, then it's left unfrozen
    # (trainable).
    # See the markdown commentary for more, but the main thing to note here
    # is that most fine-tuning examples don't apply LoRA to the input embeddings
    # or the LM Head, but it makes sense to do so for CPT.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", # Add LoRA to all of the attention matrices
                      "gate_proj", "up_proj", "down_proj", # Add LoRA to all of the FFN matrices

                      "embed_tokens", "lm_head",], # Add for continual pretraining

    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized

    # Gradient checkpointing is a very significant consideration--it tosses
    # intermediate calculations in order to save space, but it means that we
    # have to redo that math later.
    # This can save a lot of memory but also really slow down training, so only
    # use it if you have to.
    #
    # According to this Unsloth comment, it sounds like they've improved on the
    # implementation: [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch
    # sizes!
    #
    # Note: In regular HuggingFace, this is passed to the TrainingArguments.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context

    random_state = 3407,

    loftq_config = None, # And LoftQ
)

Unsloth: Offloading input_embeddings to disk to save VRAM

/usr/local/lib/python3.11/dist-packages/unsloth/models/_utils.py:748: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  offloaded_W = torch.load(filename, map_location = "cpu", mmap = True)

Unsloth: Offloading output_embeddings to disk to save VRAM

Unsloth 2025.1.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.

Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM

Unsloth Outputs

There are some interesting details in the output of the previous cell…

Offloading Embeddings

It mentions removing the input and output (LM head) embeddings from the GPU:

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM

Clever trick!

Impact on Forward Pass

The input embeddings are just a look-up table, so the step of retrieving those isn’t compute heavy.
We do need to do a vector-matrix multiply on the output embeddings–that step seems a little more intense, but perhaps it’s still small enough that it’s worth the memory savings?

Impact on Backprop

As far as weight updates, I imagine that for a given training sample we are only calculating the gradients for:
- The output embedding for the target word.
- The input embeddings for the tokens in our text.

LoRA Summary

This line shows how many parts of the model we’re applying LoRA weights to. Llama 3 has 32 layers, so the numbers make sense. It doesn’t mention the input or output embeddings, though?

Unsloth 2025.1.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.

Mixed Precision

I’m not familiar with this concept… I’d understand a little better if it meant a mix of 32-bit and 16-bit, but our model is 16-bit, so…?

Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM

LoRA Memory Use

Adding the LoRA parameters typically takes a small amount of additional memory, but a rank of 128 is actually pretty large, and the weights are adding another ~1.2 GB.

gpu_mem_lora = gpu_mem_used()

print("Total GPU memory used after adding LoRA weights: {:}".format(gpu_mem_used()))

GPU memory used: 6.17 GB

S6. Run Continued Pretraining

6.1. Create Trainer

The UnslothTrainer and UnslothTrainingArguments classes follow the paradigm set by the HuggingFace “TRL SFT”.

TRL - Transformers Reinforcement Learning - While the title emphasizes RL, it also seems to be the prefered library for fine-tuning text-generation models.
- SFT - Supervised Fine-Tuning - Specifically, their SFT classes help with this.
  - The SFT docs are here, and they even include a section on Unsloth.

Documentation

I think the Unsloth classes here must largely overlap the SFT ones, so the HuggingFace documentation serves as the main documentation source?

Also, I mentioned this in the model load section as well, but the unsloth fine-tuning example here also seems like a good reference.

Training Time

This training code takes about 10 minutes to run on an A100.

Training Parameters

Note that the Training Arguments class is nestled into the parameter list.
I’ve added on a little bit of commentary.
Most of these appear to be standard arguments, but I wonder which ones are unsloth specific? Maybe embedding_learning_rate?

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    # Model
    model = model,

    # Dataset--look like tokenization happens on the fly.
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",

    # Truncate training samples to 2,048 tokens.
    max_seq_length = max_seq_length,

    # How many threads to use for tokenization?
    dataset_num_proc = 8,

    # This parallels the TrainingArguments class in HuggingFace TRL.
    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2, # GPU Batch Size
        gradient_accumulation_steps = 8, # How many GPU batches to perform before
                                         # stepping the optimizer.
                                         # actual_batch_size = 16

        # We'll train for one epoch over our dataset.
        num_train_epochs = 1,

        # Set the learning rate(s).
        learning_rate = 5e-5,    # This looks like a pretty small lr?
        embedding_learning_rate = 5e-6, # They've set this to 10x smaller.
        lr_scheduler_type = "cosine",
        warmup_ratio = 0.1, # Have the scheduler do warmup steps before starting
                            # its normal schedule.

        # Data type.
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),

        # The 8-bit version of Adam quantizes the optimizer state to save
        # memory.
        optim = "adamw_8bit",

        weight_decay = 0.00,

        # We'll see the current training loss after every batch.
        logging_steps = 1,
        report_to = "none", # Use this for WandB etc

        seed = 3407,
        output_dir = "outputs",
    ),
)

6.2. Run Training

Run the training!

Steps

Each “step” refers to training on one batch of samples (in this case, 16 samples).

Unsloth prints some details at the top of the output which convey how much training we’re going to do. (i.e., total samples, batch size, number of batches).

Training Loss

The Training Loss is displayed as a way to ensure that the model is learning successfully. The Loss can be erratic, but it should be trending downward. If not, there’s something wrong with the setup.

trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 156
 "-____-"     Number of trainable parameters = 603,979,776

Report training time–it’s captured in the trainer_stats object.

#print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")

8.94 minutes used for training.

The below code is from the original Notebook.

Note how significant the additional memory use is for the training step, compared to just storing the model.

used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak reserved memory = {used_memory} GB.")

used_percentage = round(used_memory         /max_memory*100, 3)
print(f"Peak reserved memory % of max memory = {used_percentage} %.")

print("\n----\n")

used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")

lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

Peak reserved memory = 9.332 GB.
Peak reserved memory for training = 4.969 GB.
Peak reserved memory % of max memory = 23.587 %.
Peak reserved memory for training % of max memory = 12.559 %.

I’ve found that the torch.cuda.max_memory_reserved() always under-reports, I think it only can report the memory used by torch.

The NVIDIA SMI tool gives the true total.

gpu_mem_train = gpu_mem_used()

print("Total GPU memory used after training: {:}".format(gpu_mem_used()))

GPU memory used: 9.85 GB

S7. Inference After Training

I’ve repeated the code from S5 below, but cut down on the commentary, so see that section for more detail.

Inference Mode

# Put the model into inference mode--a required step for generating text.
FastLanguageModel.for_inference(model)

print("Model now in inference mode.")

Model now in inference mode.

Input Text

Specify our prompt and set everything up for generation.

# Tokenize and encode the text and move it to the GPU.
inputs = tokenizer(
    "Once upon a time, in a galaxy, far far away,",
    return_tensors = "pt"  # Returning the tokenized inputs as PyTorch tensors.
)

# Move the inputs to the GPU--don't forget this step!
inputs.to("cuda")

print(f"The inputs are type:\n  {type(inputs)}")
print(f"\nThe input tokens are size:\n  {inputs['input_ids'].shape}")

# Initializing the TextIteratorStreamer with the tokenizer.
# This is used to stream generated text from the model in real-time.
text_streamer = TextIteratorStreamer(tokenizer)

# Creating a dictionary to hold the arguments for text generation.
generation_kwargs = dict(
    inputs,  # The tokenized inputs to the model.
    streamer=text_streamer,  # The text streamer to process generated text in real-time.
    max_new_tokens=512,  # The maximum number of tokens to generate.
    use_cache=True,  # Enables caching to improve efficiency during generation.
)

# Creating a new thread to run the model's generate function.
# This allows the main program to process streamed output in real-time while
# the model generates text.
thread = Thread(
    target = model.generate, # Specify the function to be run in the Thread.
    kwargs = generation_kwargs # The dictionary of arguments that will be
                               # passed to `generate`
)

The inputs are type:
  <class 'transformers.tokenization_utils_base.BatchEncoding'>

The input tokens are size:
  torch.Size([1, 13])

Generate Output

Kick off the generation thread and then print out the generated text as it comes.

# Setting the maximum width for printed text.
max_print_width = 80

# We'll track the character count of the current line.
line_length = 0

# Starting the thread to begin text generation (i.e., invoke `model.generate`)
thread.start()

# Looping through the streamed text output.
for j, new_text in enumerate(text_streamer):

    # The first `new_text` is actually just our input text.
    # For this example, it's '<s> Once upon a time, in a galaxy, far far '
    if j == 0:
        # Use `textwrap` to split the input text into multiple lines.
        # It returns a list of strings (one per line)
        lines = textwrap.wrap(
            new_text,
            width = max_print_width,
            drop_whitespace = False # Make sure it doesn't strip the space off
                                    # the end of the last line.
        )

        # Store the length of the final line.
        line_length = len(lines[-1])

        # Combine the list of strings into a single one by adding newlines
        # in between.
        wrapped_text = '\n'.join(lines)

        # Print out the input text. Set end="" so that we can continue printing
        # right after the end of the input.
        print(wrapped_text, end="")

    # Subsequent pieces of new_text:
    #  - Sometimes empty string
    #  - Only single words?
    #  - Have any punctuation attached.
    # For example:
    # '', '','10 ', 'years ', 'old ', 'when ', 'the ', ..., 'came ', '', 'out. '
    else:
        # If adding `new_text` would exceed the maximum width...
        if (line_length + len(new_text)) >= max_print_width:
            print()  # Print a newline to end this line.
            print(new_text, end="")
            line_length = len(new_text)  # Reset the line length.

        else:
            # Print the new text chunk without adding a newline at the end.
            print(new_text, end="")

            # The model may print out a newline itself, in which case we need
            # to reset the length tracking.
            if ('\n' in new_text):
                line_length = 0

            else:
                # Update the current line length.
                line_length += len(new_text)

    pass  # Explicit pass statement for clarity (optional).

pass  # Explicit pass statement for clarity (optional).

<s> Once upon a time, in a galaxy, far far away, there was a little girl named 
Lily. She loved to play with her toys and explore the universe. One day, she 
found a big, shiny rock. She picked it up and it felt heavy in her hands.

Lily's mom saw her and said, "Lily, that rock is too heavy for you to carry. 
You should put it down and play with something else." But Lily didn't want to 
put it down. She held it tight and said, "No, I want to keep it. It's mine!"

Her mom smiled and said, "Okay, Lily. But be careful with it. It's very heavy 
and you don't want to hurt yourself." Lily nodded and went back to playing 
with her toys. She was happy to have found something so special and heavy.</s>

Example output:

<s> Once upon a time, in a galaxy, far faraway, there was a little girl named Lily. She loved to
play with her toys and explore the universe. One day, she found a big, shiny rock. She picked it up and
it felt heavy in her hands.

Lily's mom saw her with the rock and said, "Lily, that rock is too heavy
for you to carry. You should put it back where you found it." Lily didn't want to put it back, so she
held onto it tightly.

Later that day, Lily's dad came home from work and saw the rock. He said, "Lily,
that rock is too heavy for you to carry. You should put it back where you found it." Lily still didn't
want to put it back, so she held onto it even tighter.

Lily's mom and dad were worried that she would
hurt herself with the heavy rock, so they decided to take it away from her. They put it back where they
found it and told Lily that it was too heavy for her to carry. Lily was sad, but she understood that it
was for her own safety.</s>

<s> Once upon a time, in a galaxy, far far away, there was a little girl named
Lily. She loved to play with her toys and explore the universe. One day, she
found a big, shiny rock. She picked it up and it was very heavy.

Lily's mom said, "Lily, that rock is too heavy for you to carry. You need to
put it down." But Lily didn't want to put it down. She wanted to keep it with
her.

Lily's dad said, "Lily, that rock is too heavy for you to carry. You need to
put it down and play with your toys." But Lily didn't want to put it down. She
wanted to keep it with her.

Lily's brother said, "Lily, that rock is too heavy for you to carry. You need
to put it down and play with your toys." But Lily didn't want to put it down.
She wanted to keep it with her.

Lily's friends said, "Lily, that rock is too heavy for you to carry. You need
to put it down and play with your toys." But Lily didn't want to put it down.
She wanted to keep it with her.

Finally, Lily's teacher said, "Lily, that rock is too heavy for you to carry.
You need to put it down and play with your toys." And Lily listened. She put
the rock down and played with her toys.</s>

One of the dataset examples for comparison:

=========================
One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with
it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on
her shirt.  Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and
sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."
Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them
because they were sharing and helping each other. After they finished, Lily thanked her mom for
sharing the needle and fixing her shirt. They both felt happy because they had shared and worked
together.</s>

Overall

Our relatively small training run seems to have been very successful in adapting the model to write in the style of the dataset!

It seems like a pretty simple objective, though, so it’d be interesting to try this on something that feels more challenging?

S8. Unsloth

(Below are the Unsloth promotions from the original Notebook–wanted to make sure I preserved these.)

Unsloth Discord

If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Additional Unsloth Notebooks

Zephyr DPO 2x faster free Colab
Llama 7b 2x faster free Colab
TinyLlama 4x faster full Alpaca 52K in 1 hour free Colab
CodeLlama 34b 2x faster A100 on Colab
Mistral 7b free Kaggle version
We also did a blog with 🤗 HuggingFace, and we’re in the TRL docs!
ChatML for ShareGPT datasets, conversational notebook
Gemma 6 trillion tokens is 2.5x faster! free Colab

Local Installation

To install Unsloth on your own computer, follow the installation instructions on our Github page here.

Unsloth Features

We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
And Yi, Qwen, Deepseek, all Llama, Mistral derived archs.
We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
max_seq_length can be set to anything, since we do automatic RoPE Scaling via kaiokendev’s method.
[NEW] We make Llama-3 15 trillion tokens 2x faster! See our Llama-3 notebook

</a>

Support our work if you can! Thanks!

⭐ Star us on Github ⭐

▂▂▂▂▂▂▂▂▂▂▂▂

Appendix

Memory Use

Here are the memory statistics captured after each step.

# Record the final total memory use (after running inference again,
# post-training).
final_gpu_memory = gpu_mem_used()

"""
print("Total memory useage over the course of the notebook:")
print("1. Loading the model:", gpu_mem_model)
print("2. After running a forward pass:", gpu_mem_forward_pass)
print("3. After adding LoRA weights:", gpu_mem_lora)
print("4. After training:", gpu_mem_train)
print("5. After running inference again:", final_gpu_memory)
"""

# Define a consistent padding length for descriptions
pad = 40

print("Total memory usage over the course of the notebook:")
print(f"1. Loading the model:".ljust(pad), gpu_mem_model)
print(f"2. After running a forward pass:".ljust(pad), gpu_mem_forward_pass)
print(f"3. After adding LoRA weights:".ljust(pad), gpu_mem_lora)
print(f"4. After training:".ljust(pad), gpu_mem_train)
print(f"5. After running inference again:".ljust(pad), final_gpu_memory)

'\ngpu_mem_model="4.78 GB"\ngpu_mem_forward_pass="4.78 GB"\ngpu_mem_lora = "6.11 GB"\ngpu_mem_train = "9.85 GB"\n'

The final total given by NVIDIA SMI aligns with what’s shown in the Colab resources monitor:

Screenshot of Colab GPU resource monitor

Bar plot of the memory use broken down by step:

Breakdown of GPU Memory consumption

Continuing Pre-Training on Raw Text

Contents

Introduction

i. Where CPT is Used

ii. Training on Raw Text

iii. CPT vs. Fine-Tuning

GPT’s Explanation

iv. CPT Considerations

▂▂▂▂▂▂▂▂▂▂▂▂

Example Code

S1. Installation

S2. Download Model

S3. Data Prep

S4. Inference Prior to Training

S5. Add LoRA Weights

S6. Run Continued Pretraining

6.1. Create Trainer

6.2. Run Training

S7. Inference After Training

S8. Unsloth

▂▂▂▂▂▂▂▂▂▂▂▂

Appendix

Memory Use