Combining Categorical and Numerical Features with Text in BERT

In this tutorial we’ll look at the topic of classifying text with BERT, but where we also have additional numerical or categorical features that we want to use to improve our predictions.

To help motivate our discussion, we’ll be working with a dataset of about 23k clothing reviews. For each review, we have the review text, but also additional information such as:

  • The age of the reviewer (numerical feature)
  • The number of upvotes on the review (numerical feature)
  • The department and category of the clothing item (categorical features)

For each review, we also have a binary label, which is whether or not the reviewer ultimately recommends the item. This is what we are trying to predict.

This dataset was scraped from an (un-specified) e-commerce website by Nick Brooks and made available on Kaggle here.

In Section 2 of this Notebook, I’ve implemented four different “baseline” strategies which score fairly well, but which don’t incorporate all of the features together.

Then, in Section 3, I’ve implemented a simple strategy to combine everything and feed it through BERT. Specifically, I make text out of the additional features, and prepend this text to the review.

There is a GitHub project called the Multimodal-Toolkit which is how I learned about this clothing review dataset. The toolkit implements a number of more complicated techniques, but their benchmarks (as well as our results below) show that this simple features-to-text strategy works best for this dataset.

In our weekly discussion group, I talked through this Notebook and we also met with Ken Gu, the author of the Multi-Modal Toolkit! You can watch the recording here.

You can find the Colab Notebook version of this post here.

S1. Clothing Review Dataset

1.1. Download & Parse

Retrieve the .csv file for the dataset.

import gdown

print('Downloading dataset...\n')
# Download the file.'', 
                'Womens Clothing E-Commerce Reviews.csv', 
Downloading dataset...

To: /content/Womens Clothing E-Commerce Reviews.csv
8.48MB [00:00, 48.7MB/s]


Parse the dataset csv file into a pandas DataFrame.

import pandas as pd

data_df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv', index_col=0)

Clothing ID Age Title Review Text Rating Recommended IND Positive Feedback Count Division Name Department Name Class Name
0 767 33 NaN Absolutely wonderful - silky and sexy and comf... 4 1 0 Initmates Intimate Intimates
1 1080 34 NaN Love this dress! it's sooo pretty. i happene... 5 1 4 General Dresses Dresses
2 1077 60 Some major design flaws I had such high hopes for this dress and reall... 3 0 0 General Dresses Dresses
3 1049 50 My favorite buy! I love, love, love this jumpsuit. it's fun, fl... 5 1 0 General Petite Bottoms Pants
4 847 47 Flattering shirt This shirt is very flattering to all due to th... 5 1 6 General Tops Blouses


Recommended IND” is the label we are trying to predict for this dataset. “1” means the reviewer recommended the product and “0” means they do not.

The following are categorical features:

  • Division Name
  • Department Name
  • Class Name
  • Clothing ID

And the following are numerical features:

  • Age
  • Rating
  • Positive Feedback Count

Feature Analysis

There is an excellent Notebook on Kaggle here which does some thorough analysis on each of the features in this dataset.

Note that, in addition to the “Recommended” label, there is also a “Rating” column where the reviewer rates the product from 1 - 5. The analysis in the above Notebook shows that almost all items rated 3 - 5 are recommended, and almost all rated 1 - 2 are not recommended. We’ll see in our second baseline classifier that you can get a very high accuracy with this feature alone. However, it is still possible to do better by incorporating the other features!

1.2. Train-Validation-Test Split

I want to use the same training, validation, and test splits for all of the approaches we try so that it’s a fair comparison.

However, different approaches are going to require different transformations on the data, and for simplicity I want to apply those transformations before splitting the dataset.

To solve this, we’re going to create lists of indeces for each of the three portions. That way, for a given classification approach, we can load the whole dataset, apply our transformations, and then split it according to these pre-determined indeces.

import random
import numpy as np

# First, calculate the split sizes. 80% training, 10% validation, 10% test.
train_size = int(0.8 * len(data_df))
val_size = int(0.1 * len(data_df))
test_size = len(data_df) - (train_size + val_size)

# Sanity check the sizes.
assert((train_size + val_size + test_size) == len(data_df))

# Create a list of indeces for all of the samples in the dataset.
indeces = np.arange(0, len(data_df))

# Shuffle the indeces randomly.

# Get a list of indeces for each of the splits.
train_idx = indeces[0:train_size]
val_idx = indeces[train_size:(train_size + val_size)]
test_idx = indeces[(train_size + val_size):]

# Sanity check
assert(len(train_idx) == train_size)
assert(len(test_idx) == test_size)

# With these lists, we can now select the corresponding dataframe rows using, 
# e.g., train_df = data_df.iloc[train_idx] 

print('  Training size: {:,}'.format(train_size))
print('Validation size: {:,}'.format(val_size))
print('      Test size: {:,}'.format(test_size))
  Training size: 18,788
Validation size: 2,348
      Test size: 2,350

S2. Baseline Strategies

The following are some alternative approaches to classifying this dataset, none of which uses all of the features together. I’ve included these baselines to ensure that our final BERT-based approach outperforms them!

2.1. Always Recommend

This dataset is heavily imbalanced, with something like 85% of the reviews recommending the product. If we just predict “recommend” for every test sample, how would we do?

from sklearn.metrics import f1_score

# Select the test set samples.
test_df = data_df.iloc[test_idx]

# Create a list of all 1s to use as our predictions.
predictions = [1]*len(test_df)

# Calculate the F1 score.
f1 = f1_score(y_true=test_df["Recommended IND"], y_pred=predictions)

print('If we always recommend the product...')
print('\nF1: %.3f' % f1)
If we always recommend the product...

F1: 0.906

We’ll keep a running table of our results:

Strategy F1 Score
Always predict “recommended” 0.906

2.2. Threshold on Rating

As I mentioned earlier, the “Rating” is a very strong indicator of whether the reviewer recommended the product or not. The ideal threshold is a rating of 3.

from sklearn.metrics import f1_score

# Predict whether it's recommended based on whether the rating was 3 or higher.
predictions = test_df["Rating"] >= 3

# Calculate the F1 score.
f1 = f1_score(y_true=test_df["Recommended IND"], y_pred=predictions)

print('Recommend if rating >= 3...')
print('\nF1: %.3f' % f1)
Recommend if rating >= 3...

F1: 0.953

That’s very high! We can still do better, though :)

Strategy F1 Score
Always predict “recommended” 0.906
Predict “recommended” if rating >= 3 0.953

2.3. XGBoost

When dealing with mixed data types like this, decision trees are the standard solution, with “gradient boosted decision trees” (XGBoost) being the most common model.

In a decision tree, the classification decision is broken up into many smaller decisions (at each node of the tree). Each of these smaller decisions can operate on a different data type, enabling a mix of data types in the classifier.

However, a decision tree can’t take in raw text. So let’s use a decision tree to do our predictions based only on the non-text features in this dataset.

Install xgboost

Load the tokenizer.

from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
Loading BERT tokenizer...

Load the BERT Classification model.

from transformers import BertForSequenceClassification

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.

# Tell pytorch to run this model on the GPU.
desc = model.cuda()
3.3. Training Parameters

Let’s define all of our key training parameters in one section.

For the purposes of fine-tuning, the BERT authors recommend choosing from the following values (from Appendix A.3 of the BERT paper):

  • Batch size: 16, 32
  • Learning rate (Adam): 5e-5, 3e-5, 2e-5
  • Number of epochs: 2, 3, 4
# Larger batch sizes tend to be better, and we can fit this in memory.
batch_size = 32

# I used a smaller learning rate to combat over-fitting that I was seeing in the
# validation loss. I could probably try even smaller.
learning_rate = 1e-5

# Number of training epochs. 
epochs = 4

Another key parameter is our “maximum sequence length”, which we will truncate or pad all of our samples to. Setting this to a higher value requires more memory and slows down training, so we want to see how short we can get away with.

We’ll run a pass over the dataset to find the longest sequence and use this to inform our choice.

max_len = 0

# For every sentence...
for sent in sen_w_feats:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(sent, add_special_tokens=True)

    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)
Max sentence length:  204
# Let's use a maximum length of 200.
max_len = 200

3.3. Tokenize & Encode

Now we can do the real tokenization and encoding.

# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

print('Encoding all reviews in the dataset...')

# For every sentence...
for sent in sen_w_feats:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = max_len,           # Pad & truncate all sentences.
                        truncation = True,
                        padding = 'max_length',
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
    # Add the encoded sentence to the list.    
    # And its attention mask (simply differentiates padding from non-padding).

# Convert the lists into tensors.
input_ids =, dim=0)
attention_masks =, dim=0)
labels = torch.tensor(labels)

Encoding all reviews in the dataset...

Now that it’s done, we can divide up the samples into the three splits.

from import TensorDataset

# Split the samples, and create TensorDatasets for each split. 
train_dataset = TensorDataset(input_ids[train_idx], attention_masks[train_idx], labels[train_idx])
val_dataset = TensorDataset(input_ids[val_idx], attention_masks[val_idx], labels[val_idx])
test_dataset = TensorDataset(input_ids[test_idx], attention_masks[test_idx], labels[test_idx])

3.4. Training

Now that our input data is properly formatted, it’s time to fine tune the BERT model.

3.4.1. Setup

We’ll also create an iterator for our dataset using the torch DataLoader class. The DataLoader is responsible for randomly selecting our training batches for us.

from import DataLoader, RandomSampler, SequentialSampler

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.

Next we need to create the optimizer, passing it the weights from our BERT model.

The epsilon parameter eps = 1e-8 is “a very small number to prevent any division by zero in the implementation” (from here).

from transformers import AdamW

# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  lr = learning_rate, 
                  eps = 1e-8 

The learning rate scheduler will implement learning rate decay for us.

from transformers import get_linear_schedule_with_warmup

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples!)
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in
                                            num_training_steps = total_steps)

Define a helper function for calculating simple accuracy.

import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

Helper function for formatting elapsed times as hh:mm:ss

import time
import datetime

def format_time(elapsed):
    Takes a time in seconds and returns a string hh:mm:ss
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

3.4.2. Training Loop

We’re ready to kick off the training!

Below is our training loop. There’s a lot going on, but fundamentally for each pass in our loop we have a trianing phase and a validation phase.

Thank you to Stas Bekman for contributing the insights and code for using validation loss to detect over-fitting!


  • Unpack our data inputs and labels
  • Load data onto the GPU for acceleration
  • Clear out the gradients calculated in the previous pass.
    • In pytorch the gradients accumulate by default (useful for things like RNNs) unless you explicitly clear them out.
  • Forward pass (feed input data through the network)
  • Backward pass (backpropagation)
  • Tell the network to update parameters with optimizer.step()
  • Track variables for monitoring progress


  • Unpack our data inputs and labels
  • Load data onto the GPU for acceleration
  • Forward pass (feed input data through the network)
  • Compute loss on our validation data and track variables for monitoring progress

Pytorch hides all of the detailed calculations from us, but we’ve commented the code to point out which of the above steps are happening on each line.

PyTorch also has some beginner tutorials which you may also find helpful.

import random
import numpy as np

# This training code is based on the `` script here:

# Set the seed value all over the place to make this reproducible.
seed_val = 42


# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

# For each epoch...
for epoch_i in range(0, epochs):
    # ========================================
    #               Training
    # ========================================
    # Perform one full pass over the training set.

    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source:

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source:

        # Perform a forward pass (evaluate the model on this training batch).
        # In PyTorch, calling `model` will in turn call the model's `forward` 
        # function and pass down the arguments. The `forward` function is 
        # documented here: 
        # The results are returned in a results object, documented here:
        # Specifically, we'll get the loss (because we provided labels) and the
        # "logits"--the model outputs prior to activation.
        result = model(b_input_ids, 

        loss = result.loss
        logits = result.logits

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.

        # Update the learning rate.

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)            
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        # Unpack this training batch from our dataloader. 
        # As we unpack the batch, we'll also copy each tensor to the GPU using 
        # the `to` method.
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            result = model(b_input_ids, 

        # Get the loss and "logits" output by the model. The "logits" are the 
        # output values prior to applying an activation function like the 
        # softmax.
        loss = result.loss
        logits = result.logits
        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids ='cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time

print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))
3.4.3. Training Results

Let’s view the summary of the training process.

import pandas as pd

# Display floats with two decimal places.
pd.set_option('precision', 2)

# Create a DataFrame from our training statistics.
df_stats = pd.DataFrame(data=training_stats)

# Use the 'epoch' as the row index.
df_stats = df_stats.set_index('epoch')

# A hack to force the column headers to wrap (doesn't seem to work in Colab).
#df =[dict(selector="th",props=[('max-width', '70px')])])

# Display the table.
Training Loss Valid. Loss Valid. Accur. Training Time Validation Time
1 0.18 0.13 0.95 0:05:51 0:00:15
2 0.11 0.13 0.95 0:05:51 0:00:14
3 0.10 0.13 0.95 0:05:51 0:00:15
4 0.08 0.15 0.95 0:05:51 0:00:15

We can plot the training loss and validation loss to check for over-fitting.

import matplotlib.pyplot as plt
% matplotlib inline

import seaborn as sns

# Use plot styling from seaborn.

# Increase the plot size and font size.
plt.rcParams["figure.figsize"] = (12,6)

# Plot the learning curve.
plt.plot(df_stats['Training Loss'], 'b-o', label="Training")
plt.plot(df_stats['Valid. Loss'], 'g-o', label="Validation")

# Label the plot.
plt.title("Training & Validation Loss")
plt.xticks([1, 2, 3, 4])


There does appear to be some over-fitting here. If you really wanted to go for the best accuracy, you could try saving a model checkpoint after each epoch, and see if the third checkpoint does better on the test set.

Why Validation Loss, not Accuracy?

Validation loss is a more precise measure than validation accuracy, because with accuracy we don’t care about the exact output value, but just which side of a threshold it falls on.

If we are predicting the correct answer, but with less confidence, then validation loss will catch this, while accuracy will not.

3.5. Test

Now we’re ready to score our trained model against the test set!

The below cell will generate all of the predictions.

# Create a DataLoader to batch our test samples for us. We'll use a sequential
# sampler this time--don't need this to be random!
prediction_sampler = SequentialSampler(test_dataset)
prediction_dataloader = DataLoader(test_dataset, sampler=prediction_sampler, batch_size=batch_size)

print('Predicting labels for {:,} test sentences...'.format(len(test_dataset)))

# Put model in evaluation mode

# Tracking variables 
predictions , true_labels = [], []

# Predict 
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple( for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  # Telling the model not to compute or store gradients, saving memory and 
  # speeding up prediction
  with torch.no_grad():
      # Forward pass, calculate logit predictions.
      result = model(b_input_ids, 

  logits = result.logits

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids ='cpu').numpy()
  # Store predictions and true labels

print('    DONE.')
Predicting labels for 2,350 test sentences...

Because the test samples were processed in batches, there’s a little re-arranging required to get the results back down to a simple list.

Also, the predictions are currently floating point values representing confidences, but we need to turn these into binary labels (0 or 1).

# Combine the results across all batches. 
flat_predictions = np.concatenate(predictions, axis=0)

# For each sample, pick the label (0 or 1) with the higher score.
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

# Combine the correct labels for each batch into a single list.
flat_true_labels = np.concatenate(true_labels, axis=0)

Now we can score the results!

from sklearn.metrics import f1_score

# Calculate the F1
f1 = f1_score(flat_true_labels, flat_predictions)

print('F1 Score: %.3f' % f1)
F1 Score: 0.968

Here are the final scores:

Strategy F1 Score
Always predict “recommended” 0.906
Predict “recommended” if rating >= 3 0.953
XGBoost 0.965
BERT on review text 0.945
BERT, all features to text 0.968

We managed to outperform the other strategies!

UPDATE: One of our discussion group attendees, Jon, pointed out that (1) MCC is probably a better metric for this dataset due to the class imbalance, and (2) that it’s good practice to verify that our randomly selected test set has the same class balance as the training set (this probably happened naturally, though, given that the test set has ~2k samples).


Simply converting the extra features to text seems to be a great solution for this dataset. I suspect that this is because the categorical features in this dataset can be easily converted into meaningful text that BERT can leverage. Even the rating, which is technically a numerical feature, is pretty understandable since there are only 5 possible values.

The Multimodal-Toolkit also includes 2 other datasets where this approach (referred to as “unimodal” in their benchmark tables) does not get the highest score. You could imagine how, if there are continuous-valued (floating point) features in the dataset that are highly meaningful, it could be difficult for BERT to make sense of these as text.

So, if you just have text and categorical features, you may not need to look any farther than the simple strategy implemented in this Notebook!