Jon Karrer

NLP Classifier Introduction

U.S Patent classifier challenge

Textbook Notebook

By following these steps, you'll have a robust process for converting raw house features into a structured data object suitable for feeding into machine learning models, ensuring consistency and accuracy in your predictions.

Overview

We want to train a machine learning model to predict the score for a patent entry. This score is based on the similarity between the phrases in the patent.

Data

US Patent Classifier

Model

We will roll our own model for this task.

Main Libraries

Burn Tokenizers Csv

General Steps

  1. Prep data: Convert the dataset into a format that is suitable for training. In this case it is already in CSV form.

  2. Tokenization: Convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)

  3. Numericalization: Make a list of all of the unique words that appear (the vocab), and convert each word into a number, by looking up its index in the vocab

  4. Language model data loader creation: Creating a dependent variable that is offset from the independent variable by one token. Shuffle the training data in such a way that the dependent and independent variables maintain their structure as required.

  5. Training:: Train the model on the training data.

Data Prep

First we need to gather and split the data into training and validation sets. Using the Csv library, we can get each row and tokenize them all in one step. The score is our label for each row.

    let path = std::path::Path::new("dataset/train.csv");
    let mut reader = csv::ReaderBuilder::new().from_path(path)?;

    let rows = reader.deserialize();
    let mut classified_data = Vec::new();

    for r in rows {
        let record: PatentRecord = r?;
        let text = format!(
            "TEXT1: {}; TEXT2: {}; ANC1: {};",
            record.context, record.target, record.anchor
        );

        classified_data.push(ClassifiedText {
            text,
            label: record.score,
        })
    }

Now we just need to load this data into memory or a sqlite database. We will use the Burn crates InMemDataset for this.

    let dataset = InMemDataset::new(classified_data);

Burn puts inference as a pillar in their core values, so as we build the training pipeline, we will also build the inference pipeline. The first step here is assigning labels to human friendly strings.

    pub fn class_name(label: f32) -> String {
        match label {
            0.0 => "Unrelated",
            0.25 => "Somewhat Related",
            0.5 => "Different Meaning Synonym",
            0.75 => "Close Synonym",
            1.0 => "Very Close Match",
            _ => panic!("Invalid label"),
        }
        .to_string()
    }

Batching

Batching is the process of creating batches of data that can be fed into the model. This helps the GPU to process the data in a more efficient manner. During the batching, we will tokenize the text into a list of tokens. We also generate padding masks for each batch. Padding masks are used to pad the tokens to the same length.

We will set the exact size of each batch later in the pipeline, but this is how we can define what a single batch consists of.

    let mut tokens = Vec::new();
    let mut labels = Vec::new();

    for item in items {
        tokens.push(self.tokenizer.encode(&item.text)); // Tokenize each string
        labels.push(Tensor::from_data(
            Data::from([(item.label as i64).elem::<B::IntElem>()]),
            &self.device,
        ));
    }

    // Generate padding mask for tokenized text
    let mask = generate_padding_mask(
        self.tokenizer.pad_token(),
        tokens,
        Some(self.max_seq_length),
        &self.device,
    );

    TrainingBatch {
        tokens: mask.tensor,
        labels: Tensor::cat(labels, 0),
        mask_pad: mask.mask,
    }

Model Making

Define our model

We could fine tune a pertained model, like bert, but we will roll our own model for fun.

  1. Define our neural net architecture: Transformers are a popular architecture for NLP.

  2. Create our embedding tokens and embedding position layers: These are used to give context and positional information to the model.

  3. Create our output layer: The linear layer that will classify our data.

pub fn build<B: Backend>(&self, device: &B::Device) -> Model<B> {
    let transformer = self.transformer.init(device);
    let embedding_token =
        EmbeddingConfig::new(self.vocab_size, self.transformer.d_model).init(device);
    let embedding_pos =
        EmbeddingConfig::new(self.max_seq_length, self.transformer.d_model).init(device);
    let output = LinearConfig::new(self.transformer.d_model, self.n_classes).init(device);

    Model {
        transformer,
        embedding_token,
        embedding_pos,
        output,
        n_classes: self.n_classes,
        max_seq_length: self.max_seq_length,
    }
}

Forward pass

Now we need to forward pass the data through our model. This is where the prediction happens and the loss is calculated.

    pub fn forward(&self, item: TrainingBatch<B>) -> ClassificationOutput<B> {
        // Get batch and sequence length and device
        let [batch_size, seq_length] = item.tokens.dims();
        let device = &self.embedding_token.devices()[0];

        // Move tensors to device
        let tokens = item.tokens.to_device(device);
        let labels = item.labels.to_device(device);
        let mask_pad = item.mask_pad.to_device(device);

        // Calculate token and position embeddings, then combine them
        let index_positions = Tensor::arange(0..seq_length as i64, device)
            .reshape([1, seq_length])
            .repeat(0, batch_size);
        let embedding_positions = self.embedding_pos.forward(index_positions);
        let embedding_tokens = self.embedding_token.forward(tokens);
        let embedding = (embedding_positions + embedding_tokens) / 2;

        // Perform transformer encoding, calculate output and loss
        let encoded = self
            .transformer
            .forward(TransformerEncoderInput::new(embedding).mask_pad(mask_pad));
        let output = self.output.forward(encoded);

        let output_classification = output
            .slice([0..batch_size, 0..1])
            .reshape([batch_size, self.n_classes]);

        let loss = CrossEntropyLossConfig::new()
            .init(&output_classification.device())
            .forward(output_classification.clone(), labels.clone());

        ClassificationOutput {
            loss,
            output: output_classification,
            targets: labels,
        }
    }

The forward pass takes our training batch as an input.