Skip to Content

How to Speed Up Inference of INT4 ONNX Version of Llama 2 on Google Colab

  • This article covers a problem and a solution related to the inference of INT4 ONNX version of Llama 2 on Google Colab.

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. These models can produce realistic text and images in response to natural language prompts. Llama 2 models are available on Hugging Face in both FP32 and INT4 ONNX formats. The INT4 ONNX models are generated with Intel® Neural Compressor, a tool that can quantize the model weights to reduce the model size and improve the inference speed.

However, some users have reported that the inference of INT4 ONNX version of Llama 2 is very slow on Google Colab, a cloud-based platform that provides free access to GPUs and TPUs. In this article, we will explain the possible reasons for this issue and provide some solutions to speed up the inference.

How to Speed Up Inference of INT4 ONNX Version of Llama 2 on Google Colab

Why is INT4 ONNX version of Llama 2 slow on Google Colab?

There are two main factors that affect the inference performance of INT4 ONNX version of Llama 2 on Google Colab:

  • The hardware configuration of the Colab instance
  • The software configuration of the Colab environment

Hardware configuration

The hardware configuration of the Colab instance depends on the type of accelerator (CPU, GPU, or TPU) that is selected. By default, Colab assigns a CPU-only instance, which may not be optimal for running large-scale generative AI models. To change the accelerator type, go to Runtime -> Change runtime type and select GPU or TPU from the dropdown menu. Note that the availability and type of GPU or TPU may vary depending on the demand and usage quota.

The INT4 ONNX version of Llama 2 is optimized for Intel® CPUs, which support low-precision arithmetic operations using Vector Neural Network Instructions (VNNI). VNNI can accelerate the matrix multiplication operations that are common in deep learning models. However, not all CPUs support VNNI, and some Colab instances may have older CPUs that do not have this feature. To check if your Colab instance has a VNNI-enabled CPU, run the following command in a code cell:

!cat /proc/cpuinfo | grep -i avx512_vnni

If the output is empty, it means that your CPU does not support VNNI. If the output shows some flags with avx512_vnni, it means that your CPU supports VNNI.

Software configuration

The software configuration of the Colab environment includes the versions of ONNX Runtime, Intel® Extension for Transformers, and Intel® Neural Compressor. These are the main software components that are required to run the INT4 ONNX version of Llama 2.

ONNX Runtime is a cross-platform engine that can execute ONNX models with high performance and efficiency. However, not all versions of ONNX Runtime support the custom operators that are used in the INT4 ONNX version of Llama 2. For example, the 13B and 70B models use a custom operator called MatMulWithQuantWeight, which is only supported by a specific branch of ONNX Runtime. To install this branch, run the following command in a code cell:

!pip install --upgrade --pre --extra-index-url https://test.pypi.org/simple/ ort-nightly-cpu

The 7B model uses a custom operator called MatMulFpQ4, which is supported by ONNX Runtime version 1.16 or higher. To install this version, run the following command in a code cell:

!pip install --upgrade onnxruntime>=1.16.0

Intel® Extension for Transformers is a library that provides optimized implementations of transformer-based models using Intel® Deep Learning Boost technology. It also provides APIs for model conversion, quantization, evaluation, and inference. To install this library, run the following command in a code cell:

!pip install intel-extension-for-transformers

Intel® Neural Compressor is a tool that can quantize deep learning models to reduce their size and improve their performance. It also provides APIs for model tuning, calibration, and deployment. To install this tool, run the following command in a code cell:

!pip install neural-compressor

How to speed up inference of INT4 ONNX version of Llama 2 on Google Colab?

After ensuring that your Colab instance has a suitable hardware and software configuration, you can speed up the inference of INT4 ONNX version of Llama 2 by following these steps:

Step 1: Download the INT4 ONNX model from Hugging Face using wget or curl commands. For example, to download the 13B model, run the following command in a code cell:

!wget https://huggingface.co/Intel/Llama-2-13b-hf-onnx-int4/resolve/main/decoder_model.onnx

Step 2: Import the necessary modules from Intel® Extension for Transformers and ONNX Runtime. For example, to import the InferenceSession and LLMInference classes, run the following command in a code cell:

from intel_extension_for_transformers.llm.inference.llm_inference import InferenceSession, LLMInference
import onnxruntime as ort

Step 3: Create an InferenceSession object with the path to the downloaded ONNX model and the device type (CPU or GPU). For example, to create an InferenceSession object for the 13B model on CPU, run the following command in a code cell:

session = InferenceSession("decoder_model.onnx", device="cpu")

Step 4: Create an LLMInference object with the InferenceSession object and the tokenizer name. For example, to create an LLMInference object for the 13B model with the Intel tokenizer, run the following command in a code cell:

llm_infer = LLMInference(session, "Intel/Llama-2-13b-hf-onnx-int4")

Step 5: Use the generate method of the LLMInference object to generate text or images from a natural language prompt. For example, to generate an image of a cat wearing sunglasses from the text “a cat wearing sunglasses”, run the following command in a code cell:

llm_infer.generate("a cat wearing sunglasses", output_type="image")

The output will be displayed as an image in the Colab notebook.

Frequently Asked Questions (FAQs)

Question: What is Llama 2?

Answer: Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. These models can produce realistic text and images in response to natural language prompts.

Question: What is INT4 ONNX?

Answer: INT4 ONNX is a format that quantizes the model weights to 4-bit integers to reduce the model size and improve the inference speed. ONNX stands for Open Neural Network Exchange, which is a standard format for representing deep learning models.

Question: What is Google Colab?

Answer: Google Colab is a cloud-based platform that provides free access to GPUs and TPUs for running machine learning and data science experiments. Colab notebooks are interactive documents that can contain code, text, images, and videos.

Question: What are Intel® Neural Compressor and Intel® Extension for Transformers?

Answer: Intel® Neural Compressor is a tool that can quantize deep learning models to reduce their size and improve their performance. Intel® Extension for Transformers is a library that provides optimized implementations of transformer-based models using Intel® Deep Learning Boost technology.

Summary

In this article, we have explained how to optimize the inference performance of INT4 ONNX version of Llama 2, a generative AI model, on Google Colab using Intel® Neural Compressor and Intel® Extension for Transformers. We have also provided some examples of how to download, load, and generate content from the INT4 ONNX models using natural language prompts. We hope this article has been helpful for you to explore the possibilities of generative AI on Google Colab.

Disclaimer: This article is for informational purposes only and does not constitute professional advice. The results may vary depending on the hardware and software configuration of your Colab instance. Please refer to the official documentation of Google Colab, Hugging Face, ONNX Runtime, Intel® Neural Compressor, and Intel® Extension for Transformers for more details and support.