Run LLMs locally - Hyper Leap

Home
Run LLMs locally

0 Comments

Use case

The popularity of projects like PrivateGPT, llama.cpp, Ollama, GPT4All, llamafile, and others underscore the demand to run LLMs locally (on your own device).

This has at least two important benefits:

Privacy: Your data is not sent to a third party, and it is not subject to the terms of service of a commercial service
Cost: There is no inference fee, which is important for token-intensive applications (e.g., long-running simulations, summarization)

Overview

Running an LLM locally requires a few things:

Open-source LLM: An open-source LLM that can be freely modified and shared
Inference: Ability to run this LLM on your device with acceptable latency

Open-source LLMs

Users can now gain access to a rapidly growing set of open-source LLMs.

These LLMs can be assessed across at least two dimensions (see figure):

Base model: What is the base model, and how was it trained?
Fine-tuning approach: Was the base model fine-tuned, and, if so, what instructions were used?

The relative performance of these models can be assessed using several leaderboards, including:

Inference

A few frameworks for this have emerged to support inference of open-source LLMs on various devices:

llama.cpp: C++ implementation of llama inference code with weight optimization/quantization
gpt4all: Optimized C backend for inference
Ollama: Bundles model weights and environment into an app that runs on a device and serves the LLM
llamafile: Bundles model weights and everything needed to run the model in a single file, allowing you to run the LLM locally from this file without any additional installation steps

In general, these frameworks will do a few things:

Quantization: Reduce the memory footprint of the raw model weights
Efficient implementation for inference: Support inference on consumer hardware (e.g., CPU or laptop GPU)

In particular, see this excellent post on the importance of quantization.

With less precision, we radically decrease the memory needed to store the LLM in memory.

In addition, we can see the importance of GPU memory bandwidth sheet!

A Mac M2 Max is 5-6x faster than a M1 for inference due to the larger GPU memory bandwidth.

By Asif Raza

Use case

Overview

Open-source LLMs

Inference

Leave Comment Cancel reply

Recent Posts

Recent Posts

Red Hat OpenShift: The Enterprise Kubernetes Platform

🔒 How to Secure JAR Files from

Understanding Authorization: A Comprehensive Guide

Archives

Follow Us

Quick Links

Blog Posts

Red Hat OpenShift: The Enterprise Kubernetes Platform

🔒 How to Secure JAR Files from

Contact Info

Location

Email Us

Phone Us

Use case​

Overview​

Open-source LLMs​

Inference​

Leave Comment Cancel reply

Recent Posts

Recent Posts

Red Hat OpenShift: The Enterprise Kubernetes Platform

🔒 How to Secure JAR Files from

Understanding Authorization: A Comprehensive Guide

Archives

Follow Us

Quick Links

Blog Posts

Red Hat OpenShift: The Enterprise Kubernetes Platform

🔒 How to Secure JAR Files from

Contact Info

Location

Email Us

Phone Us

Use case

Overview

Open-source LLMs

Inference