LLMs: Ollama, vLLM, Hugging Face, LangChain, and Open WebUI
Ollama, vLLM, and llama.cpp are all tools related to running large language models (LLMs) locally on the own computer.
1. Ollama
-
Ollama (/ˈɒlˌlæmə/) is a user-friendly, higher-level interface for running various LLMs, including Llama, Qwen, Jurassic-1 Jumbo, and others.
-
It provides a streamlined workflow for downloading models, configuring settings, and interacting with LLMs through a command-line interface (CLI) or Python API.
-
Ollama acts as a central hub for managing and running multiple LLM models from different providers, and integrates with underlying tools like llama.cpp for efficient execution.
-
To pull a model checkpoint and run the model, use the
ollama run
command.-
Install Ollama on Linux:
curl -fsSL https://ollama.com/install.sh | sh
>>> Downloading ollama... ######################################################################## 100.0%-#O#- # # ######################################################################## 100.0% >>> Installing ollama to /usr/local/bin... >>> Creating ollama user... >>> Adding ollama user to render group... >>> Adding ollama user to video group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... >>> Enabling and starting ollama service... Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service. >>> The Ollama API is now available at 127.0.0.1:11434. >>> Install complete. Run "ollama" from the command line. WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.
For more install instructions , see https://github.com/ollama/ollama. -
Keep Ollama service running whenever using ollama:
$ systemctl status ollama.service ○ ollama.service - Ollama Service Loaded: loaded (/etc/systemd/system/ollama.service; disabled; preset: enabled) Active: inactive (dead)
$ ollama run phi3:mini Error: could not connect to ollama app, is it running?
To run systemd inside of Windows Subsystem for Linux (WSL) distros:
-
Add these lines to the /etc/wsl.conf to ensure systemd starts up on boot.
[boot] systemd=true
-
Run
wsl.exe --shutdown
from PowerShell to restart the WSL instances. -
Start and check the Ollama service status.
$ sudo systemctl start ollama.service $ systemctl status ollama.service ● ollama.service - Ollama Service Loaded: loaded (/etc/systemd/system/ollama.service; disabled; preset: enabled) Active: active (running) since Wed 2024-06-12 15:21:39 CST; 5min ago Main PID: 914 (ollama) Tasks: 15 (limit: 9340) Memory: 576.9M CGroup: /system.slice/ollama.service └─914 /usr/local/bin/ollama serve $ sudo ss -ntlp State Recv-Q Send-Q Local Address:Port Peer Address:Port Process LISTEN 0 4096 127.0.0.1:11434 0.0.0.0:* users:(("ollama",pid=914,fd=3))
-
-
-
Ollama has its own library to pull models, and store them at home directory of the user (i.e.,
ollama
) that running the ollama service:-
macOS:
~/.ollama/models
-
Linux:
/usr/share/ollama/.ollama/models
-
Windows:
C:\Users\%username%\.ollama\models
If a different directory needs to be used, set the environment variable
OLLAMA_MODELS
to the chosen directory.To get the home directory of the user ollama
, rungetent passwd ollama | cut -d: -f6
. -
-
The ollama service can also be accessed via its OpenAI-compatible API when the model checkpoint is prepared.
$ ollama serve --help Start ollama Usage: ollama serve [flags] Aliases: serve, start Flags: -h, --help help for serve Environment Variables: OLLAMA_DEBUG Show additional debug information (e.g. OLLAMA_DEBUG=1) OLLAMA_HOST IP Address for the ollama server (default 127.0.0.1:11434) OLLAMA_KEEP_ALIVE The duration that models stay loaded in memory (default "5m") OLLAMA_MAX_LOADED_MODELS Maximum number of loaded models (default 1) OLLAMA_MAX_QUEUE Maximum number of queued requests OLLAMA_MODELS The path to the models directory OLLAMA_NUM_PARALLEL Maximum number of parallel requests (default 1) OLLAMA_NOPRUNE Do not prune model blobs on startup OLLAMA_ORIGINS A comma separated list of allowed origins OLLAMA_TMPDIR Location for temporary files OLLAMA_FLASH_ATTENTION Enabled flash attention OLLAMA_LLM_LIBRARY Set LLM library to bypass autodetection OLLAMA_MAX_VRAM Maximum VRAM
// ensure that the model checkpoint is prepared. $ ollama list NAME ID SIZE MODIFIED phi3:mini 64c1188f2485 2.4 GB 17 minutes ago
curl
curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages":[{"role":"user","content":"Say this is a test"}],"model":"phi3:mini"}'
Python
pip install openai
from openai import OpenAI client = OpenAI( base_url='http://localhost:11434/v1/', api_key='ollama', # required but ignored ) chat_completion = client.chat.completions.create( messages=[ { 'role': 'user', 'content': 'Say this is a test', } ], model='phi3:mini', )
C#/.NET
# The official .NET library for the OpenAI API dotnet add package OpenAI --prerelease
using OpenAI.Chat; ChatClient client = new( model: "phi3:mini", credential: "EMPTY_OPENAI_API_KEY", options: new OpenAI.OpenAIClientOptions { Endpoint = new Uri("http://localhost:11434/v1/") }); ChatCompletion completion = client.CompleteChat("Say 'this is a test.'"); Console.WriteLine($"[ASSISTANT]: {completion}");
2. vLLM
-
vLLM (Very Low Latency Model) primarily focuses on deploying LLMs as low-latency inference servers.
-
It prioritizes speed and efficiency, making it suitable for serving LLMs to multiple users in real-time applications.
-
vLLM offers APIs that allow developers to integrate LLM functionality into their applications. While it can be used locally, server deployment is its main strength.
-
vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries, and with the requirements:
-
OS: Linux
-
Python: 3.8 – 3.11
-
GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
-
-
To deploy a model as an OpenAI-compatible service:
pip install vllm
$ pip list | egrep 'vllm|transformers' transformers 4.41.2 vllm 0.5.0 vllm-flash-attn 2.5.9
$ python -m vllm.entrypoints.openai.api_server --help vLLM OpenAI-Compatible RESTful API server. options: --host HOST host name --port PORT port number --api-key API_KEY If provided, the server will require this key to be presented in the header. --model MODEL Name or path of the huggingface model to use. --max-model-len MAX_MODEL_LEN Model context length. If unspecified, will be automatically derived from the model config. --gpu-memory-utilization GPU_MEMORY_UTILIZATION The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9. --served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...] The model name(s) used in the API. If multiple names are provided, the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be the same as the `--model` argument. Noted that this name(s)will also be used in `model_name` tag content of prometheus metrics, if multiple names provided, metricstag will take the first one.
# Start an OpenAI-compatible API service python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-0.5B-Instruct
If saw connection to https://huggingface.co/ failed, try:
HF_ENDPOINT=https://hf-mirror.com python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-0.5B-Instruct
Run in a firewalled or offline environment with locally cached files by setting the environment variable
TRANSFORMERS_OFFLINE=1
.HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \ HF_ENDPOINT=https://hf-mirror.com \ python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2-0.5B-Instruct \ --max-model-len 4096
The vLLM requires a NVIDIA GPU on the host system, and the
--device cpu
doesn’t work.$ python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-0.5B-Instruct --device cpu RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
llama.cpp:
-
llama.cpp is a C++ library as a core inference engine that provides the core functionality for running LLMs on CPUs and GPUs.
-
It’s designed to efficiently execute LLM models for tasks like text generation and translation.
-
Ollama and other tools like Hugging Face Transformers can use llama.cpp as the underlying engine for running LLM models locally.
Think of Ollama as a user-friendly car with a dashboard and controls that simplifies running different LLM models (like choosing a destination). vLLM is more like a high-performance racing engine focused on speed and efficiency, which is optimized for serving LLMs to many users (like a racing car on a track). llama.cpp is the core engine that does the actual work of moving the car (like the internal combustion engine), and other tools can utilize it for different purposes.
-
Use Ollama for a simple and user-friendly experience running different LLM models locally.
-
Consider vLLM if the focus is on deploying a low-latency LLM server for real-time applications.
-
llama.cpp is a low-level library that serves as the core engine for other tools to run LLMs efficiently.
3. Hugging Face
-
Hugging Face is a popular open-source community and platform focused on advancing natural language processing (NLP) research and development, which is well-known for the Transformers library, a widely used open-source framework written in Python that provides tools and functionalities for training, fine-tuning, and deploying various NLP models, including LLMs.
-
Hugging Face maintains a Model Hub, a vast repository of pre-trained NLP models, including LLMs like Qwen, Jurassic-1 Jumbo, and many others which can be downloaded and used with the Transformers library or other compatible tools.
-
Model Scope is a platform that focus on model access and aims to democratize access to a wide range of machine learning models, including LLMs. It goes beyond NLP models and encompasses various domains like computer vision, audio processing, and more. It acts as a model hosting service, allowing developers to access and utilize pre-trained models through APIs or a cloud-based environment.
-
While Model Scope has its own model repository, it also collaborates with Hugging Face. Some models from the Hugging Face Model Hub are also available on Model Scope, providing users with additional access options.
-
Here’s a table summarizing the key differences:
Feature Hugging Face Model Scope Focus
Open-source community, NLP research & development
Model access across various domains (including NLP)
Core Strength
Transformers library, Model Hub
Model hosting service, API access
Model Scope
Primarily NLP, but expanding
Wide range of machine learning models
Community Focus
Strong community focus, education, collaboration
Less emphasis on community, more on commercial aspect
-
Command line interface (CLI)
The
huggingface_hub
Python package comes with a built-in CLI calledhuggingface-cli
that can be used to interact with the Hugging Face Hub directly from a terminal.pip install -U "huggingface_hub[cli]"
In the snippet above, the [cli]
extra dependencies is also installed to make the user experience better, especially when using thedelete-cache
command.To download a single file from a repo, simply provide the repo_id and filename as follow:
# If saw connection to https://huggingface.co/ failed, uncomment the following line: # ENV HF_ENDPOINT=https://hf-mirror.com huggingface-cli download sentence-transformers/all-MiniLM-L6-v2
4. LangChain
LangChain is a framework for developing applications powered by large language models (LLMs).
5. Open WebUI
Open WebUI is an extensible, feature-rich, and user-friendly self-hosted WebUI designed to operate entirely offline. It supports various LLM runners, including Ollama and OpenAI-compatible APIs.