Ollama, vLLM, Hugging Face, LangChain, LlamaIndex, and Open WebUI

09 Mar 2025

Ollama, vLLM, and llama.cpp are all tools related to running large language models (LLMs) locally on the own computer.

1. Ollama
2. vLLM
3. Hugging Face
4. LangChain
5. LlamaIndex
6. Open WebUI

1. Ollama

Ollama (/ˈɒlˌlæmə/) is a user-friendly, higher-level interface for running various LLMs, including Llama, Qwen, Jurassic-1 Jumbo, and others.
It provides a streamlined workflow for downloading models, configuring settings, and interacting with LLMs through a command-line interface (CLI) or Python API.
Ollama acts as a central hub for managing and running multiple LLM models from different providers, and integrates with underlying tools like llama.cpp for efficient execution.

To pull a model checkpoint and run the model, use the ollama run command.

Install Ollama on Linux:

curl -fsSL https://ollama.com/install.sh | sh

>>> Downloading ollama...
######################################################################## 100.0%-#O#- #   #
######################################################################## 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.

For more install instructions , see https://github.com/ollama/ollama.

Keep Ollama service running whenever using ollama:

$ systemctl status ollama.service
○ ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; disabled; preset: enabled)
     Active: inactive (dead)

$ ollama run phi3:mini
Error: could not connect to ollama app, is it running?

To run systemd inside of Windows Subsystem for Linux (WSL) distros:

Add these lines to the /etc/wsl.conf to ensure systemd starts up on boot.
```
[boot]
systemd=true
```
Run wsl.exe --shutdown from PowerShell to restart the WSL instances.

Start and check the Ollama service status.

$ sudo systemctl start ollama.service
$ systemctl status ollama.service
● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; disabled; preset: enabled)
     Active: active (running) since Wed 2024-06-12 15:21:39 CST; 5min ago
   Main PID: 914 (ollama)
      Tasks: 15 (limit: 9340)
     Memory: 576.9M
     CGroup: /system.slice/ollama.service
             └─914 /usr/local/bin/ollama serve
$ sudo ss -ntlp
State     Recv-Q    Send-Q    Local Address:Port     Peer Address:Port    Process
LISTEN    0         4096          127.0.0.1:11434         0.0.0.0:*        users:(("ollama",pid=914,fd=3))

Ollama has its own library to pull models, and store them at home directory of the user (i.e., ollama) that running the ollama service:
- macOS: ~/.ollama/models
- Linux: /usr/share/ollama/.ollama/models
- Windows: C:\Users\%username%\.ollama\models
If a different directory needs to be used, set the environment variable OLLAMA_MODELS to the chosen directory.

To get the home directory of the user ollama, run getent passwd ollama | cut -d: -f6.
To allow the Ollama service to listen on all network interfaces (default 127.0.0.1:11434), follow these steps:
- Edit the Ollama service configuration:
  sudo systemctl edit ollama.service
- Add the following content to the editor:
  [Service] Environment="OLLAMA_HOST=0.0.0.0:11434"
- Reload and restart the Ollama service:
  sudo systemctl daemon-reload && sudo systemctl restart ollama.service

The ollama service can also be accessed via its OpenAI-compatible API when the model checkpoint is prepared.

$ ollama serve --help
Start ollama

Usage:
  ollama serve [flags]

Aliases:
  serve, start

Flags:
  -h, --help   help for serve

Environment Variables:
      OLLAMA_DEBUG               Show additional debug information (e.g. OLLAMA_DEBUG=1)
      OLLAMA_HOST                IP Address for the ollama server (default 127.0.0.1:11434)
      OLLAMA_KEEP_ALIVE          The duration that models stay loaded in memory (default "5m")
      OLLAMA_MAX_LOADED_MODELS   Maximum number of loaded models (default 1)
      OLLAMA_MAX_QUEUE           Maximum number of queued requests
      OLLAMA_MODELS              The path to the models directory
      OLLAMA_NUM_PARALLEL        Maximum number of parallel requests (default 1)
      OLLAMA_NOPRUNE             Do not prune model blobs on startup
      OLLAMA_ORIGINS             A comma separated list of allowed origins
      OLLAMA_TMPDIR              Location for temporary files
      OLLAMA_FLASH_ATTENTION     Enabled flash attention
      OLLAMA_LLM_LIBRARY         Set LLM library to bypass autodetection
      OLLAMA_MAX_VRAM            Maximum VRAM

//  ensure that the model checkpoint is prepared.
$ ollama list
NAME                    ID              SIZE    MODIFIED
phi3:mini               64c1188f2485    2.4 GB  17 minutes ago

curl

curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"role":"user","content":"Say this is a test"}],"model":"phi3:mini"}'

Python

pip install openai

from openai import OpenAI
client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',  # required but ignored
)
chat_completion = client.chat.completions.create(
    messages=[
        {
            'role': 'user',
            'content': 'Say this is a test',
        }
    ],
    model='phi3:mini',
)

C#/.NET

# The official .NET library for the OpenAI API
dotnet add package OpenAI --prerelease

using OpenAI.Chat;

ChatClient client = new(
    model: "phi3:mini",
    credential: "EMPTY_OPENAI_API_KEY",
    options: new OpenAI.OpenAIClientOptions { Endpoint = new Uri("http://localhost:11434/v1/") });

ChatCompletion completion = client.CompleteChat("Say 'this is a test.'");

Console.WriteLine($"[ASSISTANT]: {completion}");

2. vLLM

vLLM (Very Low Latency Model) primarily focuses on deploying LLMs as low-latency inference servers.
It prioritizes speed and efficiency, making it suitable for serving LLMs to multiple users in real-time applications.
vLLM offers APIs that allow developers to integrate LLM functionality into their applications. While it can be used locally, server deployment is its main strength.
vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries, and with the requirements:
- OS: Linux
- Python: 3.8 – 3.11
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)

To deploy a model as an OpenAI-compatible service:

pip install vllm

$ pip list | egrep 'vllm|transformers'
transformers                      4.41.2
vllm                              0.5.0
vllm-flash-attn                   2.5.9

$ python -m vllm.entrypoints.openai.api_server --help
vLLM OpenAI-Compatible RESTful API server.

options:
  --host HOST           host name
  --port PORT           port number
  --api-key API_KEY     If provided, the server will require this key to be presented in the header.
  --model MODEL         Name or path of the huggingface model to use.
  --max-model-len MAX_MODEL_LEN
                        Model context length. If unspecified, will be automatically derived from the model config.
  --gpu-memory-utilization GPU_MEMORY_UTILIZATION
                        The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use
                        the default value of 0.9.
  --served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]
                        The model name(s) used in the API. If multiple names are provided, the server will respond to any of the provided names. The model name in the model field of a response will be the
                        first name in this list. If not specified, the model name will be the same as the `--model` argument. Noted that this name(s)will also be used in `model_name` tag content of
                        prometheus metrics, if multiple names provided, metricstag will take the first one.

# Start an OpenAI-compatible API service
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-0.5B-Instruct

If saw connection to https://huggingface.co/ failed, try:

HF_ENDPOINT=https://hf-mirror.com python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-0.5B-Instruct

Run in a firewalled or offline environment with locally cached files by setting the environment variable TRANSFORMERS_OFFLINE=1.

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
    HF_ENDPOINT=https://hf-mirror.com \
    python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2-0.5B-Instruct \
    --max-model-len 4096

The vLLM requires a NVIDIA GPU on the host system, and the --device cpu doesn’t work.

$ python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-0.5B-Instruct --device cpu
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

llama.cpp:

llama.cpp is a C++ library as a core inference engine that provides the core functionality for running LLMs on CPUs and GPUs.
It’s designed to efficiently execute LLM models for tasks like text generation and translation.
Ollama and other tools like Hugging Face Transformers can use llama.cpp as the underlying engine for running LLM models locally.

Think of Ollama as a user-friendly car with a dashboard and controls that simplifies running different LLM models (like choosing a destination). vLLM is more like a high-performance racing engine focused on speed and efficiency, which is optimized for serving LLMs to many users (like a racing car on a track). llama.cpp is the core engine that does the actual work of moving the car (like the internal combustion engine), and other tools can utilize it for different purposes.

Use Ollama for a simple and user-friendly experience running different LLM models locally.
Consider vLLM if the focus is on deploying a low-latency LLM server for real-time applications.
llama.cpp is a low-level library that serves as the core engine for other tools to run LLMs efficiently.

3. Hugging Face

Hugging Face is a popular open-source community and platform focused on advancing natural language processing (NLP) research and development, which is well-known for the Transformers library, a widely used open-source framework written in Python that provides tools and functionalities for training, fine-tuning, and deploying various NLP models, including LLMs.
Hugging Face maintains a Model Hub, a vast repository of pre-trained NLP models, including LLMs like Qwen, Jurassic-1 Jumbo, and many others which can be downloaded and used with the Transformers library or other compatible tools.
Model Scope is a platform that focus on model access and aims to democratize access to a wide range of machine learning models, including LLMs. It goes beyond NLP models and encompasses various domains like computer vision, audio processing, and more. It acts as a model hosting service, allowing developers to access and utilize pre-trained models through APIs or a cloud-based environment.
While Model Scope has its own model repository, it also collaborates with Hugging Face. Some models from the Hugging Face Model Hub are also available on Model Scope, providing users with additional access options.

Here’s a table summarizing the key differences:

Feature	Hugging Face	Model Scope
Focus	Open-source community, NLP research & development	Model access across various domains (including NLP)
Core Strength	Transformers library, Model Hub	Model hosting service, API access
Model Scope	Primarily NLP, but expanding	Wide range of machine learning models
Community Focus	Strong community focus, education, collaboration	Less emphasis on community, more on commercial aspect

Feature

Hugging Face

Model Scope

Focus

Open-source community, NLP research & development

Model access across various domains (including NLP)

Core Strength

Transformers library, Model Hub

Model hosting service, API access

Model Scope

Primarily NLP, but expanding

Wide range of machine learning models

Community Focus

Strong community focus, education, collaboration

Less emphasis on community, more on commercial aspect

Command line interface (CLI)

The huggingface_hub Python package comes with a built-in CLI called huggingface-cli that can be used to interact with the Hugging Face Hub directly from a terminal.
```
pip install -U "huggingface_hub[cli]"
```
In the snippet above, the [cli] extra dependencies is also installed to make the user experience better, especially when using the delete-cache command.

To download a single file from a repo, simply provide the repo_id and filename as follow:
```
# If saw connection to https://huggingface.co/ failed, uncomment the following line:
# ENV HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download sentence-transformers/all-MiniLM-L6-v2
```

4. LangChain

LangChain is a framework for developing applications powered by large language models (LLMs).

5. LlamaIndex

LlamaIndex is the leading framework for building LLM-powered agents over data with LLMs and workflows.

6. Open WebUI

Open WebUI is an extensible, feature-rich, and user-friendly self-hosted WebUI designed to operate entirely offline. It supports various LLM runners, including Ollama and OpenAI-compatible APIs.