Ollama, vLLM, and llama.cpp are all tools related to running large language models (LLMs) locally on the own computer.

1. Ollama

  • Ollama (/ˈɒlˌlæmə/) is a user-friendly, higher-level interface for running various LLMs, including Llama, Qwen, Jurassic-1 Jumbo, and others.

  • It provides a streamlined workflow for downloading models, configuring settings, and interacting with LLMs through a command-line interface (CLI) or Python API.

  • Ollama acts as a central hub for managing and running multiple LLM models from different providers, and integrates with underlying tools like llama.cpp for efficient execution.

  • To pull a model checkpoint and run the model, use the ollama run command.

    • Install Ollama on Linux:

      curl -fsSL https://ollama.com/install.sh | sh
      >>> Downloading ollama...
      ######################################################################## 100.0%-#O#- #   #
      ######################################################################## 100.0%
      >>> Installing ollama to /usr/local/bin...
      >>> Creating ollama user...
      >>> Adding ollama user to render group...
      >>> Adding ollama user to video group...
      >>> Adding current user to ollama group...
      >>> Creating ollama systemd service...
      >>> Enabling and starting ollama service...
      Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
      >>> The Ollama API is now available at 127.0.0.1:11434.
      >>> Install complete. Run "ollama" from the command line.
      WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.
      For more install instructions , see https://github.com/ollama/ollama.
    • Keep Ollama service running whenever using ollama:

      $ systemctl status ollama.service
      ○ ollama.service - Ollama Service
           Loaded: loaded (/etc/systemd/system/ollama.service; disabled; preset: enabled)
           Active: inactive (dead)
      $ ollama run phi3:mini
      Error: could not connect to ollama app, is it running?
      1. Add these lines to the /etc/wsl.conf to ensure systemd starts up on boot.

        [boot]
        systemd=true
      2. Run wsl.exe --shutdown from PowerShell to restart the WSL instances.

      3. Start and check the Ollama service status.

        $ sudo systemctl start ollama.service
        $ systemctl status ollama.service
        ● ollama.service - Ollama Service
             Loaded: loaded (/etc/systemd/system/ollama.service; disabled; preset: enabled)
             Active: active (running) since Wed 2024-06-12 15:21:39 CST; 5min ago
           Main PID: 914 (ollama)
              Tasks: 15 (limit: 9340)
             Memory: 576.9M
             CGroup: /system.slice/ollama.service
                     └─914 /usr/local/bin/ollama serve
        $ sudo ss -ntlp
        State     Recv-Q    Send-Q    Local Address:Port     Peer Address:Port    Process
        LISTEN    0         4096          127.0.0.1:11434         0.0.0.0:*        users:(("ollama",pid=914,fd=3))
  • Ollama has its own library to pull models, and store them at home directory of the user (i.e., ollama) that running the ollama service:

    • macOS: ~/.ollama/models

    • Linux: /usr/share/ollama/.ollama/models

    • Windows: C:\Users\%username%\.ollama\models

    If a different directory needs to be used, set the environment variable OLLAMA_MODELS to the chosen directory.

    To get the home directory of the user ollama, run getent passwd ollama | cut -d: -f6.
  • The ollama service can also be accessed via its OpenAI-compatible API when the model checkpoint is prepared.

    $ ollama serve --help
    Start ollama
    
    Usage:
      ollama serve [flags]
    
    Aliases:
      serve, start
    
    Flags:
      -h, --help   help for serve
    
    Environment Variables:
          OLLAMA_DEBUG               Show additional debug information (e.g. OLLAMA_DEBUG=1)
          OLLAMA_HOST                IP Address for the ollama server (default 127.0.0.1:11434)
          OLLAMA_KEEP_ALIVE          The duration that models stay loaded in memory (default "5m")
          OLLAMA_MAX_LOADED_MODELS   Maximum number of loaded models (default 1)
          OLLAMA_MAX_QUEUE           Maximum number of queued requests
          OLLAMA_MODELS              The path to the models directory
          OLLAMA_NUM_PARALLEL        Maximum number of parallel requests (default 1)
          OLLAMA_NOPRUNE             Do not prune model blobs on startup
          OLLAMA_ORIGINS             A comma separated list of allowed origins
          OLLAMA_TMPDIR              Location for temporary files
          OLLAMA_FLASH_ATTENTION     Enabled flash attention
          OLLAMA_LLM_LIBRARY         Set LLM library to bypass autodetection
          OLLAMA_MAX_VRAM            Maximum VRAM
    //  ensure that the model checkpoint is prepared.
    $ ollama list
    NAME                    ID              SIZE    MODIFIED
    phi3:mini               64c1188f2485    2.4 GB  17 minutes ago

    curl

    curl http://localhost:11434/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{"messages":[{"role":"user","content":"Say this is a test"}],"model":"phi3:mini"}'

    Python

    pip install openai
    from openai import OpenAI
    client = OpenAI(
        base_url='http://localhost:11434/v1/',
        api_key='ollama',  # required but ignored
    )
    chat_completion = client.chat.completions.create(
        messages=[
            {
                'role': 'user',
                'content': 'Say this is a test',
            }
        ],
        model='phi3:mini',
    )

    C#/.NET

    # The official .NET library for the OpenAI API
    dotnet add package OpenAI --prerelease
    using OpenAI.Chat;
    
    ChatClient client = new(
        model: "phi3:mini",
        credential: "EMPTY_OPENAI_API_KEY",
        options: new OpenAI.OpenAIClientOptions { Endpoint = new Uri("http://localhost:11434/v1/") });
    
    ChatCompletion completion = client.CompleteChat("Say 'this is a test.'");
    
    Console.WriteLine($"[ASSISTANT]: {completion}");

2. vLLM

  • vLLM (Very Low Latency Model) primarily focuses on deploying LLMs as low-latency inference servers.

  • It prioritizes speed and efficiency, making it suitable for serving LLMs to multiple users in real-time applications.

  • vLLM offers APIs that allow developers to integrate LLM functionality into their applications. While it can be used locally, server deployment is its main strength.

  • vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries, and with the requirements:

    • OS: Linux

    • Python: 3.8 – 3.11

    • GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)

  • To deploy a model as an OpenAI-compatible service:

    pip install vllm
    $ pip list | egrep 'vllm|transformers'
    transformers                      4.41.2
    vllm                              0.5.0
    vllm-flash-attn                   2.5.9
    $ python -m vllm.entrypoints.openai.api_server --help
    vLLM OpenAI-Compatible RESTful API server.
    
    options:
      --host HOST           host name
      --port PORT           port number
      --api-key API_KEY     If provided, the server will require this key to be presented in the header.
      --model MODEL         Name or path of the huggingface model to use.
      --max-model-len MAX_MODEL_LEN
                            Model context length. If unspecified, will be automatically derived from the model config.
      --gpu-memory-utilization GPU_MEMORY_UTILIZATION
                            The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use
                            the default value of 0.9.
      --served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]
                            The model name(s) used in the API. If multiple names are provided, the server will respond to any of the provided names. The model name in the model field of a response will be the
                            first name in this list. If not specified, the model name will be the same as the `--model` argument. Noted that this name(s)will also be used in `model_name` tag content of
                            prometheus metrics, if multiple names provided, metricstag will take the first one.
    # Start an OpenAI-compatible API service
    python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-0.5B-Instruct

    If saw connection to https://huggingface.co/ failed, try:

    HF_ENDPOINT=https://hf-mirror.com python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-0.5B-Instruct

    Run in a firewalled or offline environment with locally cached files by setting the environment variable TRANSFORMERS_OFFLINE=1.

    HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
        HF_ENDPOINT=https://hf-mirror.com \
        python -m vllm.entrypoints.openai.api_server \
        --model Qwen/Qwen2-0.5B-Instruct \
        --max-model-len 4096

    The vLLM requires a NVIDIA GPU on the host system, and the --device cpu doesn’t work.

    $ python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-0.5B-Instruct --device cpu
    RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

llama.cpp:

  • llama.cpp is a C++ library as a core inference engine that provides the core functionality for running LLMs on CPUs and GPUs.

  • It’s designed to efficiently execute LLM models for tasks like text generation and translation.

  • Ollama and other tools like Hugging Face Transformers can use llama.cpp as the underlying engine for running LLM models locally.

Think of Ollama as a user-friendly car with a dashboard and controls that simplifies running different LLM models (like choosing a destination). vLLM is more like a high-performance racing engine focused on speed and efficiency, which is optimized for serving LLMs to many users (like a racing car on a track). llama.cpp is the core engine that does the actual work of moving the car (like the internal combustion engine), and other tools can utilize it for different purposes.

  • Use Ollama for a simple and user-friendly experience running different LLM models locally.

  • Consider vLLM if the focus is on deploying a low-latency LLM server for real-time applications.

  • llama.cpp is a low-level library that serves as the core engine for other tools to run LLMs efficiently.

3. Hugging Face

  • Hugging Face is a popular open-source community and platform focused on advancing natural language processing (NLP) research and development, which is well-known for the Transformers library, a widely used open-source framework written in Python that provides tools and functionalities for training, fine-tuning, and deploying various NLP models, including LLMs.

  • Hugging Face maintains a Model Hub, a vast repository of pre-trained NLP models, including LLMs like Qwen, Jurassic-1 Jumbo, and many others which can be downloaded and used with the Transformers library or other compatible tools.

  • Model Scope is a platform that focus on model access and aims to democratize access to a wide range of machine learning models, including LLMs. It goes beyond NLP models and encompasses various domains like computer vision, audio processing, and more. It acts as a model hosting service, allowing developers to access and utilize pre-trained models through APIs or a cloud-based environment.

  • While Model Scope has its own model repository, it also collaborates with Hugging Face. Some models from the Hugging Face Model Hub are also available on Model Scope, providing users with additional access options.

  • Here’s a table summarizing the key differences:

    Feature Hugging Face Model Scope

    Focus

    Open-source community, NLP research & development

    Model access across various domains (including NLP)

    Core Strength

    Transformers library, Model Hub

    Model hosting service, API access

    Model Scope

    Primarily NLP, but expanding

    Wide range of machine learning models

    Community Focus

    Strong community focus, education, collaboration

    Less emphasis on community, more on commercial aspect

  • Command line interface (CLI)

    The huggingface_hub Python package comes with a built-in CLI called huggingface-cli that can be used to interact with the Hugging Face Hub directly from a terminal.

    pip install -U "huggingface_hub[cli]"
    In the snippet above, the [cli] extra dependencies is also installed to make the user experience better, especially when using the delete-cache command.

    To download a single file from a repo, simply provide the repo_id and filename as follow:

    # If saw connection to https://huggingface.co/ failed, uncomment the following line:
    # ENV HF_ENDPOINT=https://hf-mirror.com
    
    huggingface-cli download sentence-transformers/all-MiniLM-L6-v2