WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Learning Center/LLM Deployment and Observability/Lesson 3

LLM Deployment: Using Open-source LLMs

Introduction/overview

Key ideas

Before choosing open-source LLMs, learn how to choose the right model for your needs, understand how to customize and scale it, and be aware of the technical challenges and security considerations involved.
Whether deploying open-source LLMs or using LLM APIs, cost considerations must be balanced with performance and output quality.
Some considerations when choosing between both involve selecting the right model and configuring parameters efficiently. Also, understand the trade-offs between resource utilization, computational demands, and the quality or creativity of the generated content.

An open-source LLM allows anyone to access its source code and modify, distribute, or use it in their projects without paying a license fee.

Because it is so adaptable, you can use it for various purposes, from academic research to commercial deployment. You can also fine-tune and customize it to fit your specific needs (data, tech stack, tasks, and applications).

On the other hand, a closed-source LLM, like OpenAI's GPT series (you saw in the last lesson), stays proprietary. Only the company that owns it can access it and change it. There are usually licensing fees and limits on using the product with this exclusivity.

For instance, you cannot access or modify the source code for the GPT 3.5 turbo you used in the walkthrough above and required, but you purchased credits from OpenAI to use the API.

In Course 1, we mentioned some of the best open-source LLMs and their architectural makeup. Here’s a recap of notable open-source LLMs:

LLaMA-2: Llama 2 is an efficient transformer model that utilizes innovative normalization and embedding schemes. Meta trained LLaMA-2 using openly accessible web data. It is free for study and commercial usage, but you may need to adjust it for specific tasks.
Mixtral 8x7B: Recent advances in sparse mixture-of-experts (MoE) models have led to the development of Mixtral 8x7B. With its open weights and Apache 2.0 licensing, Mixtral outperforms existing models like Llama 2 70B and GPT-3.5 in speed, efficiency, and scale.
Falcon: Falcon is an autoregressive LLM based on its excellent RefinedWeb dataset, licensed under the Apache License 2.0. This language model is free and available in sizes (7B, 40B, and 180B) to suit different computational and application requirements.

Pros of using an open-source LLM

Cost-efficiency: Open-source LLMs are a less expensive alternative because they do not charge licensing fees. The only costs they incur are the infrastructure and running the software.
Customizability: You can change the code in many ways because it is open. You can use your data to fine-tune the LLM to fit your needs and applications.
Transparency and trust: Seeing the full code builds trust and allows you to think about what is right. You can also identify possible biases and understand the LLM’s underlying architecture.
Full control: You are in charge of the model and the data it consumes during fine-tuning or in production. This is essential for sensitive applications that require data privacy.

Cons of using an open-source LLM

Technical needs: Setting up and maintaining an open-source LLM requires extensive technical know-how, such as training models, managing infrastructure, and creating APIs.
Security management: If you deploy on-premise, you are responsible for security. You must keep the model and data safe from people who should not have access to them.
Performance adaptability: Performance can change depending on the open-source model you choose and the available resources (infrastructure, data, etc.), which could impact results and the user experience.
Scalability considerations: Scaling the open-source LLM in production will require scaling your infrastructure, which can be complex and expensive.

Open-source LLM case study walkthrough: Q&A system

Let's use OpenLLM, one of the deployment tools we learned in the previous lesson, to deploy Falcon 7B to a local endpoint where you can send requests and receive responses. Falcon is one such model.

🫸 Heads up: This walkthrough does not cover model fine-tuning; just select the pre-trained LLM and deploy it! OpenLLM supports a series of open-source LLMs and allows you to add custom models.

We successfully ran this walkthrough in a Colab Notebook connected to a hosted A100 runtime. Please ensure you use the same or similar grade runtime that `float16` data type supports (usually GPUs with compute capability of at least 8.0).

Step 1: Install the required libraries

The LLM needs a runtime implementation in production to avoid unexpected behaviors, such as underutilizing compute, compatibility issues with existing production runtime environments, or significant performance variations.

Most models are compatible with a PyTorch runtime environment or vLLM—OpenLLM supports both. vLLMs provide better optimization and performance than the PyTorch runtime in production. Here’s how to install OpenLLM with a vLLM backend:

!pip install "openllm[vllm]"
!pip install --upgrade transformers

🚨For the most up-to-date and high-performance version of the Falcon model code, you can update to the latest version of transformers with:

!pip install --upgrade transformers

Step 2: Spin up an API endpoint for Falcon 7B

Next, let’s start the server! The command below launches an OpenLLM model server instance in the background (with `nohup`) to serve the `tiiuae/falcon-7b` model, using port 8001, `float16` data type for computation, and the `vLLm` backend, with all outputs and errors redirected to a log file.

!nohup openllm start tiiuae/falcon-7b --port 8001 --dtype float16 --backend vllm > openllm.log 2>&1 &

This ensures the server remains active and logged even if the terminal session ends.

Before you interact with the OpenLLM server, it's crucial to ensure that it is up and running:

! curl -i http://127.0.0.1:8001/readyz

The curl command output should start with `HTTP/1.1 200 OK`, meaning everything is in order; move to the next step. If it says `curl: (7) Failed to connect to localhost…`, check the ./openllm.log; the server has likely failed to start or is still in the process of starting. If it says `HTTP/1.1 503 Service Unavailable`, the server is still starting, and you should wait a bit and retry.

Step 3: Make an API call to the endpoint

You can make a synchronous or asynchronous call to the endpoint to interact with it.

Tip: Calling the endpoint
• In a synchronous call, the client waits for the endpoint to return a response before continuing. It immediately returns the LLM’s response to the application user (client).
• In a synchronous call, the client waits for the endpoint to return a response before continuing. It immediately returns the LLM’s response to the application user (client).

In this example, you will use an asynchronous call because it requires fewer available resources on the client or server side with OpenLLM's built-in Python client:

import openllm 

# Async API
async_client = openllm.AsyncHTTPClient("http://127.0.0.1:8001", timeout=120)
res = await async_client.generate("What are some strategies for effective marketing in the technology industry?", max_new_tokens=8192)
print(res.outputs[0].text)

# sync API
# client = openllm.HTTPClient('http://127.0.0.1:8001', timeout=120)
# res = client.generate("What are some strategies for effective marketing in the technology industry?", max_new_tokens=8192)

Here’s an example of the output:

#### OUPUT; DO NOT COPY####

The technology industry is a broad and complex sector that encompasses many different types of companies and products. As a result, there is no one-size-fits-all strategy for marketing in this industry. However, there are a few general principles that can be followed to help ensure success.
One important factor to consider when marketing in the technology industry is the target audience. This is because... (Clipped output)

You can also use CURL to send a request to the endpoint from a client for synchronous calls:

!curl -k -X 'POST' -N \
  'http://127.0.0.1:8001/v1/generate_stream' \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"What is LLM deployment?", "llm_config": {"max_new_tokens": 256}}'

Streaming: Support token streaming through server-sent events. You can use the /v1/generate_stream endpoint for streaming responses from LLMs. (Source: OpenLLM.)

#### ONE FROM MANY LINES OF OUPUT; DO NOT COPY####

data: {"prompt":"What are some strategies for effective marketing in the technology industry?","finished":false,"outputs":[{"index":0,"text":" proposition","token_ids":[23381],"cumulative_logprob":-32.47161287814379,"logprobs":null,"finish_reason":null}],"prompt_token_ids":[1562,362,596,6000,312,3227,3273,272,248,2533,2354,42],"prompt_logprobs":null,"request_id":"openllm-c788f94c35004947a491adbd912aa38e"}
...

Great! You can run this on a cloud server or deploy it using OpenLLM’s compatible BentoML Cloud platform.

OpenLLM also supports OpenAI-compatible endpoints. See how to deploy an OpenAI API in this notebook.

Another easy LLM deployment option includes HuggingFace Endpoints; check out this blog post to learn more.

Recap: Open-source LLMs vs LLM API providers

To recap, here’s how open-source LLMs match up with LLM API providers or closed-source LLMs to decide which option best suits your project:

Feature	Open-source LLM API providers	Close-source LLM API providers
Cost	Generally free to use the LLM itself, but it may incur infrastructure and development costs	Licensing fees, subscription costs, or usage-based pricing
Customization	High level of customization is possible, allowing tailoring to specific needs	Limited customization options, typically restricted to parameters that providers allow
Transparency	Full access to the underlying code and model architecture (for most; not all)	Limited transparency into the inner workings of the LLM
Control	Users have full control over the model and its data	Control lies with the provider; updates and changes depend on their discretion
Security	Requires careful security measures when deploying on-premise	Providers handle security and may offer additional security features
Ease of use	Requires technical expertise to set up and maintain	Generally easier to use with pre-built APIs and support
Performance	Performance may vary depending on the chosen open-source model and infrastructure	Can offer access to state-of-the-art, highly performant models
Scalability	Requires scaling infrastructure for increased usage	Providers typically handle scaling for their services
Community support	Large and active open-source communities for support and collaboration	Support depends on the specific provider
Availability of latest models	May not be state-of-the-art for your task	Often provide access to cutting-edge models developed by the provider

Phew! The last two lessons have been lengthy, and we admire your bravery in completing them. In the next lesson, you will learn how to monitor LLMs you deploy to production to troubleshoot function (poor quality responses, etc.) and operations (server) problems. See you there!

LLM Deployment: Using Open-source LLMs

Introduction/overview

Pros of using an open-source LLM

Cons of using an open-source LLM

Open-source LLM case study walkthrough: Q&A system

Step 1: Install the required libraries

Step 2: Spin up an API endpoint for Falcon 7B

Step 3: Make an API call to the endpoint

Recap: Open-source LLMs vs LLM API providers

Recommended resources

About

Resources

whylogs

WhyLabs