LLM Deployment: Using Open-source LLMs
Introduction/overview
Key ideas
- Before choosing open-source LLMs, learn how to choose the right model for your needs, understand how to customize and scale it, and be aware of the technical challenges and security considerations involved.
- Whether deploying open-source LLMs or using LLM APIs, cost considerations must be balanced with performance and output quality.
- Some considerations when choosing between both involve selecting the right model and configuring parameters efficiently. Also, understand the trade-offs between resource utilization, computational demands, and the quality or creativity of the generated content.
An open-source LLM allows anyone to access its source code and modify, distribute, or use it in their projects without paying a license fee.
Because it is so adaptable, you can use it for various purposes, from academic research to commercial deployment. You can also fine-tune and customize it to fit your specific needs (data, tech stack, tasks, and applications).
On the other hand, a closed-source LLM, like OpenAI's GPT series (you saw in the last lesson), stays proprietary. Only the company that owns it can access it and change it. There are usually licensing fees and limits on using the product with this exclusivity.
In Course 1, we mentioned some of the best open-source LLMs and their architectural makeup. Here’s a recap of notable open-source LLMs:
- LLaMA-2: Llama 2 is an efficient transformer model that utilizes innovative normalization and embedding schemes. Meta trained LLaMA-2 using openly accessible web data. It is free for study and commercial usage, but you may need to adjust it for specific tasks.
- Mixtral 8x7B: Recent advances in sparse mixture-of-experts (MoE) models have led to the development of Mixtral 8x7B. With its open weights and Apache 2.0 licensing, Mixtral outperforms existing models like Llama 2 70B and GPT-3.5 in speed, efficiency, and scale.
- Falcon: Falcon is an autoregressive LLM based on its excellent RefinedWeb dataset, licensed under the Apache License 2.0. This language model is free and available in sizes (7B, 40B, and 180B) to suit different computational and application requirements.
Pros of using an open-source LLM
- Cost-efficiency: Open-source LLMs are a less expensive alternative because they do not charge licensing fees. The only costs they incur are the infrastructure and running the software.
- Customizability: You can change the code in many ways because it is open. You can use your data to fine-tune the LLM to fit your needs and applications.
- Transparency and trust: Seeing the full code builds trust and allows you to think about what is right. You can also identify possible biases and understand the LLM’s underlying architecture.
- Full control: You are in charge of the model and the data it consumes during fine-tuning or in production. This is essential for sensitive applications that require data privacy.
Cons of using an open-source LLM
- Technical needs: Setting up and maintaining an open-source LLM requires extensive technical know-how, such as training models, managing infrastructure, and creating APIs.
- Security management: If you deploy on-premise, you are responsible for security. You must keep the model and data safe from people who should not have access to them.
- Performance adaptability: Performance can change depending on the open-source model you choose and the available resources (infrastructure, data, etc.), which could impact results and the user experience.
- Scalability considerations: Scaling the open-source LLM in production will require scaling your infrastructure, which can be complex and expensive.
Open-source LLM case study walkthrough: Q&A system
Let's use OpenLLM, one of the deployment tools we learned in the previous lesson, to deploy Falcon 7B to a local endpoint where you can send requests and receive responses. Falcon is one such model.
We successfully ran this walkthrough in a Colab Notebook connected to a hosted A100 runtime. Please ensure you use the same or similar grade runtime that `float16` data type supports (usually GPUs with compute capability of at least 8.0).
Step 1: Install the required libraries
The LLM needs a runtime implementation in production to avoid unexpected behaviors, such as underutilizing compute, compatibility issues with existing production runtime environments, or significant performance variations.
Most models are compatible with a PyTorch runtime environment or vLLM—OpenLLM supports both. vLLMs provide better optimization and performance than the PyTorch runtime in production. Here’s how to install OpenLLM with a vLLM backend:
!pip install "openllm[vllm]"
!pip install --upgrade transformers
🚨For the most up-to-date and high-performance version of the Falcon model code, you can update to the latest version of transformers with:
!pip install --upgrade transformers
Step 2: Spin up an API endpoint for Falcon 7B
Next, let’s start the server! The command below launches an OpenLLM model server instance in the background (with `nohup`) to serve the `tiiuae/falcon-7b` model, using port 8001, `float16` data type for computation, and the `vLLm` backend, with all outputs and errors redirected to a log file.
!nohup openllm start tiiuae/falcon-7b --port 8001 --dtype float16 --backend vllm > openllm.log 2>&1 &
This ensures the server remains active and logged even if the terminal session ends.
Before you interact with the OpenLLM server, it's crucial to ensure that it is up and running:
! curl -i http://127.0.0.1:8001/readyz
The curl command output should start with `HTTP/1.1 200 OK`, meaning everything is in order; move to the next step. If it says `curl: (7) Failed to connect to localhost…`, check the ./openllm.log; the server has likely failed to start or is still in the process of starting. If it says `HTTP/1.1 503 Service Unavailable`, the server is still starting, and you should wait a bit and retry.
Step 3: Make an API call to the endpoint
You can make a synchronous or asynchronous call to the endpoint to interact with it.
• In a synchronous call, the client waits for the endpoint to return a response before continuing. It immediately returns the LLM’s response to the application user (client).
• In a synchronous call, the client waits for the endpoint to return a response before continuing. It immediately returns the LLM’s response to the application user (client).
In this example, you will use an asynchronous call because it requires fewer available resources on the client or server side with OpenLLM's built-in Python client:
import openllm
# Async API
async_client = openllm.AsyncHTTPClient("http://127.0.0.1:8001", timeout=120)
res = await async_client.generate("What are some strategies for effective marketing in the technology industry?", max_new_tokens=8192)
print(res.outputs[0].text)
# sync API
# client = openllm.HTTPClient('http://127.0.0.1:8001', timeout=120)
# res = client.generate("What are some strategies for effective marketing in the technology industry?", max_new_tokens=8192)
Here’s an example of the output:
#### OUPUT; DO NOT COPY####
The technology industry is a broad and complex sector that encompasses many different types of companies and products. As a result, there is no one-size-fits-all strategy for marketing in this industry. However, there are a few general principles that can be followed to help ensure success.
One important factor to consider when marketing in the technology industry is the target audience. This is because... (Clipped output)
You can also use CURL to send a request to the endpoint from a client for synchronous calls:
!curl -k -X 'POST' -N \
'http://127.0.0.1:8001/v1/generate_stream' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{"prompt":"What is LLM deployment?", "llm_config": {"max_new_tokens": 256}}'
#### ONE FROM MANY LINES OF OUPUT; DO NOT COPY####
data: {"prompt":"What are some strategies for effective marketing in the technology industry?","finished":false,"outputs":[{"index":0,"text":" proposition","token_ids":[23381],"cumulative_logprob":-32.47161287814379,"logprobs":null,"finish_reason":null}],"prompt_token_ids":[1562,362,596,6000,312,3227,3273,272,248,2533,2354,42],"prompt_logprobs":null,"request_id":"openllm-c788f94c35004947a491adbd912aa38e"}
...
Great! You can run this on a cloud server or deploy it using OpenLLM’s compatible BentoML Cloud platform.
OpenLLM also supports OpenAI-compatible endpoints. See how to deploy an OpenAI API in this notebook.
Recap: Open-source LLMs vs LLM API providers
To recap, here’s how open-source LLMs match up with LLM API providers or closed-source LLMs to decide which option best suits your project:
Phew! The last two lessons have been lengthy, and we admire your bravery in completing them. In the next lesson, you will learn how to monitor LLMs you deploy to production to troubleshoot function (poor quality responses, etc.) and operations (server) problems. See you there!