Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs
- WhyLabs
- News
- Data Quality
Aug 17, 2023
This blog was written by Lanqi Fei, Senior ML Scientist at Glassdoor, Jamie Broomall, Senior Software Engineer at WhyLabs, and Natalia Skaczkowska-Drabczyk, Customer Success Data Scientist at WhyLabs.
The challenge of integration latency
Consider the scenario where we want to integrate a new tool into an existing service that potentially operates in real-time and involves some user interface. We need to make sure that the latency of the service in production is acceptable after the integration, while still keeping the overall maintenance costs low. In this scenario, there are trade-offs to be made and the right choice will depend on the individual characteristics of the service and the newly integrated function. Simplifying this function is a common path to gaining a significant advantage in this optimization game.
This blog, written in collaboration between Glassdoor and WhyLabs, describes a real-world instance of an integration latency challenge and gives a detailed walk-through of the changes applied within whylogs (an open-source data logging library maintained by WhyLabs) to mitigate it.
What are the best options for reducing latency?
There are a couple of options for reducing latency when integrating a new function into an existing service.
Restructuring your service or architecture to allow an early response to the caller before doing the additional work (this may be as simple as using an async call pattern with a log statement or as complex as a DAG framework). You can think of your service as a graph - its nodes should be the latency-critical tasks and ideally those should be executed, instrumented and tested independently.
- Using a DAG can be a good way of scaling out a service to a large number of new features and integrations while maintaining latency requirements and managing the complexity of the critical path to generating a high quality user response. The downside of this approach is the additional complexity as well as development and maintenance costs.
- Starting with an async call pattern can be sufficient and the appropriate first step for simple scenarios and is much cheaper to set up.
Not restructuring your service and paying the additional cost in latency or computational resources. If your service is relatively simple, you can integrate the new function in the same linear execution path with the service request. This has the drawback that each new function call adds some time to the overall latency and cost of running a service. This approach can be the most cost effective route if your service is already fast enough to afford the additional latency or if you can scale up hardware resources to accommodate the additional overhead (if needed) for the new function’s runtime.
Optimizing the integrated function. If the function you’re integrating is open-source, analyzing its implementation can open up potential optimization avenues. You may re-scope the function to essential operations only or customize the data flow in a way that would serve your use case better. In this case study, we cover exactly this approach.
Glassdoor case study
Glassdoor is investing in improving its ML model monitoring capabilities. Specifically, they’ve started using the WhyLabs Observability Platform and are working on onboarding all their various ML models there. Given that Glassdoor is recognized for its salary transparency, one of the most important models they want to monitor is the salary estimate model.
Glassdoor’s salary estimate model is a regression model that predicts the mean and the range of salary given a job title, location, and employer. Integrating whylogs (the open-source component of the WhyLabs observability stack) into this service will enable capturing potential data drift in predictions, detecting anomalies in model inputs, and understanding user traffic better.
One of the challenges Glassdoor faced while integrating with WhyLabs was meeting the strict latency requirements imposed by the real-time nature of the salary estimation service. More specifically, the latency increase due to the whylogs integration should not exceed 30 milliseconds. However, in the early phase of the integration, the team found that value to be 60 milliseconds. After considering a few different solutions to this problem, they teamed up with the WhyLabs engineers in a joint effort to customize the whylogs code for their specific use case.
Solution summary
The whylogs open-source data logging library was optimized for batch processing efficiency first, not for single row latency. In Glassdoor’s scenario, the cost to restructure the service or add async processing was not practical in short timeframes. Yet, there was value in integrating whylogs profiling if it could be done in process within the service.
The WhyLabs team accepted the challenge and applied two changes in whylogs:
- Using direct type inspection over pandas Series masks
- Optimizing the code for the single row scenario by processing the data as rows of scalars instead of single item pandas Series
These changes successfully decreased the latency overhead caused by whylogs profiling from 68 ms to 8 ms! In the next section, you’ll find a detailed description of that optimization process.
How to optimize a Python function call in an open source library
First, it's important to establish the criteria for success. In this case, we focused on the specific scenario that Glassdoor raised - logging a single row in process in a service, and keeping latency overhead lower than 30 ms per call (starting from approx +60 ms per call) to profile the inputs with whylogs.
Secondly, it’s important to produce a cheap and reproducible way to measure latency. To achieve that, we ensured that our test environment has similar specifications to the Glassdoor environment. Since latency is a concern scoped to a single thread of execution, we didn’t need to go into how many cores were used or parallelism complexities. We focused on creating a reasonable approximated performance test for the single threaded latency of processing a single row of similarly sized/typed data as in Glassdoor’s latency tests. From this, we created a simple whylogs test test_rolling_logger_latency_row_benchmark and added that to whylogs with some run time profiling.
We used cProfile to gather run time performance data for this scenario and looked for bottlenecks. To make the calculations simple, the test logs a single row of mocked data (1000 times in an initial simple performance test, so that the overall time in seconds = approx time in ms per function call). We can run the test in less than 1 minute.
From the cProfile stats you can look at the output tables sorted by function call cumulative time or visualize the data in the form of a flame graph (see below).
From the test output we saw that the whylogs preprocessing functions were a surprisingly large portion of the overall time to log the inputs (see the _pandas_split function in the flame graph and the red underlined record in the runtime profile above). The whylogs preprocessing step inspects the inputs and splits the values by type into batches of values for efficient array-based updates; this happens in Apache datasketches algorithms that calculate statistical summaries of the inputs. But this type of splitting took a disproportionately long time when logging a single row per whylogs call. When a single whylogs call operates over thousands of rows, the preprocessing step is amortized over the many rows being processed and is small.
Looking at the most expensive functions in whylogs preprocessing, we saw that two configurable experimental features took about half of the preprocessing time: checking for and counting the number of special numeric values Inf and nan. The overall time spent performing various isinstance checks were a surprisingly large portion of the time to profile a row, even when whylogs considered this part of our callstack to be the “relatively small preprocessing” step before the math heavy datasketches that compute numeric distributions and cardinality estimates.
We realized we could turn off nan and inf counting by default (and make it an opt-in feature), which should reduce the latency roughly in half; therefore we theoretically would be very close to meeting the criteria of 30 ms after having started at around 60 ms in the baseline test.
The second candidate change was to create a new preprocessing method to optimize for the special case of processing single rows. In this single row preprocessing, we would change the list of type check filters into a logically equivalent tree to categorize and infer the type of an input value– instead of doing N checks, we expected Log N checks per value.
These two changes were applied and then we reran the simple performance test on 1000 rows of mock data to see how close we were to reaching our latency goals.
After measuring this, we saw a larger than expected improvement, because the isinstance() checks of each scalar value are not the same cost as the pandas Series masking on an array of length one! We went from approx. 60 ms per call to under 10 ms per call to profile the mock data. This was sufficiently faster and surpassed the latency criteria of 30 ms per call.
The final step was testing the resulting candidate build in Glassdoor’s environment to see if the gains observed in the simple whylogs test translate into the more realistic setting. That indeed was the case, as presented in the next section.
Results
The table below shows a comparison of the latency of the salary estimation service for a single prediction, as reported by Glassdoor. The median latency of the model inference without any monitoring applied was 32 ms, while the latency of the monitored service was 100 ms using the default whylogs version. The optimized version however allowed for a much faster execution of 40 ms. This result satisfied the strict latency requirements of Glassdoor’s real-time service.
Comparing the execution profiles of the test script using the default vs optimized version of whylogs, we can also see a dramatic decrease in both the execution time from 58 seconds to 2 seconds and a substantial drop in the amount of function calls from 10 million to 4 million. The optimized code, when profiled and visualized as a flame graph, shows an entirely different, much simpler structure to the flame graph generated from the unoptimized code execution.
Conclusion
This case study proves two separate points.
Firstly, it is a worthwhile exercise to look into the implementation of the function that is being integrated into your service. There might be some tweaks availble that will speed up this function enough to spare you the pain of restructuring your service or save you some compute budget.
Secondly, it shows that ML observability can be lightweight - despite the rich statistical profiles calculated by whylogs, the overhead introduced by profiling was just 8 ms per inference, with the actual inference time being 32 ms. This seems like a fair price for the visibility and insights that WhyLabs provides for Glassdoor’s ML systems.
Appendix
# import necessary packages
import mlflow.pyfunc
import whylogs as why
from whylogs.core import DatasetSchema
from whylogs.api.writer.whylabs import WhyLabsWriter
class MLflowModelPrediction(mlflow.pyfunc.PythonModel):
self.why_logger = None
...
def load_context(self, context: mlflow.pyfunc.PythonModelContext) -> None:
# create a 15-minutely rolling logger
self.why_logger = why.logger(mode="rolling",
interval=15,
schema=DatasetSchema(cache_size=1024*32),
fork=True,
when="M",
base_name="test_inference") # initiate as static instance
self.why_logger.append_writer(writer=WhyLabsWriter())
...
def predict(self, context: mlflow.pyfunc.model.PythonModelContext,
model_input: pd.DataFrame) -> List[Dict[str, float]]:
...
# compute predictions and store them in a list of data dictionaries
dict_predictions = compute_predictions()
if self.why_logger is not None:
try:
for p in dict_predictions:
self.why_logger.log(p)
except Exception as e:
print("Whylogs Error: {}".format(sys.exc_info()))
...
def __del__(self):
# closes the rolling logger
if self.why_logger is not None:
self.why_logger.close()
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI