WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Anthony Naddeo

Jan 17, 2023

Back to Blog

BigQuery Data Monitoring with WhyLabs

Integrations
Data Quality
Whylogs

Anthony Naddeo

Jan 17, 2023

You can now monitor the quality of your data in Google BigQuery with whylogs without writing any code. This is the first truly no-code solution for data monitoring that WhyLabs offers and we started with BigQuery because of its popularity, managed infrastructure, and integration options. Data quality monitoring is a key process for ensuring that the data your analytics and machine learning applications rely on is sound. This whylogs integration is a good fit for anyone who uses BigQuery to store their data and wants to avoid writing any code to monitor the quality of that data on an ongoing basis.

The core of the integration is an Apache Beam template that we publish to a public GCS bucket. The template can be used to create a Dataflow job that consumes from BigQuery in a few different ways, depending on how you configure it.

How to use it

Before starting, you’ll need to head over to WhyLabs and create a free account to get your organization id, model id, and api key. API keys can be generated from the settings menu after you log in. You’ll supply these parameters to the Dataflow job below.

To use the integration, you'll need a GCP account that has access to the BigQuery and Dataflow services. This section will have examples that use the Google Cloud console. Start by opening the Dataflow service and creating a job from a template.

Next, select the Custom Template option.

For the template location, enter whylabs-dataflow-templates/batch_bigquery_template/latest/batch_bigquery_template.json. You'll see the form automatically expand to highlight additional parameters you have to supply.

In this example we'll profile one of the public datasets hosted by Google using the following configuration options.

Output GCS path - gs://template_test_bucket/my_job/profile. Pick a bucket you own here. This determines where the whylogs profiles are written to.
Input Mode - BIGQUERY_TABLE. This tells the template to consume an entire BigQuery table.
Date column - block_timestamp. This is the column in the dataset that should be used for time. It should have a type of TIMESTAMP in the BigQuery schema. The dataset we'll be using happens to use this name. It will be different for your data.
Organization ID - Something like org-abc123. This is the organization id of your WhyLabs account. You can get a free one by signing up at hub.whylabsapp.com.
Model ID - The model id that you'll upload these whylogs profiles to. You can create one for free by signing up at hub.whylabsapp.com.
API Key - An API key from your WhyLabs account. You can generate one from the settings menu.

There are a few additional parameters to set as well. You'll find them under the optional parameters section. They're considered optional because they're conditionally required depending on which input mode you select.

We only need to supply two more, but the image shows all of the optional parameters as well.

Input BigQuery Table - bigquery-public-data.crypto_bitcoin_cash.transactions. It's a reasonably sized free dataset with a time column.
Pandas Grouper frequency - Y. The job uses pandas to split by time under the hood. This tells it to split by year. In this example we'll generate roughly 15 whylogs profiles, one for each year of data.

Create the job and you'll see a graph like the following.

Once the job finishes, you'll see the profiles in GCS and your WhyLabs account. We created profiles for several years of data here, but the normal use case would be to create a profile for every day or hour as you receive new data.

All of this could have been run from the gcloud command line too with a command like the following.

gcloud dataflow flex-template run "my-job" \
        --template-file-gcs-location gs://whylabs-dataflow-templates/batch_bigquery_template/latest/batch_bigquery_template.json \
        --parameters input-mode=BIGQUERY_TABLE \
        --parameters input-bigquery-table='bigquery-public-data.crypto_bitcoin_cash.transactions' \
        --parameters date-column=block_timestamp \
        --parameters date-grouping-frequency=Y \
        --parameters org-id=MY_ORG_ID \
        --parameters dataset-id=MY_MODEL_ID \
        --parameters output=gs://my-bucket/dataset-timestmap-test/dataset_profile \
        --parameters api-key=MY_KEY \
        --region us-central1

How it works

The job is designed to generate a handful of whylogs profiles. For a model that you're monitoring on a daily basis, the job will end up producing a profile per day of data. If there are too many days in the job then the amount of time it takes to generate profiles will start to increase. While you could use this job to generate a profile for every day of data for several years, it would certainly not perform well.

If you have use cases that would benefit from generating many profiles per job then reach out to us. We have several Dataflow pipeline configurations that we didn't publish as templates and one of them might suit your needs.

Once you have everything set up, the next thing you'll want to do is set up monitors on your data.

What's next

We'll be adding a few features over time to this integration. The most exciting one is support for streaming mode, which will allow real time profiling of Dataflow based data pipelines.

We'll also be adding additional data sources. This is technically a Dataflow integration that happens to support BigQuery as an input right now, but Dataflow can consume more services than just BigQuery.

Get started by creating a free WhyLabs account.

Resources

Check out whylogs for open-source data logging
Create a free WhyLabs account for data and ML monitoring
Join our Community Slack channel to ask questions and learn more
Review our BigQuery documentation for more configuration options and details

Anthony Naddeo

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

BigQuery Data Monitoring with WhyLabs

How to use it

How it works

What's next

Get started by creating a free WhyLabs account.

Resources

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs