Prerequisites

To use data observability, you need the following:

  • Create a Soda Cloud account.
  • Set up a Soda Agent (optional).
  • Connect a data source.

Create a Soda Cloud account

If you don’t have a Soda Cloud account, book a demo. You’ll get a free trial to explore and test Soda.

Soda Agent (Optional)

This step is optional. Soda creates a Soda-hosted Agent with every account. You can think of an Agent as the bridge between your data sources and Soda Cloud. A Soda-hosted Agent runs in Soda’s cloud and securely connects to your data sources to scan for data quality issues.

If you are an admin and prefer to deploy your own agent, you can configure a self-hosted agent:

  • In Soda Cloud, go to your avatar > Agents
  • Click New Soda Agent and follow the setup instructions
    soda-hosted-agent

Soda Agent Basics
There are two types of Soda Agents:

  1. Soda-hosted Agent: This is an out-of-the-box, ready-to-use agent that Soda provides and manages for you. It’s the quickest way to get started with Soda as it requires no installation or deployment. It supports connections to specific data sources like BigQuery, Databricks SQL, MS SQL Server, MySQL, PostgreSQL, Redshift, and Snowflake. Soda-hosted agent (missing)
  2. Self-hosted Agent: This is a version of the agent that you deploy in your own Kubernetes cluster within your cloud environment (like AWS, Azure, or Google Cloud). It gives you more control and supports a wider range of data sources. Self-hosted agent (missing)

A Soda Agent is essentially Soda Library (the core scanning technology) packaged as a containerized application that runs in Kubernetes. It acts as the bridge between your data sources and Soda Cloud, allowing users to:

  • Connect to data sources securely
  • Run scans to check data quality
  • Create and manage no-code checks directly in the Soda Cloud interface

The agent only sends metadata (not your actual data) to Soda Cloud, keeping your data secure within your environment. Soda Agent basic concepts (missing)

Connect a Data Source

  1. In Soda Cloud, go to your avatar > Data Sources.
  2. Click New Data Source, then follow the guided steps to create the connection. Use the table below to understand what each field means and how to complete it:

Attributes

Field or Label Guidance
Data Source Label Provide a unique identifier for the data source. Soda Cloud uses the label you provide to define the immutable name of the data source against which it runs the Default Scan.
Agent Select the Soda-hosted agent, or the name of a Soda Agent that you have previously set up in your secure environment. This identifies the Soda Agent to which Soda Cloud must connect in order to run its scan.
Check Schedule Provide the scan frequency details Soda Cloud uses to execute scans according to your needs. If you wish, you can define the schedule as a cron expression.
Starting At (UTC) Select the time of day to run the scan. The default value is midnight.
Custom Cron Expression (Optional) Write your own cron expression to define the schedule Soda Cloud uses to run scans.
Column Profiling Scan Schedule Specify the time of day at which Soda runs the Automation scan.
Automation Scan Schedule Specify the time of day at which Soda runs the daily anomaly dashboard scan.
Partition column suggestion - Optional Add any amount of partition column suggestions. If a suggested column name fully matches a column discovered during metric monitoring or profiling, that column will be used as the partition column. The order of the suggested columns matters, as they will be checked sequentially from top to bottom until a match is found. If no match is found, heuristics will be applied to determine the partition column. You can change the partition column at any time in the dataset settings.
  1. Complete the connection configuration. These settings are specific to each data source (PostgreSQL, MySQL, Snowflake, etc) and usually include connection details such as host, port, credentials, and database name.

Supported databases for data observability

Soda supports metric monitoring for multiple databases. Soda leverages metadata history when available. If metadata history isn’t available for your data source, Soda builds history gradually as scans occur.

Metric monitoring support

What’s Next?