Virtual Datasets
Author SQL-backed datasets in your semantic domain using the SQL Dataset mode
A virtual dataset is a semantic dataset defined by SQL instead of a physical table. The SQL is stored directly in the domain template and bound to a single data connection. At query time, Semaphor embeds it as a subquery in the generated SQL.
Reach for a virtual dataset when you want to:
- Expose a pre-aggregated slice of a large table (e.g. monthly revenue).
- Reshape data or encode complex joins once and reuse them across dashboards.
- Present a tailored view of a source table without asking users to learn the underlying schema.
Creating a virtual dataset
- In the domain editor, on the Datasets tab, click Add Dataset.
- Switch to the SQL Dataset mode.
- Choose the owning Connection. This becomes the dataset's runtime connection and is locked after creation.
- Enter a Name (a simple identifier like
monthly_revenue-- no dots or qualified paths) and a user-friendly Label. - Write your query in the SQL editor.
- Click Preview Columns to introspect the query. Semaphor runs your SQL against the selected connection and returns the discovered columns, or shows SQL errors inline.
- Once Preview succeeds, click Add Dataset to finish.
SQL examples
All examples assume a connection bound to a single database. Fully qualify object references -- SQL Dataset mode does not apply a default schema.
Simple pre-aggregation
A rollup of orders by month:
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM analytics.orders
GROUP BY DATE_TRUNC('month', order_date)Subquery
Filter completed orders in an inner query, then join to customers:
SELECT
o.order_id,
o.order_date,
o.amount,
c.customer_id,
c.segment
FROM (
SELECT order_id, order_date, customer_id, amount
FROM analytics.orders
WHERE status = 'completed'
) AS o
JOIN analytics.customers AS c
ON c.customer_id = o.customer_idCTE (WITH ... AS)
CTEs work because Semaphor wraps your SQL as a subquery at query time, and every supported dialect allows CTEs inside a subquery. The example below builds monthly revenue per customer and then segments customers by their 90-day trailing spend:
WITH monthly_customer_revenue AS (
SELECT
customer_id,
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM analytics.orders
WHERE status = 'completed'
GROUP BY customer_id, DATE_TRUNC('month', order_date)
),
customer_trailing_90 AS (
SELECT
customer_id,
SUM(revenue) AS revenue_90d
FROM monthly_customer_revenue
WHERE month >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '3 months'
GROUP BY customer_id
)
SELECT
c.customer_id,
c.segment,
COALESCE(t.revenue_90d, 0) AS revenue_90d,
CASE
WHEN t.revenue_90d > 10000 THEN 'High'
WHEN t.revenue_90d > 1000 THEN 'Mid'
ELSE 'Low'
END AS spend_tier
FROM analytics.customers AS c
LEFT JOIN customer_trailing_90 AS t
ON t.customer_id = c.customer_idWindow function inside a CTE
Running totals and ranks compose the same way:
WITH daily_sales AS (
SELECT
order_date,
SUM(amount) AS revenue
FROM analytics.orders
GROUP BY order_date
)
SELECT
order_date,
revenue,
SUM(revenue) OVER (ORDER BY order_date) AS revenue_running_total,
RANK() OVER (ORDER BY revenue DESC) AS revenue_rank
FROM daily_salesTemplating virtual-dataset SQL
Virtual-dataset SQL supports the same template expressions as custom SQL cards. You can branch on dashboard control values, expand multi-select controls into IN lists, and gate optional predicates on filter presence.
For example, a virtual dataset that switches its grouping grain based on a dashboard control:
SELECT
{% if param('grain').value == 'day' %}
DATE_TRUNC('day', order_date)
{% elif param('grain').value == 'week' %}
DATE_TRUNC('week', order_date)
{% else %}
DATE_TRUNC('month', order_date)
{% endif %} AS bucket,
SUM(amount) AS revenue
FROM analytics.orders
GROUP BY 1See Template Expressions for the full syntax and resolved-SQL examples.
Template expressions are supported on database-backed connections only. Virtual datasets built on cached connections -- S3, S3 Tables, Google Sheets, and uploaded files -- reject template syntax at save and run time.
Authoring rules
- Fully qualify tables. Use
schema.tableordatabase.schema.tabledepending on your dialect (e.g.analytics.orders,my_project.analytics.orders). SQL Dataset mode does not apply asearch_pathor session schema. - No trailing semicolons. Your SQL is embedded as a subquery, so a trailing
;will fail preview and runtime. - Simple dataset names. The dataset Name must be a semantic-safe identifier. Dotted or qualified names are rejected.
- One owning connection. The connection chosen at creation is locked. Virtual datasets are single-connection only.
- Keep dialect-valid SQL. Use syntax your target database supports.
- Template expressions are database-only. Use
{{ param(...) }},{{ filter(...) }}, and{% if %}blocks freely on database connections. Cached-source connections (S3, S3 Tables, Google Sheets, uploaded files) reject template syntax.
Editing a virtual dataset
- Open the dataset from the Datasets tab and click Edit.
- Update the SQL. The owning connection stays read-only.
- Click Preview Columns to reintrospect against your changed SQL.
- Reconcile fields inline in the same dialog:
- Newly returned columns -- select the ones you want to add.
- Still-resolvable fields -- kept automatically; only source-derived attributes like data type refresh. Your labels, descriptions, tags, hidden state, bucket placement, and format settings are preserved.
- No-longer-returned fields -- shown as unresolved. Select the ones you want to remove.
- Click Apply to write the dataset change into the unsaved domain draft.
- Click Save on the domain to persist.
Semaphor uses a two-step save model for all semantic edits:
- Apply writes the current dataset draft into the in-memory domain draft. Nothing is persisted yet.
- Save on the domain is the only action that writes the template to the database.
Nothing auto-adds or auto-deletes fields. Every add and every remove is an explicit choice you make in the reconciliation UI.
Virtual datasets at query time
From an end user's perspective, virtual datasets are indistinguishable from physical datasets. They show up in the field list the same way, auto-joins resolve identically, and filters apply the same. See the User Guide for how users work with fields across datasets.
Under the hood, Semaphor embeds your saved SQL as a subquery (for example, FROM (SELECT ...) AS t0) inside the generated query, then applies relationships, filters, and aggregations from the card configuration.
SQL Dataset vs. Data Model
Semaphor's Add Dataset dialog has three modes: Tables, Data Model, and SQL Dataset.
- SQL Dataset is the recommended way to author new SQL-backed virtual datasets. The SQL lives directly on the dataset in your domain template, creation and edit flows share the same Preview-based authoring experience, and the dataset can be maintained inline without a separate object.
- Data Model is a still-supported path that references a reusable Data Model by ID. Use it if it already fits your team's workflow, or if you want the same SQL reused across multiple domains.
Both paths execute through the same semantic query pipeline.
Next steps
- Datasets & Fields -- reference the full dataset and field schema.
- Troubleshooting -- resolve common SQL and drift errors.
- User Guide -- how end users query across datasets.