Datasets & Fields
Define data sources and their field structure in semantic domain templates
Datasets
Datasets represent your data sources. There are two types:
Physical Datasets
A physical dataset maps directly to a database table or view:
datasets:
- name: orders
label: "Orders"
description: "Customer order transactions"
type: physical
analyticsRole: fact
connectionId: "conn_abc123"
connectionType: PostgreSQL
schema: public
table: orders
primaryKey:
- order_id
fields:
identifiers: []
dimensions: []
metrics: []Virtual Datasets
A virtual dataset uses a custom SQL query instead of a table:
You can author SQL-backed virtual datasets entirely from the UI via Add Dataset → SQL Dataset -- no hand-editing YAML required. See Virtual Datasets for the walkthrough.
datasets:
- name: monthly_revenue
label: "Monthly Revenue"
type: virtual
analyticsRole: snapshot
connectionId: "conn_abc123"
connectionType: PostgreSQL
sql: |
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
fields:
dimensions:
- name: month
label: "Month"
dataType: date
metrics:
- name: revenue
label: "Revenue"
dataType: number
sourceField: revenue
aggregate: SUM
- name: order_count
label: "Order Count"
dataType: number
sourceField: order_count
aggregate: SUMUse virtual datasets for pre-aggregated data, complex transformations, or combining tables with custom logic.
Dataset Field Reference
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique identifier (snake_case) |
label | string | Yes | User-friendly display name |
description | string | No | Help text for users |
type | physical | virtual | Yes | Table-backed or SQL-backed |
analyticsRole | enum | No | Semantic role used by query planning and relationship diagnostics |
connectionId | string | Yes | ID of the data connection |
connectionType | enum | Yes | Database type |
catalog | string | No | Database catalog (BigQuery, Snowflake) |
schema | string | No | Database schema name |
table | string | Conditional | Table name (required for physical) |
sql | string | Conditional | SQL query (required for virtual) |
primaryKey | string[] | No | Primary key columns (important for fan-out detection) |
fields | object | Yes | Contains identifiers, dimensions, metrics arrays |
Supported connectionType values: PostgreSQL, MySQL, MSSQL, BigQuery, Redshift, Snowflake, clickhouse, S3, S3Tables, GoogleSheets, FileUpload, API
Dataset Analytics Role
analyticsRole tells Semaphor what kind of dataset this is. It is not a permission setting and it does not create joins by itself. It helps the semantic query planner, MCP tools, and relationship diagnostics choose safer defaults.
| Value | Use for | Example |
|---|---|---|
fact | Transaction or line-level tables with additive business measures | orders, invoice_lines, scale_tickets |
dimension | Descriptive entity tables used to group or filter facts | customers, suppliers, products |
event | Timestamped activity or log tables | page_events, machine_events |
snapshot | Point-in-time state tables | daily_inventory_snapshot, account_balances |
lookup | Small reference or mapping tables | status_codes, stage_lookup |
bridge | Associative tables connecting entities | user_groups, product_categories |
unknown | Insufficient information | temporary or newly imported tables |
Use fact for the dataset that owns the metric grain, and dimension or lookup for datasets joined to enrich that fact. For example, a supplier concentration analysis should usually start from fact_purchase_line and join dim_supplier, not start from every supplier and join purchases.
analyticsRole is especially useful for MCP and AI-assisted analysis. It lets Semaphor expose relationship-aware fields, prefer fact-rooted queries for metrics, and warn or refuse unsafe multi-fact joins that could inflate numbers.
Primary Date Field
Each dataset can declare one date dimension as its primary date field. This is the dataset's canonical date — the one dashboard-level features use when they need a "default" date to filter on.
Today it powers the Global Date Filter: a single date range set once on a dashboard that applies to every eligible card, without per-card setup.
Declare it with primaryDateField on the dataset, referencing a dimension by name:
datasets:
- name: orders
label: "Orders"
type: physical
primaryDateField: order_date
fields:
dimensions:
- name: order_date
label: "Order Date"
dataType: date
- name: ship_date
label: "Ship Date"
dataType: datetimeIn the semantic domain editor UI, open a dataset's dimensions and toggle Set as primary date field on the column you want. The toggle only appears on date or datetime dimensions.
Rules
A few things to keep in mind when setting a primary date field:
- The field you reference must exist on the dataset.
- It must be a
dateordatetimetype — other data types aren't eligible, and the toggle won't appear on them. - Renames are safe. If you rename the field later, the
primaryDateFieldreference updates automatically — no dashboard re-wiring needed.
Each dataset has one primary date field at a time. To switch to a different field, toggle Set as primary date field on the new date dimension (the previous one clears automatically), or update primaryDateField directly in YAML.
See Global Date Filter for the dashboard-side feature this unlocks.
Fields
Every dataset has three field buckets: identifiers, dimensions, and metrics. Each serves a distinct role in query generation.
Identifiers
Identifiers are primary and foreign key fields used for joins:
identifiers:
- name: order_id
label: "Order ID"
dataType: number
type: primary
- name: customer_id
label: "Customer ID"
dataType: number
type: foreign| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Column name in the database |
label | string | No | Display name |
description | string | No | Help text |
dataType | enum | Yes | string, number, boolean, date, datetime, json, geo |
type | primary | foreign | Yes | Primary keys enable fan-out detection |
sourceField | string | No | Physical column name when it differs from name |
hidden | boolean | No | Hide from end users |
tags | string[] | No | Categorization tags |
Always mark primary keys with type: primary. This enables Semaphor's fan-out prevention, which avoids metric inflation when joining one-to-many relationships.
Dimensions
Dimensions are groupable fields that appear in GROUP BY clauses:
dimensions:
- name: order_date
label: "Order Date"
dataType: date
granularity: day
- name: status
label: "Status"
dataType: string
- name: region
label: "Region"
dataType: string
isDisplayField: true
- name: order_month
label: "Order Month"
dataType: date
expression: "DATE_TRUNC('month', order_date)"| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Column name in the database |
label | string | No | Display name |
description | string | No | Help text |
dataType | enum | Yes | string, number, boolean, date, datetime, json, geo |
granularity | enum | No | Time granularity: day, week, month, quarter, year, hour, minute, second |
isDisplayField | boolean | No | Mark as the default display label for this dataset |
dateFormat | string | No | Custom date display format |
customFormat | string | No | Custom format string |
hidden | boolean | No | Hide from end users |
tags | string[] | No | Categorization tags |
sourceField | string | No | Physical column name. Required if no expression and name doesn't match a column |
expression | string | No | SQL expression for an inline dimension transform. Required if no sourceField. See Inline Expressions |
When using expression, the dimension is evaluated as a row-level SQL transform. Aggregate functions (SUM, COUNT, etc.) are not allowed in dimension expressions.
The expression field supports inline SQL transforms for dimensions. See Inline Expressions for syntax details and examples.
Metrics
Metrics are aggregatable fields that produce numeric results:
metrics:
- name: amount
label: "Revenue"
dataType: number
sourceField: amount
aggregate: SUM
- name: quantity
label: "Quantity"
dataType: number
sourceField: quantity
aggregate: SUM
- name: unit_price
label: "Unit Price"
dataType: number
sourceField: unit_price
aggregate: AVG
- name: discounted_total
label: "Discounted Total"
dataType: number
expression: "amount * (1 - discount_rate)"
aggregate: SUM| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique metric name |
label | string | No | Display name |
description | string | No | Help text |
dataType | enum | Yes | Usually number |
sourceField | string | Conditional | Physical column name. Required if no expression |
expression | string | Conditional | SQL expression. Required if no sourceField |
aggregate | enum | Recommended | Default aggregate for this metric: SUM, COUNT, AVG, MIN, MAX, MEDIAN, or DISTINCT |
format | string | object | No | Display format specification |
unit | string | No | Unit label (e.g., "USD", "kg") |
isDefault | boolean | No | Mark as the default metric for this dataset |
hidden | boolean | No | Hide from end users |
filters | array | No | Default filters applied to this metric |
tags | string[] | No | Categorization tags |
Every metric must have either sourceField or expression. If neither is provided, validation will fail.
Metric aggregate is the semantic default
aggregate defines how Semaphor rolls up the metric when a dashboard, MCP call, Data App, or agent asks for that metric without specifying a different aggregate. Legacy templates without aggregate are treated as SUM for compatibility, but new semantic domains should author it explicitly.
Choose the aggregate that matches the business meaning:
| Metric shape | Recommended aggregate | Example |
|---|---|---|
| Additive amount, value, cost, revenue, margin, quantity, duration, or weight | SUM | total revenue, net tons, gross margin value |
| Count metric or numeric indicator that represents occurrences | COUNT | order count, ticket count |
| Rate, ratio, score, price, or per-unit measurement | AVG | conversion rate, reliability score, unit cost per ton |
| Latest/highest/lowest observed value | MAX or MIN | latest snapshot version, lowest inventory level |
| Median-sensitive distribution metric | MEDIAN | median handle time |
| Unique-value count | DISTINCT | distinct customers |
Dashboards and MCP can still request a different aggregate for a specific analysis. For example, a metric modeled as aggregate: SUM can be analyzed with an explicit AVG override when the user asks for average margin instead of total margin. That override is query-specific; it does not change the semantic model.
The expression and format fields support inline SQL transforms and display formatting. See Inline Expressions for syntax details, aggregation behavior, and examples.
Complete Fields Example
Putting it all together for an orders dataset:
datasets:
- name: orders
label: "Orders"
type: physical
analyticsRole: fact
connectionId: "conn_abc123"
connectionType: PostgreSQL
schema: public
table: orders
primaryKey: [order_id]
fields:
identifiers:
- name: order_id
label: "Order ID"
dataType: number
type: primary
- name: customer_id
label: "Customer ID"
dataType: number
type: foreign
dimensions:
- name: order_date
label: "Order Date"
dataType: date
granularity: day
- name: status
label: "Status"
dataType: string
- name: channel
label: "Sales Channel"
dataType: string
- name: order_quarter
label: "Order Quarter"
dataType: date
expression: "DATE_TRUNC('quarter', order_date)"
metrics:
- name: amount
label: "Revenue"
dataType: number
sourceField: amount
aggregate: SUM
- name: quantity
label: "Quantity"
dataType: number
sourceField: quantity
aggregate: SUM