Datasets & Fields
Define data sources and their field structure in semantic domain templates
Datasets
Datasets represent your data sources. There are two types:
Physical Datasets
A physical dataset maps directly to a database table or view:
datasets:
- name: orders
label: "Orders"
description: "Customer order transactions"
type: physical
connectionId: "conn_abc123"
connectionType: PostgreSQL
schema: public
table: orders
primaryKey:
- order_id
fields:
identifiers: []
dimensions: []
metrics: []Virtual Datasets
A virtual dataset uses a custom SQL query instead of a table:
datasets:
- name: monthly_revenue
label: "Monthly Revenue"
type: virtual
connectionId: "conn_abc123"
connectionType: PostgreSQL
sql: |
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
fields:
dimensions:
- name: month
label: "Month"
dataType: date
metrics:
- name: revenue
label: "Revenue"
dataType: number
sourceField: revenue
- name: order_count
label: "Order Count"
dataType: number
sourceField: order_countUse virtual datasets for pre-aggregated data, complex transformations, or combining tables with custom logic.
Dataset Field Reference
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique identifier (snake_case) |
label | string | Yes | User-friendly display name |
description | string | No | Help text for users |
type | physical | virtual | Yes | Table-backed or SQL-backed |
connectionId | string | Yes | ID of the data connection |
connectionType | enum | Yes | Database type |
catalog | string | No | Database catalog (BigQuery, Snowflake) |
schema | string | No | Database schema name |
table | string | Conditional | Table name (required for physical) |
sql | string | Conditional | SQL query (required for virtual) |
primaryKey | string[] | No | Primary key columns (important for fan-out detection) |
fields | object | Yes | Contains identifiers, dimensions, metrics arrays |
Supported connectionType values: PostgreSQL, MySQL, MSSQL, BigQuery, Redshift, Snowflake, clickhouse, S3, S3Tables, GoogleSheets, FileUpload, API
Fields
Every dataset has three field buckets: identifiers, dimensions, and metrics. Each serves a distinct role in query generation.
Identifiers
Identifiers are primary and foreign key fields used for joins:
identifiers:
- name: order_id
label: "Order ID"
dataType: number
type: primary
- name: customer_id
label: "Customer ID"
dataType: number
type: foreign| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Column name in the database |
label | string | No | Display name |
description | string | No | Help text |
dataType | enum | Yes | string, number, boolean, date, datetime, json, geo |
type | primary | foreign | Yes | Primary keys enable fan-out detection |
hidden | boolean | No | Hide from end users |
tags | string[] | No | Categorization tags |
Always mark primary keys with type: primary. This enables Semaphor's fan-out prevention, which avoids metric inflation when joining one-to-many relationships.
Dimensions
Dimensions are groupable fields that appear in GROUP BY clauses:
dimensions:
- name: order_date
label: "Order Date"
dataType: date
granularity: day
- name: status
label: "Status"
dataType: string
- name: region
label: "Region"
dataType: string
isDisplayField: true| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Column name in the database |
label | string | No | Display name |
description | string | No | Help text |
dataType | enum | Yes | string, number, boolean, date, datetime, json, geo |
granularity | enum | No | Time granularity: day, week, month, quarter, year, hour, minute, second |
isDisplayField | boolean | No | Mark as the default display label for this dataset |
dateFormat | string | No | Custom date display format |
customFormat | string | No | Custom format string |
hidden | boolean | No | Hide from end users |
tags | string[] | No | Categorization tags |
Metrics
Metrics are aggregatable fields that produce numeric results:
metrics:
- name: amount
label: "Revenue"
dataType: number
sourceField: amount
- name: quantity
label: "Quantity"
dataType: number
sourceField: quantity
- name: discounted_total
label: "Discounted Total"
dataType: number
expression: "amount * (1 - discount_rate)"| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique metric name |
label | string | No | Display name |
description | string | No | Help text |
dataType | enum | Yes | Usually number |
sourceField | string | Conditional | Physical column name. Required if no expression |
expression | string | Conditional | SQL expression. Required if no sourceField |
format | string | object | No | Display format specification |
unit | string | No | Unit label (e.g., "USD", "kg") |
isDefault | boolean | No | Mark as the default metric for this dataset |
hidden | boolean | No | Hide from end users |
filters | array | No | Default filters applied to this metric |
tags | string[] | No | Categorization tags |
Every metric must have either sourceField or expression. If neither is provided, validation will fail.
The expression and format fields support inline SQL transforms and display formatting. See Metric Expressions & Formatting for syntax details, aggregation behavior, and examples.
Complete Fields Example
Putting it all together for an orders dataset:
datasets:
- name: orders
label: "Orders"
type: physical
connectionId: "conn_abc123"
connectionType: PostgreSQL
schema: public
table: orders
primaryKey: [order_id]
fields:
identifiers:
- name: order_id
label: "Order ID"
dataType: number
type: primary
- name: customer_id
label: "Customer ID"
dataType: number
type: foreign
dimensions:
- name: order_date
label: "Order Date"
dataType: date
granularity: day
- name: status
label: "Status"
dataType: string
- name: channel
label: "Sales Channel"
dataType: string
metrics:
- name: amount
label: "Revenue"
dataType: number
sourceField: amount
- name: quantity
label: "Quantity"
dataType: number
sourceField: quantity