Open Source BI for Modern Data Stacks: Snowflake, BigQuery & Beyond
DATA ENGINEERING

Open Source BI for Modern Data Stacks: Snowflake, BigQuery & Beyond

Preset Team
9 min read
1,653 words

The modern data stack consolidated around a clear pattern: store everything in a cloud warehouse — Snowflake, BigQuery, Databricks, Redshift — model it with dbt, and serve it to humans and applications from there. The BI tool sits at the very end of that pipeline, and which one you pick has outsized impact on whether the rest of the stack delivers on its promises. A BI tool that fights the warehouse instead of leveraging it can quietly ruin a perfectly good data platform.

This guide covers what good warehouse integration looks like in a BI platform, why it matters for performance and scale, and how the leading open source BI options (Apache Superset, Metabase, Lightdash, and Redash) compare on the dimensions that actually move the needle: query pushdown, semantic layers, dbt awareness, customer-facing concurrency, and the modern-stack workflows you're already running.

Why warehouse integration is the fulcrum

Cloud warehouses changed BI economics in two specific ways:

  • Compute and storage decouple. You don't pay to keep data warm in a separate analytics database; the warehouse is the analytics database, and you pay only for the queries that run against it.
  • Query engines got fast enough. Snowflake, BigQuery, and Databricks can return interactive-latency answers over billions of rows, with the right modeling and partitioning.

That pair of facts upended a generation of BI tools whose architecture assumed they'd be running their own query engine on extracts of warehouse data. The new pattern is direct query: every chart hits the warehouse, and the warehouse does the heavy lifting. A modern BI tool's job is to compose good SQL, push it down, cache thoughtfully, and stay out of the way.

When that works, the BI experience inherits everything the warehouse already does well — concurrency, governance, lineage, cost controls. When it doesn't — when the BI tool insists on extracting data, building a side-cache that goes stale, or rewriting your queries badly — you've added a second source of truth and a second set of operational problems.

What "good warehouse integration" actually looks like

A short checklist that separates BI tools that respect the warehouse from those that work around it:

  • Native connector with full SQL-dialect support. Native means the tool understands the warehouse's specific SQL dialect — Snowflake's QUALIFY, BigQuery's UNNEST, Databricks' Photon hints — not a lowest-common-denominator subset. Tools that go through a generic adapter end up fighting the warehouse on edge cases.
  • Query pushdown. Filters, aggregations, joins, and window functions should execute in the warehouse, not in the BI tool. If the BI tool pulls a million rows and aggregates them locally, you've lost the warehouse's performance advantage.
  • Direct-query first, caching second. A modern BI tool defaults to live queries against the warehouse and uses caching as an optimization, not a workaround. The cache layer should be aware of warehouse query results (with TTLs and tag-based invalidation) rather than maintaining its own parallel copy of the data.
  • Semantic layer that pushes down. Datasets and metrics defined in the BI tool should compile to SQL that runs in the warehouse — not be evaluated in-memory after the fact. This is what makes a BI tool's semantic layer composable with dbt rather than a parallel modeling track.
  • dbt awareness. At minimum, the BI tool should be able to read your dbt manifest so descriptions, tags, and exposures flow through. The leading platforms surface dbt models, lineage, and freshness directly in the BI UI.
  • Concurrency that scales horizontally. For customer-facing analytics or large internal teams, the BI tool should add capacity by adding stateless workers, not by being a single bottleneck in front of the warehouse.

The platforms that get this right tend to win at scale. The ones that don't — typically older products that grew up before the cloud-warehouse era — end up replicating data, getting out of sync, or becoming the slow link in an otherwise fast stack.

The open source shortlist

Four projects are credible options for warehouse-native open source BI today:

Apache Superset

Apache Superset has the broadest warehouse coverage of any open source BI platform: native connectors for Snowflake, BigQuery, Databricks, Redshift, Postgres, MySQL, Trino, Presto, ClickHouse, Apache Druid, Pinot, plus 40+ more via SQLAlchemy. Every connector supports direct query — Superset composes SQL in the warehouse's dialect, pushes down filters and aggregations, and caches results in a configurable layer (Redis, Memcached, or the database itself).

The semantic layer is built around datasets (tables or virtual SQL queries) and metrics (aggregations defined once and reused). Definitions compile to SQL that runs in the warehouse, so the semantic layer scales with the warehouse rather than imposing its own ceiling. Superset can also be configured to read dbt manifests, surfacing dbt models, descriptions, and exposures in its UI so the analytics team's modeling work flows directly through to consumers.

For customer-facing concurrency, Superset's stateless web tier scales horizontally — managed offerings like Preset routinely run multi-region clusters in front of Snowflake and BigQuery for embedded use cases with thousands of concurrent users.

Metabase

Metabase has solid coverage of the major cloud warehouses — Snowflake, BigQuery, Redshift, Databricks, and the standard relational/OLAP databases. The connector quality is good, and direct-query is the default. The thinner spot is the semantic layer: Metabase's segments and metrics are useful for ad-hoc analysis but don't compose into a modeling layer the way Superset's datasets or Lightdash's dbt-native models do. dbt support exists via integrations but isn't first-class.

Best fit when the warehouse is one of the major cloud players and the modeling layer lives elsewhere — typically in dbt, with Metabase reading the resulting tables as a relatively flat surface.

Lightdash

Lightdash is the most opinionated of the four about modern data stack integration: your dbt project is your semantic layer. Lightdash reads your dbt models directly, exposes their measures and dimensions in the BI UI, and pushes every query through to the warehouse via dbt's connection profile. There's no parallel modeling layer to maintain.

Warehouse coverage matches what dbt supports — Snowflake, BigQuery, Databricks, Redshift, Postgres, plus a long tail. The tradeoff is exactly what you'd expect: if you're not on dbt, Lightdash's value drops sharply, because the dbt project is the product.

Best fit for dbt-first analytics teams who want their BI layer to track their dbt models without any translation step.

Redash

Redash supports 35+ data sources, including all the major warehouses, and its query model is genuinely warehouse-native — every chart is backed by a SQL query that runs against the source. There's no semantic layer to speak of, and dbt integration is limited.

Best fit when your audience is SQL-fluent and you want a fast query/dashboard surface in front of warehouse-resident data. Less aligned with the modern-stack workflows that depend on a real semantic layer.

Comparison

Capability Apache Superset Metabase Lightdash Redash
Native warehouse connectors 40+ databases 25+ databases dbt-supported set 35+ databases
Query pushdown Yes Yes Yes (via dbt) Yes
Semantic layer Datasets + metrics Limited Native to dbt None
dbt awareness Manifest integration Integration via plugins First-class Limited
Caching layer Redis / Memcached / DB Built-in Warehouse-side Built-in
Customer-facing concurrency Horizontal scaling Limited (paid tier) Limited Limited
Snowflake / BigQuery / Databricks support Native Native Native Native
Active contributor community Very active Active Growing Limited

A note on the proprietary alternatives — Looker, Power BI, and Tableau all integrate well with Snowflake and BigQuery, and Looker in particular has a strong semantic-layer story (LookML) that pushes down. The tradeoffs are familiar: per-viewer licensing that scales linearly with your audience, lock-in to the vendor's modeling language, and limits on embedding into your own product. For internal-only teams the math sometimes works; for customer-facing analytics it usually doesn't.

API-driven ML and downstream workflows

A few of the audience questions on this topic ask specifically about ML pipelines and API-driven workflows. The honest answer: BI tools are not where ML models get deployed — that's the warehouse's job (Snowflake's Snowpark, BigQuery ML, Databricks model serving) or a dedicated MLOps stack. What a BI tool should do is expose model output cleanly:

  • Read scored tables, with the warehouse's compute doing the work, and surface predictions alongside the input features.
  • Trigger refreshes via API, so dashboards can be updated when an upstream model finishes a batch run.
  • Provide programmatic access, via REST or SQL, so other systems can pull the same metrics the dashboards display.

Apache Superset has a documented REST API covering datasets, charts, dashboards, and SQL Lab, plus the embedded SDK for surfacing visualizations elsewhere. Metabase's API is also solid. Lightdash exposes dbt-modeled metrics through its API and supports dbt run triggers. Redash has a query-results API that's useful for simple integrations.

How to choose

  • You're on Snowflake / BigQuery / Databricks at non-trivial scale, want a real semantic layer in the BI tool, broad connector coverage for the long tail, and a path to customer-facing embedded analytics. Apache Superset, ideally via a managed offering like Preset.
  • Your team already lives in dbt and you want zero translation between modeling and BI. Lightdash.
  • You're standing up internal BI quickly on a single major warehouse and the modeling lives in dbt outside the BI tool. Metabase.
  • Your audience is SQL-first and you want a fast warehouse-backed query surface. Redash.

For most modern data stacks where the BI tool needs to scale alongside the warehouse — broad connector coverage, semantic-layer pushdown, dbt awareness, customer-facing concurrency — Apache Superset is the most complete open source answer today.

Where Preset fits

Preset is a managed Apache Superset platform with native connectors for every major cloud warehouse, dbt manifest integration, horizontal scaling for customer-facing concurrency, and SOC 2 / HIPAA-eligible deployments. Teams running on Snowflake, BigQuery, or Databricks typically use Preset to skip the operational lift of running Superset themselves.

If you're scoping a BI rollout on top of a modern data stack and want to talk through the architecture, the team is happy to walk through it. For related angles, our companion guides on open source embedded analytics platforms and self-service BI for non-technical teams cover the use cases where warehouse-native open source BI most often gets evaluated.

Subscribe to our blog updates

Receive a weekly digest of new blog posts

Close