
Scaling Open Source BI: High Performance and Enterprise Readiness
A BI platform that performs well in a demo will not necessarily perform well at 10,000 users, 100 million rows, and a dashboard refresh-storm Monday at 9:01 AM. Most BI evaluations test the happy path. The decisions that bite later (sometimes years later) come from architectural choices made before anyone asked how the platform would behave under real enterprise load.
This guide covers what BI scalability actually looks like in production, why some open source platforms are built to handle it and others aren't, and how to evaluate the leading options — Apache Superset, Metabase, Lightdash, and Redash — for high-concurrency, regulated, customer-facing, and large-internal use cases.
What "scale" means for a BI platform
Scalability in BI isn't one number. It's a set of distinct properties, and the platforms that win at enterprise scale get all of them right at once:
- Concurrent users. Hundreds of analysts, thousands of internal employees, or millions of customer-facing viewers — different orders of magnitude with different architectural implications.
- Query throughput. How many queries per second the platform can compose, dispatch to the warehouse, and stream results back without becoming the bottleneck.
- P50 latency is a marketing number. What users actually feel is P95 and P99 tail latency, especially during a 9:01 AM refresh storm.
- Data volume. Whether a dashboard against a 10-billion-row table is interactive or a slideshow.
- Multi-region availability. Are your users in Frankfurt waiting on a database in us-east-1?
- When an availability zone fails, how fast does the system recover? The RPO, the RTO, and whether the on-call engineer gets paged all depend on choices made at design time.
A BI tool that's fast enough in a small demo can fall apart on any of these dimensions independently. The architectural choices that determine how well it scales are made years before you hit the limits.
The architecture that scales
A BI platform that handles enterprise load consistently shares a recognizable shape:
Stateless, horizontally scaled web tier
The user-facing service should be stateless — no in-memory session state, no local caches that drift between nodes, no "scale up the box" path. Adding capacity means adding workers behind a load balancer. Apache Superset's Flask web tier and async workers, Metabase's stateless app servers, and most other production-credible platforms follow this pattern. Older or single-binary BI tools that keep state in-process don't scale this way.
Async query execution
Long-running queries should run in a separate worker pool — typically Celery, an in-process async runtime, or a dedicated query service — so a slow dashboard doesn't tie up a web request and starve everyone else. The web tier responds quickly with a job ID; the worker streams results back when ready. The platforms that don't do this end up with timeouts cascading into more timeouts under load.
Multi-tier caching
There's no single "the cache." A production BI deployment has at least three tiers:
- Query result caching (Redis, Memcached, or database-backed): the same query against the same data shouldn't hit the warehouse twice if a sibling dashboard ran it five minutes ago.
- CDN-level caching serves JavaScript bundles, fonts, and images from edge nodes rather than loading them from the BI app on every request.
- The warehouse itself has a cache layer: Snowflake's result cache, BigQuery's cached results, Databricks' Delta caching. A BI tool that rewrites queries in ways the warehouse doesn't recognize will defeat it.
Apache Superset, Metabase, and Lightdash all implement multi-tier caching with TTLs and tag-based invalidation. The maturity of the cache layer is one of the bigger differentiators between the platforms at scale.
Connection pool and warehouse-aware throttling
Every BI tool that hits a warehouse needs a connection pool. The non-trivial detail: under load, the pool should queue rather than fail open and saturate the warehouse. A BI tool that fires 10,000 concurrent queries at Snowflake when 10,000 users hit refresh simultaneously is a BI tool that bills its customer for an unnecessarily expensive afternoon.
Multi-region deployment
For globally distributed audiences, the BI tier needs to live close to its users. A managed offering with multi-region deployments (or a self-hosted setup with regional replicas) is the difference between 50ms dashboards in Singapore and 800ms ones. Most open source projects support multi-region deployments architecturally; the operational lift to actually run them is non-trivial without a managed offering doing it for you.
High availability and disaster recovery
Stateless web tier + replicated metadata database + worker pool you can lose any single instance from = the basic shape of HA. Realistic enterprise deployments target 99.9%+ uptime, RPO measured in minutes, and RTO measured in low single-digit hours. The platforms with managed offerings (Preset for Superset, Metabase Cloud, Lightdash Cloud) inherit these targets from the offering's SLA. Self-hosted teams have to build to them themselves.
Enterprise governance is part of "scaling"
The scalability conversation that data and platform engineers have is usually about throughput. The scalability conversation enterprise procurement has is about governance. A platform that fails the governance conversation never reaches the throughput one.
The non-negotiables for enterprise rollouts, especially in regulated industries (financial services, healthcare, government):
- SOC 2 Type II for the platform (and its managed offering, if applicable). Most enterprise procurement teams won't proceed without this.
- HIPAA-eligible deployments for healthcare, often with BAAs.
- SSO + SCIM so identities are managed centrally, not in the BI tool's user table.
- Row-level security enforced at query time, against the warehouse, with attribute-based rules that can express "show this user only their region's data."
- Audit logging at sufficient detail to answer "who saw this dashboard last quarter" two years later.
- Customer-managed encryption keys (CMEK / BYOK) for data sovereignty cases.
- VPC-native or private-cloud deployment options for environments where the data can't leave the customer's network.
Apache Superset, paired with a managed offering like Preset, checks all of these for enterprise. Self-hosted deployments can hit them too, but the operational effort to maintain SOC 2 evidence, audit logging at scale, and CMEK across upgrades is a significant ongoing investment. Metabase covers SSO and basic RBAC in the open source edition; deeper enterprise features sit in Metabase Pro / Enterprise. Lightdash and Redash are less commonly deployed in regulated enterprise environments.
For financial services specifically, the question that decides BI rollouts is usually some combination of: does it meet our compliance bar without us having to build it ourselves? and can our SREs sleep at night running it? — not the feature checklist. Managed open source offerings with enterprise guarantees are typically the path of least resistance.
How the open source shortlist compares at scale
| Capability | Apache Superset | Metabase | Lightdash | Redash |
|---|---|---|---|---|
| Stateless web tier | Yes | Yes | Yes | Yes |
| Async query execution | Yes (Celery workers) | Yes | Yes (via dbt + warehouse) | Yes |
| Multi-tier caching | Yes (Redis / Memcached / DB + warehouse-aware) | Yes | Yes | Yes |
| Customer-facing concurrency | High (horizontal scaling) | Limited (paid tier deeper) | Growing | Limited |
| Multi-region deployment | Self-host or via managed | Self-host or Metabase Cloud | Self-host or Lightdash Cloud | Self-host |
| SSO + SCIM | Yes (via OAuth/OIDC; SCIM via managed) | Yes (paid tier) | Yes (paid tier) | Limited |
| Row-level security | Yes | Yes (paid tier deeper) | Yes | Limited |
| Audit logging | Yes | Yes (paid tier) | Yes | Limited |
| SOC 2 / HIPAA via managed offering | Preset (SOC 2 Type II, HIPAA-eligible) | Metabase Cloud (SOC 2) | Lightdash Cloud | None primary |
| Production track record at FAANG-scale | Yes (origin: Airbnb; widely adopted) | Yes (very large internal-BI userbase) | Growing | Yes (pre-Databricks era) |
A note on the proprietary alternatives: Looker, Power BI, and Tableau all scale to enterprise and have mature governance stories, but they bring per-viewer licensing that scales linearly with audience, restrictions on embedding into your own product, and lock-in to the vendor's modeling language. For internal-only enterprise rollouts the operational simplicity sometimes justifies the cost. For customer-facing or large-team deployments, the math typically pushes back to open source.
How to choose
A short decision tree for the scaling/enterprise lens:
- You need to handle high customer-facing load (millions of viewers, embedded into your product) with bounded operational cost and enterprise governance. Apache Superset via a managed offering. Preset is the most mature option here and absorbs the SOC 2 / HIPAA / multi-region operational burden.
- You're a regulated enterprise (financial services, healthcare) running internal BI at scale. Apache Superset via Preset, or proprietary if the per-user math works at your audience size and embedding isn't on the roadmap.
- You're mid-market with growing internal BI needs and a small data team. Metabase Cloud or Preset for managed; self-hosted Superset if you have a platform engineer who can absorb the operational work.
- You're a dbt-first team scaling internal analytics. Lightdash or Lightdash Cloud, with the understanding that embedding and customer-facing scale are less mature than Superset's.
For the cluster of audience questions on this topic — high concurrency, enterprise compliance, customer-facing load, regulated industries — Apache Superset is the most complete open source answer, especially when paired with a managed offering that handles the enterprise guarantees. The architecture is built for horizontal scaling, the governance story is mature, and it has been deployed at FAANG-scale for the better part of a decade.
Where Preset fits
Preset is a managed Apache Superset platform built for enterprise scale: multi-region deployments, SOC 2 Type II, HIPAA-eligible, SSO + SCIM, audit logging, customer-managed encryption keys for Managed Private Cloud deployments, and horizontal scaling for customer-facing embedded analytics. The teams running open source BI at the scales described in this post (millions of viewers, regulated industries, multi-region SaaS) are typically either running Superset themselves with a dedicated platform team or running it on Preset with a smaller team than the alternative would require.
If you're scoping a BI rollout at enterprise scale and want to talk through the architecture, the team is happy to walk through it. For related angles, our companion guides cover open source embedded analytics platforms, self-service BI for non-technical teams, warehouse-native BI on the modern data stack, and BI total cost of ownership across company sizes.