DASHBOARDS

Using Superset to Understand Superset Usage

Stephan Claus

This post was originally publised on the HomeToGo Engineering blog.

Introduction

This article walks you through a potential approach to monitor your Superset usage directly within Superset leveraging the internal metadata database.

At HomeToGo, we are always looking for opportunities to further streamline our Data Tech Stack. Naturally, one of the most important components is the reporting environment that allows our teams to easily explore, interact and create insights of our data. Historically, we were using multiple tools for this purpose, namely Tableau, Redash and QlikView which:

  • have different usage workflows and pre-requirements:
    • For Redash, one needs to be able to write SQL to produce the initial data to explore
    • For Tableau, the workflow is more user-friendly for non-technical users but it's not as easily maintainable with a programmatic approach on a more flexible and granular level
  • cause an overhead to manage and keep the environments aligned

Over the summer of 2021, we looked at different solutions to unify these environments and ended up running a Proof of Concept with Apache Superset. We ultimately decided to use it as our main reporting tool.

If you monitor the reporting space closely, you have probably already heard of Apache Superset. If you haven’t, we highly recommend checking out the Preset blog to see how it matured in 2021. While there is still areas for improvement the below Superset advantages stood out for us:

  • The data exploring process is more visual for non-technical users but also provides capability for custom SQL (via SQLLab)
  • All business logic is available in SQL
  • API-first approach to manage most object types (datasets, metrics, access, etc.)
  • An RBAC model for granular data access
  • Open source license with active community and experience in building strong OSS
  • Advanced chart types with option to easily contribute with own developed charts
  • Advanced chart functionality, e.g. time sliders, date grain selector out of the box YoY calculations
  • Python post-processing like Facebook's prophet forecasting, advanced filters and Jinja templating functionality

Two weeks ago, we made our Superset environment (self hosted) public to our stakeholders and started our migration from Redash.

To closely monitor our user adoption rate and general usage patterns of Superset, we made this a project for our monthly hackathon events and looked into Superset's internal PostgreSQL metadata database.

Keep reading for our step by step approach and what we’ve learned so far. On top, we released all the bits and bytes you need to replicate the results of this article here.

Superset Dashboard

Step 1 - Connect to the PostgreSQL Metadata Database

For the sake of the hackathon, we went with a simple approach by connecting the Superset metadata base directly to Superset — reporting in real time has never been easier. 🙂 And of course, insert your favorite inception joke here.

Superset Inception

Fun aside, there are other ways to integrate the metadata into Superset, especially if you want to:

  • Avoid exposing the production database to Superset directly
  • Connect it to other data sources (i.e. the query performance statistics that is provided by your DWH)
  • Start parsing out information from the json fields in the different tables

A more reliable solution could incorporate the following steps:

  1. Import the content of the metadata database via Airbyte into your data warehouse
  2. Create your dbt model(s)
  3. Add the datasets via your usual data warehouse connection

Step 2 - Explore the Data Model

Once connected, you are welcomed by 55 tables that are (more or less) used by Superset internally. As there is not much documentation around this topic (yet), we started to explore and “reverse engineer” our way into a better understanding by drawing an UML diagram first.

Superset Metadata Database Schema UML

From there it was quite easy to derive the tables that are of interest for our use case, namely:

Superset Dataset

As you can see, we created two custom datasets on top that are:

  1. Connecting master data from dashboards, charts, etc. to the log data itself
  2. Aggregating different master data into one view

For reference, you can find the SQL code for the virtualized data sets at the bottom of this article or in this repository.

Step 3 — Know what you want to accomplish

After gaining a good understanding of the base tables, we developed the areas where we wanted to derive insights. For the scope of the hackathon, we therefore focused on the following questions:

User Adoption Rate

We treat our internal stakeholders as customers — we are always interested in master data (e.g. how many registered users do we have) but are more focused on:

  • How many active users we have
  • Which dashboards/datasets are used the most

To surface these findings, we had to connect the master data to the log table, which we accomplished with a virtualized dataset. The reasons for this are manyfold — just envision a view of dashboards/datasets which are not used at all and could provide valuable feedback for the data owners.

Chart Types

Knowing which chart types are used the most provides insights into:

  • If we need special testing before updating Superset in case the chart type was changed
  • Encouraging users to use certain chart types for feature parity (e.g. use the echarts time series line chart instead of the simple line chart)

Entity Naming Conventions

Unfortunately, Superset does not support proper tagging of elements like datasets, charts or dashboards. In our opinion, this causes struggles in navigating the content when > 50 users are producing content. To avoid this, we ask our users to follow a very simple naming convention that makes their content easily searchable and allows us to parse tags going forward:

[tag1_tag2_…_tagn].[element name]

Where the tag section would follow the snake_case guidelines. Examples are marketing_sem.Performance Dashboard or monitoring.Superset Dashboard.

As naming things is hard, we wanted to build a simple monitor showing us the naming convention adoption rate. We are now able to give a friendly ping to users whom we don’t see following the naming convention. 😉

Development of Virtualized Datasets

Virtualized datasets are a great way to allow customers to connect different data sources and build customisations. However, we want to be mindful about what are valid use cases as extensive use could potentially clutter our provided datasets that are coming out of dbt.

Valid use cases for virtualized datasets would include joins, unions and more complex logics with window functions that are not available via Charts directly.

Use cases we want to learn from are when people only use one data set with additional case when or renaming columns. This feedback loop will allow us to improve our dbt data models directly and keep our environments tidy.

Step 4 — Build the Dashboard

Now on to the fun part! Once the datasets were ready for action, we jumped directly into creating charts and gathering them into a dashboard with minimal filter settings (only for Time Range, Time Grain, Dashboard Status). If you want to follow our approach, you can download our exported dashboard soon here. (Link updated in CW 8), otherwise you can find some examples below.

User development and user activity

Superset Monitoring

Datasets development and showing users with the most virtualized datasets

Monitoring Superset

Chart development and most prominent selection (using a treemap to make it look fancy)

Monitoring Superset

Our internal naming conventions check — the gauge chart is a great way to incentivize

Monitoring Superset

What is Next

While we are already pleased with the insights we derived by the first iteration of the dashboard, there are a couple of topics we want to continue working on:

Further Classification of Activity

There is definitely room for improvement in classifying the events that are logged in the action column to get the most of the logs table. At the moment, there are 104 different actions defined in our db including things like:

  • welcome (logged when a user sees the front page)
  • annotation* (if users are working with annotation layers)
  • several get/list API calls
  • internal actions like explore_json

There are also several other actions you may not want to count overall (as they are happening indirectly in the background) compared to things like creation/changing/viewing of superset elements such as:

  • queries (if a query was executed from SQLLab)
  • ChartRestApi.data (if data was loaded by a chart component)
  • count (if a dashboard got viewed)

We assume that parsing the json entries for action ='log’ will provide you with even more granular information for further action classification.

Add Performance Perspective

As of now, we have not found a great way to create a performance view that includes query execution time as well as the impact of the internal Superset cache. While the first can be derived by parsing the Data Warehouse tables (e.g. on Snowflake via SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY) this does not include the cache perspective. We experimented with the column duration_ms of the log table but frankly we are not sure about the reliability.

If you have more detailed knowledge on the above topics, or simply want to join us on this effort to build the next iterations of this dashboard, please reach out directly in the Superset Slack Community.

A big thank you goes out to our Superset team Agustin Figueroa and Donatas Jocas, as well as Ilayda Tangor and Sung Pham, for working on this as their hackathon project. In case you are interested in other areas we are working on, you can find more data tech stack related articles in our HomeToGo publication.

Appendix

SQL code for monitoring.superset_action_log dataset:

select
date_trunc('day', l.dttm) as action_date,
case 
 when lower(l.action) = 'queries' then 'query from sqllab'
 when lower(l.action) = 'chartrestapi.data' then 'query from charts'
 when lower(l.action) = 'count' then 'dashboard view'
 when lower(l.action) like 'annotation%' then 'annotations'
 when lower(l.action) like 'css%' then 'css'
 else lower(l.action)
end as action,
l.user_id,
u.user_name,
u.created_on as user_registration_date,
l.dashboard_id,
d.dashboard_title,
case 
 when d.published = true then 'published'
 else 'draft'
end as dashboard_status,
l.slice_id,
s.slice_name,
s.datasource_type,
s.datasource_name,
s.datasource_id,
count(1) as action_count
from logs as l
left join ab_user as u on u.id = l.user_id
left join dashboards as d on d.id = l.dashboard_id
left join slices as s on s.id = l.slice_id
where 1=1
and l.action != 'log'
group by 1,2,3,4,5,6,7,8,9,10,11,12,13

SQL code for monitoring.superset_naming_conventions dataset:

with cte_names as (
 select
 table_name as name,
 'virtualized dataset' as source,
 'published' as status,
 created_by_fk as created_by_user_id
 from tables
 where 1=1
 and is_sqllab_view = true

 union 

 select
 dashboard_title as name,
 'dashboard' as source,
 case 
 when d.published = true then 'published'
 else 'draft'
 end as status,
 du.user_id as created_by_user_id
 from dashboards as d
 left join dashboard_user as du on d.id=du.dashboard_id 

 union

 select
 slice_name as name,
 'charts' as source,
 'published' as status,
 su.user_id as created_by_user_id
 from slices as s
 left join slice_user as su on s.id=su.slice_id 
),
cte_names_with_properties as (
select 
name,
source,
u.username as owner,
status,
case 
 when lower(split_part(name,'.', 1)) = split_part(name,'.', 1) then true
 else false
end as domain_is_lowercase,
case 
 when lower(name) like '%.%' then true
 else false
end as has_dot,
case 
 when lower(name) like 'archived%' then true
 else false
end as is_archived,
split_part(name,’.’, 1) as domains,
case 
 when lower(name) like '%untitled%' then true
 when lower(name) ~ '\d{2}\/\d{2}\/\d{4}' then true
 else false
end as has_bad_phrases
from cte_names as tmp
left join ab_user as u on tmp.created_by_user_id = u.id
)
select 
*,
case 
 when domain_is_lowercase is true 
 and has_dot is true 
 and has_bad_phrases is false 
 and is_archived is false
 then true
 else false
end as is_valid
from cte_names_with_properties

Subscribe to our blog updates

Receive a weekly digest of new blog posts

Close