DATA ENGINEERING

Why Data Teams Keep Reinventing the Wheel: The Struggle for Code Reuse in the Data Transformation Layer

Maxime Beauchemin

Being a data engineer is a lot like being a plumber. Both professions involve critical, behind-the-scenes work that people rely on every day, but rarely notice—until something breaks. The job is essential, but it’s also mind-numbingly repetitive. Every house needs plumbing, and plumbers spend their careers laying pipes, fixing leaks, and installing the same fixtures. Over time, it’s the same work, house after house.

Data engineering feels much the same way. We build pipelines, transform data, and maintain systems—work that is intricately repetitive. Each company requires similar solutions, yet we find ourselves rebuilding variations of the same pipelines time and time again. But here’s the difference: as software engineers, we don’t have to be stuck in this cycle. We have tools that allow us to break free from this repetition, to build reusable abstractions and scalable systems that could be shared and replicated instantly across industries.

Yet despite this potential, we’re still reinventing the wheel. I’ve seen it across hundreds of teams: data engineers writing slightly different versions of the same SQL scripts and pipelines over and over. We’re all solving the same problems but without leveraging the reusability that software engineering should enable.

When I open-sourced Apache Airflow back in 2015, I thought we would see data engineers leveraging these tools to create reusable, high-level constructs. I imagined pipelines that could be shared across companies—like software libraries—where teams could build once and use everywhere. But that hasn’t happened. Even at the Airflow Summit, where we saw huge advancements in orchestration, no one was talking about reusing the actual logic within these pipelines.

So, What’s the Problem?

At first glance, it seems like companies—especially those in the same verticals—should be able to share pipelines. SaaS companies, for instance, have nearly identical core entities like users, organizations, and subscriptions. But as you dig deeper, you realize that no two companies calculate their metrics in exactly the same way. Whether it’s the way they define engagement or measure churn, every business has its own quirks. These small but significant differences prevent the kind of standardization and reuse that should be possible in software engineering.

The intuition is clear: every tech company needs to compute engagement and growth metrics from event data. It seems logical that a generic event pipeline could reliably produce common metrics—active users (DAU, WAU, MAU), actions, actors, and growth metrics like new, retained, or churned users. Such a pipeline could even support cohort analysis and user segmentation. Extend that further, and standardized pipelines could even enable reusable data visualizations and dashboards. So why doesn’t this exist?

This blog post dives into that question: why haven’t we seen the rise of unified data models and reusable computations? Is this a missed opportunity, or is something deeper at play?

Intricately Similar, Yet Intricately Different

When looking at a specific subject area, it’s striking how similar the core entities and attributes are across different businesses. The foundational metrics that companies aim to compute have been established over decades of business performance management, leading to clear commonalities. As mentioned earlier, SAAS companies are similar to one another: users have core attributes like user ID, email, and signup date; organizations typically track attributes such as organization name, industry, and plan tier. The metrics they care about are equally uniform—Monthly Recurring Revenue (MRR), Churn Rate, Customer Lifetime Value (CLV), and Net Revenue Retention (NRR) are standard KPIs in the SaaS industry. Whether you’re tracking user engagement (e.g., DAU/WAU/MAU) or financial health (e.g., ARR, LTV), the formulaic approach to computing these metrics is well-established and consistent across SaaS companies.

But while the similarities are evident, the differences are just as critical. Each business operates with its own unique set of priorities, meaning that key entity attributes, business rules, and pricing strategies can vary widely. For example, while the concept of a user might seem universal, different companies might care about vastly different sets of user attributes. Beyond common demographics like age or location, some businesses may focus on user engagement frequency, while others prioritize attributes like feature adoption or subscription tier changes.

Moreover, the user journey for each company can differ significantly. One SaaS company might offer freemium models, while another may have complex tiered pricing. These pricing models can heavily influence how metrics like LTV or churn are calculated. For instance, a company that offers annual subscriptions might have a very different approach to measuring churn compared to a company focused on monthly plans. Business rules around KPIs are also subject to unique tweaks. A company might calculate Net Revenue Retention (NRR) differently based on how they handle upgrades, downgrades, or customer discounts.

In short, while the foundational models are similar, the real challenge comes in the layers of customization and nuance that each business requires to reflect its unique reality.

Introducing Parametric Pipeline on Unified Models

At the heart of addressing these challenges lies the concept of a "parametric pipeline" on “unified models.” The core idea is simple: instead of reinventing the wheel for every data transformation, we could define (1) a unified data model to normalize your data into and (2) a “parametric” data pipeline that’s dynamically generated based on parameters (based on a pre-defined interface) to transform that data to support common analytics use cases.

A Unified Data Model

The first step in this concept is defining a unified model—a standard data model that normalizes data across different sources and systems. This model provides the foundation, bringing structure to common business entities like users, organizations, and activities. The goal is to create a flexible schema that can accommodate data from multiple sources, such as CRM systems (e.g., Salesforce, HubSpot) or subscription platforms, without losing the ability to compute core metrics like customer lifetime value (CLV), churn, and net revenue retention (NRR).

For instance, the unified model for a SaaS business would include tables for users, organizations, and subscriptions, where users and organizations are linked to activity logs (e.g., actions taken by users). This normalization ensures that, no matter the source of the data—be it a CRM system or a home grown product—the model maintains a common structure.

Another benefit of having a unified data model—whether it serves as the input or output of a data pipeline—is that it opens up opportunities for data integration across business units or following mergers and acquisitions (M&A). It also allows for the sharing of semantic layers that often live within Business Intelligence tools, creating a consistent layer for reporting and analysis across the organization. This kind of model enables smoother transitions, better data interoperability, and improved decision-making across different business entities.

Importantly, this model must be flexible enough to allow businesses to map their unique attributes and behaviors into it, accommodating various custom fields, actions, or metadata from different systems.

The real power of a unified model lies in its ability to bridge the gap between disparate data sources and provide a consistent framework for analytics. This allows for simplified and standardized transformations downstream, regardless of the complexities in upstream data sources.

Clearly, transforming various source data into this unified model represents a significant amount of effort. This could include building connectors for multiple systems and mapping data fields. However, when compared to the manual effort it takes for every business to continuously reinvent the downstream transformations in a silo, this initial investment becomes far more justified. By normalizing data into a unified model upfront, businesses can avoid repetitive work, unlock the benefits of standardized pipelines, and open the door to reusable analytics and visualizations downstream. In the long run, the time and resources saved on maintenance, debugging, and cross-team alignment far outweigh the initial transformation effort.

A Parametric Data Pipeline

Once the data is normalized into the unified model, the next component is the parametric pipeline. This pipeline is designed to transform the normalized data into datasets that enable meaningful insights, metrics, and KPIs. The term "parametric" refers to the ability of the pipeline to adapt to various configurations and customizations through parameters, while adapting the underlying logic.

For example, a parametric pipeline could compute common SaaS metrics like MRR, churn, and user engagement. However, rather than hard-coding the logic for each use case, the pipeline would allow for parameters such as timeframes, custom segments, or business-specific rules (e.g., handling customer upgrades/downgrades differently). This ensures that while the pipeline is standardized, it can also be flexible enough to accommodate a wide range of business scenarios and analytics needs.

Parametric pipelines also offer scalability. Instead of reinventing metrics or transformations each time the data model or source changes, businesses can rely on the same pipelines, adjusting parameters to account for their specific requirements. For instance, you might use the same pipeline to compute retention metrics for different product lines or segments of users by simply changing the input parameters.

All in All

The promise of such a system is significant: it could enable organizations to leverage reusable code and computations for analytics while maintaining the flexibility needed to address their unique requirements. However, this approach also raises critical questions around where to draw the line between flexibility and structure, and how to balance simplicity with customization.

This concept of parametric pipelines is aimed at solving the very problem we've outlined: businesses are too often forced to rewrite logic that could, in theory, be standardized and shared across organizations that have analytics needs in common. But the key challenge remains in determining the right balance between what is standardized and what is customizable—a theme that recurs throughout this exploration.

Unclogging Reusability Down the Pipeline

One of the most powerful benefits of adopting unified data models and parametric pipelines is their potential to enable reusability not just in data transformations but all the way to the data consumption and visualization layer. By creating predictable, standardized output schemas, these pipelines allow for common patterns of analysis, making it easier for data professionals to work with familiar, pre-defined outputs.

1. Creating Common Patterns for Analysis

With standardized outputs, the guesswork is removed from data analysis. Data professionals no longer need to decipher how metrics were computed or deal with inconsistencies across organizations. Unified models can produce consistent outputs, ensuring that everyone is using the same definitions for core metrics like MRR, churn, or user engagement. This predictability makes analysis smoother and faster, fostering more trust in the data and the process itself.

2. Enabling Reusable Data Visualizations

The real potential lies in how this predictability can extend to reusable visualizations. Once a unified model is established, we can easily build parametric dashboards and charts that adjust based on input parameters, much like the data pipelines themselves. With solutions like Superset and Preset, we already have the mechanics to define assets as code, allowing for fully customizable, reusable dashboards. However, the challenge today is that different teams often work with unique models, making it difficult to standardize the production of these visual assets.

The Superset/Preset Advantage

In Apache Superset and Preset, we already have the ability to define visualizations as code, which means that once the output from a unified model becomes standardized, creating reusable dashboards becomes straightforward. If the parametric pipeline produces consistent, reusable datasets, these can be easily fed into pre-built dashboards and visualizations—removing the friction of building new charts from scratch for every project or dataset.

By standardizing the data transformation process through unified models, we could unlock the ability to reuse logic, code, and even visualization assets across multiple teams and organizations. It’s not just about code reuse in the transformation layer—this approach could lead to reusability across the entire data stack, from ingestion to final insights.

The Right Foundation

To SQL or not SQL?

When architecting a solution for unified data models and computations, there are clear constraints to consider. First off, you need a significant amount of storage and compute power to support all this data processing. With the rise of cloud data warehouses and the prevalence of SQL, it might seem like a SQL-based approach is the ideal solution. However, SQL comes with its own set of challenges.

  1. Limited Dynamism: SQL isn’t the most dynamic language out there, which often leads teams to rely on "templated SQL" as a workaround. While tools like dbt or SQLMesh can facilitate this, the resulting code can become messy when trying to generate dynamic SQL.
  2. Multiple SQL Dialects: Supporting different SQL dialects adds another layer of complexity. What works in one environment might break in another, making it harder to maintain a consistent codebase.

Considering a non-SQL approach, Spark presents a compelling alternative. It allows for more dynamic schemas and configurations, sidestepping some of the templated-SQL issues. However, the trade-off is that you require a Spark cluster, which can be daunting for data engineers and analysts who often prefer a SQL-oriented approach.

Parametric VS template-oriented

Another key question around the foundation is whether to adopt a parametric approach or a template-and-fork strategy. A purely parametric approach can create a complex black box that’s hard to modify. On the other hand, a template-oriented approach means that once you fork, you’re largely on your own. The best solution might be a blend of both: a solid foundation with parametric capabilities, while allowing for forking when needed.

Fixed or dynamic schema?

We also need to consider how to manage table structures. When a new custom user attribute is created, should we columnize it? My stance is that managing schemas dynamically can become a logistical nightmare. A fixed schema with complex data types like "maps" would be a more effective approach. This allows for dynamic attributes while keeping the core structure stable. The contents of those maps can be dynamic, supported by metadata that lets the compute engine handle them efficiently. The consumption layer can then virtually columnize those maps using views or materializations, making it easier to work with.

Picking the Right ETL Framework

Choosing the right ETL framework is crucial for effectively managing unified data models and computations. You need a solution that can express and run all these jobs seamlessly. If you’re leaning towards a templated SQL approach, there are several tools available, but it’s essential to consider a few key factors:

  • Templated SQL: This is a must for generating and managing SQL dynamically.
  • Dynamic DAGs of Tasks: The framework should support building acyclic graphs (DAGs) dynamically to define and orchestrate complex workflows.
  • Support for Incremental Load and Schema Management: The ability to handle incremental data loads and manage schema changes is vital for maintaining efficiency.

Many tools could fit this bill, including dbt, Airflow, SQLMesh, and Dagster just to name a few. However, the choice of framework should also consider the future owners of these transformations. Here are some critical factors to weigh:

  • Framework Fit: dbt shines in SQL-only environments and has strong support for templated SQL and Jinja. However, if there’s a chance you’ll need to run Python code or handle more complex logic, you might want to look elsewhere.
  • Multi-Dialect Support: SQLmesh might be better for handling multiple SQL dialects, which is crucial given the varying environments in which teams operate. Yet, its adoption isn’t as widespread as dbt, which could be a consideration for team familiarity.
  • Complex Data Types: Both dbt and Airflow handle SQL and Jinja well, but they can struggle with the multi-dialect challenge, potentially forcing teams to use a common SQL subset supplemented by a dialect compatibility layer where necessary—especially important if you plan to utilize advanced data types like maps.

Ultimately, the success of the chosen framework hinges not only on selecting the right tool for the job but also on ensuring that it aligns with the preferences and skills of the people who will be maintaining this logic in the long run.

Existentialism - WHERE MIN(flexibility) > MAX(code_reuse)

When it comes to building parametric data pipelines on unified data models, there’s a delicate balance between flexibility and code reuse. The minimum requirements for flexibility often create a set of common denominators that can become too large to support effective code reuse. Let’s explore the areas where flexibility is required and see how these needs compound.

  1. Core Entities: Core entities themselves might not even be a safe assumption. For instance, if your use case aims to accommodate extra entities on top of the common ones defined in the framework, this can lead to excessive dynamism that’s hard to account for at the framework level (more on “Compartmentalizing Dynamism” later in this post). For example, your SAAS company might have a more intricate hierarchy that goes beyond the typical User→Organization model, while requiring reporting at the team or business unit level. Flexibility in offering managing extra entities - that can’t be managed as attributes of the ones defined by the framework - may make the code hard to write, understand, manage, and evolve.
  2. Core Entity Attributes: While we can standardize core attributes like user ID and organization name, many businesses require unique attributes that reflect their specific needs—like customer_segment or compliance_status. This divergence complicates a unified model.
  3. Preferred Computation Engine and Related SQL Dialect: Different teams may prefer specific computation engines (like Snowflake, Redshift, or Spark) that come with their own SQL dialects. Supporting these variations adds another layer of complexity, making it challenging to maintain a consistent codebase.
  4. Preferred ETL Tool/Orchestrator: The choice of ETL tools (like dbt, Airflow, or SQLMesh) also influences how data transformations are managed. Each tool has its strengths and weaknesses, and accommodating preferences can lead to fragmentation in the code and processes.
  5. Action Types: We’ve established common actions (e.g., registering, logging in), but each business may have unique actions that need tracking, such as feature_requests or customer_feedback. The sheer variety can lead to a bloated model that’s hard to manage.
  6. Custom Metrics: Different organizations often have distinct ways of measuring success. While some may focus on user engagement metrics, others might prioritize revenue growth. Custom metrics necessitate additional logic that can clash with a reusable framework.
  7. Data Sources: As businesses evolve, the number of data sources they integrate can expand. Each source may introduce new data types and structures that require specific handling, further complicating the model.
  8. Diverging business logic: The transformations applied to data can vary widely from one organization to another. A standardized approach might not be flexible enough to accommodate unique business logic, leading to additional customization.
  9. Data Modeling approaches: Different teams may be opinionated around data modeling, and different database engines work better using different paradigms

As we can see, these flexibility requirements can compound, creating a scenario where the common denominators become too broad, making effective code reuse challenging. Clearly, serving to all of these preferences is impossible. The proposed system will have to make hard decisions about where to impose structure and where to offer extensibility and flexibility. For instance, if you go the SQL route, the solution will need to pick an ETL tool to express these transformations—say dbt—and select a subset of supported SQL dialects, like Snowflake or BigQuery. As these decisions are made, they may cater to different subsegments of organizations, further complicating the landscape. Shoot too broad and the system doesn’t fit anyone’s specific needs, shoot to narrow and the system is only useful to a single organization.

At the core of this "parametric pipeline" idea is the need to clearly define what is flexible (the parameters) and what is hard-shelled (the pipeline itself). If, over time, the parameters begin to outnumber the core pipeline structure, it points toward the need for custom solutions rather than reusable code. In other words, if it becomes harder to parameterize the system than it is to write custom code, the framework's value completely disappears. The real challenge, from my perspective, isn’t just finding the right ratio of flexibility to structure—it’s defining what should be flexible and what should remain fixed. As with any framework, the key is balancing flexibility and constraints in a way that operators find productive.

Now, when looking at the viability of these unified computation models, and when seeking hard common ground across a wide variety of organizations and use cases, it may be that the common denominator is simply too small to justify the constraints imposed by the framework. Perhaps the reason these unified frameworks haven’t emerged is that the compounding need for flexibility across these areas undermines the structure necessary for a solid foundation.

The Not-So-Competitive Landscape

The fact that I haven't come across a widely adopted solution or framework that tackles unified data models and computations head-on might suggest there’s a gap in the market. I haven't done exhaustive research, but after working in the space for decades, I would expect to have heard about something significant by now if it existed. Perhaps I'm missing something, and if so, I encourage readers to comment or reach out with solutions they’ve come across that fit the bill—along with their thoughts on whether these tools truly fulfill the promise of reusable data models and computations across organizations.

It’s worth noting that multiple BI vendors have tried to offer pre-built templates over the years, aiming to simplify analytics for specific verticals. However, these solutions have generally not gained much traction. For instance, Looker’s reusable LookML models were intended to allow organizations to reuse analytics components, but widespread adoption of these templates has been limited. Similarly, dbt is often hailed for modularity and reusable SQL, yet there’s little -as far as I know - evidence of a successful push toward industry-specific, shareable models within its community. I believe that the Microsoft ecosystem has also released vertical-specific templates over the years, particularly for industries like retail or finance, but even these offerings haven't made a significant impact or achieved mass adoption.

Worth mentioning is Microsoft’s work around their Common Data Model (CDM). While CDM is primarily aimed at data integration, rather than analytics, it could theoretically serve as a foundation for implementing parametric pipelines. Similarly, Fivetran’s source-oriented standardized models, along with their open source dbt packages, could be leveraged to support analytics use cases. Yet, despite the availability of these unified models, we haven’t seen the emergence of widely adopted open-source packages that help transform this data for analytics purposes.

This lack of widespread adoption suggests that while the tools may offer certain reusable elements, they fail to address the more complex, dynamic needs of organizations across different industries. This might support the idea that creating truly reusable, unified data models faces significant barriers—whether it's the need for flexibility, the inherent differences between businesses, or the difficulty of balancing standardization with customization.

Vertical Opportunities for Parametric Pipelines

There are many areas of business where a parametric pipeline solution could create immense value. These solutions would allow organizations to standardize and automate common computations while maintaining enough flexibility to adapt to unique business needs. Below are a few verticals and use cases that could benefit greatly from such a framework:

User Activity / Growth & Engagement

Almost every business needs a way to analyze user behavior over time. By building a parametric pipeline around user activity, it becomes easy to compute metrics like Daily Active Users (DAU), Weekly Active Users (WAU), and Monthly Active Users (MAU). These pipelines could also handle growth accounting metrics such as new, churned, resurrected, retained, or stale users. Additionally, organizations could answer more sophisticated user-journey-type questions around behavior, frequency of use, and deeper segmentation of usage patterns.

SaaS Metrics

For SaaS companies, revenue and retention-related metrics are critical. A unified parametric pipeline could automate key performance indicators (KPIs) like Monthly Recurring Revenue (MRR), Customer Lifetime Value (CLV), or churn rates. By building on top of the user activity framework, SaaS metrics could incorporate a customer dimension and deliver revenue-focused insights on a per-customer basis, while also answering questions about subscription upgrades, downgrades, and user engagement.

Cohort Analysis Framework

Cohort analysis is essential for businesses looking to track customer behavior over time and compare different groups of users. Building a parametric pipeline for cohort analysis would allow businesses to define specific cohorts based on user actions or characteristics and then automatically compute performance metrics, retention rates, and engagement trends for each cohort. This solution would enable deeper insights into how different groups evolve over time.

Retail

Retail businesses constantly analyze customer purchasing behavior, product performance, and inventory needs. A parametric pipeline could help standardize the computation of metrics like Average Order Value (AOV), customer lifetime value, and inventory turnover ratios. Retailers could also benefit from automated sales forecasting, segmentation analysis, and promotions performance tracking, all tailored to their specific product lines and customer bases.

Healthcare

In healthcare, parametric pipelines could be used to manage patient data, clinical outcomes, and operational efficiency. For example, hospitals and healthcare providers often need to track patient visits, readmission rates, and the effectiveness of treatment protocols over time. A unified pipeline could standardize these metrics while still allowing for adjustments based on specific medical departments, patient demographics, or regulatory requirements like HIPAA.

Manufacturing

Manufacturing businesses require a deep understanding of production efficiency, inventory levels, and supply chain performance. A parametric pipeline could automate the calculation of metrics like Overall Equipment Effectiveness (OEE), defect rates, and supply chain lead times. These pipelines could also be adapted to track real-time production data, monitor machine performance, and optimize resource allocation across different factories.

E-Commerce

In e-commerce, customer behavior and purchasing trends are crucial. A parametric pipeline could help automatically generate metrics around cart abandonment, product popularity, or conversion rates across different marketing channels. It could also integrate with user activity data to identify patterns in customer purchasing frequency, preferences, and product return rates.


Clearly, the opportunities for parametric pipelines are vast, spanning nearly every analytics vertical you can think of. The more common and standardized a subject area becomes, the more viable and useful a parametric solution will be. Areas that are well-defined and widely adopted can leverage simpler models, making the potential for reusable, standardized computations even greater. Whether it's user behavior, SaaS metrics, or industry-specific KPIs, the sky's the limit when it comes to the impact these pipelines could have across businesses.

The Challenge of Composability

Another reason why parametric pipelines haven’t gained traction may lie in the complexity of fitting multiple business areas together in a cohesive and flexible framework. While it’s relatively straightforward to standardize computations in isolated areas like user activity or SaaS metrics, it becomes much harder when trying to compose and integrate multiple subject areas across an entire organization in a cohesive way.

The difficulty here is in balancing scope and flexibility. Each vertical, whether it’s retail, healthcare, or manufacturing, comes with its own nuances, making it challenging to create modules that are robust enough to cover their specific needs while still being flexible enough to integrate into a larger, unified system. In an ideal world, these solutions would be built as modular components that could be composed together, allowing organizations to orchestrate multiple subject areas centrally, without sacrificing the guarantees and standards required for each module to function independently.

Composability would allow businesses to pick and choose the modules most relevant to their needs—whether that's user activity, cohort analysis, or revenue metrics—while ensuring they all fit together in a central computation framework. However, designing these modules in a way that ensures they are both interoperable and independent remains a significant technical challenge.

Compartmentalizing Dynamism: Flexibility Without Sacrificing Structure

A key challenge in building reusable pipelines is balancing flexibility with structure. To enable parametric pipelines, we need a system that can adapt to different business needs while keeping the core structure intact. The solution lies in compartmentalizing dynamic elements within a mostly static schema.

This means defining a stable core for entities like users, organizations, or subscriptions—elements that don’t change much across companies—while using more flexible data types like maps or JSON fields for custom dimensions or unique business logic. This allows for customization without disrupting the overall structure.

By relying on metadata to process dynamic fields, the pipeline can adapt to business-specific logic without rewriting the entire system. This balance ensures the framework remains stable and efficient while still accommodating business-specific needs.

Subject-Oriented vs. Source-Oriented: The Data Integration Conundrum

One of the biggest questions when building parametric pipelines is whether the solution should be subject-oriented or source-oriented. Should the focus be on covering specific subject areas—like CRM—and providing a unified model that can accommodate data from any CRM system (Salesforce, HubSpot, etc.)? Or should the approach be more source-specific, tightly integrating with each individual system’s unique architecture?

Take the CRM example. Different CRMs share many similarities in structure—both have concepts like leads, accounts, and opportunities—but they also differ in subtle yet important ways. Salesforce, HubSpot, and others allow for different levels of customization, extensibility, and even data architecture, meaning a universal CRM model would need to balance these inherent differences. A subject-oriented approach would strive to standardize these differences into a common model, translating from each CRM’s unique schema into a unified, shared structure.

On the other hand, a source-oriented approach might provide more flexibility by focusing on tightly integrating with each system's native architecture. While this allows for greater accuracy and utilization of system-specific features, it would mean building different models for each source system, reducing the reusability and scalability of the overall framework.

Ultimately, both subject-oriented and source-oriented approaches are valid and necessary, but the decision should be evaluated on a case-by-case basis. The key factors to consider are how complex and tailored the source data model is, and how large the user base for that particular source might be.

For instance, Salesforce and HubSpot are good examples where a source-oriented approach might be more suitable. Both systems have a wide population of users, but their underlying data models are intricate and intricately different, making it challenging to impose a one-size-fits-all solution.

On the flip side, a more generic user-action framework that computes engagement and growth metrics (e.g., tracking users performing actions over time) can benefit from a subject-oriented approach. With a simple input schema—such as user, action, and time—this framework is flexible enough to be built on top of various systems, supporting multiple use cases while maintaining high reusability. This makes it ideal for scenarios where the input data remains relatively uniform across different sources.

Both approaches have their trade-offs. Subject-oriented models are great for scalability and reusability, but they can struggle with the nuances and customization that each source system requires. Meanwhile, source-oriented models allow for precision but may lack the generalization needed for true scalability across different tools or systems.

Conclusion

As data engineers, we shouldn’t be content with simply laying the same pipes again and again, like a plumber constrained by the limitations of physical work. Unlike plumbing, data engineering offers us the chance to abstract and automate—to build solutions that evolve and scale with every use. The real opportunity is in stepping beyond those repetitive, manual tasks and embracing the power of software to create systems that adapt, replicate, and scale at a level that physical work never could. The future lies not in repeating what we’ve done but in elevating our work to new heights with tools that let us focus on innovation instead of maintenance.

Despite the clear need for reusable, unified data models and parametric pipelines, these solutions haven’t gained much traction. Even though many companies across industries share similar needs, the challenges around flexibility, composability, and the balance between subject- and source-oriented models make it tough to create a one-size-fits-all solution.

The potential is huge, across all verticals and areas of business. But building a system that works across different businesses without locking them into rigid models is still an unsolved problem.

I wrote this post because, much like in open-source software, I believe ideas are meant to be shared, iterated on, and improved. As I described in "The Downfall of the Data Engineer," data engineering is still in its infancy, and there’s a lot that doesn’t feel quite right yet. We’re all trying to figure things out, and sharing ideas is a big part of that process. If we can make progress toward something like reusable parametric pipelines, it could solve many of the current pain points in the field. The future of data engineering could be so much brighter if we elevate ourselves, fellow plumbers, and start building these parametric pipelines. That future may be just over the horizon—let’s go build it.

What’s your experience? I’d love to hear from readers who have worked with any tools or platforms that attempt to tackle this challenge. Does any product or framework deliver on this vision of reusable models and computations? How well do they work—or what stops them from fulfilling their potential? Reach out to Preset or @mistercrunch on 𝕏.

Subscribe to our blog updates

Receive a weekly digest of new blog posts

Close