Modern Data Stack Playbook

What we run in production for lean startups

Here's the modern data stack I build for tech startups. It reads marketing, product and finance data, cleans and links it, and provides a full, tested and trustworthy view of business activity.

I'll walk through the physical infrastructure from sources to destinations, and the logical modelling layer that transforms raw data into structured, usable datasets.

For each component, I choose tools I already run in production for clients so they're quick to set up.

My non-negotiables for each tool choice:

  • Lean team friendly: one engineer can manage it
  • Open source: if the pricing jumps, it can be self hosted
  • Good developer experience: easy for any data engineer to pick up quickly
  • Low entry-level cost: starter tier under £500 per month
Modern Data Stack Playbook

Loader

Platforms use a loader to read data from sources into a warehouse. Sources are usually 3rd party SaaS products like Mixpanel and Hubspot, internal databases like your product's transactional database, and unstructured data like Google Sheets and Excel files.

There is no single best loader. Choosing means making trade-offs between support for your data sources, service quality (it's "just a few API calls", but they break a lot), extensibility, any enterprise level security compliance etc concerns, and cost.

They start at hundreds of dollars per month and can easily run into the thousands. Row or record based billing and just a few less than optimally configured sources can add up fast.

For seed to series A startups I choose Airbyte because it's competitively priced, has a good range of source connectors, is open source and extensible, and is growing fast. Fivetran is often considered the best quality and supports the most sources, but usually costs more. Stitch is lower priced than both, well established, and has a good range of connectors, but the core product doesn't seem to have been updated since its 2018 acquisition.

Loader Positives Deal-breaker for us
Airbyte (chosen) Open source, can be self hosted, 550+ connectors, managed service offering at a fair price None
Fivetran Biggest connector library, brilliant developer experience, rock solid SLA Expensive. Pricing is complex and usually higher than other products, even at modest volumes. No open source fallback.
Stitch Open source, can be self hosted, good connector library, fair price Complex codebase, difficult to write new connectors.

Data Warehouse

The warehouse is the central source of truth. It needs to be fully managed, easy to use, and able to scale without painful migrations.

Because most startups are already using AWS or GCP, I use the defaults in each cloud. For AWS, that's Redshift, in GCP it's BigQuery.

This keeps credentials, VPC peering, and cost dashboards inside the same cloud console the dev-ops team already uses, and avoids cross-cloud egress fees.

Newer options like Postgres and Citus or ClickHouse are cheap to run, but they create extra ops work to manage which exceeds the savings from running them. They're brilliant technologies, but have a smaller community and are harder to hire data engineers to work with.

Engines like AWS Athena that run queries on Parquet or CSV files directly from object storage can be considerably cheaper than Redshift, but not having a fixed cost creates risks. If you're ready to manage the complexity of Athena, it's a great option to migrate to, but I wouldn't recommend it for a first data platform.

Let's fix your metrics headaches

Book a call

Data Modelling

When source data is loaded into the warehouse, it needs structuring, cleaning, sometimes fixing, integration with other sources, then structuring into reporting marts.

I use dbt (Data Build Tool) for this. It's a combination of a framework for modelling and testing data, a way to link and run SQL models in sequence, plus useful macros and best practice guides. It's hugely popular and has become a de-facto standard in the modern data stack.

There are some credible challengers like SQLMesh, and Airflow DAGs make it possible to run modelling and a lot more orchestration, but I stick with dbt because the maturity and power of the product, size of the community, hiring pool and built-in tests all make it quick and easy to use, and cover most essentials very well.

Modelling tool Positives Deal-breaker for us
dbt (chosen) Very widely used. Version controlled SQL. Built in tests, docs. None
SQLMesh Powerful, easy to use, cheaper to run than dbt Still quite new with limited adoption.
Airflow Totally open ended - a full orchestration tool, not just data modelling. Cheaper to run than dbt. Lots of work needed to reinvent core dbt features. Fragile and not developer friendly in many places.

Model layers

I build warehouses with four layers of models:

Layer Use case
Raw Raw data from each source, standardized and lightly cleaned (e.g. text fields converted to dates)
Fixes Patches for edge cases and inconsistencies in data (e.g. adding or fixing values)
Intermediary Unified business entities (e.g. user, subscription, purchase) which combine views of entities from all sources into a single model.
Reporting Marts Business-ready metrics (e.g. MRR, conversion rate, churn), produced by aggregating and calculating values from the intermediary models.

The mart layer powers BI and reporting tools, but analysts and data scientists might use all layers for deep dives and ad hoc analysis.

Orchestration and Observability

Airbyte and dbt's managed services include some observability and orchestration features, sending alerts if sources fail to load, models don't build, or tests on marts don't pass. Alerts plus automated retries mean the most common errors can be handled with minimal setup and no extra cost.

As dependencies increase, when the platform starts powering AI systems and reverse ETLs into other tools, if you're using less reliable source connectors in the loader, or high uptime SLAs are needed on critical reports, it's worth adding dedicated orchestration and observability tools for more precise and robust failure-handling.

Governance and Security

Data governance has to stand up to GDPR requirements and audits. Here's some key features from the stack that make it easy to stay compliant.

Each dbt model is version controlled in git, documented in dbt Docs, and columns are tagged as PII (personally identifiable information), GDPR sensitive, or financial. This is surfaced in the data catalog so analysts can see which fields need to be masked.

Both Redshift and BigQuery support column-level ACLs and can mask data. By exposing only the clean mart tables to Metabase and using service roles with read only credentials, dashboards are kept secure and viewers can't access sensitive data.

If dbt data quality or freshness tests fail, dashboards are blocked from updating and an alert is sent. Your reports show when they were last updated, so it's clear if you're looking at stale data.

Reporting Outputs

There's no single best reporting product. Here the tradeoffs are between ecosystem integration (e.g. Microsoft's Power BI and Google's Looker), analyst-driven tools vs. self-serve tools for non-technical users, depth of customisation, governance features like audit trails, and total cost.

I choose Metabase for most startups because it's low cost, quick to set up, and makes it easy for teams to explore data on their own. There's enough flexibility for analysts to build custom dashboards, and a good interface for non-technical users. Looker is a very strong product, but usually costs much more, and since the Google acquisition it's unclear how well it will support systems outside of the Google ecosystem. AWS's Amazon QuickSight has an incredibly low price, but has more limited features and needs a lot of technical setup to use.

Metabase can also produce CSV and Excel exports for data scientists and analysts, and data can be exported directly from the reporting marts for AI and ML systems, so the same clean, integrated data can be used across an organisation and sent to other systems.

BI tool Positives Deal-breaker for us
Metabase Open source, good for analysts and non-technical users None
Amazon Quicksight Very low cost Very limited customisation and no self-serve for non technical users
Looker Sophisticated modelling and query interface, good for analysts and non-technical users Not open source

Conclusion

The right data platform gives the whole company clarity on what's happening across marketing, product and finance combined. It brings a level of understanding that otherwise takes hours of tedious, manual, error prone reporting work.

As the tools and designs mature, these platforms are becoming easier and cheaper to set up and operate. With Scale Direction, I'm drawing on a decade's experience across countless startups, turning that into an accessible, low-cost option.

If your analyst spends Fridays stitching CSVs, this platform can give you that day back without spending £80k and three months hiring a data engineer. Let's jump on a call and let's see how we can get a data platform working for you today.

Let's fix your metrics headaches

Book a call