The Best Ways to Build Data Pipelines, According to the Experts

Written by Adam Calica
Published on Mar. 25, 2020
Brand Studio Logo

Before SpotHero was founded in 2011, finding a good parking spot meant crossing fingers and circling the parking garage. Today, SpotHero operates in parking garages nationwide (more than 1,000 just in Chicago), airports and stadiums. 

And those thousands of parking spots mean one thing for Director of Data Science Long Hei: terabytes of data.

With a $50 million Series D funding secured in August 2019, SpotHero began expanding its digital platform and deepening its technology stack to optimize parking throughout North America. The company also invested in hiring new talent and adding features to its single protocol software. But Hei still had to interpret the data and scale for SpotHero’s future. 

“To facilitate this,” Hei explained, “we have moved our raw data out of Redshift into S3, which allows us to scale the amount of data almost infinitely.”

We asked Hei and Mastery Logistics’ Lead Machine Learning Engineer Jessie Daubner about which tools and technologies they use to build data pipelines and what steps they’re taking to ensure those data pipelines continue to scale business. Because whether these companies are making parking more seamless or reimagining freight technology, as Mastery Logistics is, one thing is certain: data is king.

Data Pipeline Architecture

Data pipeline architecture is the system that captures, organizes and then sorts data for actionable insights. It's the system that takes billions of raw data points and turns them into real, readable analysis. Companies must ensure that their data pipeline architecture is clean and organized at all times to get the most out of their datasets.

 

mastery
MASTERY LOGISTICS SYSTEMS

Mastery Logistics Systems

Mastery Logistics Systems helps freight companies reduce waste by arming them with software that makes it more efficient to move goods from one place to another. Lead Machine Learning Engineer Jessie Daubner explained how her team harnesses Snowflake to deliver insights to their customers faster.

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies specifically?

Since we’re an early-stage startup with a small team, we had a greenfield opportunity to evaluate the latest tools to build a modern data stack. As a result, we’ve built our analytics layer and initial data pipelines in Snowflake using an ELT pattern. 

This has enabled us to deliver insights to our customers faster by using Snowflake to directly consume data from Kafka topics and empower our data science team to focus on delivering insights rather than the DBA work typical of building new data infrastructure. 

We also use Fivetran, a managed data ingestion service, to sync data from our SaaS application and other third-party data sources like Salesforce, so that new transaction data is available for analysis across the organization with as little of a delay as 10 minutes. Lastly, our team uses dbt (data, build, tool) to transform data for analysis and visualization, which has enabled our team of engineers and analysts to bring software engineering best practices into our analytics workflow.

"Since we’re a small team, we had the opportunity to evaluate the latest tools to build a modern data stack.’’ 

 

As your company — and thus, your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

We’re confident in Snowflake’s ability to scale as the number and size of customers using our transportation management system (TMS) grows. However, with a near real-time service level agreement of 10 minutes or less required for some data sources and machine-learning-driven services we expect to outgrow Fivetran. 

Thankfully, our architecture team has implemented messaging and stream processing using Kafka, AVRO and Confluent Schema Registry, so we already have a single asynchronous messaging protocol in place to meet our SLAs as our data volume increases.

As a Python-oriented team, we’ve also committed to using Faust, an open-source Python library with similar functionality to Kafka streams. As our default stream processing framework, that provides AVRO codec and schema registry support.

 

Find out who's hiring.
See all Data + Analytics jobs at top tech companies & startups
View 3894 Jobs

 

spothero
SPOTHERO

SpotHero

Facilitating parking reservations nationwide through an app requires processing large amounts of data. Director of Data Science Long Hei explains why he uses Apace Airflow to build the data pipeline at SpotHero

What technologies or tools are you currently using to build your data pipeline, and why did you choose them?

We mainly use Apache Airflow to build our data pipeline. It’s an open-source solution and has a great and active community. It comes with a number of supported operators that we utilize heavily, such as Redshift operator and Postgres operator. 

At SpotHero, we have extended our pipeline to our PipeGen functionality so we can take YAML files and generate DAGs, which serve as internal self-serve ETLs.

"As the business scales, the volume of our data also scales tremendously.’’

 

What steps are you taking to ensure your data pipeline continues to scale with the business?

As a culture, we always encourage and help other business teams build their own ETL processes using Airflow and PipeGen. As the business scales, the volume of our data also scales tremendously. There are more people in the business who need access to the data. As a result, we have outgrown Redshift as a catch-all for our data. 

To facilitate scale, we have moved our raw data out of Redshift and into S3, which allows us to scale the amount of data almost infinitely. We can query and explore that data through Presto. 

Once a data set is locked down, the ETL-ready insights can be graduated into Redshift via PipeGen. This has significantly helped the rapid scaling of the data and the data pipeline.

Top Tools For Building A Data Pipeline

  • Snowflake
  • Apache Airflow
  • Stitch
  • PySpark
  • DBT
  • Big Query
  • PostSQL
  • Python

 

Harry's data pipeline
Harry's

Harry's

Before Harry’s was founded in 2013, most men purchased shaving razors at their local convenience stores. By adopting the direct-to-consumer business model, Harry’s consumers are able to buy thousands of German-engineered razors at affordable prices. 

But the company doesn’t operate on sharp edges alone. With a $112 million Series D funding round in December 2017, the e-commerce company expanded its offerings into skin care and shaving cream. Still, each new razor that goes down the factory line in Germany translates to only one thing for Head of Analytics Pooja Modi: high volumes of data.

To address scaling needs, Modi said, “We’re spending a lot of energy on data validation and measuring data quality at every single step, with robust monitoring and alerting on the health of our data. We are also focused on data testing and documentation, enabling us to better communicate context and expectations across team members.”

Ensuring that the data pipeline continues to scale with business means starting with the right tools, whether that means turning to trusted programming languages like Python, or harnessing new technologies like Snowflake. Read on to hear how Modi and Senior Data Scientist John Maiden at Cherre process data with cutting edge tools. 

 

Pooja Modi

Head of Analytics

As Harry’s e-commerce business expands from men’s razors to encompass shaving products and skincare, Head of Analytics Pooja Modi said, “Scalability is definitely top of mind.” To ensure Harry’s data pipeline can scale to support higher volumes of data, her team is focused on measuring data quality at every step.

"We are focused on data testing and documentation.’’

 

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies specifically?

This question is perfectly timed as we are in the middle of piloting several different tools while getting smarter on our decision criteria. Historically, we have relied on Redshift and Looker, enriched with a set of in-house capabilities to support data ingestion and orchestration (e.g. our open-source tool Arthur).  

We’re now in the midst of piloting several new technologies (e.g. Snowflake, Stitch and DBT), broadly optimizing for usability, reliability, cost and feature richness. We like to work with technologies that come with a high level of customer support from the vendors and user communities. It’s also appreciated when tools provide turnkey integrations, enabling us to refocus our bandwidth on solving specific complexities within the business.

 

As your company — and thus, your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

Scalability is definitely top of mind for Harry’s as we are in the middle of rapidly spinning up new brands and scaling to several more retailers this year. We need to ensure that our data pipeline can scale to support a higher volume and variety of data. Secondly, the data pipeline needs to continue to be manageable for the current and future team. To address these needs, we’re spending a lot of energy on data validation and measuring data quality at every single step, with robust monitoring and alerting on the health of our data. We are also focused on data testing and documentation, enabling us to better communicate context and expectations across team members.

 

Cherre data pipeline
Cherre

Cherre

Cherre’s provides investors, insurers, real estate advisors and other large enterprises with a platform to collect, resolve and augment real estate data from hundreds of thousands of public, private and internal sources. To manage ever-changing data, Senior Data Scientist John Maiden explains why he relies on the standard tools from the Google Cloud shop.

"Python is a mature language with great library support for ML and AI applications.’’

 

John Maiden

Senior Data Scientist

What technologies or tools are you currently using to build your data pipeline, and why did you choose those technologies specifically?

We're a Google Cloud shop, so we use many of the standard ETL tools like Airflow, BigQuery and PostgreSQL to get data ready for analysis. Once the data is ready, we use Python and the usual suspects such as Pandas, scikit-learn for small data sets and Spark (PySpark) when we need to scale. Python is a mature language with great library support for ML and AI applications, and PySpark allows us to extend that functionality to large data sets. Spark Graphframes is a critical technology for us when it comes to graph processing since we're handling hundreds of millions of rows.

Find out who's hiring.
See all Data + Analytics jobs at top tech companies & startups
View 3894 Jobs

As your company — and thus, your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

There are two parts to making our work scale. We first build reusable pipelines and a growth-focused architecture that can develop alongside growing client demand. The second is cultivating a team with strong domain knowledge that understands the type and quality of data currently available, and what we need to add or improve to better support our products.

 

Responses have been edited for length and clarity. Images via listed companies.