Want to unleash chaos in data science?
Build unreliable workflows.
Going Senior Data Scientist Nick Lowery explained, “Inevitably, the team will get so bogged down fighting fires and struggling to maintain existing systems, the ability to produce new work will go to zero.”
Luckily, there are plenty of tools and practices teams can use to avoid these setbacks and succeed in workflow development. Senior Manager of Enterprise Data Nikki Marinsek and her peers at Evidation have developed a solid process, which includes defining inputs and desired outputs and using dbt and Github Actions workflows to build and validate data each day.
According to her, following these steps isn’t just helpful for her team; it’s essential for the success of the business.
“Data is critical both internally at Evidation and for our clients,” Marinsek said.
There are many reasons why teams need strong data workflows. For CompanyCam Data Scientist Wyatt McLeod, one of the biggest ones is knowing when something is wrong with data quality. He believes that, when data scientists know more about the data itself, they’ll have a better understanding of where to apply it.
“In data-oriented professions, more so than many others, knowledge is power,” McLeod said.
Below, Lowery, Marinsek and McLeod share more about how — and why — their teams build successful data workflows and the advice they’d give to others in the field who wish to improve their approach to workflow development.
CompanyCam’s platform enables commercial and home-services contractors to visually document progress in real time, organize project information, collaborate with crew members and manage marketing and sales efforts.
Describe your team’s process for building successful data workflows. What tools and best practices does your team rely on?
Building a successful data workflow starts with understanding who will need the data and how they will use it. There’s little use in data if practical application is difficult. Truly understanding the questions that stakeholders will want to answer is key to developing a good data workflow.
Beyond this, I truly believe that having a clear style guide with peer review for any changes to production data is necessary for a successful data workflow. In the same vein of having practical data, getting a second set of eyes on anything you create is a good way to answer the question, “Is this genuinely useful?”
Lastly, document everything and in multiple places. Use schema files, inline code comments and Notion documents; if they exist in your workflow, then people will need to understand how it works. No one wants to have to try and understand your 500-line SQL query or machine learning model by reading raw code or playing “guess-and-check.” We specifically use dbt schema files to document every model column with tests and comments.
Why is it important for your team to create reliable data workflows, and how does it bolster your team’s ability to accomplish tasks?
I cannot stress enough how important it is to have a reliable data workflow. Having clean, accurate data on a timely basis not only removes barriers to finishing tasks, but also increases faith in the analytics that a data team can provide. A reliable data workflow also makes it very obvious when something is wrong with data quality. In a well-oiled machine, when something breaks, it’s obvious which piece it was. I believe that this is one of the biggest benefits of having a reliable data workflow.
“Having clean, accurate data on a timely basis not only removes barriers to finishing tasks, but also increases faith in the analytics that a data team can provide.”
What advice would you give to other data scientists interested in improving their approach to building data workflows?
The main lesson that I can share is to ask everyone in your company as many questions as you can about everything. In data-oriented professions, more so than many others, knowledge is power. When you understand your company, you know what data is needed and where it will need to go.
Individuals use Evidation’s platform to track activities, such as walking and sleeping, and gain insights from this information, while organizations leverage this data for healthcare research purposes.
Describe your team’s process for building successful data workflows. What tools and best practices does your team rely on?
First, we define the inputs and desired outputs of our workflow. We gather schema specifications, explore the data or consult with internal teams to understand data inputs. Determining the desired outputs often involves meeting with clients or internal stakeholders to understand their goals and constraints and then working backward to determine what data outputs are needed to build the right data exports, dashboards and analyses.
Once we establish the inputs and outputs, we define intermediate data transformations. The team has built internal tooling that creates data models at each stage of processing given the schema of source data, a modeling specification that defines the transformations to be applied at each stage, and a library of Jinja templates and dbt macros. We use dbt and Github Actions workflows to build and validate the data each day.
Our success hinges on robust monitoring and validation across our workflows. For example, we validate the schema of incoming data to catch schema drift, we receive alerts about model build failures through Slack, we use Anomalo to detect data anomalies and we use dbt tests to validate our data export schema against our data contract.
“Our success hinges on robust monitoring and validation across our workflows.”
Why is it important for your team to create reliable data workflows, and how does it bolster your team’s ability to accomplish tasks?
Data is critical both internally at Evidation and for our clients. Unreliable data workflows can create a myriad of downstream problems. In some of our more complex research studies, data workflows underlie key checkpoints of the participant journey. For example, eligibility may be gated on wearable data density or a range of lab test results. In these cases, a data workflow is set up to combine the necessary data from third-party sources, determine eligibility and trigger either the disqualification or advancement of the participant in the study. An issue in these workflows will increase wait times for participants and result in a poorer experience.
Data workflows are also used to support monitoring systems and dashboards, which are used by both clients and internal Evidation teams to evaluate how a program is going and intervene if needed. For example, we may send out reminder emails to boost engagement or change an onboarding flow to reduce dropoff. Finally, the data collected in our studies and programs are used to answer important research questions. Robust data flows and monitoring systems ensure that the data we collect is accurate, complete and privacy-centric.
What advice would you give to other data scientists interested in improving their approach to building data workflows?
Based on many lessons learned along the way, I have three pieces of advice to offer. First, improve monitoring with a priori knowledge, of a type of knowledge that an individual has when they know a fact and lack any evidence from experience. Earlier on, we discovered that our high-level monitoring, which includes survey completion rate trends, wasn’t detecting critical edge-case issues. We pivoted our strategy and used the schedule of events in the study protocol to build expectations about what each participant should be doing and when, and then monitored for deviations. Doing so allowed us to uncover additional edge cases that were silently impacting participants’ experience. Pre-mortems, unit tests and schema validations are other ways we build strong expectations to test or monitor against.
Secondly, move failure points upstream. We improved the reliability of our pipelines by moving data validations upstream, from exports to data builds, and enhanced the quality of pipelines by moving personal identifiable information definitions upstream, which includes data builds and the backend.
Lastly, continually optimize. We routinely assess unreliable or costly points in our workflows and work to improve them. We’ve done this by reducing build frequency, optimizing partitioning for querying and standardizing and taking advantage of Snowflake’s incremental and dynamic tables.
Going’s app connects travelers with affordable international and U.S. flight deals.
Describe your team’s process for building successful data workflows. What tools does your team rely on?
For us, a successful workflow is reliable and interoperable. Reliability centers around automated testing. For analytics workflows, we leverage the unit testing functionality in dbt; for data science and engineering workflows, we encapsulate business logic and transformation in R and Python libraries, which we test using well-known frameworks, such as “testthat” and “pytest.”
We encourage developers to use a test-driven development approach when writing code, and we automate testing with Github as part of our peer review process. We also use Elementary Data to reactively monitor data quality, and it’s most commonly used to test cases not easily captured in unit tests or identify infrastructure or orchestration issues.
To facilitate interoperability, Iceberg serves as our universal data storage format. We chose Iceberg because it supported our analytics and ML use cases at scale and is widely supported across enterprise and open-source tools. This allows datasets created in any platform to be consumed by any other in our ecosystem, helps prevent vendor lock-in by relying on open formats and facilitates experimentation with any tool or technology that supports the format.
Why is it important for your team to create reliable data workflows, and how does it bolster your team’s ability to accomplish tasks?
Unreliable workflows breed chaos. Inevitably, the team will get so bogged down fighting fires and struggling to maintain existing systems, the ability to produce new work will go to zero. By prioritizing reliability, we’re investing in our team’s ability to consistently deliver value long-term.
While testing is our most visible reliability practice, checking that tests pass before deployment is a relatively minor contribution to our overall approach. The real value of our tests are as a communication tool. Well-constructed and documented tests enable our team to step into an unfamiliar code base and immediately get a decent understanding of the intent and expectations. We achieve reliability by leveraging expertise from the entire team for every project, rather than encouraging many single points of failure by keeping project ownership narrow. Taking a test-driven approach also helps us keep our iteration cycles quick, which is the essential core of consistently delivering business value.
“By prioritizing reliability, we’re investing in our team’s ability to consistently deliver value long-term.”
What advice would you give to other data scientists interested in improving their approach to building data workflows?
Test everything — your code, your queries and your data. Every requirement should manifest as a test. Turn a comment explaining something into a test. Making all of these checks explicit enables better understanding of your codebase, reduces mental overhead and gives team members more confidence making changes. Write your tests first. This will keep you on track to build the things you need and not the things you don’t need.
Take time to reflect on bottlenecks in your workflows as both an individual and a team, and work on ways to speed them up. For example, tests involving data can be slow, especially when you try to query data sources as part of a test. Instead, capture sample data as a test fixture and use that for development. We’ll often combine production data samples with synthetic data that captures specific intricacies in test fixtures.
Look for opportunities to create shared resources for your team. Miles McBain has a good talk on this called “Really Useful Engines.” The idea is that creating a layer of shared libraries reflecting core team capabilities will improve productivity by reducing duplicated efforts, encourage testing and documenting code, and build consistency into your outputs.