How to Build a Successful Data Science Workflow

Written by Adam Calica
Published on May. 20, 2020
Brand Studio Logo

In a less mature industry like data science, there aren’t always textbook answers to problems. When undertaking new data science projects, data scientists must consider the specificities of the project, past experiences and personal preferences when setting up the source data, modeling, monitoring, reporting and more.

While there’s no one-size-fits-all method for data science workflows, there are some best practices, like taking the time to set up auto-documentation processes and always conducting post-mortems after projects are completed to find areas ripe for improvement.

Stefon Kern, manager at The Marketing Store, said he focuses on evaluating the utility and business value of what his team has built.

“More important than any specific tool or technology, is proactively, consciously adopting a mindset that’s focused on continual evaluation and optimization,” Kern said.

 Data Science Coach Ben Oren of Flatiron School agreed on the importance of assessing each project after the fact. Improving data science workflows often occurs at the consolidation step, so after every project, he documents what was done, considers where the problems and inefficiencies crept in, and imagines ways to improve processes for the future. 

Data scientists tweak their data science workflows to align with what works best for their teams and businesses. What other best practices are they using to optimize their data workflows? Code reviews, collaboration between data scientists and data engineering teams and agile environments are just a few examples.

Data Science Workflow: Tips for Building a Successful Workflow

  • Perform fundamental data preparation and profiling
  • Implement an auto-documentation process
  • Perform constant code and design reviews
  • Build up libraries of common tasks
  • Prioritize unit testing

The Marketing Store

Stefon Kern

MANAGER, DATA SCIENCE

When data science projects are finished, the post-mortem phase begins, Kern said. This assessment period allows the team to identify areas for improvement rather than putting completed work aside. As potential problems arise in the future, Kern’s team will save time by already being aware of its weak spots.

 

Tell us a bit about your technical process for building data science workflows.

Our data science projects begin with a dedicated R&D phase. Most likely, whatever you’re building has aspects that have already been developed by others in our field. It’s important to be aware of the latest and greatest methodologies, tools and resources out there. Then, you can strategically decide whether you need to reinvent the wheel or not. 

By the end of this initial R&D phase, we have our objectives locked down and we have identified the approach and resources needed to achieve our end goal, including staffing, tech resources and data requirements. Next, we begin building a “minimally working” version of the product. For this stage, we use real data, following development best practices and building a viable workflow. 

Once we’re satisfied with the initial build, we enter a phase dedicated to scaling, testing and optimizing. While we favor Python and Apache Spark when it comes to enterprise-wide, large-scale data science workflows, we have an agnostic, needs-based approach to technology, and often leverage many of the specialized statistical packages developed by the research community.

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

I strongly recommend taking the time to set up an auto-documentation process: Python’s Sphinx package, for instance, will automatically extract information from docstrings to create a perfectly formatted, easy-to-navigate HTML-based documentation page. 

Other important best practices include setting up self-contained virtual environments, appropriate project structure, modularized code, unit tests and quality-control checks. Moreover, once you’ve developed your own best practices and process guardrails, it’s critical to provide your team with training and periodic reminders to ensure that the best practices truly become your team’s practices. 

 

I strongly recommend taking the time to set up an auto-documentation process.

 

What advice do you have for other data scientists looking to improve how they build their workflows?

More important than any specific tool or technology, is proactively, consciously adopting a mindset that’s focused on continual evaluation and optimization. When a project is complete, there can be a tendency to set it aside and not look at it again until a problem arises. To avoid that trap, I recommend conducting post-mortems and assessments, with a focus on evaluating the utility and business value of what you’ve built, and identifying things that can be improved.

The Marketing Store is Hiring | View 12 Jobs

 

Integral Ad Science

Rene Haase

VICE PRESIDENT OF DATA ENGINEERING

When tackling a new project, Vice President of Data Engineering Rene Haase integrates the data science and data engineering pods in order to get as many perspectives as possible. Together, they educate each other and brainstorm on how to overcome development and deployment challenges with their machine learning models. At Integral Ad Science, this collaboration starts at recruiting, where members from both teams interview potential candidates.

 

Tell us a bit about your technical process for building data science workflows. 

In order to understand IAS’s data science platform setup, it’s important to understand the scale we are operating at: Our systems are required to process about 4 billion events per hour. Those volume requirements influence the type of tools IAS leverages when developing data science workflows. 

For developing models, we leverage H2O Driverless AI quite extensively and leverage Spark ML for training data pipelines. We are also experimenting with TensorFlow and are building prototypes with Neo4j for GraphDB. The latter we are using to build a fake news machine learning model. We use Airflow for orchestration, Hive and Impala for data exploration, and Jupyter and Zeppelin notebooks for analytics. 

In our latest project, we’re building streaming data pipelines with Apache Flink and are experimenting with building H2O Driverless AI RESTful APIs, which our streaming pipelines will leverage. This creates a great challenge to build not just accurate, but also operationally efficient models in order to support our hourly throughput requirements. We are partnering with experts in the area of streaming and with AWS, which represent great learning opportunities for our teams. 

 

It’s important to build a strong data science community in your company.

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

We are not just handling large amounts of data, but are also maintaining several hundred machine learning models. Managing this amount of machine learning models requires automation of the data science SDLC. As such, we are persisting models in ModelDBs, maintaining feature development code in GitHub, maintaining automatic monitoring, and creating plenty of dashboards describing the health of the models and systems.

We’ve also found that having a designated data engineering group support the data science organization to be highly beneficial. The members of the latter group are mostly aspiring data scientists who have a good understanding of data science concepts and are honing their skills in the production of data science models. Our data science engineering group partners closely with our data scientists, educating each other and brainstorming on how to overcome common challenges when developing and deploying machine learning models. At IAS, this collaboration starts at recruiting where members from both teams interview potential candidates.

Lastly, it is very important to “know” your data, which means exporting model evaluation metrics, monitoring training data, developing ways to detect and respond to shifts in your data, and retaining training data for as long as possible.

 

What advice do you have for other data scientists looking to improve how they build their workflows?

It’s important to build a strong data science community in your company. That means that members need to be comfortable receiving and providing constructive feedback. This will allow your data scientists to continue to grow and build cutting-edge solutions.

Software and data science models need to be maintained and enhanced wherever possible. We strive as a team to constantly evolve. Failing to maintain models for too long will mean that you fall behind and accumulate technical debt. A mature data science organization continually keeps up with new technologies and finds ways to improve both the software and the models.

Integral Ad Science is Hiring | View 6 Jobs

 

Analytics8

Matt Levy

SENIOR CONSULTANT AND DATA SCIENCE PRACTICE LEAD

Not every client is ready for data science implementation, according to Matt Levy, senior data scientist at Analytics8. In order to produce real business value, Levy recommends spending more time understanding the business problems at hand and what you really want to predict before diving into a solution. When projects are data-science ready, clients’ needs and preferences dictate the technologies used on a case-by-case basis.

 

Tell us a bit about your technical process for building data science workflows. 

Analytics8 practices what we call “ethical data science.” We know that data science can result in quick failure without proper planning and preparation. We spend time upfront ensuring our clients are data-science ready, and that their projects will bring value. We help them understand the ramifications of machine learning-based decisions, and avoid bias when building models. 

We are not afraid to tell our customers that they are not ready for data science implementation if we can’t say with integrity that they have the right level of data maturity, or have identified a project that will bring business value.

While we use traditional and more modern tools to prepare, profile and model the data, our clients’ needs and preferences dictate the technologies we use on an individual basis. Our favorite implementations are those where machine learning models are deployed into customers’ existing BI reporting platform, so users gain richer analysis and more insight inside the tools they already know.

 

We are not afraid to tell our customers that they are not ready for data science implementation. 

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

Everyone wants to implement the “buzzworthy” part of data science: particularly, building and deploying machine learning and AI into their organization. However, not everyone realizes how critical data preparation and data profiling are to data science success. If your organization is not truly data-science ready, then you are doing a huge disservice to outcomes and your bottom line. Invest time in organizing, cleansing and democratizing your data into “one true source” that provides accurate and reliable information. 

Once this is done, make sure to go beyond basic exploratory data analysis and get a better understanding of what your data is telling you, and what features you might be able to unlock with some upfront discovery.

 

What advice do you have for other data scientists looking to improve how they build their workflows?

Spend more time understanding the business problems at hand and what you really want to predict before diving into a solution. 

Creating a perfect model for a question that might not need to be answered, or for which the information quality is low, won’t bring you the most business value and will likely result in failure. The more you emphasize solving true business problems, the better off you will be downstream in your workflow.

Analytics8 is Hiring | View 4 Jobs

 

Nordstrom Trunk Club

Rossella Blatt Vital

DIRECTOR OF DATA SCIENCE AND ANALYTICS

Rossella Blatt Vital said her team at Nordstrom Trunk Club strives to build better data models that outperform their previous build. To accomplish better data models, Vital said maintaining an agile and collaborative environment is crucial. As well, researching technical topics, code reviews and design reviews help maximize the stability of their models. 

 

Tell us a bit about your technical process for building data science workflows.

It’s important to understand the business problem and frame it in a data science context. We work closely with business stakeholders and the data engineering team to identify, collect and create the data needed. This often requires the use of multiple tools depending on the nature of the data (e.g., SQL, Python, Amazon S3, HDFS cluster and Cloud). 

The next phase is data processing and exploratory data analysis (EDA). This is where we explore the available data to gain relevant insights and best approaches moving forward. The insights collected during this phase are then used for model building and tuning where we use a variety of ML frameworks and Python packages (Spark, Scikit-learn and TensorFlow). 

We like to start with simpler modeling approaches and add complexity during the next iterations and evaluate the model’s performance with added complexity. We consider this a paramount step that we perform throughout the workflow and double down once the final model has been selected. The result is an iterative model we deploy into production and leverage for various products. 

 

It is crucial to maintain an agile and collaborative environment when looking at your workflow.

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

When building a data product, we strive to focus on building the right thing and building it right. We invest in cross-functional collaboration and embrace iterative data development. We believe that adopting a champion challenger approach during deployment, model development and production makes a huge difference. 

We strive to build a better model that outperforms the previous model built. We start this process by researching these topics to expand our technical horizons. Code reviews and design reviews are done throughout the process to maximize the stability of our models. 

 

What advice do you have for other data scientists looking to improve how they build their workflows?

It is crucial to maintain an agile and collaborative environment when looking at your workflow and testing hacky solutions with robust, reproducible ones every sprint. The key is to always stay curious, be open to learning, and be able to bounce ideas off cross-functional teams. Doing so boosts the quality and velocity of the data product workflow. 

Our advice to machine learning leaders is to make sure strong communication is established within your teams and make it known that we can learn from our mistakes, which will make us more experienced data scientists. 

Get Alerted for Jobs from Nordstrom Trunk Club

 

Flatiron School

Ben Oren

DATA SCIENCE COACH

Every data science workflow begins with the repo at Flatiron School, Oren said, specifically using the Cookiecutter Data Science tool on GitHub. Cookiecutter generates directories tailored to any given project so all engineers can be on the same page. From there, Luigi helps with workflow management, and other tools such as Tableau and open-source tools like Plotly and Flask  help create visualization.

 

Tell us a bit about your technical process for building data science workflows. 

It all starts with the repo. I always start with Cookiecutter Data Science, a scalable tool that can generate a repo structure tailored to a given project while automatically including necessary directories and conforming to a general type. For both big data and local work, Luigi is a flexible workflow management system that can combine simple tasks into complex ETL and model-building processes. 

To round things out with visualization, enterprise tools like Tableau feature easy end-user deployment, dynamic capabilities and integration with coding languages, but might not be worth investing in over stable open-source tools like combining Plotly and Flask (or ggplot2 with a little elbow grease).  

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

Data scientists are generally terrible at and fearful of unit testing, but it’s the backbone of reproducibility and stability. Ensuring that the inputs and outputs for the processes you've built are what you expect them to be is as important for building models as building websites. Beyond making robust unit tests, what I’ve found most effective for stability and reproducibility is building up libraries of common tasks. When the wheel isn’t constantly being reinvented, data cleaning, feature engineering, cross-validation and model tuning go faster and processes between collaborators become standardized.

 

Data scientists are generally terrible at and fearful of unit testing, but it’s the backbone of reproducibility and stability.

 

What advice do you have for other data scientists looking to improve how they build their workflows?

Remember to document and iterate the workflow, the same way as everything else. There can be a tendency on data science projects to focus on the expansion step of engineering a project: thinking through different approaches to a data set, creatively solving problems that arise, etc. But improving the process across projects, especially at a workflow level, often comes at the consolidation step: documenting what was done, considering where the problems and inefficiencies cropped up and imagining ways to improve for the future. 

Take the time to write out workflow steps as they’re happening, highlight inefficient moments and return to them after the project is done to think through alternatives.

Get Alerted for Jobs from Flatiron School

 

Flatiron Health team members chatting

FLATIRON HEALTH

In a relatively new and dynamic field like data science, adaptability is key. 

“There’s no one workflow that suits everyone and every project,” said CB Insights Senior Data Scientist Rongyao Huang. “Identify the primary focus of each project phase and pick tools accordingly.” Since the field evolves fast, Huang is constantly on the lookout for new tools as they emerge.

Huang and other data scientists across New York agree: selecting a data science workflow depends on the particulars of the project, which should be discussed early in the planning phase and include stakeholders across the business, including product and engineering.

“At the beginning of the process, I ask partners about the problem we are trying to solve and the metrics that will be most useful,” Natalie Goldman, data analyst at theSkimm, said. “I use this question as a tool to help focus on the user and determine what we can actually find using data.”

Defining metrics and documenting progress throughout a project’s pipeline help build repeatable processes and track down errors. When an established metric of success is missed, teams can retrace their steps, determine what went wrong, and iterate quickly to ensure future progress. 

“Build pipelines to iterate as fast as possible while testing robustly,” said Orchard Lead Data Engineer Greg Svigruha. “Don’t get too wedded to any specific tool, framework or language. Use anything that provides the most value.”

 

CB Insights

Rongyao Huang

SENIOR DATA SCIENTIST

Huang said data science projects for her team are broken into three phases: exploration, refinement and productization. Each stage has a different primary focus and an effective workflow should aim to speed up each one in regards to its goal.

 

What are some of your favorite tools that you’ve used to build your data science workflow?

Jupyter Notebook is the best environment for the exploration stage. Its interactive nature and ability to combine programs with documentation and visuals makes it more efficient for fast prototyping and collaboration. Docker and virtualenv isolate project environments, making them replicable and portable, a must-have when doing remote development or collaborating with others. And we use Project Log to keep a record for each campaign.

A good graphical user interface in the refinement stage makes refactoring a lot easier. Our favorite GUIs include PyCharm, Atom, Visual Studio Code and Sublime. We use Google Sheets for error analysis and performance tracking. And we check TensorBoard for deep learning projects.

In the productionization stage, we’ve developed templates for jobs and services. A data science solution can implement the interfaces of how data should be fetched, prepared, processed and saved. The remaining work of configuration, logging and deployment are abstracted away. We’re also continuously adding to a utils library where common functionalities like database reads and writes are standardized.

DATA AND MODEL VERSIONING IN THE REFINEMENT STAGE

Huang said it's helpful to standardize how data and model versions are named, annotated and saved. Her process includes automating functions like seeding random competents, adding Unix timestamps to standardize file and folder names, saving metadata alongside data and models, and backing everything up on Amazon S3 after hitting “save.”

What are some of your best practices for creating reproducible and stable workflows?

A good reproducibility framework is PyTorch Lightning, a lightweight ML wrapper that helps to organize PyTorch code to decouple the data science from engineering and automate training pipelines. The Lightning template implements standardized interfaces of practices like model configuration, data preparation and others. 

However, I think this should be a cautious investment, especially for small teams. It comes with a flexibility-versus-automation tradeoff. There’s also a learning curve for users and a maintenance cost for developers. It should be used when helping teams with a broad range of tasks, automating heavy components in the workflow and in instances in which the team are committed to maintaining it in the long run.

CB Insights is Hiring | View 18 Jobs

 

Flatiron Health

Christine Hung

VP OF DATA INSIGHTS ENGINEERING

Hung’s team approaches data science experiments with the idea in mind that they should be easily replicated. Using tools like Jupyter Notebook for experimenting or Blocks for pipeline-building, they prioritize traceability and repeatability so they can track errors easily. 

 

What are some of your favorite tools that you’ve used to build your data science workflow?

Our data science team uses the right tool for the specific project at hand. Depending on who the team is collaborating with, we may end up using different technology sets, but within each project, we always have reproducibility top of mind. 

Some of our favorite tools include the Jupyter Notebook for experimenting and prototyping, scikit-learn for model development, and an in-house, multi-language ETL system called Blocks for building end-to-end data science pipelines.

 

What are some of your best practices for creating reproducible and stable workflows?

We are believers in versioning not only code and models, but also in versioning data. To reproduce the results of an analysis, it’s essential that we are able to go back to the original data easily. We also spend a lot of time ensuring our workflows have the right level of continuous evaluation and monitoring so that it’s easy to spot regressions and stability issues as soon as they surface. 

Lastly, we take the time to educate our teams on reproducibility and implementing processes that help ensure we are following best practices. A little bit of work up front goes a long way.

 

What advice do you have for other data scientists looking to improve their workflows?

It’s important to design workflows that are inclusive of the tools and technologies commonly used by the team members involved in a project. Sometimes, when multiple teams are involved in a project, there are ways to integrate multiple languages together to allow domain experts to work in their preferred toolset while still enabling an efficient, stable and reproducible workflow.

It’s also important to treat every analysis and experiment as something that’s going to be revisited. Be very thoughtful about monitoring and documenting how data is generated, transformed and analyzed. Design a process that requires minimal additional effort, but allows for every analysis to be easily recreated.

Flatiron Health is Hiring | View 58 Jobs

 

The Trade Desk

Harry Shore

DATA ANALYST

Shore said the data and engineering teams work very closely together at The Trade Desk. Communication creates transparency and encourages the sharing of different perspectives. 

 

What are some of your favorite tools that you’ve used to build your data science workflow?

We’ve long used Vertica as our core data store and analytics platform, which is great for data exploration and rapid prototyping. It’s easy to pull data and export it to another application. When I’m uncovering relationships in a new data set, I’ll often do some filtering and aggregation in Vertica, then throw the result set into Tableau to build exploratory plots.

However, as our modeling work got more sophisticated, we moved from Vertica to Spark because some algorithms just aren’t feasible to implement in SQL. Zeppelin is a great tool for rapid Spark development, and our engineering team has worked to make our core data sets available in Parquet format via S3.

 

What are some of your best practices for creating reproducible and stable workflows?

The move from Vertica to Spark had the potential to complicate workflows; it was easy to schedule execution of a Vertica query and copy the results table into a production system. But models in Spark can have a wide array of designs, dependencies and outputs.

One of our engineers had a great idea: What if we had a single Spark project that could be used to both develop and run models? So, we worked closely with engineering as they developed a shared library, where each model was its own class, and each class could be run on a schedule using Airflow. The killer feature was that the library could also be loaded into Zeppelin. So as each data scientist was doing data exploration and prototyping, the code they were writing used the exact same helper functions and data interfaces available in production. This methodology made for a close to seamless transition from prototyping to production.

 

What advice do you have for other data scientists looking to improve their workflows?

We have a close working relationship with our engineers. From the first design conversations through productization and release, both data scientists and engineers are part of the conversation. This benefits everyone; early data science prototypes can be informed by production considerations, and the engineers working on the final stages of deployment have an understanding of how the model is supposed to work. Different people have different perspectives, and hearing them all at various stages of development can be helpful.

Also, don’t add too much structure before you need it. The pipeline we built has been helpful, but we’ve also made a point of keeping the structure pretty loose. Each model accesses data and is scheduled to run the same way. But beyond that, design is driven by the requirements of the project and what the data scientist building it thinks is appropriate.

The Trade Desk is Hiring | View 27 Jobs

 

theSkimm

Natalie Goldman

DATA ANALYST

Context is crucial for Goldman and her team. While numbers and raw data are important, Goldman said they can be deceiving when they lack circumstantial information. So it’s key to spend time analyzing qualitative data and user research.

 

What are some of your favorite tools that you’ve used to build your data science workflow?

At a high level, my workflow is as follows: align on success metrics; find, validate, clean and analyze the data; apply models; communicate results; make recommendations and continue to monitor results. 

At the beginning of the process, I ask partners about the problem we are trying to solve and the metrics that will be most useful. I use this question as a “tool” to help focus on the user and determine what we can actually find using data. 

 

What are some of your best practices for creating reproducible and stable workflows?

I have found building informative, customizable dashboards to be the most effective, especially during long testing periods. I often share my dashboard with a partner in the company to test its effectiveness and then iterate depending on whether or not it is successfully interpreted without my assistance. Practices I have found helpful include using colors, usually green and red, as indicators of “good” and “bad,” and using text boxes on the dashboard to aid in interpretation. I also build in filters or editable fields to zero in on key data without changing anything on the back end.

 

What advice do you have for other data scientists looking to improve their workflows?

Documentation, organization and effective dashboarding are three tools to improve workflow. I also recommend using a planned file structure, setting calendar reminders and systematically collecting results and storing them in one place.

Additionally, as data professionals, we often trust numbers as the holy grail for everything we do. However, it’s important to recognize that numbers and metrics can sometimes be misleading without the proper context. Qualitative data and user research can provide invaluable insight into how users interact with products, and why they do what they do. Integrating research data such as surveys, Net Promoter Scores or brand studies into our workflows can help us put things into perspective.

theSkimm is Hiring | View 5 Jobs

 

Reonomy

Maureen Teyssier

CHIEF DATA SCIENTIST

Data metrics are necessary before and after feature generation, and on the output from the model,” Teyssier said. 

Clear metrics — defined early alongside key stakeholders — make output changes easy to monitor. 

 

What are some of your favorite tools that you’ve used to build your data science workflow?

At Reonomy, we begin data science projects with a discovery phase that includes stakeholders from the product, data science and engineering teams. Having this collaboration upfront greatly increases the success rate of the projects. It gives our data scientists enough context to feel confident in their decisions and allows them to feel the importance of their work.

Machine learning projects are only successful when high-signal data is fed into models. Our data scientists create this signal by doing visualizations and analysis in Databricks, which allows them to extract data from many points in our Spark-Scala production pipelines. Using an interactive Spark environment also allows them to write code that is easier to transition into our production pipelines, which is important with cross-functional teams.

 

What are some of your best practices for creating reproducible and stable workflows?

When there’s machine learning embedded in production pipelines, it’s essential to create actionable metrics in several locations within the pipeline. “Actionable” means the metrics have enough breadth to capture changes in the data, but they aren’t so general that it’s difficult to obtain an understanding of what is going wrong. 

Data metrics are necessary before and after feature generation, and on the output from the model. Having these metrics means when output changes, it’s possible to quickly identify whether or not it’s okay. If it’s not okay, the metrics indicate where it needs to be fixed. We have also chosen not to dynamically train the models because, for a growing company, it adds a lot of uncertainty for marginal lift.

 

What advice do you have for other data scientists looking to improve how they build their workflows?

It’s important for a data scientist to consider a few key things: a clear idea of what needs to be built; ready access to the data needed to test features and models; the right tools that allow for quick iteration; clear communication with the people that will be implementing the model in production; and a way to surface technical performance metrics to stakeholders in the company. 

Reonomy is Hiring | View 3 Jobs

 

Orchard

Greg Svigruha

LEAD DATA ENGINEER

Svigruha’s data team relies heavily on testing and redeployment since their work in helping users buy and sell their home at fair prices is dependent on the ever-changing housing market. Their practice of redeploying weekly involves backtesting simulations of previous modeling algorithms.

 

What are some of your favorite tools that you’ve used to build your data science workflow?

At Orchard, our data science model changes continually with the housing market, so the models need to be redeployed for two main reasons. We have near real-time transactional data feeding into our system accounting for movements in the market, and we need the model and its predictions to reflect these changes. Improvements are deployed multiple times a week.

We retrain our production model every night to make sure it has the latest market data. We use Airflow to perform a number of functions on AWS, like executing the latest algorithm to create model files, building a Docker image from the prediction service’s codebase, deploying and performing walk forward testing.

 

What are some of your best practices for creating reproducible and stable workflows?

Every change to the modeling algorithm needs to be backtested before it’s deployed to production, which is the most challenging part. Ideally, our backtests simulate how the modeling algorithm would have performed over the last year, had we deployed it one year ago. 

The evaluation data has to be large enough to be statistically significant and to counter seasonal effects, but we cannot use the same model to predict for an entire year because we would have trained new ones during that time. So, we repeatedly create new versions of the model by applying a shifting window on a historical dataset. One key difference compared to reality is that we simulate weekly instead of daily, retraining for cost and capacity reasons. This workflow is also orchestrated with Airflow and AWS.

STEPS IN ORCHARD’S BACKTEST SIMULATIONS

Adding new features to the model algorithm and regenerating training data Launching 52 EC2 machines for the 52 weeks in a year Training models on shifted versions of 10 years of historical data Aggregating residuals and computing statistics of the model’s expected performance Comparing performance to baseline and deciding on a proposed change.

    Orchard is Hiring | View 8 Jobs