SD Solutions

NDA Databricks-native platform | Data Engineer

Posted 2 Hours Ago

Be an Early Applicant

Hiring Remotely in Argentina

Remote

Senior level

HR Tech • Information Technology • Professional Services

The Role

Design and build scalable batch and streaming data pipelines on the Databricks Lakehouse to support entity resolution, MDM, data quality, and real-time analytics. Optimize Spark/PySpark workloads, implement CDC/CDF patterns and Delta Live Tables, create reliable idempotent pipelines, design data models (Kimball, SCDs), and collaborate with AI/ML Engineers and Data Scientists while promoting best practices (testing, version control, monitoring).

Summary Generated by Built In

On behalf of NDA, Databricks-native platform, SD Solutions is looking for an experienced Data Engineer to design and build the data infrastructure that powers our Master Data Management platform, built natively on the Databricks Data Intelligence Platform.

In this role, you will develop scalable, high-performance data pipelines that support entity resolution, data quality, and real-time analytics!

You will take a hands-on role in building and optimizing batch and streaming pipelines using modern Lakehouse technologies, including Delta Live Tables and Change Data Capture patterns. This includes ensuring data reliability, consistency, and performance through robust pipeline design, testing, and optimization of Spark workloads.

Working closely with AI/ML Engineers, Product Managers, and Data Scientists, you will translate data requirements into efficient data models and pipelines that enable intelligent features and analytics. You will also contribute to best practices across data engineering, including monitoring, version control, and automated testing.

This is a highly self-directed role suited for someone who thrives in a fast-paced environment, where building scalable data systems and ensuring data quality at scale are central to success.

SD Solutions is a staffing company operating globally. Contact us to get more details about the benefits we offer.

Responsibilities:

Design and Develop Scalable Data Pipelines: Lead the design, development, and optimization of robust, high-performance data pipelines within the Databricks Lakehouse Platform to support LakeFusion's core functionalities, including entity resolution, data quality, and analytical reporting.
Implement Real-time Data Ingestion: Build and manage streaming data pipelines using Delta Live Tables (DLT) and other Databricks capabilities for real-time data ingestion, transformation, and processing, leveraging Change Data Capture (CDC) and Change Data Feed (CDF) patterns.
Optimize Spark Workloads: Apply advanced PySpark best practices and Spark optimization techniques to ensure efficient processing of large-scale datasets, reducing latency and cost for batch and streaming operations.
Ensure Data Reliability and Quality: Develop pipelines with a strong focus on reliability, testability, and data quality. Implement idempotent designs to guarantee data consistency and accuracy across all data flows.
Data Modeling and Architecture: Design and implement logical and physical data models for the Lakehouse, including dimensional modeling (e.g., Kimball methodology) and handling Slowly Changing Dimensions (SCDs), to support analytical and operational needs.
Collaborate on Data Solutions: Work closely with AI/ML Engineers, Product Managers, and Data Scientists to understand data requirements, integrate new data sources, and provide foundational data infrastructure for LakeFusion's intelligent features.‍
Promote Best Practices: Advocate for and implement best practices in data engineering, including version control, automated testing, monitoring, and alerting for data pipelines.

Requirements:

5+ years of hands-on experience as a Data Engineer or in a similar role, specifically building and managing large-scale data platforms and pipelines in a production environment.
Deep expertise with the Databricks Lakehouse Platform, including extensive experience with Delta Lake, Databricks SQL, Unity Catalog, and Databricks Workflows.
Proven proficiency in building and optimizing data pipelines using Apache Spark, particularly with PySpark for complex data transformations and processing.
Demonstrated experience with streaming data technologies and building real-time pipelines, ideally using Delta Live Tables (DLT).
Strong understanding and practical application of Change Data Capture (CDC) and Change Data Feed (CDF) patterns for incremental data loading.
Solid foundation in data modeling concepts, including dimensional modeling (Kimball) and techniques for managing Slowly Changing Dimensions (SCDs).
Experience in designing and implementing reliable, testable, and idempotent data pipelines, ensuring data quality and consistency.
Familiarity with data governance, metadata management, and data cataloging principles.
Excellent problem-solving skills and the ability to debug complex data issues across distributed systems.
Strong communication skills, capable of articulating complex technical concepts to both technical and non-technical stakeholders.

Advantages:

Specific experience with Entity Resolution or Master Data Management (MDM) systems and their underlying data structures.
Experience with cloud platforms (AWS, Azure) for data engineering deployments.
Knowledge of MLOps practices and integrating data pipelines with machine learning workflows.
Experience with CI/CD for data pipelines and infrastructure as code (e.g., Terraform).

About the company:

NDA is a Databricks-native platform that unifies master data, product data, and relationships into a single AI-ready foundation.
One platform. Three sources of trust:
• MDM (Trusted Entities): Customers, suppliers, and accounts mastered and governed across systems
• Graph (Trusted Networks): Relationships and hierarchies connected for intelligence and AI reasoning
• PIM (Trusted Products): SKUs and catalogs enriched and ready for commerce
Built entirely on the Databricks, NDA eliminates data duplication and brings governance directly to where your data lives.
Powered by LLMs, we deliver explainable entity resolution, automated stewardship, and trusted golden records at scale.

By applying for this position, you agree to the terms outlined in our Privacy Policy. Please take a moment to review our Privacy Policy https://sd-solutions.breezy.hr/privacy-notice, and make sure you understand its contents. If you have any questions or concerns regarding our Privacy Policy, please feel free to contact us.

Skills Required

5+ years hands-on experience as a Data Engineer building and managing large-scale production data platforms and pipelines
Deep expertise with the Databricks Lakehouse Platform (Delta Lake, Databricks SQL, Unity Catalog, Databricks Workflows)
Proven proficiency building and optimizing data pipelines using Apache Spark, particularly PySpark
Experience with streaming data technologies and building real-time pipelines, ideally using Delta Live Tables (DLT)
Practical experience implementing Change Data Capture (CDC) and Change Data Feed (CDF) patterns
Strong understanding of data modeling concepts, including dimensional modeling (Kimball) and Slowly Changing Dimensions (SCDs)
Experience designing and implementing reliable, testable, and idempotent data pipelines ensuring data quality and consistency
Familiarity with data governance, metadata management, and data cataloging principles
Excellent problem-solving skills and ability to debug complex data issues across distributed systems
Strong communication skills to articulate technical concepts to technical and non-technical stakeholders
Experience with Entity Resolution or Master Data Management (MDM) systems and their data structures
Experience with cloud platforms (AWS, Azure) for data engineering deployments
Knowledge of MLOps practices and integrating data pipelines with machine learning workflows
Experience with CI/CD for data pipelines and infrastructure as code (e.g., Terraform)