Xebia

XUS_IN_Data Engineer

Sorry, this job was removed at 04:42 p.m. (CST) on Tuesday, Sep 24, 2024

Be an Early Applicant

Internship

Artificial Intelligence • Cloud • Information Technology • Software • Consulting • Data Privacy

The Role

About Xebia

Xebia is a trusted advisor in the modern era of digital transformation, serving hundreds of leading brands worldwide with end-to-end IT solutions. The company has experts specializing in technology consulting, software engineering, AI, digital products and platforms, data, cloud, intelligent automation, agile transformation, and industry digitization. In addition to providing high-quality digital consulting and state-of-the-art software development, Xebia has a host of standardized solutions that substantially reduce the time-to-market for businesses.

Xebia also offers a diverse portfolio of training courses to help support forward-thinking organizations as they look to upskill and educate their workforce to capitalize on the latest digital capabilities. The company has a strong presence across 16 countries with development centres across the US, Latin America, Western Europe, Poland, the Nordics, the Middle East, and Asia Pacific.

Responsibilities

Establish scalable, efficient, automated processes for data analysis, data model development, validation, and implementation.
Work closely with analysts/data scientists to understand impact to the downstream data models.
Write efficient and well-organized software to ship products in an iterative, continual release environment.
Contribute to and promote good software engineering practices across the team
Communicate clearly and effectively to technical and non-technical audiences.

Minimum Qualifications:

University or advanced degree in engineering, computer science, mathematics, or a related field
Strong hands-on experience in Databricks using PySpark and Spark SQL (Unity Catalog, workflows, Optimization techniques)
Experience with at least one cloud provider solution (GCP preferred)
Strong experience working with relational SQL databases.
Strong experience with object-oriented/object function scripting language: Python.
Working knowledge in any transformation tools, DBT preferred.
Ability to work with Linux platform.
Strong knowledge of data pipeline and workflow management tools (Airflow)
Working knowledge of Git hub /Git Toolkit
Expertise in standard software engineering methodology, e.g. unit testing, code reviews, design documentation
Experience creating Data pipelines that prepare data for ingestion & consumption appropriately.
Experience in maintaining and optimizing databases/filesystems for production usage in reporting, analytics.
Working in a collaborative environment and interacting effectively with technical and non-technical team members equally well. Good verbal and written communication skills.

Questionnaire

Scenario 1:

Data Pipeline Design on GCP

You are tasked with designing a data pipeline to process and analyze log data generated by a web application. The log data is stored in Google Cloud Storage (GCS) and needs to be ingested, transformed, and loaded into BigQuery for reporting and analysis.

Requirements:

Ingestion: The log data should be ingested from GCS to a staging area in BigQuery.

Transformation: Apply necessary transformations such as parsing JSON logs, filtering out irrelevant data, and aggregating metrics.

Loading: Load the transformed data into a final table in BigQuery for analysis.

Orchestration: The entire pipeline should be orchestrated to run daily.

Monitoring and Alerting: Set up monitoring and alerting to ensure the pipeline runs successfully and errors are detected promptly.

Questions:

1) Ingestion:

What GCP services would you use to ingest the log data from GCS to BigQuery, and why?

Provide an example of how you would configure this ingestion process.

2) Transformation:

Describe how you would implement the transformation step. What tools or services would you use?

Provide an example transformation you might perform on the log data.

3) Loading:

How would you design the schema for the final BigQuery table to ensure efficient querying?

What considerations would you take into account when loading data into BigQuery?

4) Orchestration:

Which GCP service would you use to orchestrate the data pipeline, and why?

Outline a high-level workflow for the daily orchestration of the pipeline.

5) Monitoring and Alerting:

What strategies would you use to monitor the pipeline's performance?

How would you set up alerts to notify you of any issues?

Scenario 2: Optimizing BigQuery Queries

You are responsible for optimizing BigQuery queries to improve performance and reduce costs. You notice that a frequently run query is taking longer than expected and is costly.

Questions:

1) Performance Analysis:

How would you analyze the performance of a BigQuery query?

What specific metrics or logs would you look at to identify inefficiencies?

2) Optimization Techniques:

List at least three techniques you would use to optimize a BigQuery query.

Explain how each technique improves performance or reduces costs.

3) Partitioning and Clustering:

Describe how you would use partitioning and clustering in BigQuery to optimize query performance.

Provide an example scenario where each technique would be beneficial.

Scenario 3: Data Migration to GCP

Your organization is migrating its on-premises data warehouse to Google Cloud Platform. You need to design and implement a migration strategy.

Questions:

1) Planning and Assessment:

What factors would you consider when planning the migration of an on-premises data warehouse to GCP?

How would you assess the readiness of your existing data warehouse for migration?

2) Migration Strategy:

Describe the steps you would take to migrate data from an on-premises data warehouse to BigQuery.

What tools or services would you use to facilitate the migration?

3) Post-Migration Optimization:

After migrating the data, how would you optimize the new BigQuery data warehouse for performance and cost-efficiency?

What best practices would you follow to ensure the migrated data is accurate and queryable?

Scenario 4: Real-time Data Processing on GCP

Your company requires real-time data processing to analyze streaming data from IoT devices. The data needs to be ingested, processed, and stored for further analysis.

Questions:

1) Ingestion:

What GCP service(s) would you use to ingest real-time streaming data from IoT devices?

Explain the benefits of using these services for real-time data ingestion.

2) Processing:

Describe how you would implement real-time data processing on GCP.

Which GCP services would you use, and why?

3) Storage:

How would you store the processed real-time data for efficient querying and analysis?

What considerations would you take into account when choosing a storage solution?

One liner for GCP:

How do you secure data in Google Cloud Storage?

What is the difference between Google BigQuery and Google Cloud SQL?

How do you implement data pipeline automation in Google Cloud?

Can you explain the role of Google Cloud Pub/Sub in data processing?

What strategies do you use for cost optimization in Google Cloud?

How do you handle schema changes in Google BigQuery?

What is the purpose of Google Dataflow, and when would you use it?

How do you monitor and troubleshoot performance issues in Google Cloud Dataproc?

Explain the difference between managed and unmanaged instance groups in GCP.

How would you design a data warehouse architecture on GCP?

Some useful links:

Xebia | Creating Digital Leaders.

https://www.linkedin.com/company/xebia/mycompany/

http://twitter.com/xebiaindia

https://www.instagram.com/life_at_xebia/

http://www.youtube.com/XebiaIndia

View all jobs at Xebia

View Xebia Profile

Report Job

The Company

HQ: Atlanta, GA

3,254 Employees

On-site Workplace

Year Founded: 2001

What We Do

We are a pioneering IT consultancy company, following 1 mission, 4 values, and 4 business principles.

WHO WE ARE
With over 20 years of experience, our global network of passionate technologists and pioneering craftsmen deliver cutting-edge technology and game-changing consulting to companies on the brink of transformation.

Founded in 2001, Xebia was the first Dutch organization to embrace the Agile way of working, with gurus like Jeff Sutherland. Since then, we have grown from a Java company into a full-service digital consulting company with 4500+ professionals working on a worldwide ambition.

We are organized in complementary chapters – teams with a tremendous amount of knowledge and experience within a particular field, such as Agile, DevOps, Data and AI, Cloud, Software Technology, Low Code, and Microsoft.

We help the world’s top 250 companies and category leaders overcome digital challenges, embrace innovation, adopt new technology, and implement new business models. In addition to high-quality consulting, we also provide offshoring and nearshoring services.

WHAT WE DO
★ Digital Strategy
★ DevOps and SRE
★ Agile
★ Data and AI
★ Cloud
★ Microsoft Solutions
★ Software Technology
★ Security
★ Low Code
★ Xebia Academy

HOW WE ARE ORGANIZED
Xebia has launched specific labels, like GoDataDriven, Binx, Xpirit, Qxperts, Stackstate, Instruqt, Xccelerated, and Xebia Academy Complementing our organic growth, other specialized companies join our successful journey and also operate within the Xebia network under their own brand name, like Appcino, coMakeIt, g-company, Oblivion, PGS Software, and SwissQ. Together we are Xebia.

With 17 offices in Atlanta, San Francisco, UK, Vietnam, Canada, Amsterdam, and Hilversum (the Netherlands), Belgium, Germany, Gurgaon, Jaipur, Hyderabad, Pune, Bangalore, Poland, Melbourne, Mexico, and Dubai.

✉️ [email protected]