Data Engineering Questions & Training

Data Engineering Questions / Training

By mark smallcombe

  1. Explain the difference and the reason to choose using NoSQL {mongoDB | DynamoDB | .. } over Relational database {Postgress |MySQL} and vice versa. Give an example for a project where you had to make this choice, and walk through your reasoning.

(This question can be modified for the relevant technologies.. )

  1. Streaming vs Batch “Explain the difference and the reason to choose using Streaming over Batch and vice versa. Give an example for a project where you had to make this choice, and walk through your reasoning.”

  2. Job vs Service “Explain the difference and the reason to choose using Job over Service and vice versa. Give an example for a project where you had to make this choice, in the context of ML pipelines and walk through your reasoning.”

  3. Athena

    1. What is the engine behind athena

    2. How is presto different from Spark? How does it affect your query planning?

    3. Performance tuning - Top 10: partitioning, bucketing, compression, optimize file sizes, optimize columnar data store generation, query tuning, optimize order by, optimize group by, use approx functions, column selection. What are the tradeoffs (time vs cost)?

    4. What is the cost composed of?

    5. How can you calculate cost?

    6. How can you optimize your queries (partitions, join order, limit tricks, etc)

    7. What options do you have to limit the cost of athena?

    8. when would u use athena vs spark

  4. Spark -

    1. Join strategies #1, Join strategies #2 - how? Pros and cons. (broadcast hash, shuffle hash, shuffle sort merge, cartesian).

    2. What’s the difference between a data frame and a dataset?

    3. Sort merge vs broadcast

      1. broadcast join is 4 times faster if one of the table is small and enough to fit in memory

      2. Is broadcasting always a good solution ? Absolutely no. If you are joining two data sets both are very large broad casting any table would kill your spark cluster and fails your job.

    4. Shuffle & AQE

      1. Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan.

      2. Dynamically coalescing shuffle partitions

      3. Dynamically switching join strategies

      4. Dynamically optimizing skew joins

  5. BigQuery

    1. What is the difference in the implementation between partitions and clustering in BQ?

    2. What ways do you know to reduce query cost in BigQuery?

    3. What is the BigQuery cost composed of? How can you reduce storage cost?

    4. Did you ever encounter a memory error when running BigQuery? Why does it happen and how is it related to the Dremel implementations

    5. How can you control the access to sensitive data in BigQuery?

    6. What options do you have to limit the cost of BigQuery?

    7. When using BigQuery ML to train TF models - what happens in the background?

  6. Airflow

    1. What is airflow?

    2. How do you transfer information between tasks in airflow?

    3. Please give me a real-world example of using spark and airflow together

  7. Data Validation

    1. How can you protect yourself from bad data? Data validation, TDDA, monitoring.

    2. Tools:

      1. Type validation: typeguard

      2. Data validation pydantic

      3. Test driven: tdda

      4. Data quality: great expectations

  8. File formats

    1. Can you explain the parquet file format? https://parquet.apache.org/documentation/latest/

    2. What are the shortcomings of parquet and how is it solved by file formats like hudi, delta, iceberg? https://lakefs.io/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/

  9. What is a CDC and why do you need it, or how do you use it? - Change data capture (CDC) is the process of recognising when data has been changed in a source system so a downstream process or system can action. A common use case is to reflect (replication) the change in a different target system so that the data in the systems stay in sync.

  10. Outage handling and the differences between stream-based processing vs concurrent isolated worker-based processing using

By nielsen Ilai Malka

References:

  1. Cap theorem, 2 (which is great), 3 (isn't complete)

Last updated