Data Engineering Questions & Training
Last updated
Last updated
General
What are the considerations when choosing methods of ingesting data to bigquery?
Have you worked with data science teams? What were your responsibilities?
What are the considerations of choosing spark vs bigquery?
By mark smallcombe
CAP Theorem
how it effect real world application (latency is availability in real world)
Explain the difference and the reason to choose using NoSQL {mongoDB | DynamoDB | .. } over Relational database {Postgress |MySQL} and vice versa. Give an example for a project where you had to make this choice, and walk through your reasoning.
(This question can be modified for the relevant technologies.. )
Streaming vs Batch “Explain the difference and the reason to choose using Streaming over Batch and vice versa. Give an example for a project where you had to make this choice, and walk through your reasoning.”
Job vs Service “Explain the difference and the reason to choose using Job over Service and vice versa. Give an example for a project where you had to make this choice, in the context of ML pipelines and walk through your reasoning.”
Athena
What is the engine behind athena
How is presto different from Spark? How does it affect your query planning?
Performance tuning - Top 10: partitioning, bucketing, compression, optimize file sizes, optimize columnar data store generation, query tuning, optimize order by, optimize group by, use approx functions, column selection. What are the tradeoffs (time vs cost)?
What is the cost composed of?
How can you calculate cost?
How can you optimize your queries (partitions, join order, limit tricks, etc)
What options do you have to limit the cost of athena?
when would u use athena vs spark
Spark -
Several spark articles that can be used as candidate questions by sivaprasad mandapati.
Join strategies #1, Join strategies #2 - how? Pros and cons. (broadcast hash, shuffle hash, shuffle sort merge, cartesian).
What’s the difference between a data frame and a dataset?
broadcast join is 4 times faster if one of the table is small and enough to fit in memory
Is broadcasting always a good solution ? Absolutely no. If you are joining two data sets both are very large broad casting any table would kill your spark cluster and fails your job.
Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan.
Dynamically coalescing shuffle partitions
Dynamically switching join strategies
Dynamically optimizing skew joins
BigQuery
What is the difference in the implementation between partitions and clustering in BQ?
What ways do you know to reduce query cost in BigQuery?
What is the BigQuery cost composed of? How can you reduce storage cost?
Did you ever encounter a memory error when running BigQuery? Why does it happen and how is it related to the Dremel implementations
How can you control the access to sensitive data in BigQuery?
What options do you have to limit the cost of BigQuery?
When using BigQuery ML to train TF models - what happens in the background?
Airflow
What is airflow?
How do you transfer information between tasks in airflow?
Please give me a real-world example of using spark and airflow together
Data Validation
How can you protect yourself from bad data? Data validation, TDDA, monitoring.
Tools:
Type validation: typeguard
Data validation pydantic
Test driven: tdda
Data quality: great expectations
Saas: SuperConductive by GE
File formats
Can you explain the parquet file format? https://parquet.apache.org/documentation/latest/
How is this leveraged by Spark? https://databricks.com/session/spark-parquet-in-depth
What are the shortcomings of parquet and how is it solved by file formats like hudi, delta, iceberg? https://lakefs.io/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/
Julien simon on AWS glue data brew vs data wrangler
What is a CDC and why do you need it, or how do you use it? - Change data capture (CDC) is the process of recognising when data has been changed in a source system so a downstream process or system can action. A common use case is to reflect (replication) the change in a different target system so that the data in the systems stay in sync.
Outage handling and the differences between stream-based processing vs concurrent isolated worker-based processing using
By nielsen Ilai Malka
How would you design and implement an API rate limiter?
References:
Cap theorem, 2 (which is great), 3 (isn't complete)
Why do we need Data engineering? (podcast)
Q: you have a real time stream - what is better? A stream-based processing system, or a worker-based, that can be triggered on different time ranges, in the context of recovery from outage.