Comment on page
Data Engineering Questions & Training
- 1.General
- 9.What are the considerations when choosing methods of ingesting data to bigquery?
- 10.Have you worked with data science teams? What were your responsibilities?
- 11.What are the considerations of choosing spark vs bigquery?
By mark smallcombe
- 1.CAP Theorem
- 7.how it effect real world application (latency is availability in real world)
- 9.
- 10.
- 11.
- 12.
- 13.
- 2.Explain the difference and the reason to choose using NoSQL {mongoDB | DynamoDB | .. } over Relational database {Postgress |MySQL} and vice versa. Give an example for a project where you had to make this choice, and walk through your reasoning.
(This question can be modified for the relevant technologies.. )
- 1.Streaming vs Batch “Explain the difference and the reason to choose using Streaming over Batch and vice versa. Give an example for a project where you had to make this choice, and walk through your reasoning.”
- 2.Job vs Service “Explain the difference and the reason to choose using Job over Service and vice versa. Give an example for a project where you had to make this choice, in the context of ML pipelines and walk through your reasoning.”
- 3.Athena
- 1.What is the engine behind athena
- 2.How is presto different from Spark? How does it affect your query planning?
- 3.Performance tuning - Top 10: partitioning, bucketing, compression, optimize file sizes, optimize columnar data store generation, query tuning, optimize order by, optimize group by, use approx functions, column selection. What are the tradeoffs (time vs cost)?
- 4.What is the cost composed of?
- 5.How can you calculate cost?
- 6.How can you optimize your queries (partitions, join order, limit tricks, etc)
- 7.What options do you have to limit the cost of athena?
- 8.when would u use athena vs spark
- 4.Spark -
- 1.
- 2.Join strategies #1, Join strategies #2 - how? Pros and cons. (broadcast hash, shuffle hash, shuffle sort merge, cartesian).
- 3.What’s the difference between a data frame and a dataset?
- 4.
- 1.broadcast join is 4 times faster if one of the table is small and enough to fit in memory
- 2.Is broadcasting always a good solution ? Absolutely no. If you are joining two data sets both are very large broad casting any table would kill your spark cluster and fails your job.
- 5.
- 1.Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan.
- 2.Dynamically coalescing shuffle partitions
- 3.Dynamically switching join strategies
- 4.Dynamically optimizing skew joins
- 5.BigQuery
- 1.What is the difference in the implementation between partitions and clustering in BQ?
- 2.What ways do you know to reduce query cost in BigQuery?
- 3.What is the BigQuery cost composed of? How can you reduce storage cost?
- 4.Did you ever encounter a memory error when running BigQuery? Why does it happen and how is it related to the Dremel implementations
- 5.How can you control the access to sensitive data in BigQuery?
- 6.What options do you have to limit the cost of BigQuery?
- 7.When using BigQuery ML to train TF models - what happens in the background?
- 6.Airflow
- 1.What is airflow?
- 2.How do you transfer information between tasks in airflow?
- 3.Please give me a real-world example of using spark and airflow together
- 7.Data Validation
- 1.How can you protect yourself from bad data? Data validation, TDDA, monitoring.
- 2.Tools:
- 1.
- 2.
- 3.
- 4.
- 5.
- 8.File formats
- 1.
- 2.
- 3.What are the shortcomings of parquet and how is it solved by file formats like hudi, delta, iceberg? https://lakefs.io/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/
- 9.
- 10.What is a CDC and why do you need it, or how do you use it? - Change data capture (CDC) is the process of recognising when data has been changed in a source system so a downstream process or system can action. A common use case is to reflect (replication) the change in a different target system so that the data in the systems stay in sync.
- 11.Outage handling and the differences between stream-based processing vs concurrent isolated worker-based processing using
Q: you have a real time stream - what is better? A stream-based processing system, or a worker-based, that can be triggered on different time ranges, in the context of recovery from outage.
By nielsen Ilai Malka
- 1.
- How would you design and implement an API rate limiter?
- 1.
References:
- 1.
- 3.
- 5.
Last modified 1yr ago