SAP Course in Hyderabad | Clinical SAS Training in Hyderabad MyLearn Nest

Top 700+ Data Engineering Interview Questions & Answers (Azure, ADF, Databricks, SQL, Python, BigQuery)

In today’s fast-growing IT industry, Data Engineering has become one of the most in-demand and high-paying career paths. With companies increasingly relying on data-driven decision-making, the need for skilled professionals in technologies like Azure Data Factory (ADF), Azure Services, Databricks, SQL, Python, BigQuery, and Dataflow is growing rapidly.

However, cracking a Data Engineer interview is not easy. Top MNC companies like Deloitte, Accenture, TCS, Infosys, Capgemini, and many global organizations are focusing on real-time, scenario-based interview questions rather than just theoretical knowledge. Candidates are expected to have strong hands-on experience, practical understanding, and the ability to solve real-world data problems.

That’s why at MyLearnNest Training Academy, we have taken a step further to help aspiring data engineers succeed in their careers. After extensive research and analysis of top MNC interview patterns, we have compiled a powerful collection of 700+ Data Engineering Interview Questions and Answers covering all the latest technologies used in the industry.

🎯 How We Collected These Questions

Our team has carefully researched and analyzed interview experiences from candidates who attended interviews at leading companies. We focused on identifying:

  • Frequently asked interview questions
  • Real-time scenario-based questions
  • Practical use cases from live projects
  • Questions based on tools like ADF, Databricks, SQL, Python, BigQuery, and Dataflow

We didn’t just collect random questions — we gathered industry-relevant, job-oriented questions that are actually being asked in real interviews. This ensures that learners are fully prepared for the current job market.

💡 What This 700+ Interview Questions Collection Covers

This comprehensive guide includes interview questions from all key areas of Data Engineering:

✔ Azure Data Factory (ADF) – Pipelines, Triggers, Integration Runtime
✔ Azure Services – Storage, Data Lake, Synapse
✔ Databricks – PySpark, Optimization, Delta Lake
✔ SQL – Complex Queries, Joins, Performance Tuning
✔ Python – Data Processing, Scripting, Automation
✔ BigQuery – Partitioning, Clustering, Cost Optimization
✔ Dataflow – Streaming & Batch Processing

Each question is designed to help you understand real-world scenarios, making it easier to crack interviews confidently.

🔥 Why These Questions Are Important

Many candidates fail interviews not because they lack knowledge, but because they are not prepared for how questions are asked in real interviews.

👉 MNC companies focus on:

  • Practical problem-solving
  • Real-time project experience
  • Optimization techniques
  • Scenario-based thinking

This 700+ question collection helps you bridge that gap by preparing you for actual interview expectations.

🏆 How MyLearnNest Helps You Succeed

At MyLearnNest Training Academy (Hyderabad), we don’t just provide content — we provide complete career transformation support.

We are committed to helping students move from learning to earning with our job-oriented training programs.

🎓 Our Training Approach

✔ Real-time project-based training
✔ Hands-on practical sessions
✔ Industry-level curriculum
✔ Expert trainers with 10+ years experience


💼 Placement Support

✔ Resume preparation
✔ Mock interviews based on real MNC questions
✔ Interview guidance and mentorship
✔ 100% placement assistance

🚀 Job-Oriented Training

Our courses are designed to make you job-ready from day one.
We focus on:

  • Real-time use cases
  • End-to-end data pipeline development
  • Practical implementation

This ensures that you are not just learning concepts, but actually gaining the skills required to succeed in interviews and on the job.

🌍 Why Choose MyLearnNest Training Academy?

If you are searching for the best Data Engineering training institute in Hyderabad (Ameerpet), MyLearnNest is the right place.

👉 We provide:

  • Snowflake + DBT + Python + SQL training
  • Azure Data Engineer training
  • Real-time project experience
  • Career guidance and mentorship

Our goal is simple — help students get high-paying IT jobs in top companies.

📊 Who Can Use This 700+ Interview Questions Guide?

This guide is useful for:

✔ Freshers who want to start a career in Data Engineering
✔ Working professionals looking to switch to Data Engineering
✔ Developers who want to upgrade to cloud/data roles
✔ Anyone preparing for Azure / Big Data interviews

🔥 Free Resource for Students

At MyLearnNest, we believe in providing value to our students.
That’s why we are sharing these 700+ Data Engineering Interview Questions and Answers for FREE to help you prepare better and succeed in your career.

This is not just a question bank — it’s a complete interview preparation guide designed based on real industry expectations.

Azure Data Factory Interview Questions & Answers

🔷 BASICS (1–10)

1. What is Azure Data Factory? ADF is a cloud-based ETL/ELT data integration service by Microsoft Azure used to ingest, transform, and orchestrate data across various sources and destinations.

2. Main components of ADF? Pipelines, Activities, Datasets, Linked Services, Integration Runtimes, Triggers, and Data Flows.

3. What is a pipeline in ADF? A logical grouping of activities that together perform a task. It defines the workflow for data movement and transformation.

4. What is an activity in ADF? A single step in a pipeline (e.g., Copy, Lookup, ForEach). Activities can be data movement, transformation, or control flow activities.

5. What is a dataset in ADF? A named view of data that points to the data you want to use as input or output in an activity (e.g., a table, file, folder).

6. What is a linked service? A connection string definition that tells ADF how to connect to an external data source or compute resource (like a connection object).

7. Difference between dataset and linked service? Linked service = connection to the source/sink. Dataset = structure/location of the data within that connection.

8. What are Integration Runtimes? The compute infrastructure ADF uses to perform data integration activities across different network environments.

9. Types of Integration Runtime? Azure IR, Self-hosted IR, and Azure-SSIS IR.

10. What is Azure Integration Runtime? A fully managed, serverless IR hosted in Azure for cloud-to-cloud data movement and transformation.


🔷 INTEGRATION RUNTIME & LINKED SERVICES (11–16)

11. What is Self-hosted Integration Runtime? An IR installed on an on-premises machine or private network VM to connect to on-premises or private cloud data sources.

12. What is Azure SSIS Integration Runtime? A dedicated cluster in Azure to lift-and-shift and run SSIS packages natively in ADF.

13. What is Copy Activity? The primary activity in ADF to copy data from source to sink. Supports 90+ connectors.

14. Supported data sources in ADF? Azure Blob, ADLS, SQL DB, Synapse, Cosmos DB, Oracle, SAP, Salesforce, REST APIs, SFTP, S3, and 90+ more.

15. What is schema in dataset? Defines the column structure (name, type) of the data. Can be auto-detected or manually defined.

16. What is a pipeline trigger? A mechanism to execute a pipeline — either on a schedule, event, or window basis.


🔷 TRIGGERS (17–21)

17. Types of triggers in ADF? Schedule trigger, Tumbling Window trigger, Storage Event trigger, and Custom Event trigger.

18. What is schedule trigger? Fires pipelines at a defined wall-clock time (e.g., daily at 8 AM). Supports recurrence.

19. What is tumbling window trigger? Fires at fixed-size, non-overlapping time intervals. Supports dependency, retry, and backfill.

20. What is event-based trigger? Fires when a blob is created or deleted in Azure Blob Storage or based on custom events via Event Grid.

21. What is debug mode in ADF? Allows you to test pipelines interactively without publishing. Uses a live cluster and shows results in real time.


🔷 AUTHORING & MONITORING (22–30)

22. What is publish in ADF? Saves and deploys all authored changes to the ADF service (live mode) or commits to a Git branch.

23. Difference between debug and trigger execution? Debug = interactive test, doesn’t use triggers, uses debug settings. Trigger execution = actual run via schedule/event/manual trigger.

24. What is monitoring in ADF? The Monitor tab shows pipeline runs, activity runs, trigger runs with status, duration, errors, and rerun options.

25. What is pipeline run ID? A unique GUID assigned to every pipeline execution for tracking and debugging.

26. What is parameterization in ADF? Making pipelines/datasets/linked services dynamic by defining parameters that accept values at runtime.

27. Difference between parameters and variables? Parameters are passed in at runtime (read-only inside pipeline). Variables are mutable values set and updated within a pipeline run.

28. What are system variables? Built-in ADF variables like @pipeline().RunId, @pipeline().DataFactory, @pipeline().TriggerTime, @pipeline().TriggerName.

29. What is dynamic content? An expression-based value in ADF fields using @ syntax to reference parameters, variables, functions, and activity outputs.

30. What is expression language in ADF? ADF uses an expression language based on Azure Logic Apps with functions for strings, math, dates, arrays, etc., wrapped in @{ } or @.


🔷 EXPRESSIONS & FUNCTIONS (31–43)

31. What is @pipeline() function? Refers to pipeline-level metadata — e.g., @pipeline().RunId, @pipeline().parameters.myParam.

32. What is @activity() function? References the output of a previous activity — e.g., @activity('LookupActivity').output.firstRow.columnName.

33. What is ForEach activity? Iterates over an array and executes inner activities for each item. Supports sequential and parallel execution.

34. What is Until activity? A looping activity that repeats inner activities until a specified condition becomes true.

35. What is If Condition activity? Evaluates a boolean expression and executes either a true or false branch of activities.

36. What is Switch activity? Similar to If Condition but evaluates a value expression against multiple cases and executes the matching case branch.

37. What is Lookup activity? Reads data from a dataset and returns the result (firstRow or allRows) for use in downstream activities.

38. What is Get Metadata activity? Retrieves metadata about a dataset — e.g., file exists, column list, last modified time, item count.

39. What is Web activity? Calls an external HTTP/REST endpoint from within a pipeline. Used to trigger APIs, Azure Functions, or Logic Apps.

40. What is Set Variable activity? Sets the value of a pipeline variable at runtime.

41. What is Append Variable activity? Appends a value to an existing array-type pipeline variable.

42. What is Wait activity? Pauses pipeline execution for a specified number of seconds.

43. What is Execute Pipeline activity? Invokes a child pipeline from a parent pipeline. Can run synchronously or asynchronously.


🔷 COPY ACTIVITY ADVANCED (44–48)

44. What is Copy Activity performance tuning? Adjusting DIUs, parallel copies, partitioning, staging, and network bandwidth to maximize throughput.

45. What is data partitioning in Copy Activity? Splits source data into partitions read in parallel — physical partitions, dynamic range, or column-based.

46. What is fault tolerance in Copy Activity? Skip incompatible rows or log errors to a file while continuing the copy instead of failing the whole activity.

47. What is staging in ADF? An intermediate Blob storage area used to stage data before loading into a sink like Synapse using PolyBase/COPY command.

48. What is PolyBase in ADF? A high-performance bulk load mechanism for Azure Synapse Analytics, used via staging to load large data volumes efficiently.


🔷 DATA FLOWS (49–60)

49. What is Data Flow in ADF? A visually designed, code-free data transformation feature that runs on Spark clusters.

50. Difference between Copy Activity and Data Flow? Copy Activity = data movement (no transformation). Data Flow = complex transformations using Spark; slower to spin up but powerful.

51. What is Mapping Data Flow? A visual ETL tool in ADF for building data transformation logic that compiles to Spark and runs at scale.

52. What is Wrangling Data Flow? A Power Query-based data preparation tool for self-service, code-free data wrangling (being deprecated in favor of Mapping Data Flow).

53. What is schema drift? Automatically handles unexpected or changing column structures in source data without failing the data flow.

54. What is projection in Data Flow? Defines the schema (columns and types) of a source or derived dataset within a data flow.

55. What is sink transformation? The final step in a data flow that writes output to a target dataset (e.g., ADLS, SQL, Synapse).

56. What is source transformation? The starting point of a data flow that reads data from a dataset or inline data source.

57. What is derived column transformation? Creates new columns or updates existing ones using expressions — similar to SQL computed columns.

58. What is filter transformation? Filters rows based on a boolean expression — like a SQL WHERE clause.

59. What is join transformation? Joins two data streams based on a condition — supports inner, left outer, right outer, full outer, and cross joins.

60. What is aggregate transformation? Groups data and computes aggregates like SUM, COUNT, AVG, MAX, MIN — like SQL GROUP BY.


🔷 INCREMENTAL LOAD & CDC (61–65)

61. How do you handle incremental data load? Use a watermark column (e.g., ModifiedDate), store last loaded value, and filter source data greater than watermark on each run.

62. What is watermarking in ADF? Tracking the last successfully loaded timestamp or ID to fetch only new/changed records in subsequent runs.

63. How do you implement CDC in ADF? Use SQL Server Change Tracking or Change Data Capture, or use source system CDC APIs combined with ADF pipelines to capture inserts, updates, deletes.

64. How do you handle late arriving data? Use tumbling window triggers with delay settings, or design pipelines to reprocess a lookback window of data periodically.

65. How do you design metadata-driven pipelines? Store pipeline configuration (source, sink, table names, load type) in a control table (Azure SQL), then loop over it using ForEach to drive dynamic pipelines.


🔷 DESIGN PATTERNS (66–77)

66. What is parameterized pipeline design? A single pipeline handles multiple sources/sinks by accepting parameters at runtime, reducing code duplication.

67. How do you pass parameters between pipelines? Use Execute Pipeline activity and pass values in the Parameters section. Child pipeline receives via @pipeline().parameters.

68. How do you handle failures in ADF? Set failure dependencies, use Try-Catch-like patterns with If Condition, implement retry logic, and send alerts via Logic Apps or Azure Monitor.

69. What is retry logic in ADF? Each activity has a Retry count and Retry interval setting to automatically re-execute on transient failures.

70. What is dependency condition in ADF? Defines when a downstream activity runs: on Success, Failure, Completion, or Skipped of an upstream activity.

71. Success, Failure, Completion dependencies? Success = runs only if upstream succeeded. Failure = runs only if upstream failed. Completion = runs regardless of outcome.

72. How do you implement error logging? Use failure dependency path → Copy Activity or Web Activity to write error details to a SQL table, Blob, or call a logging API.

73. How do you monitor pipeline failures? Use the Monitor tab, set up Azure Monitor alerts, or use Log Analytics workspace with ADF diagnostic logs.

74. How do you send alerts from ADF? Configure Azure Monitor alert rules on ADF metrics (PipelineFailedRuns) with action groups (email, SMS, webhook).

75. How do you integrate ADF with Azure Monitor? Enable Diagnostic Settings in ADF to send logs and metrics to Log Analytics, Event Hub, or Storage Account.

76. What is concurrency in ADF pipelines? The number of simultaneous pipeline runs or ForEach iterations executing in parallel.

77. What is pipeline concurrency control? Set the Max Concurrent Runs on a pipeline to limit how many instances run simultaneously, preventing resource contention.


🔷 PERFORMANCE & SECURITY (78–85)

78. How do you optimize performance in ADF? Tune DIUs, use parallel copy, partition source data, use staging for Synapse, optimize Data Flow cluster size, and avoid unnecessary transformations.

79. What are DIUs in Copy Activity? Data Integration Units — a measure of compute power (CPU + memory + network) allocated to a Copy Activity. Range: 2–256.

80. What is parallel copy in ADF? The number of concurrent threads used to read from source or write to sink simultaneously in Copy Activity.

81. How do you secure credentials in ADF? Store secrets in Azure Key Vault and reference them in Linked Services instead of hardcoding credentials.

82. What is Azure Key Vault integration in ADF? ADF fetches secrets (passwords, connection strings) from Key Vault at runtime using a Key Vault Linked Service.

83. What is managed identity in ADF? An Azure AD identity automatically created for ADF that can authenticate to other Azure services without storing credentials.

84. Difference between managed identity and service principal? Managed identity is auto-managed by Azure (no secret rotation needed). Service principal is manually registered in AAD and requires managing secrets/certificates.

85. How do you handle sensitive data in ADF? Use Key Vault for secrets, enable managed identity, apply column-level masking, restrict access via IAM roles, and avoid logging sensitive values.


🔷 CI/CD & DEVOPS (86–91)

86. What is Git integration in ADF? Connects ADF to Azure DevOps Git or GitHub for source control, branching, collaboration, and code review on pipeline definitions.

87. What is CI/CD in ADF? Automating the build and release of ADF pipelines across environments (Dev → Test → Prod) using Azure DevOps pipelines and ARM templates.

88. How do you deploy ADF pipelines using DevOps? Use the ADF publish step to generate ARM templates from the adf_publish branch, then deploy them via Azure DevOps release pipelines to target environments.

89. What is ARM template in ADF? An Azure Resource Manager JSON template auto-generated by ADF that describes all pipeline resources for deployment.

90. What is global parameter in ADF? Parameters defined at the ADF instance level, available across all pipelines — useful for environment-specific values like base URLs or environment names.

91. What is version control in ADF? Using Git integration to track changes, manage branches, review pull requests, and roll back pipeline definitions.


🔷 ADVANCED DESIGN & BEST PRACTICES (92–100)

92. How do you handle large-scale data ingestion? Use partitioned reads, DIU scaling, parallel copy, staging, and incremental loads. Split large files and use ForEach for parallel processing.

93. How do you orchestrate pipelines across environments? Use global parameters for environment configs, deploy via CI/CD, and use Execute Pipeline for modular orchestration.

94. How do you schedule dependent pipelines? Use Execute Pipeline (synchronous) for sequential dependency, or tumbling window trigger dependencies for time-based chaining.

95. How do you build reusable pipeline frameworks? Use parameterized generic pipelines driven by a metadata/control table, with child pipelines for specific tasks called via Execute Pipeline.

96. Best practices for ADF pipeline design? Use parameterization, modular pipelines, metadata-driven design, Key Vault for secrets, Git integration, error handling on all activities, and meaningful naming conventions.

97. How do you optimize Data Flow performance? Use appropriate cluster size (compute-optimized), partition data properly, avoid unnecessary transformations, use cache sinks, and minimize row-by-row operations.

98. How do you handle schema evolution in ADF? Enable schema drift in Data Flows, use dynamic mapping in Copy Activity, and design pipelines to auto-detect and adapt to schema changes.

99. What is cost optimization in ADF? Minimize DIUs, use event triggers instead of frequent schedules, shut down SSIS IR when idle, reduce Data Flow cluster uptime, and use serverless SQL pools where possible.

100. How do you design an end-to-end ETL pipeline using ADF? Ingest raw data from source (Copy Activity) → land in ADLS raw zone → transform using Mapping Data Flow or Databricks → write to curated zone → load into Synapse/SQL → trigger downstream reporting. Use metadata-driven design, parameterization, error handling, monitoring, and CI/CD throughout.

Azure Fundamentals Interview Questions & Answers


🔷 AZURE FUNDAMENTALS (1–14)

1. What is Microsoft Azure? Azure is Microsoft’s cloud computing platform offering 200+ services including compute, storage, networking, databases, AI, and DevOps across a global infrastructure.

2. Core services offered by Azure? Compute (VMs, App Service, Functions), Storage (Blob, SQL, Cosmos DB), Networking (VNet, Load Balancer), Security (AAD, Key Vault), DevOps, AI/ML, and Analytics.

3. What is a Resource Group? A logical container that holds related Azure resources (VMs, DBs, storage) for a solution. Used for lifecycle management, access control, and billing.

4. What is a subscription in Azure? A billing and access boundary in Azure. All resources are created under a subscription, which is tied to an Azure account and a payment method.

5. What is Azure Portal? A web-based GUI at portal.azure.com for creating, managing, and monitoring Azure resources without code.

6. What is Azure CLI? A cross-platform command-line tool (az commands) to manage Azure resources programmatically from terminal/scripts.

7. What is Azure PowerShell? A set of cmdlets for managing Azure resources via PowerShell scripting — preferred in Windows-heavy environments.

8. What is Azure Resource Manager (ARM)? The deployment and management layer for Azure. All operations (portal, CLI, SDK) go through ARM, which handles authentication, authorization, and resource provisioning.

9. What is a region in Azure? A geographic area containing one or more datacenters (e.g., East US, West Europe). Resources are deployed to specific regions for latency and compliance.

10. What is an availability zone? Physically separate datacenters within a region with independent power, cooling, and networking. Protects against datacenter-level failures.

11. What is high availability in Azure? Designing systems to minimize downtime using availability zones, redundant VMs, load balancers, and auto-failover mechanisms.

12. What is scalability in Azure? The ability to increase (scale up/out) or decrease (scale down/in) resources to handle changing workloads.

13. What is elasticity in cloud computing? Automatically provisioning and de-provisioning resources in real time based on demand — paying only for what you use.

14. What is IaaS, PaaS, and SaaS? IaaS = you manage OS/apps (Azure VMs). PaaS = you manage apps/data only (App Service, ADF). SaaS = fully managed software (Microsoft 365, Dynamics).


🔷 COMPUTE & APP SERVICES (15–16)

15. What is Azure Virtual Machine? IaaS offering that provides on-demand, scalable compute with full OS control. Supports Windows and Linux. You manage OS, patches, and runtime.

16. What is Azure App Service? A PaaS platform for hosting web apps, REST APIs, and mobile backends. Supports .NET, Java, Python, Node.js with built-in scaling, CI/CD, and SSL.


🔷 STORAGE (17–22)

17. What is Azure Storage Account? A top-level namespace for Azure Storage services — Blob, Table, Queue, and File. Provides a unique endpoint and access keys.

18. Types of storage in Azure? Blob Storage, Table Storage, Queue Storage, File Storage, Disk Storage (managed disks for VMs), and Data Lake Storage Gen2.

19. What is Azure Blob Storage? Object storage for unstructured data (images, videos, logs, backups). Supports Hot, Cool, Cold, and Archive access tiers.

20. What is Azure Table Storage? NoSQL key-value store for structured, non-relational data. Ideal for storing large amounts of semi-structured data cheaply.

21. What is Azure Queue Storage? A message queue service for asynchronous communication between application components. Stores millions of messages accessible via HTTP/HTTPS.

22. What is Azure File Storage? Managed file shares in the cloud accessible via SMB and NFS protocols. Can be mounted on Windows, Linux, or macOS like a network drive.


🔷 DATABASES (23–25)

23. What is Azure SQL Database? A fully managed relational PaaS database based on SQL Server. Handles patching, backups, HA automatically. Supports serverless and hyperscale tiers.

24. What is Azure Cosmos DB? A globally distributed, multi-model NoSQL database. Supports SQL, MongoDB, Cassandra, Gremlin, and Table APIs with single-digit millisecond latency.

25. What is Azure Data Factory? A cloud ETL/ELT orchestration service for data ingestion, transformation, and pipeline scheduling across 90+ connectors.


🔷 NETWORKING (26–35)

26. What is Azure Virtual Network (VNet)? A private, isolated network in Azure where you deploy resources. Enables secure communication between VMs, subnets, and on-premises networks.

27. What is a subnet in Azure? A segmented address range within a VNet used to organize and isolate resources. NSGs are typically applied at the subnet level.

28. What is Network Security Group (NSG)? A firewall with inbound/outbound rules that controls traffic flow to/from Azure resources based on IP, port, and protocol.

29. What is Azure Load Balancer? A Layer 4 (TCP/UDP) load balancer that distributes incoming traffic across multiple VMs for high availability and scalability.

30. What is Azure Application Gateway? A Layer 7 (HTTP/HTTPS) load balancer with WAF, SSL termination, URL-based routing, and session affinity for web applications.

31. What is Azure Traffic Manager? A DNS-based global traffic routing service that directs users to the best endpoint based on routing methods (performance, geographic, weighted, priority).

32. What is Azure Front Door? A global CDN + Layer 7 load balancer with WAF, SSL offload, caching, and intelligent routing for globally distributed web apps.

33. What is Azure DNS? A hosting service for DNS domains using Azure infrastructure. Provides fast, reliable DNS resolution integrated with other Azure services.

34. What is Azure VPN Gateway? Sends encrypted traffic between Azure VNets and on-premises networks over the public internet via IPsec/IKE VPN tunnels.

35. What is ExpressRoute? A private, dedicated, high-bandwidth connection between on-premises infrastructure and Azure — bypasses the public internet entirely.


🔷 IDENTITY & SECURITY (36–40)

36. What is Azure Active Directory (AAD)? Microsoft’s cloud-based identity and access management service. Provides SSO, MFA, B2B/B2C, and conditional access for apps and users.

37. What is RBAC in Azure? Role-Based Access Control assigns roles (Owner, Contributor, Reader, custom) to users/groups/service principals on specific Azure scopes (subscription, RG, resource).

38. Difference between RBAC and ACL? RBAC assigns roles to identities at resource scope (coarse-grained). ACL assigns permissions at file/object level (fine-grained, e.g., ADLS Gen2 file permissions).

39. What is Managed Identity? An automatically managed Azure AD identity for Azure services to authenticate to other services (e.g., ADF accessing Key Vault) without storing credentials.

40. What is Azure Key Vault? A secure store for secrets, encryption keys, and SSL certificates. Centralizes credential management with access control and audit logging.


🔷 MONITORING (41–43)

41. What is Azure Monitor? A full-stack monitoring platform for collecting, analyzing, and acting on telemetry (metrics, logs, traces) from Azure and on-premises resources.

42. What is Azure Log Analytics? A workspace within Azure Monitor for ingesting and querying logs using KQL (Kusto Query Language) from various Azure services.

43. What is Azure Alerts? Rules in Azure Monitor that trigger notifications or automated actions (email, SMS, runbook, webhook) when metrics or log conditions are met.


🔷 DEVOPS (44–48)

44. What is Azure DevOps? A suite of developer tools: Repos (Git), Pipelines (CI/CD), Boards (Agile), Test Plans, and Artifacts for end-to-end software delivery.

45. What is CI/CD in Azure? Continuous Integration (auto-build/test on code commit) + Continuous Delivery/Deployment (auto-deploy to environments) using Azure Pipelines.

46. What is Azure Repos? A Git or TFVC source control service in Azure DevOps for storing and versioning code with PR workflows.

47. What is Azure Pipelines? A CI/CD service that builds, tests, and deploys code to any platform/cloud. Supports YAML-based and classic pipelines.

48. What is Azure Artifacts? A package management service for hosting NuGet, npm, Maven, Python, and Universal packages with feed management and versioning.


🔷 SERVERLESS & INTEGRATION (49–54)

49. What is Azure Functions? A serverless compute service that runs event-triggered code without managing infrastructure. Supports HTTP, timer, queue, Blob, Cosmos DB triggers.

50. What is Azure Logic Apps? A low-code workflow automation service with 400+ connectors for integrating apps, data, and services without writing much code.

51. Difference between Azure Functions and Logic Apps? Functions = code-first serverless compute for custom logic. Logic Apps = low-code workflow orchestration for integration scenarios. Often used together.

52. What is Azure Event Hub? A big data streaming platform and event ingestion service capable of receiving millions of events per second. Used for telemetry, clickstream, IoT data.

53. What is Azure Service Bus? An enterprise-grade message broker supporting queues and topics/subscriptions for reliable, ordered, transactional async messaging between apps.

54. Difference between Event Hub and Service Bus? Event Hub = high-volume event streaming (telemetry, logs). Service Bus = reliable messaging with ordering, dead-lettering, transactions for app integration.


🔷 DATA & ANALYTICS (55–59)

55. What is Azure Data Lake Storage (ADLS)? A massively scalable, hierarchical, Hadoop-compatible storage built on Blob Storage. Gen2 supports ACLs, fine-grained security, and analytics workloads.

56. What is Azure Synapse Analytics? A unified analytics platform combining data warehousing (dedicated SQL pools), big data (Spark), data integration (ADF-like pipelines), and BI in one workspace.

57. What is Azure Databricks? A managed Apache Spark platform optimized for Azure. Used for big data processing, ML, and collaborative data engineering notebooks.

58. What is Azure HDInsight? A managed open-source analytics service running Hadoop, Spark, Hive, HBase, Kafka, and Storm clusters on Azure.

59. What is Azure Machine Learning? An end-to-end MLOps platform for building, training, deploying, and monitoring ML models with support for AutoML, pipelines, and model registry.


🔷 CONTAINERS (60, 78–79)

60. What is Azure Kubernetes Service (AKS)? A managed Kubernetes service that simplifies deploying, scaling, and operating containerized applications. Azure manages the control plane.

78. What is Azure Container Instances (ACI)? Serverless containers that run without managing VMs or orchestrators. Ideal for short-lived, burst, or simple containerized workloads.

79. Difference between AKS and ACI? AKS = full Kubernetes orchestration for complex, long-running microservices. ACI = simple, fast, serverless container runs without orchestration overhead.


🔷 HIGH AVAILABILITY & DR (61–66)

61. How do you design a highly available architecture? Deploy across availability zones, use load balancers, autoscaling, geo-redundant storage, active-active or active-passive setups, and health probes.

62. How do you implement disaster recovery? Use Azure Site Recovery for VM replication, geo-redundant databases, multi-region active-passive deployments, and RTO/RPO-aligned failover plans.

63. What is Azure Site Recovery? A DRaaS solution that replicates on-premises or Azure VMs to a secondary region and enables orchestrated failover/failback.

64. What is Azure Backup? A centralized cloud backup service for VMs, SQL DBs, Blob, Files, and on-premises workloads with retention policies and restore capabilities.

65. What is geo-redundancy in Azure Storage? GRS (Geo-Redundant Storage) replicates data to a secondary region hundreds of miles away, providing protection against regional disasters.

66. What is zone-redundant storage? ZRS replicates data synchronously across 3 availability zones within a single region, protecting against datacenter-level failures.


🔷 SECURITY & GOVERNANCE (67–74)

67. How do you secure Azure resources? Use RBAC, NSGs, Private Endpoints, Key Vault, Managed Identity, Azure AD Conditional Access, Defender for Cloud, and encryption at rest/in transit.

68. What is Zero Trust model in Azure? Never trust, always verify. Authenticate every user/device, apply least-privilege access, assume breach, and verify explicitly using AAD, MFA, and conditional access.

69. What is Azure Security Center (Defender for Cloud)? A unified security management and threat protection platform that provides security posture scores, threat detection, and compliance assessments.

70. What is Azure Policy? A governance service that enforces organizational standards by defining, assigning, and evaluating policy rules on Azure resources (e.g., enforce tagging, restrict regions).

71. What is Azure Blueprints? Packages Azure Policies, RBAC, ARM templates, and resource groups into a repeatable definition for deploying compliant environments at scale.

72. How do you implement governance in Azure? Use Management Groups, Subscriptions hierarchy, Azure Policy, RBAC, Blueprints, tagging standards, and Cost Management budgets.

73. What is tagging in Azure? Metadata key-value pairs applied to resources for cost allocation, environment identification, automation, and governance (e.g., Environment: Production).

74. How do you optimize Azure costs? Right-size VMs, use Reserved Instances, spot VMs, autoscaling, delete unused resources, apply lifecycle policies on storage, and monitor with Cost Management.


🔷 COST & SCALING (75–77)

75. What is Azure Cost Management? A built-in tool for monitoring, allocating, and optimizing Azure spending with budgets, alerts, and cost analysis dashboards.

76. What is autoscaling in Azure? Automatically adds or removes compute instances based on demand metrics (CPU, memory, queue length) to maintain performance and control costs.

77. What is a scale set in Azure? Azure Virtual Machine Scale Sets (VMSS) create and manage a group of identical, load-balanced VMs that scale in/out automatically.


🔷 INFRASTRUCTURE AS CODE (80–83)

80. What is Infrastructure as Code (IaC) in Azure? Defining and provisioning Azure infrastructure using code/templates (ARM, Bicep, Terraform) for repeatability, version control, and automation.

81. What is ARM template? A JSON file describing Azure resources declaratively. Submitted to ARM for deployment — supports parameters, variables, dependencies, and outputs.

82. What is Bicep in Azure? A domain-specific language (DSL) that simplifies ARM template authoring. More readable, less verbose, compiles to ARM JSON under the hood.

83. Difference between ARM and Bicep? ARM = verbose JSON syntax, harder to read/write. Bicep = cleaner, concise syntax that compiles to ARM. Same capabilities, better developer experience with Bicep.


🔷 CI/CD & DEPLOYMENT PATTERNS (84–87)

84. What is CI/CD pipeline design in Azure DevOps? Source (Repos) → Build pipeline (compile, test, package) → Release pipeline (deploy to Dev → Test → Prod) with approvals, gates, and rollback.

85. How do you deploy multi-environment pipelines? Use YAML pipeline stages with environment-specific variable groups, service connections, and approval gates for Dev, Test, UAT, and Prod stages.

86. What is blue-green deployment? Maintain two identical environments (blue = current, green = new). Switch traffic to green after validation. Enables zero-downtime deployments and instant rollback.

87. What is canary deployment? Gradually route a small percentage of traffic (e.g., 5%) to the new version, monitor, and progressively increase rollout. Reduces risk of full deployments.


🔷 API & DATA ARCHITECTURE (88–94)

88. What is Azure API Management? A fully managed API gateway for publishing, securing, throttling, and monitoring APIs. Provides developer portal, policies, and analytics.

89. What is API Gateway pattern? A single entry point for all clients that handles routing, auth, rate limiting, SSL termination, and response transformation — implemented via APIM in Azure.

90. What is data partitioning in Azure? Dividing data across multiple storage nodes/partitions to improve scalability and performance. Used in Cosmos DB (partition key), SQL (table partitioning), and ADLS.

91. What is data sharding? Horizontal partitioning where each shard contains a subset of data rows — distributes load across multiple databases/nodes for scale-out.

92. What is caching in Azure? Storing frequently accessed data in a fast in-memory layer to reduce latency and database load. Azure provides Redis Cache and CDN for caching.

93. What is Azure Redis Cache? A managed in-memory data store based on open-source Redis. Used for session management, output caching, real-time leaderboards, and pub/sub messaging.

94. What is hybrid cloud architecture in Azure? Combining on-premises infrastructure with Azure cloud using ExpressRoute/VPN, Azure Arc, Azure Stack, and consistent identity/security across both.


🔷 ADVANCED & BEST PRACTICES (95–100)

95. What is multi-cloud strategy? Using services from multiple cloud providers (Azure + AWS + GCP) for redundancy, best-of-breed services, and avoiding vendor lock-in.

96. How do you handle big data processing in Azure? Ingest via Event Hub/ADF → store in ADLS Gen2 → process with Databricks/Synapse Spark → serve via Synapse SQL/Power BI. Use Delta Lake for reliability.

97. How do you design ETL pipelines in Azure? Use ADF for orchestration → ADLS as raw/curated zones → Databricks/Synapse for transformation → SQL DB/Synapse DW as serving layer. Apply medallion architecture (Bronze → Silver → Gold).

98. How do you handle streaming data in Azure? Ingest with Event Hub or IoT Hub → process with Azure Stream Analytics or Databricks Structured Streaming → sink to Cosmos DB, SQL, or Power BI for real-time dashboards.

99. Best practices for Azure architecture? Follow the Azure Well-Architected Framework: Reliability, Security, Cost Optimization, Operational Excellence, and Performance Efficiency. Use IaC, least-privilege, autoscaling, monitoring, and tagging.

100. How do you design end-to-end cloud solutions using Azure? Define requirements → choose services (compute, storage, network, data) → design for HA/DR across zones/regions → secure with AAD, RBAC, Key Vault → automate with IaC and CI/CD → monitor with Azure Monitor + Log Analytics → optimize costs with autoscaling and Reserved Instances → iterate using DevOps practices.

Azure Databricks Interview Questions & Answers

🔷 FUNDAMENTALS (1–12)

1. What is Azure Databricks? A managed Apache Spark platform on Azure, optimized for big data processing, ML, and collaborative data engineering. Combines Spark with Delta Lake, MLflow, and Unity Catalog in a unified workspace.

2. What is Apache Spark? An open-source, distributed computing engine for large-scale data processing. Supports batch, streaming, SQL, ML, and graph workloads in-memory across a cluster.

3. Key features of Databricks? Managed Spark clusters, Delta Lake, MLflow, Unity Catalog, collaborative notebooks, Delta Live Tables, AutoML, serverless compute, Git integration, and REST APIs.

4. What is a Databricks workspace? A cloud environment where teams collaborate on notebooks, clusters, jobs, and data. Provides a unified UI for data engineering, science, and analytics.

5. What is a cluster in Databricks? A set of computation resources (driver + worker nodes) that execute Spark workloads. Clusters run notebooks, jobs, and queries.

6. Types of clusters in Databricks? All-purpose clusters (interactive, shared, persistent) and Job clusters (auto-created for specific job runs, auto-terminated after). Also Serverless SQL Warehouses for SQL analytics.

7. What is a notebook in Databricks? A web-based interactive document containing code cells (Python, SQL, Scala, R), markdown, and visualizations. Supports real-time collaboration and version history.

8. Languages supported in Databricks? Python, SQL, Scala, R, and shell commands (%sh). Each notebook has a default language; cells can switch using magic commands (%python, %sql, %scala).

9. What is DBFS? Databricks File System — a distributed file system abstraction layer over cloud storage (ADLS, S3, GCS). Provides a unified / path for accessing files across the cluster.

10. What is a job in Databricks? A scheduled or triggered execution of a notebook, JAR, Python script, or Delta Live Tables pipeline. Managed via the Jobs UI or REST API.

11. What is a library in Databricks? A package (PyPI, Maven, CRAN, or custom JAR/egg/wheel) installed on a cluster to make external code available to notebooks and jobs.

12. What is autoscaling in Databricks clusters? Automatically adds or removes worker nodes based on workload demand. Saves cost when idle and scales out during peak processing without manual intervention.

🔷 SPARK ARCHITECTURE (13–25)

13. What is driver node in Spark? The master node that runs the SparkContext, coordinates task scheduling, maintains DAG, and collects results. Runs your main application code.

14. What is worker node in Spark? Nodes that run executors, which execute tasks assigned by the driver. Each worker has CPU, memory, and local storage for processing data partitions.

15. What is RDD in Spark? Resilient Distributed Dataset — the fundamental, immutable, fault-tolerant distributed data structure in Spark. Low-level API with no schema; manually parallelized across partitions.

16. What is DataFrame in Spark? A distributed collection of data organized into named columns (like a SQL table). Built on top of RDDs with schema, Catalyst optimization, and higher-level API.

17. Difference between RDD and DataFrame? RDD = unstructured, no optimization, functional API, verbose. DataFrame = structured with schema, optimized by Catalyst, SQL-friendly, much faster and preferred.

18. What is Dataset API? A typed, strongly-typed (compile-time safety) version of DataFrame available in Scala/Java. Combines RDD’s type safety with DataFrame’s optimization. Python uses DataFrames (no Dataset).

19. What is lazy evaluation in Spark? Transformations are not executed immediately — Spark builds a DAG of operations and executes only when an action is called. Enables optimization before execution.

20. What are transformations in Spark? Operations that return a new DataFrame/RDD without executing immediately (lazy). Examples: filter, select, groupBy, join, withColumn, map, flatMap.

21. What are actions in Spark? Operations that trigger actual execution of the DAG and return results to driver or write to storage. Examples: count, collect, show, write, save, take.

22. What is SparkSession? The unified entry point for Spark applications in Spark 2.0+. Encapsulates SparkContext, SQLContext, and HiveContext. Created via SparkSession.builder.

23. What is caching in Spark? Storing a DataFrame/RDD in memory (or disk) to avoid recomputation on subsequent actions. Use .cache() or .persist() for iterative algorithms and repeated use.

24. What is a partition in Spark? A chunk of data distributed across worker nodes. Spark processes each partition in parallel. Default partition count is based on HDFS block size or spark.default.parallelism.

25. What is a Spark job? A parallel computation triggered by an action. Each job is divided into stages, stages into tasks. One action = one job.

🔷 DELTA LAKE (26–34)

26. What is Delta Lake in Databricks? An open-source storage layer on top of Parquet that adds ACID transactions, schema enforcement, time travel, DML operations, and scalable metadata to data lakes.

27. What is ACID transaction in Delta Lake? Atomicity (all or nothing), Consistency (valid state always), Isolation (concurrent transactions don’t interfere), Durability (committed data persists) — guaranteed via Delta transaction log.

28. What is schema enforcement? Delta Lake rejects writes that don’t match the table schema by default — prevents corrupt or unexpected data from entering the table (write-time validation).

29. What is schema evolution? Allows adding new columns or changing schema over time using mergeSchema or overwriteSchema options. Enables pipelines to adapt without breaking.

30. What is time travel in Delta Lake? Querying historical versions of a Delta table using VERSION AS OF or TIMESTAMP AS OF. Powered by the Delta transaction log. Useful for auditing and rollbacks.

31. What is OPTIMIZE in Delta Lake? Compacts many small Parquet files into larger, optimal-sized files to improve query performance and reduce file overhead. Run periodically on large tables.

32. What is Z-ORDER in Databricks? A co-locality optimization technique used with OPTIMIZE that sorts and co-locates related data in the same files based on specified columns, speeding up filter queries.

33. What is VACUUM in Delta Lake? Removes old Parquet files no longer referenced by the Delta transaction log. Default retention is 7 days. Frees up storage but removes time travel capability beyond retention.

34. What is checkpointing in Spark? Saves the RDD/DataFrame state or streaming query progress to reliable storage (DBFS/ADLS) to truncate DAG lineage and enable recovery from failures.

🔷 JOINS & SHUFFLES (35–40)

35. What is broadcast join? A join optimization where a small DataFrame is broadcast (copied) to all worker nodes, avoiding expensive shuffle. Triggered automatically or via broadcast() hint.

36. What is shuffle in Spark? Redistribution of data across partitions/nodes required by wide transformations (groupBy, join, distinct). Expensive operation — involves disk I/O and network transfer.

37. What is wide vs narrow transformation? Narrow = each input partition maps to one output partition (filter, map, select) — no shuffle. Wide = multiple input partitions needed for output (groupBy, join, distinct) — causes shuffle.

38. What is repartition vs coalesce? repartition(n) = full shuffle, increases or decreases partitions, evenly distributed. coalesce(n) = no shuffle, only decreases partitions by merging — faster but potentially uneven.

39. What is skew in Spark? Data skew occurs when some partitions have significantly more data than others, causing a few tasks to take much longer and become bottlenecks.

40. How do you handle data skew? Use salting (add random key prefix), AQE skew join optimization, repartitioning, broadcast joins for small tables, or split skewed keys manually.

🔷 OPTIMIZATION INTERNALS (41–45)

41. What is Spark UI? A web interface showing jobs, stages, tasks, DAG visualizations, executor stats, storage, SQL plans, and environment details — essential for performance debugging.

42. What is DAG in Spark? Directed Acyclic Graph — a logical execution plan of transformations. Spark builds the DAG at planning time and optimizes it before execution. Visualized in Spark UI.

43. What is Catalyst optimizer? Spark SQL’s query optimization engine. Applies rule-based and cost-based optimizations: logical plan → optimized logical plan → physical plan → code generation.

44. What is Tungsten engine? Spark’s execution engine for physical memory management and code generation. Uses off-heap memory, binary format, cache-aware computation, and whole-stage code generation for speed.

45. What is cluster manager in Spark? Manages resource allocation across the cluster. Options: Databricks-managed (default), YARN, Mesos, Kubernetes, or standalone Spark cluster manager.

🔷 CLUSTER MANAGEMENT (46–48)

46. What is job scheduling in Databricks? Assigning compute resources and execution order to jobs. Databricks uses FIFO by default; supports fair scheduling for concurrent workloads.

47. What is job cluster vs all-purpose cluster? Job cluster = auto-created for one job run, isolated, cost-efficient, auto-terminated. All-purpose cluster = persistent, shared, interactive — better for development.

48. What is autoscaling behavior in clusters? Databricks monitors pending tasks; scales out by adding workers when backlogged, scales in by removing idle workers. Optimized autoscaling checks every 15 seconds.

🔷 GOVERNANCE & SECURITY (49–53)

49. What is Unity Catalog? A centralized governance solution for Databricks that provides fine-grained access control, data lineage, auditing, and unified metadata management across workspaces.

50. What is data governance in Databricks? Controlling who can access what data using Unity Catalog, RBAC, column-level security, row filters, data masking, audit logs, and data lineage tracking.

51. What is secret scope in Databricks? A secure store for credentials and secrets. Backed by Databricks or Azure Key Vault. Secrets accessed via dbutils.secrets.get() — never displayed in plaintext in notebooks.

52. What is mounting in Databricks? Attaching external cloud storage (ADLS, S3) to DBFS using dbutils.fs.mount() so it appears as a local path. Being replaced by Unity Catalog external locations.

53. What is DBFS vs ADLS? DBFS = virtual filesystem abstraction layer built into Databricks, limited governance. ADLS Gen2 = actual Azure storage with fine-grained ACLs, Unity Catalog integration, and production-grade governance.

🔷 FILE FORMATS (54–60)

54. File formats supported in Databricks? Parquet (default/recommended), Delta, ORC, JSON, CSV, Avro, Text, Binary, XML (with libraries).

55. What is Parquet format? Columnar, compressed, open-source file format. Stores data column-by-column enabling efficient reads for analytical queries. Supports schema, nested types, and predicate pushdown.

56. What is ORC format? Optimized Row Columnar — Hive-native columnar format similar to Parquet. Good for Hive/Hadoop ecosystems. Parquet generally preferred in Spark/Databricks.

57. What is JSON vs CSV difference in Spark? JSON = semi-structured, supports nested/complex types, larger file size. CSV = flat, simple, no type info, schema inferred. Both are row-based and less efficient than Parquet for analytics.

58. What is partition pruning? Spark skips reading entire partitions that don’t match filter conditions. Requires data partitioned on the filter column. Drastically reduces I/O for large tables.

59. What is predicate pushdown? Pushing filter conditions down to the storage/file layer (Parquet, Delta) so only matching rows are read — reduces data scanned before it reaches Spark engine.

60. What is caching vs persistence? .cache() = stores in memory only (default MEMORY_AND_DISK in Databricks). .persist(StorageLevel) = allows choosing storage level (memory only, disk only, memory+disk, serialized, replicated).

🔷 PIPELINE DESIGN (61–66)

61. How do you design a data pipeline in Databricks? Ingest raw data to Bronze (ADLS/Delta) → clean/validate to Silver → aggregate/enrich to Gold → serve via SQL Warehouse or Power BI. Orchestrate with Jobs or Delta Live Tables.

62. How do you handle incremental loads? Use watermark columns with MERGE, Auto Loader for new file detection, Structured Streaming with checkpoints, or Delta Lake’s Change Data Feed for incremental reads.

63. What is CDC in Databricks? Change Data Capture — tracking inserts, updates, and deletes from source systems. Implemented using Delta MERGE, Change Data Feed (delta.enableChangeDataFeed), or Debezium + Event Hub streaming.

64. How do you implement MERGE in Delta Lake?

 
 
sql
MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE

Handles upserts and deletes atomically.

65. What is upsert in Delta Lake? Update existing records and insert new ones in a single atomic MERGE operation — the most common pattern for CDC and incremental loads in Delta.

66. How do you handle late arriving data? Use watermarking in Structured Streaming to define how late data can arrive. For batch, reprocess affected partitions. Delta MERGE handles late updates gracefully.

🔷 STREAMING (67–72)

67. What is Structured Streaming? Spark’s scalable, fault-tolerant stream processing engine built on DataFrames/SQL. Treats streaming data as an unbounded table and processes it as micro-batches or continuously.

68. What is micro-batch processing? Default Structured Streaming mode that collects data for a small interval (trigger interval) and processes it as a batch. Balances latency and throughput.

69. What is continuous processing in Spark? An experimental mode in Structured Streaming that processes each record as it arrives with ~1ms latency. Limited operations supported compared to micro-batch.

70. What is watermarking in streaming? Defines how late data is tolerated in streaming aggregations. withWatermark("timestamp", "10 minutes") allows 10 minutes of late data before the state is dropped.

71. What is windowing in Spark streaming? Aggregating streaming data over time windows — tumbling (fixed, non-overlapping), sliding (overlapping), and session windows (activity-based gaps).

72. What is stateful vs stateless processing? Stateless = each micro-batch processed independently (filter, map). Stateful = maintains state across batches (aggregations, joins, deduplication) — uses checkpointing.

🔷 PERFORMANCE TUNING (73–82)

73. How do you optimize Spark jobs? Minimize shuffles, use broadcast joins, cache intermediate results, tune partition count, use columnar formats (Parquet/Delta), enable AQE, push predicates early, and right-size clusters.

74. Best practices for performance tuning? Partition data on high-cardinality filter columns, use Z-ORDER for Delta, avoid collect() on large data, use explain() to analyze plans, avoid UDFs where possible, enable AQE.

75. What is memory management in Spark? Spark divides executor memory into execution memory (shuffles, joins, sorts) and storage memory (caching). Unified memory model shares these dynamically since Spark 1.6.

76. What is spill in Spark? When execution memory is insufficient, Spark spills data to disk — slows down jobs significantly. Indicates need for more memory, fewer partitions, or cluster resize.

77. What is executor memory vs driver memory? Executor memory = heap size for worker tasks (transformations, caching). Driver memory = heap size for the driver program (collecting results, DAG planning). Both configurable separately.

78. What is Adaptive Query Execution (AQE)? A Spark 3.0+ feature that re-optimizes query plans at runtime based on actual data statistics — handles skew joins, coalesces shuffle partitions, and converts sort-merge joins to broadcast joins dynamically.

79. What is dynamic partition pruning? A Spark 3.0+ optimization that pushes filter conditions from fact tables to dimension table scans at runtime, significantly reducing I/O in star-schema queries.

80. What is cost-based optimizer (CBO)? Uses table statistics (row counts, column cardinality) to choose the most efficient join order and strategy. Enable with ANALYZE TABLE and spark.sql.cbo.enabled=true.

81. What is cluster sizing strategy? Choose cluster size based on data volume, transformations, and SLA. Start with memory-optimized nodes for caching-heavy workloads, compute-optimized for CPU-heavy transformations. Use autoscaling for variable loads.

82. What is autoscaling optimization? Enable optimized autoscaling (Databricks-enhanced), set appropriate min/max workers, use spot instances for workers, and monitor cluster utilization to avoid over/under-provisioning.


🔷 MONITORING & DEBUGGING (83–87)

83. How do you monitor Databricks jobs? Use Jobs UI for run history and status, Spark UI for task-level metrics, Azure Monitor integration for alerts, Log Analytics for querying logs, and Ganglia for cluster metrics.

84. What is Ganglia in Databricks? An open-source cluster monitoring tool integrated in Databricks showing real-time CPU, memory, network, and disk metrics per node — accessible from the cluster UI.

85. What is logging in Databricks? Capture logs using log4j (driver/executor logs), configure cluster log delivery to DBFS/ADLS, use dbutils.notebook.exit() for notebook outputs, and send custom logs to Azure Monitor.

86. How do you debug failed jobs? Check event logs in Jobs UI → inspect Spark UI for failed stage/task → review executor logs and exception messages → use explain() on DataFrames → add intermediate .show() or count() checkpoints.

87. What is Databricks REST API? A comprehensive HTTP API to programmatically manage clusters, jobs, notebooks, DBFS, secrets, and permissions. Used for CI/CD automation and integration with external tools.

🔷 CI/CD & MLFLOW (88–93)

88. What is CI/CD in Databricks? Automating notebook/code testing and deployment across environments (Dev → Test → Prod) using Azure DevOps or GitHub Actions with Databricks CLI, REST API, and bundle configurations.

89. How do you deploy notebooks to production? Use Git integration (Repos) → branch-based development → PR review → merge to main → CI pipeline runs tests (nutter/pytest) → CD pipeline deploys using Databricks CLI or dbx/bundle deploy.

90. What is MLflow in Databricks? An open-source ML lifecycle platform built into Databricks for experiment tracking, model packaging, model registry, and deployment. Native integration with Spark and AutoML.

91. What is experiment tracking? Logging ML run parameters, metrics, tags, and artifacts (models, plots) using mlflow.log_param(), mlflow.log_metric() to compare runs and reproduce results.

92. What is model registry? A centralized MLflow component for versioning, staging (Staging → Production → Archived), and managing ML models across teams with approval workflows.

93. What is feature store in Databricks? A centralized repository to create, store, discover, and reuse ML features. Ensures consistency between training and serving, supports point-in-time lookups for time-series features.


🔷 DELTA LIVE TABLES & DATA QUALITY (94–98)

94. What is Delta Live Tables (DLT)? A declarative ETL framework in Databricks for building reliable, maintainable data pipelines. Define transformations as SQL/Python; DLT manages orchestration, retries, lineage, and data quality.

95. What is data quality in Databricks? Enforcing expectations on data using DLT constraints (@dlt.expect, @dlt.expect_or_drop, @dlt.expect_or_fail), Great Expectations library, or custom validation notebooks.

96. How do you implement data validation checks? Use DLT expectations for pipeline-level constraints, write assertion-based checks using assert or custom functions, log bad records to quarantine tables, and monitor quality metrics in DLT UI.

97. What is medallion architecture? A layered data design pattern: Bronze (raw ingested data, as-is) → Silver (cleaned, validated, deduplicated) → Gold (aggregated, business-ready). Each layer progressively improves data quality.

98. How do you handle schema evolution in production? Enable mergeSchema for additive changes, use Delta’s schema evolution options, version schemas in a registry, test schema changes in lower environments, and use DLT’s schema drift handling.


🔷 COST & END-TO-END (99–100)

99. What is cost optimization in Databricks? Use job clusters instead of all-purpose for production, enable autoscaling, use spot/preemptible instances for workers, right-size clusters, schedule jobs during off-peak, use serverless SQL for ad-hoc queries, and OPTIMIZE + VACUUM to reduce storage costs.

100. How do you design an end-to-end real-time pipeline in Databricks? Ingest streaming data via Event Hub → Auto Loader or Structured Streaming reads into Bronze Delta table → streaming transformations write to Silver (deduplication, cleaning) → aggregations write to Gold → serve via Databricks SQL / Power BI for real-time dashboards. Use DLT for declarative pipeline management, Unity Catalog for governance, MLflow for model serving if needed, and Azure Monitor for end-to-end observability.

SQL Interview Questions & Answers

Here are concise answers to all 100 SQL questions:


🔷 BASICS (1–20)

1. What is SQL? Structured Query Language — a standard language for managing, querying, and manipulating relational databases. Used to create, read, update, and delete data.

2. Types of SQL commands? DDL (CREATE, ALTER, DROP), DML (SELECT, INSERT, UPDATE, DELETE), DCL (GRANT, REVOKE), TCL (COMMIT, ROLLBACK, SAVEPOINT), DQL (SELECT).

3. Difference between WHERE and HAVING? WHERE filters rows before grouping (works on raw data). HAVING filters groups after GROUP BY (works on aggregated results). WHERE cannot use aggregate functions; HAVING can.

4. What is a primary key? A column (or combination) that uniquely identifies each row in a table. Cannot be NULL, must be unique, only one per table. Automatically creates a clustered index.

5. What is a foreign key? A column in one table that references the primary key of another table. Enforces referential integrity — prevents orphan records.

6. INNER JOIN vs LEFT JOIN? INNER JOIN returns only matching rows from both tables. LEFT JOIN returns all rows from the left table and matching rows from the right; unmatched right rows return NULL.

7. What is NULL in SQL? Represents missing, unknown, or inapplicable data. NULL ≠ 0 or empty string. Use IS NULL / IS NOT NULL to check — NULL compared with anything returns NULL (unknown).

8. COUNT(*) vs COUNT(column)? COUNT(*) counts all rows including NULLs. COUNT(column) counts only non-NULL values in that column.

9. What is DISTINCT? Removes duplicate rows from the result set. SELECT DISTINCT col returns unique values. Can be used with multiple columns — combination must be unique.

10. DELETE vs TRUNCATE vs DROP? DELETE = removes specific rows (filterable, logged, rollbackable). TRUNCATE = removes all rows fast (minimal logging, resets identity, rollbackable in some DBs). DROP = removes entire table including structure.

11. What is a constraint? Types? Rules enforced on columns to maintain data integrity. Types: PRIMARY KEY, FOREIGN KEY, UNIQUE, NOT NULL, CHECK, DEFAULT.

12. What is ORDER BY? Sorts the result set by one or more columns in ascending (ASC, default) or descending (DESC) order. Applied last in query execution after SELECT.

13. What is GROUP BY? Groups rows with the same values in specified columns into summary rows. Used with aggregate functions (SUM, COUNT, AVG, MAX, MIN).

14. What is a default value? A value automatically assigned to a column when no value is provided during INSERT. Defined with DEFAULT constraint.

15. What is LIMIT or TOP? Restricts number of rows returned. LIMIT n (MySQL/PostgreSQL), TOP n (SQL Server), ROWNUM / FETCH FIRST n ROWS (Oracle).

16. What is IN operator? Tests whether a value matches any value in a list or subquery. WHERE dept IN ('HR', 'IT') — cleaner alternative to multiple OR conditions.

17. BETWEEN vs IN? BETWEEN filters a range of values (inclusive on both ends): salary BETWEEN 50000 AND 80000. IN filters specific discrete values: dept IN ('HR', 'Finance').

18. What is aliasing in SQL? Assigning a temporary name to a column or table using AS keyword for readability. SELECT salary * 12 AS annual_salary or FROM employees e.

19. What is LIKE operator? Used for pattern matching in string columns. WHERE name LIKE 'A%' (starts with A). Works with wildcards % and _.

20. What are wildcards? % = matches zero or more characters. _ = matches exactly one character. Used with LIKE: '%son' (ends with son), 'J_n' (Jan, Jon, etc.).

🔷 DATA TYPES & STRUCTURE (21–25)

21. CHAR vs VARCHAR? CHAR = fixed-length, pads with spaces, faster for fixed-size data. VARCHAR = variable-length, stores only actual characters, more storage-efficient for variable data.

22. What is a schema? A logical container/namespace within a database that groups related objects (tables, views, procedures). E.g., dbo.employeesdbo is the schema.

23. What is a table? A structured storage object organized in rows (records) and columns (fields). The fundamental unit of data storage in relational databases.

24. What is normalization? Process of organizing a database to reduce redundancy and improve integrity by dividing large tables into smaller ones and defining relationships. Follows normal forms (1NF, 2NF, 3NF, BCNF).

25. What is denormalization? Intentionally introducing redundancy into a database by combining tables to improve read performance. Used in data warehouses and reporting systems.

🔷 JOINS & SUBQUERIES (26–35)

26. What is a subquery? A query nested inside another query (SELECT, INSERT, UPDATE, DELETE). Can be in SELECT, FROM, or WHERE clause. Executes first and passes result to outer query.

27. Types of joins in SQL? INNER JOIN, LEFT JOIN (LEFT OUTER), RIGHT JOIN (RIGHT OUTER), FULL OUTER JOIN, CROSS JOIN, SELF JOIN, and NATURAL JOIN.

28. What is self join? A table joined with itself using aliases. Used to compare rows within the same table — e.g., finding employees and their managers from the same employees table.

29. What is a correlated subquery? A subquery that references columns from the outer query. Executes once per row of the outer query — slower than regular subqueries. Used for row-by-row comparisons.

30. UNION vs UNION ALL? UNION combines result sets and removes duplicates (sorts internally — slower). UNION ALL combines all rows including duplicates (faster). Column count and types must match in both.

31. What is CASE statement? Conditional logic in SQL — like if-else. CASE WHEN condition THEN result ELSE default END. Used in SELECT, ORDER BY, GROUP BY, and WHERE clauses.

32. What is COALESCE? Returns the first non-NULL value from a list of expressions. COALESCE(col1, col2, 'default') — commonly used to handle NULLs with fallback values.

33. What is NULLIF? Returns NULL if two expressions are equal, otherwise returns the first expression. NULLIF(col, 0) — used to avoid division-by-zero errors.

34. What are aggregate functions? Functions that compute a single result from multiple rows: SUM, COUNT, AVG, MAX, MIN, STDEV, VAR. Used with GROUP BY or as window functions.

35. WHERE vs ON in joins? ON defines the join condition between tables (filters during join). WHERE filters rows after the join. For INNER JOIN they’re equivalent; for OUTER JOINs they differ — ON preserves unmatched rows, WHERE eliminates them.

🔷 VIEWS & INDEXES (36–45)

36. What is a view? A virtual table defined by a SELECT query. Doesn’t store data physically (usually). Simplifies complex queries, provides security by hiding columns, and abstracts underlying schema.

37. What is a materialized view? A view that physically stores the query result and is periodically refreshed. Faster to query than regular views but requires storage and refresh strategy. Used in data warehouses.

38. View vs table? Table = physical storage of data. View = virtual, no storage (unless materialized), always reflects latest data, can’t always be updated directly.

39. What is indexing? A database optimization structure that allows faster data retrieval. Like a book index — points to row locations without scanning the entire table.

40. Types of indexes? Clustered, Non-clustered, Unique, Composite, Full-text, Filtered, Covering, Bitmap (Oracle), Columnstore (SQL Server for analytics).

41. What is clustered index? Physically sorts and stores table rows based on the indexed column. Only one per table. Primary key creates clustered index by default. Table IS the index.

42. What is non-clustered index? A separate structure that holds key values and pointers to actual rows. Multiple allowed per table. Faster lookups without reordering physical data.

43. What is composite index? An index on two or more columns. Column order matters — index is used when the leading columns are in the WHERE clause ((dept, salary) helps WHERE dept = 'IT').

44. What is unique index? Enforces uniqueness on the indexed column(s). Similar to UNIQUE constraint. Can be on non-primary key columns. Allows only one NULL (in most databases).

45. EXISTS vs IN? EXISTS = checks if subquery returns any rows (stops at first match — efficient for large datasets). IN = checks if value matches list from subquery (evaluates all). EXISTS preferred with correlated subqueries.

🔷 CTEs & WINDOW FUNCTIONS (46–60)

46. What is a CTE? Common Table Expression — a named temporary result set defined with WITH clause, usable in the main query. Improves readability and supports recursion.

47. CTE vs subquery? CTE = reusable within the same query, readable, supports recursion, defined once and referenced multiple times. Subquery = inline, can’t be reused, less readable for complex logic.

48. What is recursion in SQL? A CTE that references itself to process hierarchical data (org charts, file systems). Requires an anchor member and recursive member with termination condition.

 
 
sql
WITH cte AS (
  SELECT id, name, manager_id FROM emp WHERE manager_id IS NULL
  UNION ALL
  SELECT e.id, e.name, e.manager_id FROM emp e JOIN cte c ON e.manager_id = c.id
)
SELECT * FROM cte;
```

**49. What is ROW_NUMBER()?**
Assigns a unique sequential integer to each row within a partition. No ties — each row gets a distinct number. Used for deduplication and pagination.

**50. What is RANK()?**
Assigns rank to rows within a partition. Ties get the same rank but the next rank skips (1,1,3,4). Gaps exist in ranking sequence.

**51. What is DENSE_RANK()?**
Like RANK() but no gaps — ties get same rank and next rank is consecutive (1,1,2,3). Used when skipping ranks is undesirable.

**52. ROW_NUMBER vs RANK vs DENSE_RANK?**
```
Score: 100, 100, 90, 80
ROW_NUMBER:  1, 2, 3, 4  (unique always)
RANK:        1, 1, 3, 4  (gaps after ties)
DENSE_RANK:  1, 1, 2, 3  (no gaps)

53. What are window functions? Functions that perform calculations across a set of rows related to the current row without collapsing rows (unlike GROUP BY). Use OVER(PARTITION BY ... ORDER BY ...) clause.

54. What is partition in window functions? PARTITION BY divides rows into groups for the window function to operate on independently — like GROUP BY but without collapsing rows. Each partition is processed separately.

55. What is running total? Cumulative sum of values up to the current row.

 
 
sql
SELECT date, sales,
  SUM(sales) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) AS running_total
FROM sales;

56. What is moving average? Average of a sliding window of N rows.

 
 
sql
SELECT date, sales,
  AVG(sales) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_3
FROM sales;

57. What is LAG() and LEAD()? LAG(col, n) = accesses value n rows before current row. LEAD(col, n) = accesses value n rows after current row. Used for comparing current row with previous/next row values.

58. What is pivoting? Transforming distinct row values into column headers to create a cross-tabular report. Done using CASE+GROUP BY or PIVOT keyword (SQL Server).

59. What is unpivoting? Transforming columns back into rows — the reverse of pivot. Done using UNPIVOT keyword or UNION ALL approach.

60. What is a derived table? A subquery in the FROM clause treated as a temporary table. Like a CTE but defined inline. SELECT * FROM (SELECT col FROM table WHERE ...) AS derived.

🔷 PERFORMANCE & INTERNALS (61–72)

61. What is query optimization? The process of rewriting queries and designing schemas/indexes to minimize execution time and resource usage. Involves analyzing execution plans, adding indexes, and rewriting logic.

62. What is execution plan? A step-by-step road map the database engine uses to execute a query. Shows operations (scan, seek, join type, sort) and their costs. Use EXPLAIN (MySQL/PG) or SET SHOWPLAN (SQL Server).

63. What is indexing strategy? Choosing which columns to index based on query patterns: index WHERE clause columns, JOIN keys, ORDER BY columns, and high-selectivity columns. Avoid over-indexing (slows DML).

64. What is data skew? Unequal distribution of data values in a column, leading to uneven partition sizes and slow queries. Affects performance in parallel databases and Spark.

65. Normalization forms (1NF, 2NF, 3NF)? 1NF = atomic values, no repeating groups. 2NF = 1NF + no partial dependency (non-key columns depend on whole PK). 3NF = 2NF + no transitive dependency (non-key columns depend only on PK).

66. What is ACID? Atomicity (all or nothing), Consistency (valid state always maintained), Isolation (concurrent transactions don’t interfere), Durability (committed data persists after crash).

67. What is a transaction? A sequence of SQL operations treated as a single unit. Starts with BEGIN, ends with COMMIT (save) or ROLLBACK (undo). Ensures ACID properties.

68. What is deadlock? Two or more transactions block each other by holding locks the other needs — circular wait. Database detects and kills one transaction (deadlock victim) to resolve.

69. What is locking? Mechanism to control concurrent access to data. Ensures transactions don’t interfere with each other. Locks are acquired on rows, pages, or tables.

70. Types of locks? Shared (S) lock = read, multiple allowed. Exclusive (X) lock = write, blocks others. Update (U) lock = intent to update. Intent locks = signal intent to lock lower-level objects.

71. What is isolation level? Defines how much a transaction is isolated from other concurrent transactions. Controls the trade-off between consistency and concurrency.

72. Types of isolation levels? Read Uncommitted (dirty reads allowed), Read Committed (default, no dirty reads), Repeatable Read (no dirty/non-repeatable reads), Serializable (strictest, no phantom reads). Also Snapshot Isolation.

🔷 CURSORS & STORED OBJECTS (73–81)

73. What is a cursor? A database object to retrieve and process rows one at a time from a result set. Useful for row-by-row operations but generally slow and resource-intensive.

74. When to avoid cursors? Almost always — replace with set-based operations (joins, window functions, CTEs) which are far more efficient. Use only when row-by-row logic is unavoidable.

75. What is a temporary table? A table created in tempdb (SQL Server) or session scope that exists for the duration of a session or stored procedure. #temp (local) or ##temp (global).

76. Temp table vs CTE? Temp table = physical storage in tempdb, persists across multiple statements, indexable. CTE = virtual, exists only for one query, not indexable, better for readability.

77. What is a stored procedure? A precompiled set of SQL statements stored in the database and executed by name. Supports parameters, variables, control flow, error handling, and transactions.

78. What is a function? A reusable SQL routine that returns a value (scalar) or table. Called within SQL statements. Scalar functions return one value; table-valued functions return a result set.

79. Procedure vs function? Function = must return a value, can be used in SELECT, no DML (in most DBs), no error handling. Procedure = may not return value, can’t be used in SELECT, supports DML, transactions, and error handling.

80. What is a trigger? A stored procedure that automatically fires in response to DML events (INSERT, UPDATE, DELETE) on a table. Used for auditing, validation, and cascading actions.

81. Types of triggers? BEFORE/AFTER triggers (MySQL), INSTEAD OF triggers (SQL Server views), DML triggers (row-level, statement-level), DDL triggers (on schema changes), Logon triggers.

🔷 PARTITIONING & ARCHITECTURE (82–85)

82. What is partitioning? Dividing a large table into smaller, manageable pieces (partitions) based on column values. Improves query performance through partition pruning and parallel processing.

83. Types of partitioning? Range (date ranges), List (specific values), Hash (hash of column value for even distribution), Composite (combination of methods).

84. What is sharding? Horizontal partitioning across multiple database servers/instances. Each shard holds a subset of rows. Scales write performance across servers — used in distributed systems.

85. Data warehouse schema types? Star schema (fact table + dimension tables, denormalized, fast queries), Snowflake schema (normalized dimensions, complex joins), Data Vault (hub-satellite model for flexibility and history).

🔷 PRACTICAL SQL QUERIES (86–100)

86. Find duplicate records:

 
 
sql
SELECT col, COUNT(*) AS cnt
FROM table
GROUP BY col
HAVING COUNT(*) > 1;

87. Remove duplicates (keep one):

 
 
sql
-- SQL Server / CTE approach
WITH cte AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY col ORDER BY id) AS rn
  FROM table
)
DELETE FROM cte WHERE rn > 1;

88. Second highest salary:

 
 
sql
SELECT MAX(salary) AS second_highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
-- OR
SELECT salary FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

89. Nth highest salary:

 
 
sql
SELECT salary FROM (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
  FROM employees
) t
WHERE rnk = N;

90. Employees earning more than their manager:

 
 
sql
SELECT e.name AS employee, e.salary, m.name AS manager, m.salary AS mgr_salary
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;

91. Find missing numbers in a sequence:

 
 
sql
SELECT n.num AS missing
FROM (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS num
      FROM master..spt_values) n  -- or generate_series in PG
WHERE n.num BETWEEN 1 AND (SELECT MAX(id) FROM table)
  AND n.num NOT IN (SELECT id FROM table);
-- PostgreSQL:
SELECT generate_series(MIN(id), MAX(id)) AS missing
FROM table
EXCEPT SELECT id FROM table;

92. Top 3 salaries per department:

 
 
sql
SELECT * FROM (
  SELECT name, dept, salary,
    DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rnk
  FROM employees
) t
WHERE rnk <= 3;

93. Running total of sales:

 
 
sql
SELECT date, amount,
  SUM(amount) OVER (ORDER BY date 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;

94. Customers with no orders:

 
 
sql
SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;
-- OR using NOT EXISTS:
SELECT * FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id);

95. Employees who joined in last 30 days:

 
 
sql
SELECT * FROM employees
WHERE hire_date >= DATEADD(DAY, -30, GETDATE());  -- SQL Server
-- WHERE hire_date >= CURRENT_DATE - INTERVAL '30 days';  -- PostgreSQL

96. Consecutive duplicate records:

 
 
sql
SELECT * FROM (
  SELECT *,
    LAG(status) OVER (ORDER BY id) AS prev_status
  FROM table
) t
WHERE status = prev_status;

97. Find gaps in dates:

 
 
sql
SELECT DATEADD(DAY, 1, d1.date) AS gap_start,
       DATEADD(DAY, -1, d2.date) AS gap_end
FROM (SELECT date, LEAD(date) OVER (ORDER BY date) AS next_date FROM dates_table) t
WHERE DATEDIFF(DAY, date, next_date) > 1;

98. Most frequent value in a column:

 
 
sql
SELECT TOP 1 col, COUNT(*) AS freq
FROM table
GROUP BY col
ORDER BY COUNT(*) DESC;
-- Or with LIMIT in MySQL/PG

99. Convert rows to columns (pivot):

 
 
sql
-- SQL Server PIVOT:
SELECT * FROM (
  SELECT emp_id, month, sales FROM sales_table
) src
PIVOT (
  SUM(sales) FOR month IN ([Jan], [Feb], [Mar])
) pvt;
-- Generic CASE approach:
SELECT emp_id,
  SUM(CASE WHEN month='Jan' THEN sales END) AS Jan,
  SUM(CASE WHEN month='Feb' THEN sales END) AS Feb
FROM sales_table GROUP BY emp_id;

100. Split comma-separated values into rows:

 
 
sql
-- SQL Server (STRING_SPLIT):
SELECT id, value
FROM table
CROSS APPLY STRING_SPLIT(csv_column, ',');

-- PostgreSQL:
SELECT id, unnest(string_to_array(csv_column, ',')) AS value
FROM table;

-- MySQL 8+:
SELECT id, SUBSTRING_INDEX(SUBSTRING_INDEX(csv_col, ',', n), ',', -1) AS val
FROM table JOIN numbers ON n <= LENGTH(csv_col) - LENGTH(REPLACE(csv_col,',','')) + 1;

Python Interview Questions & Answers

🔷 BASICS (1–25)

1. What is Python? Key features? Python is a high-level, interpreted, general-purpose programming language. Key features: simple readable syntax, dynamic typing, interpreted, object-oriented, large standard library, cross-platform, and supports multiple paradigms (OOP, functional, procedural).

2. Python data types? int, float, complex, str, bool, list, tuple, set, frozenset, dict, bytes, bytearray, NoneType.

3. List vs tuple vs set vs dictionary? List = ordered, mutable, allows duplicates [1,2,3]. Tuple = ordered, immutable, allows duplicates (1,2,3). Set = unordered, mutable, no duplicates {1,2,3}. Dict = key-value pairs, ordered (3.7+), mutable, unique keys {"a":1}.

4. Mutable vs immutable? Mutable = can be changed after creation (list, dict, set, bytearray). Immutable = cannot be changed (int, float, str, tuple, frozenset). Immutable objects are hashable and safe as dict keys.

5. What is a variable? A named reference to a memory location storing a value. Python variables are dynamically typed — no need to declare type. x = 10 creates a variable x pointing to integer 10.

6. What are keywords? Reserved words with special meaning in Python. Cannot be used as identifiers. Examples: if, else, for, while, def, class, return, import, True, False, None, and, or, not, in, is, lambda, yield, with.

7. What is indentation? Python uses indentation (spaces/tabs) to define code blocks instead of braces {}. Standard is 4 spaces. Inconsistent indentation raises IndentationError.

8. What is type casting? Converting one data type to another. Explicit: int("10"), str(10), float("3.14"), list((1,2,3)). Implicit: Python automatically converts int + float → float.

9. What is dynamic typing? Variable types are determined at runtime, not compile time. Same variable can hold different types at different times: x = 10; x = "hello" — both valid.

10. What is id() function? Returns the unique memory address (identity) of an object. id(x) — useful to check if two variables point to the same object. Related to is operator comparison.

11. What is len() function? Returns the number of items in an object (string, list, tuple, dict, set). len("hello") → 5, len([1,2,3]) → 3.

12. What is slicing? Extracting a portion of a sequence using [start:stop:step]. lst[1:4] = elements at index 1,2,3. lst[::-1] = reverse. Works on strings, lists, tuples.

13. What is indexing? Accessing individual elements using position. lst[0] = first element. lst[-1] = last element. Python supports negative indexing (from end).

14. What is a string? An immutable sequence of Unicode characters enclosed in single, double, or triple quotes. "hello", 'world', """multiline""". Strings are iterable and support slicing.

15. String methods? upper(), lower(), strip(), split(), join(), replace(), find(), count(), startswith(), endswith(), format(), encode(), isdigit(), isalpha(), zfill(), ljust(), rjust(), title(), center().

16. What is input() function? Reads user input from console as a string. name = input("Enter name: "). Always returns string — cast if needed: age = int(input("Age: ")).

17. What is print() function? Outputs data to console. print("hello", end="\n", sep=" "). Supports multiple arguments, sep for separator, end for line ending, file for output destination.

18. What are comments? Non-executable lines for documentation. Single-line: # comment. Multi-line: use multiple # or triple-quoted strings """...""" (technically a string, used as docstring too).

19. What is None? Python’s null value representing absence of a value. Type is NoneType. Default return value of functions with no return. Check with if x is None (not == None).

20. is vs ==? == compares values (equality). is compares identity (same memory address). [1,2] == [1,2] → True. [1,2] is [1,2] → False (different objects). Use is only for None, True, False comparisons.

21. What is boolean type? bool — subclass of int. Only two values: True (1) and False (0). Result of comparisons and logical operations. Falsy values: 0, None, "", [], {}, (), set().

22. What is type() function? Returns the type/class of an object. type(10) → <class 'int'>. Use isinstance(obj, type) for type checking in production code (handles inheritance).

23. What are escape characters? Special characters in strings prefixed with \. \n (newline), \t (tab), \\ (backslash), \' (single quote), \" (double quote), \r (carriage return), \0 (null).

24. What is Python interpreter? A program that executes Python code line-by-line (CPython is the default). Converts source code → bytecode → machine execution. Other interpreters: PyPy, Jython, IronPython.

25. What is PEP 8? Python Enhancement Proposal 8 — the official style guide for Python code. Covers indentation (4 spaces), line length (79 chars), naming conventions (snake_case for variables, PascalCase for classes), and documentation.

🔷 CONTROL FLOW (26–30)

26. What is if-else? Conditional execution based on boolean expression.

 
 
python
if score >= 90:
    print("A")
elif score >= 80:
    print("B")
else:
    print("C")

27. What is for loop? Iterates over a sequence (list, string, range, dict, etc.).

 
 
python
for i in range(5):       # 0,1,2,3,4
    print(i)
for item in my_list:
    print(item)

28. What is while loop? Repeats as long as condition is True.

 
 
python
n = 0
while n < 5:
    print(n)
    n += 1

Risk of infinite loop if condition never becomes False.

29. What is break and continue? break = exits the loop immediately. continue = skips current iteration and moves to next. Both work in for and while loops.

30. What is pass? A null statement — does nothing. Used as a placeholder where syntax requires a statement but no logic is needed yet (empty functions, classes, loops).

 
 
python
def todo_later():
    pass

🔷 FUNCTIONS (31–45)

31. What are functions? Reusable blocks of code defined with def keyword. Encapsulate logic, reduce repetition, improve readability. Can accept parameters and return values.

32. Built-in vs user-defined functions? Built-in = provided by Python (len, print, range, type, int, str, sorted, zip, map, filter). User-defined = created by programmer using def. Both are first-class objects in Python.

33. What is recursion? A function calling itself to solve smaller subproblems. Requires a base case to stop. Python default recursion limit is 1000 (sys.setrecursionlimit()).

34. Factorial using recursion:

 
 
python
def factorial(n):
    if n <= 1:       # base case
        return 1
    return n * factorial(n - 1)  # recursive case

print(factorial(5))  # 120

35. What are arguments in functions? Values passed to a function when calling it. Parameters = placeholders in definition. Arguments = actual values passed. Python is pass-by-object-reference.

36. Types of arguments? Positional (order matters), Keyword (name=value, order flexible), Default (preset value), *args (variable positional), **kwargs (variable keyword). Order: positional → *args → keyword → **kwargs.

37. What is default argument? A parameter with a preset value used when no argument is provided.

 
 
python
def greet(name, msg="Hello"):
    print(f"{msg}, {name}")
greet("Alice")          # Hello, Alice
greet("Bob", "Hi")      # Hi, Bob

*38. What is args? Allows a function to accept any number of positional arguments as a tuple.

 
 
python
def add(*args):
    return sum(args)
add(1, 2, 3, 4)  # 10

**39. What is kwargs? Allows a function to accept any number of keyword arguments as a dictionary.

 
 
python
def info(**kwargs):
    for k, v in kwargs.items():
        print(f"{k}: {v}")
info(name="Alice", age=30)

40. What is lambda function? An anonymous single-expression function. lambda arguments: expression. Used for short, throwaway functions — commonly with map, filter, sorted.

 
 
python
square = lambda x: x ** 2
sorted(lst, key=lambda x: x[1])

41. What are anonymous functions? Functions without a name — created using lambda in Python. Used inline where a full def is unnecessary. Limited to single expression (no statements, no return keyword).

42. What is return statement? Exits a function and optionally sends a value back to the caller. A function without return returns None. Can return multiple values as a tuple: return x, y.

43. Return vs print? print() displays to console — output only, no value passed back. return passes a value back to the caller for further use. Functions used in expressions must return, not print.

44. What is docstring? A string literal at the start of a function/class/module documenting its purpose. Accessed via __doc__ attribute or help(). Triple-quoted: """This function does X.""".

45. What is scope of variables? The region where a variable is accessible. LEGB rule: Local (function) → Enclosing (nested function) → Global (module) → Built-in (Python built-ins). Use global or nonlocal keywords to modify outer scope variables.

🔷 DATA STRUCTURES (46–65)

46. What is a list? An ordered, mutable, indexed collection allowing duplicates. Defined with []. Most versatile Python data structure. Supports heterogeneous types.

47. List methods? append(x) add end, extend(iter) add multiple, insert(i,x) at index, remove(x) first occurrence, pop(i) remove+return, sort() in-place, reverse(), index(x), count(x), clear(), copy().

48. What is list comprehension? Concise way to create lists using a single line with optional filtering.

 
 
python
squares = [x**2 for x in range(10)]
evens = [x for x in range(20) if x % 2 == 0]
matrix = [[i*j for j in range(3)] for i in range(3)]

49. What is a tuple? An ordered, immutable sequence. Defined with (). Faster than lists, hashable (usable as dict key), used for fixed data like coordinates, RGB values, database records.

50. Why is tuple immutable? Designed for fixed collections — immutability provides hashability (usable as dict keys/set elements), thread safety, and slight performance benefit. Signals “this data shouldn’t change.”

51. What is a set? An unordered collection of unique, hashable elements. Defined with {} or set(). No indexing. O(1) average lookup. Used for deduplication and membership tests.

52. Set operations?

 
 
python
a | b   # union — all elements
a & b   # intersection — common elements
a - b   # difference — in a not in b
a ^ b   # symmetric difference — in a or b but not both
a <= b  # subset check

53. What is a dictionary? An ordered (Python 3.7+), mutable collection of key-value pairs. Keys must be unique and hashable. O(1) average lookup. {"name": "Alice", "age": 30}.

54. Dictionary methods? get(k, default), keys(), values(), items(), update(), pop(k), popitem(), setdefault(), clear(), copy(), fromkeys(). Dict comprehension: {k:v for k,v in ...}.

55. What is key-value pair? The fundamental unit of a dictionary. Key = unique identifier (immutable), Value = associated data (any type). Accessed via dict[key] or dict.get(key).

56. What is nested dictionary? A dictionary where values are themselves dictionaries. Used for hierarchical data.

 
 
python
students = {
    "Alice": {"age": 20, "grade": "A"},
    "Bob": {"age": 22, "grade": "B"}
}
students["Alice"]["grade"]  # "A"

57. Shallow copy vs deep copy? Shallow copy = new object but nested objects are still referenced (shared). Deep copy = completely independent copy including all nested objects. Matters for mutable nested structures.

58. copy() vs deepcopy()?

 
 
python
import copy
lst = [[1,2], [3,4]]
shallow = copy.copy(lst)      # or lst.copy() or lst[:]
deep = copy.deepcopy(lst)
# Modify nested: shallow copy affected, deep copy independent

59. What is unpacking? Extracting values from sequences/iterables into variables.

 
 
python
a, b, c = [1, 2, 3]
first, *rest = [1, 2, 3, 4]    # first=1, rest=[2,3,4]
x, y = (10, 20)
a, b = b, a                     # swap

60. What is zip()? Combines multiple iterables element-wise into tuples. Stops at shortest iterable.

 
 
python
names = ["Alice", "Bob"]
scores = [90, 85]
list(zip(names, scores))  # [("Alice",90), ("Bob",85)]
dict(zip(names, scores))  # {"Alice":90, "Bob":85}

61. What is enumerate()? Adds a counter to an iterable, returning (index, value) pairs.

 
 
python
for i, val in enumerate(["a","b","c"], start=1):
    print(i, val)  # 1 a, 2 b, 3 c

62. sorted() vs sort()? sort() = in-place list method, modifies original, returns None. sorted() = built-in function, returns new sorted list, works on any iterable. Both accept key and reverse parameters.

63. What is heap / priority queue? A binary tree-based data structure where parent is always smaller (min-heap) than children. Python’s heapq module implements min-heap. heapq.heappush(), heapq.heappop() for O(log n) operations.

64. What is stack and queue in Python? Stack (LIFO) = use list with append() and pop(). Queue (FIFO) = use collections.deque with append() and popleft(), or queue.Queue for thread-safe operations.

65. What is deque? Double-ended queue from collections module. O(1) append/pop from both ends. appendleft(), popleft(), rotate(). More efficient than list for queue operations.

🔷 FILE HANDLING (66–71)

66. How to read a file?

 
 
python
with open("file.txt", "r") as f:
    content = f.read()        # entire file as string
    lines = f.readlines()     # list of lines
    line = f.readline()       # one line at a time

67. How to write to a file?

 
 
python
with open("file.txt", "w") as f:
    f.write("Hello\n")
with open("file.txt", "a") as f:
    f.write("Appended line\n")

68. File modes? r = read (default), w = write (overwrites), a = append, x = create (fails if exists), b = binary mode (rb, wb), + = read+write (r+, w+).

69. What is with open()? Context manager for file operations. Automatically closes file when block exits — even if exception occurs. Cleaner than manual try-finally with f.close().

70. CSV file handling?

 
 
python
import csv
# Read
with open("file.csv") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row["name"])
# Write
with open("out.csv","w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name","age"])
    writer.writeheader()
    writer.writerow({"name":"Alice","age":30})

71. JSON file handling?

 
 
python
import json
# Read JSON
with open("data.json") as f:
    data = json.load(f)         # file → dict
obj = json.loads('{"key":"val"}')  # string → dict
# Write JSON
with open("out.json","w") as f:
    json.dump(data, f, indent=2)    # dict → file
s = json.dumps(data)                 # dict → string

🔷 MODULES & PACKAGES (72–75)

72. What are modules? A Python file (.py) containing functions, classes, and variables. Promotes code reuse and organization. Standard library has 200+ built-in modules (os, sys, math, datetime, re, json).

73. What is import? Loads a module into current namespace. import math, from math import sqrt, from math import *, import numpy as np. Python caches modules — imported once per session.

74. What is __name__ == "__main__"? __name__ is "__main__" when script is run directly, but the module name when imported. The guard if __name__ == "__main__": ensures code only runs when script is executed directly, not when imported.

75. What are packages? A directory of modules with an __init__.py file (Python 3.3+ also supports implicit namespace packages). Enables hierarchical module organization. import mypackage.mymodule.

🔷 OOP (76–85)

76. What is OOP? Object-Oriented Programming — a paradigm organizing code around objects (data + behavior). Four pillars: Encapsulation, Inheritance, Polymorphism, Abstraction. Python supports OOP natively.

77. What is a class? A blueprint/template for creating objects. Defines attributes (data) and methods (behavior). Created with class keyword.

 
 
python
class Dog:
    species = "Canis"          # class attribute
    def __init__(self, name):
        self.name = name       # instance attribute

78. What is an object? An instance of a class — a concrete realization of the blueprint with its own attribute values. dog = Dog("Rex") creates an object. Objects have identity, state, and behavior.

79. What is constructor __init__? A special method called automatically when an object is created. Initializes instance attributes. self refers to the current instance.

 
 
python
def __init__(self, name, age):
    self.name = name
    self.age = age

80. What is inheritance? A child class inheriting attributes and methods from a parent class. Promotes code reuse. Child can override parent methods. super() calls parent’s method.

 
 
python
class Animal:
    def speak(self): return "..."
class Dog(Animal):
    def speak(self): return "Woof"

81. Types of inheritance? Single (one parent), Multiple (multiple parents class C(A, B)), Multilevel (A→B→C), Hierarchical (one parent, multiple children), Hybrid (combination). Python supports all via MRO (Method Resolution Order).

82. What is polymorphism? Same interface, different behavior. A method with the same name behaves differently for different classes. Achieved via method overriding and duck typing in Python.

 
 
python
for animal in [Dog(), Cat(), Bird()]:
    animal.speak()  # each calls its own speak()

83. What is encapsulation? Bundling data and methods together and restricting direct access to internal state. Python uses naming conventions: _protected (convention), __private (name mangling to _ClassName__attr).

84. What is abstraction? Hiding implementation details and exposing only necessary interfaces. Achieved using abstract base classes (ABC) from abc module.

 
 
python
from abc import ABC, abstractmethod
class Shape(ABC):
    @abstractmethod
    def area(self): pass

85. Method overloading vs overriding? Overloading = same method name, different parameters (Python doesn’t support natively — use default args or *args). Overriding = child class redefines parent’s method with same signature — true polymorphism.


🔷 ADVANCED PYTHON (86–100)

86. What is exception handling? Mechanism to gracefully handle runtime errors without crashing.

 
 
python
try:
    result = 10 / 0
except ZeroDivisionError as e:
    print(f"Error: {e}")
except (TypeError, ValueError):
    pass
else:
    print("No error")     # runs if no exception
finally:
    print("Always runs")  # cleanup

87. Types of exceptions? ZeroDivisionError, TypeError, ValueError, IndexError, KeyError, AttributeError, NameError, FileNotFoundError, ImportError, StopIteration, OverflowError, MemoryError, RecursionError. All inherit from BaseException.

88. try-except-finally? try = code that might raise exception. except = handle specific exceptions. else = runs if no exception. finally = always executes (cleanup: closing files, DB connections). Can raise custom exceptions with raise.

89. What is a generator? A function that yields values one at a time using yield instead of returning all at once. Memory-efficient for large sequences. Returns a generator object (lazy evaluation).

 
 
python
def fibonacci():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

90. Generator vs iterator? All generators are iterators but not vice versa. Generator = function with yield, auto-implements __iter__ and __next__. Iterator = any object with __iter__() and __next__(). Generator is simpler to create.

91. What is iterator? An object implementing __iter__() (returns self) and __next__() (returns next value, raises StopIteration when exhausted). Lists, dicts, strings are iterables but not iterators — call iter() on them.

92. What is yield? A keyword that pauses a generator function, returns a value, and saves state. Next call resumes from where it paused. Unlike return, function state is preserved between calls.

 
 
python
def counter(n):
    for i in range(n):
        yield i          # pauses here each time

93. What is a decorator? A function that wraps another function to extend its behavior without modifying its code. Uses @ syntax. Fundamental to Python frameworks (Flask routes, Django views).

 
 
python
def timer(func):
    def wrapper(*args, **kwargs):
        import time
        start = time.time()
        result = func(*args, **kwargs)
        print(f"Time: {time.time()-start:.4f}s")
        return result
    return wrapper

@timer
def my_function():
    pass

94. What is closure? A function that remembers variables from its enclosing scope even after that scope has finished executing. Requires a nested function that references outer variables.

 
 
python
def multiplier(n):
    def multiply(x):
        return x * n    # n from enclosing scope
    return multiply
double = multiplier(2)
double(5)   # 10

95. What is multithreading? Running multiple threads (lightweight, shared memory) concurrently in the same process. Python’s threading module. Good for I/O-bound tasks (file, network) — limited for CPU-bound due to GIL.

 
 
python
import threading
t = threading.Thread(target=my_function, args=(arg1,))
t.start(); t.join()

96. What is multiprocessing? Running multiple processes with separate memory spaces to achieve true parallelism. Python’s multiprocessing module. Bypasses GIL — ideal for CPU-bound tasks.

 
 
python
from multiprocessing import Pool
with Pool(4) as p:
    results = p.map(func, data_list)

97. What is GIL? Global Interpreter Lock — a mutex in CPython that allows only one thread to execute Python bytecode at a time. Prevents race conditions but limits true multi-core CPU parallelism. GIL doesn’t affect multiprocessing or I/O-bound threads.

98. What is memory management in Python? Python uses private heap for all objects. Memory manager handles allocation. Reference counting is primary mechanism. Cyclic garbage collector handles circular references. Memory pooling via pymalloc for small objects.

99. What is garbage collection? Automatic memory reclamation. Python uses reference counting (object deleted when count = 0) + cyclic GC (gc module) for circular references. gc.collect() triggers manual collection.

 
 
python
import gc
gc.collect()           # force collection
gc.get_count()         # current collection counts

100. What is virtual environment? An isolated Python environment with its own interpreter and packages — prevents dependency conflicts between projects.

 
 
bash
python -m venv myenv          # create
source myenv/bin/activate     # activate (Linux/Mac)
myenv\Scripts\activate        # activate (Windows)
pip install package           # install in isolation
deactivate                    # exit venv

Tools: venv (built-in), virtualenv, conda, pipenv, poetry.

DataFlow Interview Questions & Answers

Here are concise answers to all Dataflow questions:


🔷 FUNDAMENTALS (1–25)

1. What is Google Cloud Dataflow? A fully managed, serverless data processing service on GCP for executing Apache Beam pipelines. Handles provisioning, scaling, and fault tolerance automatically. Supports both batch and streaming workloads.

2. What is Apache Beam? An open-source unified programming model for defining both batch and streaming data processing pipelines. Write once, run anywhere — on Dataflow, Spark, Flink, or local runners.

3. Difference between Dataflow and Beam? Apache Beam = the programming model/SDK used to write pipelines (the code). Dataflow = the managed execution engine/runner on GCP that runs Beam pipelines. Beam is portable; Dataflow is one of its runners.

4. What is a pipeline in Dataflow? A directed acyclic graph (DAG) of data processing steps. Reads from a source, applies transformations, and writes to a sink. Defined using Beam SDK and executed on a runner.

5. What is PCollection? Pipeline Collection — the fundamental data abstraction in Beam. An immutable, distributed dataset of any type. Can be bounded (batch) or unbounded (streaming). Every transform consumes and produces PCollections.

6. What is PTransform? Pipeline Transform — an operation applied to a PCollection to produce one or more PCollections. Examples: ParDo, GroupByKey, Combine, Flatten, Partition. The building blocks of a Beam pipeline.

7. What is a pipeline runner? The execution engine that runs a Beam pipeline. Options: DirectRunner (local testing), DataflowRunner (GCP), SparkRunner (Apache Spark), FlinkRunner (Apache Flink). Runner is specified in pipeline options.

8. What is batch processing? Processing a finite, bounded dataset — all data is available upfront. Pipeline reads, transforms, and writes the complete dataset. Typically used for historical data processing, ETL jobs.

9. What is stream processing? Processing an infinite, unbounded dataset in real time as data arrives. Pipeline runs continuously reading from sources like Pub/Sub. Requires windowing and triggers to produce results.

10. What is the unified model in Beam? Beam’s key innovation — the same pipeline code handles both batch and streaming by abstracting data as PCollections (bounded or unbounded) with the same transforms, windowing, and triggers applied uniformly.

11. What are sources and sinks? Source = where data enters the pipeline (Pub/Sub, GCS, BigQuery, Kafka). Sink = where data is written after processing (BigQuery, GCS, Bigtable, Spanner). Defined using ReadFromX and WriteToX transforms.

12. What is a pipeline graph? The DAG visualization of a Beam pipeline showing all transforms, PCollections, and their dependencies. Visible in Dataflow UI — helps understand data flow and identify bottlenecks.

13. What is DoFn? A “Do Function” — a user-defined class containing the processing logic applied to each element in a PCollection. The core of ParDo transforms. Override process() method to define per-element logic.

14. What is ParDo? Parallel Do — the most general and widely used transform. Applies a DoFn to each element in a PCollection in parallel across workers. Equivalent to a distributed map operation with side inputs/outputs support.

15. What is Map vs FlatMap? Map = applies a function to each element, produces exactly one output per input (1-to-1). FlatMap = applies a function that returns zero or more outputs per input (1-to-many). Both are simplified wrappers over ParDo.

16. What is filter transform? Applies a boolean function to each element — keeps elements where function returns True, discards others. beam.Filter(lambda x: x > 0). Equivalent to a ParDo that conditionally yields elements.

17. What is GroupByKey? Groups all values with the same key in a key-value PCollection. Input: (key, value) pairs → Output: (key, Iterable[values]). Triggers a shuffle operation — expensive, use Combine where possible.

18. What is Combine? An efficient aggregation transform that reduces elements using an associative, commutative function. More efficient than GroupByKey+map because it uses combiner lifting (partial aggregation before shuffle). Examples: sum, count, max.

19. What is pipeline options? Configuration parameters for a pipeline execution: runner type, project ID, region, temp location, job name, worker type, max workers, service account. Passed via command-line or PipelineOptions class.

20. What is a Dataflow job? A unit of execution on Dataflow — one run of a Beam pipeline. Has a unique job ID, status (Running, Succeeded, Failed, Cancelled), execution graph, metrics, and logs viewable in the Dataflow UI.

21. What is SDK in Beam? Software Development Kit — libraries for writing Beam pipelines in a specific language. Provides PCollection, PTransform, DoFn, windowing APIs. Available for Python, Java, and Go.

22. Supported languages in Beam? Java (most mature), Python (most popular for data engineering), Go (growing support). Each has its own SDK with equivalent APIs. Cross-language transforms allow using Java transforms from Python pipelines.

23. What is pipeline lifecycle? Define pipeline → Apply transforms (build DAG) → Submit to runner → Runner optimizes (fusion, combiner lifting) → Workers execute → Results written to sink → Job completes or runs continuously (streaming).

24. What is a worker in Dataflow? A Compute Engine VM that executes pipeline tasks. Workers receive work bundles from the Dataflow service, process data, and return results. Dataflow manages worker lifecycle, autoscaling, and fault tolerance.

25. What is autoscaling? Dataflow automatically adds or removes workers based on pipeline backlog and throughput. Streaming autoscaling adjusts continuously. Batch autoscaling provisions optimal workers for job completion. No manual intervention needed.

🔷 INTERMEDIATE CONCEPTS (26–50)

26. What is windowing in Dataflow? Dividing an unbounded PCollection into finite, logical groups (windows) based on timestamps for aggregation. Essential for streaming — allows computing results over time intervals rather than all data at once.

27. Types of windows? Fixed (tumbling) windows, Sliding windows, Session windows, Global window (default — all data in one window). Custom windows also supported via WindowFn interface.

28. What is fixed window? Non-overlapping, equal-duration time intervals. Each element belongs to exactly one window. beam.WindowInto(FixedWindows(60)) = 1-minute windows. Simple, most common for periodic aggregations.

29. What is sliding window? Overlapping windows defined by size and slide interval. An element can belong to multiple windows. SlidingWindows(size=60, period=30) = 60-sec windows every 30 sec. Used for moving averages.

30. What is session window? Dynamic windows based on activity gaps — groups events with no gap longer than a specified timeout. Window closes after inactivity. Sessions(gap_size=300) = closes after 5 min of inactivity. Used for user session analysis.

31. What is a trigger? A mechanism that determines when to emit results for a window. By default, Beam fires when the watermark passes the window end. Triggers allow firing early (for low latency) or late (for completeness).

32. Types of triggers? AfterWatermark (default — fires at window end), AfterProcessingTime (fires after processing time delay), AfterCount (fires after N elements), Repeatedly (fires multiple times), AfterAny/AfterAll (composite triggers).

33. What is watermark? An estimate of how far behind real-time the event data is. Represents “all data up to this timestamp has been received.” Dataflow advances the watermark as it processes data. When watermark passes window end → window is considered complete.

34. What is event time vs processing time? Event time = when the event actually occurred (embedded in data). Processing time = when the event is processed by the pipeline. The difference is skew/latency. Beam windowing operates on event time for correctness.

35. What is allowed lateness? The duration after a window closes that late-arriving data is still accepted and used to update results. beam.WindowInto(..., allowed_lateness=Duration(seconds=3600)). Data arriving after this is dropped.

36. What is accumulation mode? How results are updated when a window fires multiple times (with repeated triggers). ACCUMULATING = new pane contains all data seen so far (cumulative). DISCARDING = new pane contains only data since last firing (delta).

37. What is side input? Additional data injected into a DoFn from another PCollection — useful for lookup tables or reference data. Read as a singleton or iterable inside process(). Updated dynamically in streaming if the side input changes.

38. What is side output? A mechanism to emit additional output PCollections from a single DoFn beyond the main output. Use TupleTag to tag and separate outputs — e.g., routing valid records to main output and errors to error output.

39. What is CoGroupByKey? Joins multiple key-value PCollections by key — like a SQL JOIN. Takes two or more PCollections of (k, v) pairs and outputs (k, CoGbkResult) containing iterables of values from each input for that key.

40. What is Flatten transform? Merges multiple PCollections of the same type into a single PCollection. Like a SQL UNION ALL — combines all elements from input collections without deduplication. (pc1, pc2, pc3) | beam.Flatten().

41. What is Partition transform? Splits a single PCollection into multiple PCollections based on a partitioning function. Each element is assigned to exactly one partition. beam.Partition(partition_fn, num_partitions).

42. What is Reshuffle? A transform that redistributes data evenly across workers, breaking fusion and enabling checkpoint recovery. Used to improve parallelism after a bottleneck step or prevent fusion when checkpointing is desired.

43. What is fusion in Dataflow? An optimization where Dataflow merges consecutive transforms into a single execution stage to minimize data serialization and network transfer. Can cause issues if an intermediate step needs checkpointing — use Reshuffle to break fusion.

44. What is pipeline optimization? Dataflow automatically applies: combiner lifting, fusion, work rebalancing, and streaming engine optimizations. Manual: minimize GroupByKey, use Combine, avoid large side inputs, prefer primitives over complex DoFns.

45. What is dynamic work rebalancing? Dataflow splits in-progress work bundles at runtime and redistributes them to idle workers. Prevents slow workers (stragglers) from delaying job completion. Unique to Dataflow — significantly improves batch job performance.

46. What is the hot key problem? When data is heavily skewed around one or few keys, GroupByKey or Combine assigns disproportionately large workloads to single workers — causing bottlenecks. Detected in Dataflow UI as hot key warnings.

47. What is combiner lifting? An optimization where partial aggregation (combining) happens locally on each worker before the shuffle, reducing data transferred during GroupByKey. Only works with associative+commutative combine functions.

48. What is parallelism in Dataflow? The degree to which pipeline steps execute concurrently across workers. Controlled by number of workers, key distribution, and partition count. GroupByKey limits parallelism to number of distinct keys.

49. What is data skew? Unequal distribution of data across keys or partitions causing some workers to process much more data than others. Results in slow jobs and hot worker nodes. Mitigated by salting keys, hot key fanout, or custom partitioning.

50. What is shuffle operation? Data redistribution across workers by key, required by GroupByKey and CoGroupByKey. Involves serialization, network transfer, and deserialization — expensive. Dataflow Shuffle service offloads this to managed infrastructure.

🔷 ADVANCED CONCEPTS (51–75)

51. What is Streaming Engine? A Dataflow backend optimization that offloads windowing state and timer management from worker VMs to a managed streaming service. Reduces worker memory usage, improves autoscaling, and lowers cost. Enable with --enable_streaming_engine.

52. What is Dataflow Shuffle? A managed, backend shuffle service for batch pipelines that moves shuffle data off worker VMs to Google’s infrastructure. Reduces worker disk/memory needs, speeds up jobs, and improves autoscaling. Enable with --experiments=shuffle_mode=service.

53. What is exactly-once processing? Guarantees each record is processed and written exactly once — no duplicates, no data loss. Dataflow provides exactly-once semantics for Pub/Sub-to-BigQuery streaming using deduplication and checkpointing.

54. What is at-least-once processing? Guarantees each record is processed at least once but may be processed multiple times (duplicates possible). Faster than exactly-once. Requires idempotent sinks or downstream deduplication.

55. What is checkpointing? Saving pipeline state and processing progress to durable storage so pipeline can resume from last checkpoint after worker failure. Essential for fault tolerance in streaming pipelines. Beam checkpoints at bundle boundaries.

56. What is stateful processing? Maintaining per-key state across elements within a window. Using Beam’s State API (ValueState, BagState, MapState, CombiningState) in a DoFn to accumulate, track, or deduplicate data per key.

57. What is Timer in Beam? A mechanism in stateful DoFns to trigger processing at a specific event time or processing time. @on_timer decorator fires logic at a scheduled time — used for timeouts, session expiry, or delayed aggregations.

58. What is State API? Beam’s per-key, per-window state storage within DoFns. Types: ValueState (single value), BagState (list of values), MapState (key-value map), CombiningState (accumulated value). Enables complex stateful streaming logic.

59. What is Splittable DoFn (SDF)? A DoFn that can process a large input element by splitting it into smaller work units (restrictions) that are processed in parallel. Enables scalable I/O connectors — used in file reads, Kafka, database connectors.

60. What is dynamic splitting? During execution, Dataflow splits a work bundle being processed by a slow worker and reassigns the remaining work to another worker. Enabled by SDF — improves utilization and reduces straggler impact.

61. What is drain vs cancel job? Drain = graceful shutdown — stops reading new input, finishes processing in-flight data, flushes windows and writes results. Cancel = immediate termination — discards in-flight data. Use Drain for streaming to avoid data loss.

62. What is update job? Replacing a running streaming Dataflow job with a new version without stopping it — preserves state and processing position. Job must be compatible (same pipeline shape). --update --job_name=existing_job_name.

63. What is a template in Dataflow? A packaged, reusable Dataflow job stored in GCS. Allows running pipelines without a development environment — execute via Dataflow UI, REST API, or Cloud Composer. Two types: Classic and Flex templates.

64. What is Flex Template? A modern Dataflow template packaged as a Docker container image stored in Artifact Registry. Supports dynamic pipeline construction, custom dependencies, and runtime parameter injection. More flexible than classic templates.

65. What is a Classic Template? A legacy Dataflow template where the pipeline graph is pre-built and stored as a JSON file in GCS. Parameters are injected at runtime but the DAG is fixed at template creation time. Being superseded by Flex Templates.

66. What is worker harness? The process on each Dataflow worker VM that manages SDK execution, receives work bundles from the Dataflow service, and coordinates with the SDK harness. Acts as the bridge between Dataflow infrastructure and user code.

67. What is SDK harness? The process running user-defined Beam code (DoFns, transforms) on workers. Communicates with the worker harness via gRPC. In containerized execution, runs in a separate Docker container from infrastructure code.

68. What is container-based execution? Each Dataflow worker runs user code in a Docker container (custom or Google-provided). Allows custom dependencies, libraries, and runtime environments. Essential for Flex Templates and cross-language pipelines.

69. What is Dataflow Prime? Next-generation Dataflow execution environment with improved autoscaling (vertical and horizontal), right-fitting (automatically selects optimal machine type), and better performance for complex pipelines. Enable with --dataflow_service_options=enable_prime.

70. What is resource autoscaling? Dataflow Prime’s ability to scale both the number of workers (horizontal) and machine size (vertical) dynamically based on actual workload requirements — reducing over-provisioning and cost.

71. What is horizontal vs vertical scaling? Horizontal = adding more worker VMs (more parallelism). Vertical = using larger VMs with more CPU/memory (handles skewed or memory-intensive tasks). Dataflow Prime supports both automatically.

72. What is cost optimization in Dataflow? Use streaming engine and Dataflow Shuffle (reduces VM resources), right-size workers, use preemptible/spot VMs for batch, minimize GroupByKey (use Combine), enable autoscaling, use Flex RS (flexible resource scheduling) for non-urgent batch.

73. What is worker type? The Compute Engine machine type used for Dataflow workers. Default: n1-standard-4. Choose based on workload: memory-optimized (n1-highmem) for large side inputs, compute-optimized for CPU-heavy transforms.

74. What is custom machine type? Specifying exact vCPU and memory configuration for Dataflow workers beyond standard machine types. Allows right-sizing for workload-specific needs using --machine_type=n2-custom-8-16384.

75. What is service account in Dataflow? The identity Dataflow workers use to access GCP resources (GCS, BigQuery, Pub/Sub). Should follow least-privilege principle. Separate controller service account (Dataflow service) and worker service account (user code) recommended.

🔷 GCP INTEGRATIONS (76–90)

76. How Dataflow integrates with Google Cloud Storage? GCS is used as temp/staging location for job files, as a source (ReadFromText, ReadFromAvro) and sink (WriteToText, WriteToParquet). Also stores templates and worker boot images. Native, high-throughput I/O via GCS connector.

77. How Dataflow reads from BigQuery? Using ReadFromBigQuery transform — supports direct read (BigQuery Storage Read API, fast parallel reads), export (exports to GCS then reads), and SQL queries. Storage Read API preferred for large-scale reads.

78. How Dataflow writes to BigQuery? Using WriteToBigQuery transform — supports streaming insert (real-time, higher cost, exactly-once), batch load (via GCS staging, lower cost), and Storage Write API (high-throughput, exactly-once for streaming).

79. What is Pub/Sub? Google Cloud’s managed message streaming service. Decouples producers and consumers. Supports push/pull delivery. Used as the primary unbounded streaming source for Dataflow streaming pipelines.

80. How Dataflow reads streaming data from Pub/Sub? Using ReadFromPubSub(topic=...) or ReadFromPubSub(subscription=...). Returns unbounded PCollection of messages. Dataflow handles checkpointing and acknowledgment automatically with exactly-once semantics.

81. What is Pub/Sub subscription? A named resource representing a stream of messages from a Pub/Sub topic. Dataflow attaches to a subscription to consume messages. Using subscription (vs topic) lets Dataflow seek back to specific timestamps.

82. What is Pub/Sub acknowledgment? Confirming to Pub/Sub that a message has been successfully processed so it’s not redelivered. Dataflow acknowledges messages after they are committed to sinks. Unacknowledged messages are redelivered after ack deadline.

83. What is dead-letter topic? A secondary Pub/Sub topic where unprocessable or failed messages are routed instead of blocking the pipeline. Implemented using side outputs in DoFns — failed records go to DLT for investigation and reprocessing.

84. What is BigQuery streaming insert? Inserting rows into BigQuery in real time using the insertAll API (tabledata.insertAll). Low latency (seconds), but higher cost and has quota limits. Use Storage Write API for higher throughput and exactly-once.

85. What is BigQuery batch load? Loading data into BigQuery via bulk import from GCS (CSV, JSON, Avro, Parquet). Lower cost than streaming inserts, higher latency. Dataflow uses this via WriteToBigQuery with FILE_LOADS method and GCS staging.

86. What is schema handling in Dataflow? Defining the structure of data (column names, types) when reading/writing to structured sources. For BigQuery, schema can be auto-detected, passed as JSON string, or defined as TableSchema object in the pipeline.

87. What is schema evolution? Handling changes to data structure over time (new fields added, types changed). In Dataflow: use flexible schema options in BigQuery (ignore_unknown_values=True), handle in DoFns with try/except, or use schema registry with Avro/Protobuf.

88. How to handle JSON data in Dataflow? Read as strings from GCS/Pub/Sub → parse with json.loads() in a DoFn → extract fields → process. Use beam.Map(json.loads) for simple parsing. Handle malformed JSON in try/except and route to dead-letter.

89. How to handle Avro/Parquet files? Use ReadFromAvro / WriteToAvro or ReadFromParquet / WriteToParquet transforms. Schema is embedded in Avro files. For Parquet, specify schema via pyarrow. Both support efficient columnar reads with predicate pushdown.

90. How to connect Dataflow with Cloud Composer? Use DataflowCreatePythonJobOperator or DataflowTemplatedJobStartOperator in Airflow DAGs. Composer triggers Dataflow jobs, monitors status using sensors (DataflowJobStateSensor), and handles dependencies with other GCP services.

🔷 SCENARIO-BASED / PRACTICAL (91–104)

91. How do you design a real-time streaming pipeline? Pub/Sub → ReadFromPubSub → parse/validate DoFn (route errors to dead-letter topic) → apply windowing (fixed/session) → aggregate/enrich → WriteToBigQuery (Storage Write API) + WriteToGCS for archival. Enable Streaming Engine, use exactly-once mode.

92. How do you handle late arriving data? Set allowed_lateness on windows to accept late data. Use AfterWatermark trigger with AfterProcessingTime as late trigger. Choose ACCUMULATING mode to include late data in updated results. Monitor watermark lag in Dataflow UI.

93. How do you handle duplicate records in streaming? Use stateful DoFn with SetState or BagState to track seen message IDs per key. Or use Pub/Sub message IDs with BigQuery’s insertId for deduplication. Storage Write API with exactly-once mode eliminates duplicates at the sink.

94. How do you optimize Dataflow job cost? Enable Dataflow Shuffle and Streaming Engine, use preemptible VMs for batch workers, enable autoscaling with appropriate min/max workers, use Flex RS for flexible scheduling, minimize GroupByKey, use Combine, and right-size machine types.

95. How do you debug a failed Dataflow job? Check Dataflow UI job graph for failed step (red node) → view step logs for error message → check Stackdriver/Cloud Logging for worker errors → reproduce locally with DirectRunner → add logging in DoFns → check for data type mismatches or NullPointerExceptions.

96. How do you monitor Dataflow pipelines? Use Dataflow UI (job graph, metrics, logs), Cloud Monitoring (custom dashboards, pipeline metrics), Cloud Logging (worker/job logs via logging module), set up alerts on job state changes and lag metrics, and use beam_metrics for custom counters.

97. How do you handle data skew in GroupByKey? Add random salt to keys before GroupByKey to distribute load, then remove salt after aggregation (two-phase aggregation). Use CombinePerKey with combiner lifting instead. Use hot key fanout feature in Dataflow for known hot keys.

98. How do you design ETL using Dataflow? Extract: ReadFromBigQuery/ReadFromGCS/ReadFromPubSub → Transform: validate, clean, enrich, join (CoGroupByKey with reference data via side inputs), aggregate → Load: WriteToBigQuery/WriteToGCS/WriteToSpanner. Parameterize via pipeline options and use Flex Templates for reusability.

99. How do you migrate batch pipeline to streaming? Replace bounded source with unbounded (ReadFromPubSub instead of ReadFromGCS) → add windowing strategy (FixedWindows) → replace batch triggers with streaming triggers (AfterWatermark) → switch sink to streaming-compatible (WriteToBigQuery with streaming) → enable Streaming Engine → test watermark and late data handling.

100. How do you ensure fault tolerance in Dataflow? Dataflow automatically retries failed bundles on worker failures and restarts failed workers. For streaming: enable checkpointing, use exactly-once mode, implement dead-letter queues for poison messages. For batch: enable Dataflow Shuffle for faster recovery. Use drain (not cancel) for graceful shutdown.

Bigquery Interview Questions & Answers

🔷 FUNDAMENTALS (1–25)

1. What is Google Cloud Dataflow? A fully managed, serverless data processing service on GCP for executing Apache Beam pipelines. Handles provisioning, scaling, and fault tolerance automatically. Supports both batch and streaming workloads.

2. What is Apache Beam? An open-source unified programming model for defining both batch and streaming data processing pipelines. Write once, run anywhere — on Dataflow, Spark, Flink, or local runners.

3. Difference between Dataflow and Beam? Apache Beam = the programming model/SDK used to write pipelines (the code). Dataflow = the managed execution engine/runner on GCP that runs Beam pipelines. Beam is portable; Dataflow is one of its runners.

4. What is a pipeline in Dataflow? A directed acyclic graph (DAG) of data processing steps. Reads from a source, applies transformations, and writes to a sink. Defined using Beam SDK and executed on a runner.

5. What is PCollection? Pipeline Collection — the fundamental data abstraction in Beam. An immutable, distributed dataset of any type. Can be bounded (batch) or unbounded (streaming). Every transform consumes and produces PCollections.

6. What is PTransform? Pipeline Transform — an operation applied to a PCollection to produce one or more PCollections. Examples: ParDoGroupByKeyCombineFlattenPartition. The building blocks of a Beam pipeline.

7. What is a pipeline runner? The execution engine that runs a Beam pipeline. Options: DirectRunner (local testing), DataflowRunner (GCP), SparkRunner (Apache Spark), FlinkRunner (Apache Flink). Runner is specified in pipeline options.

8. What is batch processing? Processing a finite, bounded dataset — all data is available upfront. Pipeline reads, transforms, and writes the complete dataset. Typically used for historical data processing, ETL jobs.

9. What is stream processing? Processing an infinite, unbounded dataset in real time as data arrives. Pipeline runs continuously reading from sources like Pub/Sub. Requires windowing and triggers to produce results.

10. What is the unified model in Beam? Beam’s key innovation — the same pipeline code handles both batch and streaming by abstracting data as PCollections (bounded or unbounded) with the same transforms, windowing, and triggers applied uniformly.

11. What are sources and sinks? Source = where data enters the pipeline (Pub/Sub, GCS, BigQuery, Kafka). Sink = where data is written after processing (BigQuery, GCS, Bigtable, Spanner). Defined using ReadFromX and WriteToX transforms.

12. What is a pipeline graph? The DAG visualization of a Beam pipeline showing all transforms, PCollections, and their dependencies. Visible in Dataflow UI — helps understand data flow and identify bottlenecks.

13. What is DoFn? A “Do Function” — a user-defined class containing the processing logic applied to each element in a PCollection. The core of ParDo transforms. Override process() method to define per-element logic.

14. What is ParDo? Parallel Do — the most general and widely used transform. Applies a DoFn to each element in a PCollection in parallel across workers. Equivalent to a distributed map operation with side inputs/outputs support.

15. What is Map vs FlatMap? Map = applies a function to each element, produces exactly one output per input (1-to-1). FlatMap = applies a function that returns zero or more outputs per input (1-to-many). Both are simplified wrappers over ParDo.

16. What is filter transform? Applies a boolean function to each element — keeps elements where function returns True, discards others. beam.Filter(lambda x: x > 0). Equivalent to a ParDo that conditionally yields elements.

17. What is GroupByKey? Groups all values with the same key in a key-value PCollection. Input: (key, value) pairs → Output: (key, Iterable[values]). Triggers a shuffle operation — expensive, use Combine where possible.

18. What is Combine? An efficient aggregation transform that reduces elements using an associative, commutative function. More efficient than GroupByKey+map because it uses combiner lifting (partial aggregation before shuffle). Examples: sum, count, max.

19. What is pipeline options? Configuration parameters for a pipeline execution: runner type, project ID, region, temp location, job name, worker type, max workers, service account. Passed via command-line or PipelineOptions class.

20. What is a Dataflow job? A unit of execution on Dataflow — one run of a Beam pipeline. Has a unique job ID, status (Running, Succeeded, Failed, Cancelled), execution graph, metrics, and logs viewable in the Dataflow UI.

21. What is SDK in Beam? Software Development Kit — libraries for writing Beam pipelines in a specific language. Provides PCollection, PTransform, DoFn, windowing APIs. Available for Python, Java, and Go.

22. Supported languages in Beam? Java (most mature), Python (most popular for data engineering), Go (growing support). Each has its own SDK with equivalent APIs. Cross-language transforms allow using Java transforms from Python pipelines.

23. What is pipeline lifecycle? Define pipeline → Apply transforms (build DAG) → Submit to runner → Runner optimizes (fusion, combiner lifting) → Workers execute → Results written to sink → Job completes or runs continuously (streaming).

24. What is a worker in Dataflow? A Compute Engine VM that executes pipeline tasks. Workers receive work bundles from the Dataflow service, process data, and return results. Dataflow manages worker lifecycle, autoscaling, and fault tolerance.

25. What is autoscaling? Dataflow automatically adds or removes workers based on pipeline backlog and throughput. Streaming autoscaling adjusts continuously. Batch autoscaling provisions optimal workers for job completion. No manual intervention needed.


🔷 INTERMEDIATE CONCEPTS (26–50)

26. What is windowing in Dataflow? Dividing an unbounded PCollection into finite, logical groups (windows) based on timestamps for aggregation. Essential for streaming — allows computing results over time intervals rather than all data at once.

27. Types of windows? Fixed (tumbling) windows, Sliding windows, Session windows, Global window (default — all data in one window). Custom windows also supported via WindowFn interface.

28. What is fixed window? Non-overlapping, equal-duration time intervals. Each element belongs to exactly one window. beam.WindowInto(FixedWindows(60)) = 1-minute windows. Simple, most common for periodic aggregations.

29. What is sliding window? Overlapping windows defined by size and slide interval. An element can belong to multiple windows. SlidingWindows(size=60, period=30) = 60-sec windows every 30 sec. Used for moving averages.

30. What is session window? Dynamic windows based on activity gaps — groups events with no gap longer than a specified timeout. Window closes after inactivity. Sessions(gap_size=300) = closes after 5 min of inactivity. Used for user session analysis.

31. What is a trigger? A mechanism that determines when to emit results for a window. By default, Beam fires when the watermark passes the window end. Triggers allow firing early (for low latency) or late (for completeness).

32. Types of triggers? AfterWatermark (default — fires at window end), AfterProcessingTime (fires after processing time delay), AfterCount (fires after N elements), Repeatedly (fires multiple times), AfterAny/AfterAll (composite triggers).

33. What is watermark? An estimate of how far behind real-time the event data is. Represents “all data up to this timestamp has been received.” Dataflow advances the watermark as it processes data. When watermark passes window end → window is considered complete.

34. What is event time vs processing time? Event time = when the event actually occurred (embedded in data). Processing time = when the event is processed by the pipeline. The difference is skew/latency. Beam windowing operates on event time for correctness.

35. What is allowed lateness? The duration after a window closes that late-arriving data is still accepted and used to update results. beam.WindowInto(..., allowed_lateness=Duration(seconds=3600)). Data arriving after this is dropped.

36. What is accumulation mode? How results are updated when a window fires multiple times (with repeated triggers). ACCUMULATING = new pane contains all data seen so far (cumulative). DISCARDING = new pane contains only data since last firing (delta).

37. What is side input? Additional data injected into a DoFn from another PCollection — useful for lookup tables or reference data. Read as a singleton or iterable inside process(). Updated dynamically in streaming if the side input changes.

38. What is side output? A mechanism to emit additional output PCollections from a single DoFn beyond the main output. Use TupleTag to tag and separate outputs — e.g., routing valid records to main output and errors to error output.

39. What is CoGroupByKey? Joins multiple key-value PCollections by key — like a SQL JOIN. Takes two or more PCollections of (k, v) pairs and outputs (k, CoGbkResult) containing iterables of values from each input for that key.

40. What is Flatten transform? Merges multiple PCollections of the same type into a single PCollection. Like a SQL UNION ALL — combines all elements from input collections without deduplication. (pc1, pc2, pc3) | beam.Flatten().

41. What is Partition transform? Splits a single PCollection into multiple PCollections based on a partitioning function. Each element is assigned to exactly one partition. beam.Partition(partition_fn, num_partitions).

42. What is Reshuffle? A transform that redistributes data evenly across workers, breaking fusion and enabling checkpoint recovery. Used to improve parallelism after a bottleneck step or prevent fusion when checkpointing is desired.

43. What is fusion in Dataflow? An optimization where Dataflow merges consecutive transforms into a single execution stage to minimize data serialization and network transfer. Can cause issues if an intermediate step needs checkpointing — use Reshuffle to break fusion.

44. What is pipeline optimization? Dataflow automatically applies: combiner lifting, fusion, work rebalancing, and streaming engine optimizations. Manual: minimize GroupByKey, use Combine, avoid large side inputs, prefer primitives over complex DoFns.

45. What is dynamic work rebalancing? Dataflow splits in-progress work bundles at runtime and redistributes them to idle workers. Prevents slow workers (stragglers) from delaying job completion. Unique to Dataflow — significantly improves batch job performance.

46. What is the hot key problem? When data is heavily skewed around one or few keys, GroupByKey or Combine assigns disproportionately large workloads to single workers — causing bottlenecks. Detected in Dataflow UI as hot key warnings.

47. What is combiner lifting? An optimization where partial aggregation (combining) happens locally on each worker before the shuffle, reducing data transferred during GroupByKey. Only works with associative+commutative combine functions.

48. What is parallelism in Dataflow? The degree to which pipeline steps execute concurrently across workers. Controlled by number of workers, key distribution, and partition count. GroupByKey limits parallelism to number of distinct keys.

49. What is data skew? Unequal distribution of data across keys or partitions causing some workers to process much more data than others. Results in slow jobs and hot worker nodes. Mitigated by salting keys, hot key fanout, or custom partitioning.

50. What is shuffle operation? Data redistribution across workers by key, required by GroupByKey and CoGroupByKey. Involves serialization, network transfer, and deserialization — expensive. Dataflow Shuffle service offloads this to managed infrastructure.


🔷 ADVANCED CONCEPTS (51–75)

51. What is Streaming Engine? A Dataflow backend optimization that offloads windowing state and timer management from worker VMs to a managed streaming service. Reduces worker memory usage, improves autoscaling, and lowers cost. Enable with --enable_streaming_engine.

52. What is Dataflow Shuffle? A managed, backend shuffle service for batch pipelines that moves shuffle data off worker VMs to Google’s infrastructure. Reduces worker disk/memory needs, speeds up jobs, and improves autoscaling. Enable with --experiments=shuffle_mode=service.

53. What is exactly-once processing? Guarantees each record is processed and written exactly once — no duplicates, no data loss. Dataflow provides exactly-once semantics for Pub/Sub-to-BigQuery streaming using deduplication and checkpointing.

54. What is at-least-once processing? Guarantees each record is processed at least once but may be processed multiple times (duplicates possible). Faster than exactly-once. Requires idempotent sinks or downstream deduplication.

55. What is checkpointing? Saving pipeline state and processing progress to durable storage so pipeline can resume from last checkpoint after worker failure. Essential for fault tolerance in streaming pipelines. Beam checkpoints at bundle boundaries.

56. What is stateful processing? Maintaining per-key state across elements within a window. Using Beam’s State API (ValueStateBagStateMapStateCombiningState) in a DoFn to accumulate, track, or deduplicate data per key.

57. What is Timer in Beam? A mechanism in stateful DoFns to trigger processing at a specific event time or processing time. @on_timer decorator fires logic at a scheduled time — used for timeouts, session expiry, or delayed aggregations.

58. What is State API? Beam’s per-key, per-window state storage within DoFns. Types: ValueState (single value), BagState (list of values), MapState (key-value map), CombiningState (accumulated value). Enables complex stateful streaming logic.

59. What is Splittable DoFn (SDF)? A DoFn that can process a large input element by splitting it into smaller work units (restrictions) that are processed in parallel. Enables scalable I/O connectors — used in file reads, Kafka, database connectors.

60. What is dynamic splitting? During execution, Dataflow splits a work bundle being processed by a slow worker and reassigns the remaining work to another worker. Enabled by SDF — improves utilization and reduces straggler impact.

61. What is drain vs cancel job? Drain = graceful shutdown — stops reading new input, finishes processing in-flight data, flushes windows and writes results. Cancel = immediate termination — discards in-flight data. Use Drain for streaming to avoid data loss.

62. What is update job? Replacing a running streaming Dataflow job with a new version without stopping it — preserves state and processing position. Job must be compatible (same pipeline shape). --update --job_name=existing_job_name.

63. What is a template in Dataflow? A packaged, reusable Dataflow job stored in GCS. Allows running pipelines without a development environment — execute via Dataflow UI, REST API, or Cloud Composer. Two types: Classic and Flex templates.

64. What is Flex Template? A modern Dataflow template packaged as a Docker container image stored in Artifact Registry. Supports dynamic pipeline construction, custom dependencies, and runtime parameter injection. More flexible than classic templates.

65. What is a Classic Template? A legacy Dataflow template where the pipeline graph is pre-built and stored as a JSON file in GCS. Parameters are injected at runtime but the DAG is fixed at template creation time. Being superseded by Flex Templates.

66. What is worker harness? The process on each Dataflow worker VM that manages SDK execution, receives work bundles from the Dataflow service, and coordinates with the SDK harness. Acts as the bridge between Dataflow infrastructure and user code.

67. What is SDK harness? The process running user-defined Beam code (DoFns, transforms) on workers. Communicates with the worker harness via gRPC. In containerized execution, runs in a separate Docker container from infrastructure code.

68. What is container-based execution? Each Dataflow worker runs user code in a Docker container (custom or Google-provided). Allows custom dependencies, libraries, and runtime environments. Essential for Flex Templates and cross-language pipelines.

69. What is Dataflow Prime? Next-generation Dataflow execution environment with improved autoscaling (vertical and horizontal), right-fitting (automatically selects optimal machine type), and better performance for complex pipelines. Enable with --dataflow_service_options=enable_prime.

70. What is resource autoscaling? Dataflow Prime’s ability to scale both the number of workers (horizontal) and machine size (vertical) dynamically based on actual workload requirements — reducing over-provisioning and cost.

71. What is horizontal vs vertical scaling? Horizontal = adding more worker VMs (more parallelism). Vertical = using larger VMs with more CPU/memory (handles skewed or memory-intensive tasks). Dataflow Prime supports both automatically.

72. What is cost optimization in Dataflow? Use streaming engine and Dataflow Shuffle (reduces VM resources), right-size workers, use preemptible/spot VMs for batch, minimize GroupByKey (use Combine), enable autoscaling, use Flex RS (flexible resource scheduling) for non-urgent batch.

73. What is worker type? The Compute Engine machine type used for Dataflow workers. Default: n1-standard-4. Choose based on workload: memory-optimized (n1-highmem) for large side inputs, compute-optimized for CPU-heavy transforms.

74. What is custom machine type? Specifying exact vCPU and memory configuration for Dataflow workers beyond standard machine types. Allows right-sizing for workload-specific needs using --machine_type=n2-custom-8-16384.

75. What is service account in Dataflow? The identity Dataflow workers use to access GCP resources (GCS, BigQuery, Pub/Sub). Should follow least-privilege principle. Separate controller service account (Dataflow service) and worker service account (user code) recommended.


🔷 GCP INTEGRATIONS (76–90)

76. How Dataflow integrates with Google Cloud Storage? GCS is used as temp/staging location for job files, as a source (ReadFromTextReadFromAvro) and sink (WriteToTextWriteToParquet). Also stores templates and worker boot images. Native, high-throughput I/O via GCS connector.

77. How Dataflow reads from BigQuery? Using ReadFromBigQuery transform — supports direct read (BigQuery Storage Read API, fast parallel reads), export (exports to GCS then reads), and SQL queries. Storage Read API preferred for large-scale reads.

78. How Dataflow writes to BigQuery? Using WriteToBigQuery transform — supports streaming insert (real-time, higher cost, exactly-once), batch load (via GCS staging, lower cost), and Storage Write API (high-throughput, exactly-once for streaming).

79. What is Pub/Sub? Google Cloud’s managed message streaming service. Decouples producers and consumers. Supports push/pull delivery. Used as the primary unbounded streaming source for Dataflow streaming pipelines.

80. How Dataflow reads streaming data from Pub/Sub? Using ReadFromPubSub(topic=...) or ReadFromPubSub(subscription=...). Returns unbounded PCollection of messages. Dataflow handles checkpointing and acknowledgment automatically with exactly-once semantics.

81. What is Pub/Sub subscription? A named resource representing a stream of messages from a Pub/Sub topic. Dataflow attaches to a subscription to consume messages. Using subscription (vs topic) lets Dataflow seek back to specific timestamps.

82. What is Pub/Sub acknowledgment? Confirming to Pub/Sub that a message has been successfully processed so it’s not redelivered. Dataflow acknowledges messages after they are committed to sinks. Unacknowledged messages are redelivered after ack deadline.

83. What is dead-letter topic? A secondary Pub/Sub topic where unprocessable or failed messages are routed instead of blocking the pipeline. Implemented using side outputs in DoFns — failed records go to DLT for investigation and reprocessing.

84. What is BigQuery streaming insert? Inserting rows into BigQuery in real time using the insertAll API (tabledata.insertAll). Low latency (seconds), but higher cost and has quota limits. Use Storage Write API for higher throughput and exactly-once.

85. What is BigQuery batch load? Loading data into BigQuery via bulk import from GCS (CSV, JSON, Avro, Parquet). Lower cost than streaming inserts, higher latency. Dataflow uses this via WriteToBigQuery with FILE_LOADS method and GCS staging.

86. What is schema handling in Dataflow? Defining the structure of data (column names, types) when reading/writing to structured sources. For BigQuery, schema can be auto-detected, passed as JSON string, or defined as TableSchema object in the pipeline.

87. What is schema evolution? Handling changes to data structure over time (new fields added, types changed). In Dataflow: use flexible schema options in BigQuery (ignore_unknown_values=True), handle in DoFns with try/except, or use schema registry with Avro/Protobuf.

88. How to handle JSON data in Dataflow? Read as strings from GCS/Pub/Sub → parse with json.loads() in a DoFn → extract fields → process. Use beam.Map(json.loads) for simple parsing. Handle malformed JSON in try/except and route to dead-letter.

89. How to handle Avro/Parquet files? Use ReadFromAvro / WriteToAvro or ReadFromParquet / WriteToParquet transforms. Schema is embedded in Avro files. For Parquet, specify schema via pyarrow. Both support efficient columnar reads with predicate pushdown.

90. How to connect Dataflow with Cloud Composer? Use DataflowCreatePythonJobOperator or DataflowTemplatedJobStartOperator in Airflow DAGs. Composer triggers Dataflow jobs, monitors status using sensors (DataflowJobStateSensor), and handles dependencies with other GCP services.


🔷 SCENARIO-BASED / PRACTICAL (91–104)

91. How do you design a real-time streaming pipeline? Pub/Sub → ReadFromPubSub → parse/validate DoFn (route errors to dead-letter topic) → apply windowing (fixed/session) → aggregate/enrich → WriteToBigQuery (Storage Write API) + WriteToGCS for archival. Enable Streaming Engine, use exactly-once mode.

92. How do you handle late arriving data? Set allowed_lateness on windows to accept late data. Use AfterWatermark trigger with AfterProcessingTime as late trigger. Choose ACCUMULATING mode to include late data in updated results. Monitor watermark lag in Dataflow UI.

93. How do you handle duplicate records in streaming? Use stateful DoFn with SetState or BagState to track seen message IDs per key. Or use Pub/Sub message IDs with BigQuery’s insertId for deduplication. Storage Write API with exactly-once mode eliminates duplicates at the sink.

94. How do you optimize Dataflow job cost? Enable Dataflow Shuffle and Streaming Engine, use preemptible VMs for batch workers, enable autoscaling with appropriate min/max workers, use Flex RS for flexible scheduling, minimize GroupByKey, use Combine, and right-size machine types.

95. How do you debug a failed Dataflow job? Check Dataflow UI job graph for failed step (red node) → view step logs for error message → check Stackdriver/Cloud Logging for worker errors → reproduce locally with DirectRunner → add logging in DoFns → check for data type mismatches or NullPointerExceptions.

96. How do you monitor Dataflow pipelines? Use Dataflow UI (job graph, metrics, logs), Cloud Monitoring (custom dashboards, pipeline metrics), Cloud Logging (worker/job logs via logging module), set up alerts on job state changes and lag metrics, and use beam_metrics for custom counters.

97. How do you handle data skew in GroupByKey? Add random salt to keys before GroupByKey to distribute load, then remove salt after aggregation (two-phase aggregation). Use CombinePerKey with combiner lifting instead. Use hot key fanout feature in Dataflow for known hot keys.

98. How do you design ETL using Dataflow? Extract: ReadFromBigQuery/ReadFromGCS/ReadFromPubSub → Transform: validate, clean, enrich, join (CoGroupByKey with reference data via side inputs), aggregate → Load: WriteToBigQuery/WriteToGCS/WriteToSpanner. Parameterize via pipeline options and use Flex Templates for reusability.

99. How do you migrate batch pipeline to streaming? Replace bounded source with unbounded (ReadFromPubSub instead of ReadFromGCS) → add windowing strategy (FixedWindows) → replace batch triggers with streaming triggers (AfterWatermark) → switch sink to streaming-compatible (WriteToBigQuery with streaming) → enable Streaming Engine → test watermark and late data handling.

100. How do you ensure fault tolerance in Dataflow? Dataflow automatically retries failed bundles on worker failures and restarts failed workers. For streaming: enable checkpointing, use exactly-once mode, implement dead-letter queues for poison messages. For batch: enable Dataflow Shuffle for faster recovery. Use drain (not cancel) for graceful shutdown.


All Dataflow questions answered! Let me know if you want deeper dives on any topic. 🚀

 
 
 
 

Bigquery1. What is Google BigQuery?2. What type of database is BigQuery?3. What is serverless architecture in BigQuery?4. What is a dataset in BigQuery?5. What is a table in BigQuery?6. Difference between dataset and project?7. What is schema in BigQuery?8. What are supported data types?9. What is nested and repeated fields?10. What is STRUCT in BigQuery?11. What is ARRAY in BigQuery?12. What is UNNEST?13. What is standard SQL vs legacy SQL?14. What is BigQuery pricing model?15. What is on-demand pricing?16. What is flat-rate pricing?17. What is partitioned table?18. What is clustered table?19. Difference between partitioning and clustering?20. What is ingestion-time partitioning?21. What is columnar storage?22. What is table expiration?23. What is dataset location?24. What is BigQuery UI?25. What is query job?26. What is slot in BigQuery?27. What is query execution model?28. What is Dremel architecture?29. What is shuffle in BigQuery?30. What is broadcast join?31. What is hash join?32. What is query optimization in BigQuery?33. What is materialized view?34. What is view in BigQuery?35. Difference between view and materialized view?36. What is temporary table?37. What is CTE in BigQuery?38. What is window function in BigQuery?39. What is analytic function?40. What is partition pruning?41. What is clustering pruning?42. What is wildcard table?43. What is table decorator?44. What is federated query?45. What is external table?46. What is BigLake table?47. What is streaming insert?48. What is batch load?49. What is data ingestion methods?50. What is schema auto-detection?51. What is schema evolution?52. What is JSON support in BigQuery?53. What is Avro format?54. What is Parquet format?55. What is ORC format?56. What is data compression in BigQuery?57. What is approximate aggregation functions?58. What is SAFE function?59. What is ARRAY_AGG?60. What is STRUCT query usage?61. What is query execution plan?62. How to analyze query performance?63. What is slot reservation?64. What is workload management?65. What is BI Engine?66. What is BigQuery ML?67. What is BigQuery Omni?68. What is data security in BigQuery?69. What is IAM roles in BigQuery?70. What is column-level security?71. What is row-level security?72. What is data masking?73. What is encryption in BigQuery?74. What is audit logging?75. What is data lineage?76. What is partition filter requirement?77. What is clustering best practice?78. What is cost optimization strategy?79. What is query caching?80. What is result cache?81. What is temporary function (UDF)?82. What is JavaScript UDF?83. What is remote function?84. What is BigQuery scripting?85. What is stored procedure in BigQuery?86. How do you optimize a slow BigQuery query?87. How do you reduce query cost?88. How do you design partition strategy?89. How do you handle large joins efficiently?90. How do you implement incremental load in BigQuery?91. How do you handle duplicate data?92. How do you design data warehouse schema in BigQuery?93. How do you handle schema changes?94. How do you load streaming data into BigQuery?95. How do you monitor BigQuery jobs?96. How do you secure sensitive data?97. How do you handle nested JSON data?98. How do you design real-time analytics pipeline?99. How do you integrate BigQuery with Dataflow?100. How do you handle late arriving data in BigQuery?

 
3:26 PM
 
 
 
 

Here are concise answers to all 100 BigQuery questions:


🔷 FUNDAMENTALS (1–25)

1. What is Google BigQuery? A fully managed, serverless, cloud-native enterprise data warehouse on GCP. Designed for large-scale analytical queries using SQL. Separates compute from storage, scales automatically, and charges per query/slot usage.

2. What type of database is BigQuery? An OLAP (Online Analytical Processing) columnar data warehouse — not a transactional (OLTP) database. Optimized for fast analytical reads over massive datasets, not for frequent small writes or row-level updates.

3. What is serverless architecture in BigQuery? No infrastructure to provision, manage, or scale. BigQuery automatically allocates compute resources (slots) per query, scales to petabytes, and handles all maintenance. Pay for queries run, not idle servers.

4. What is a dataset in BigQuery? A logical container for tables, views, and functions within a GCP project. Defines the geographic location of data. Access control is managed at dataset level. Equivalent to a schema/database in traditional RDBMS.

5. What is a table in BigQuery? A structured collection of rows and columns with a defined schema. Can be native (BigQuery-managed storage), external, views, or materialized views. Addressed as project.dataset.table.

6. Difference between dataset and project? Project = top-level GCP billing and access boundary (contains multiple datasets). Dataset = logical grouping of tables within a project (has a location, access controls). Hierarchy: Project → Dataset → Table.

7. What is schema in BigQuery? Defines the structure of a table — column names, data types, and modes (NULLABLE, REQUIRED, REPEATED). Can be specified manually, auto-detected, or inferred from source files.

8. Supported data types? INT64, FLOAT64, NUMERIC, BIGNUMERIC, STRING, BYTES, BOOL, DATE, TIME, DATETIME, TIMESTAMP, GEOGRAPHY, JSON, ARRAY, STRUCT (RECORD). Also INTERVAL and RANGE types.

9. What is nested and repeated fields? Nested = STRUCT fields (records within records, one-to-one). Repeated = ARRAY fields (one-to-many, REPEATED mode). Combination allows storing denormalized, hierarchical data avoiding expensive JOINs — ideal for analytics.

10. What is STRUCT in BigQuery? A container of ordered fields with defined names and types — like a record or object. STRUCT<name STRING, age INT64>. Used to represent nested data. Access with dot notation: record.field.

11. What is ARRAY in BigQuery? An ordered list of zero or more values of the same type. ARRAY<STRING>. Stored as REPEATED fields. Cannot have ARRAYs of ARRAYs directly. Use UNNEST to expand arrays into rows for querying.

12. What is UNNEST? A function that flattens an ARRAY into individual rows. Used in FROM clause with a JOIN or cross join to expand repeated fields for row-level analysis.

 
sql
SELECT name, item
FROM orders, UNNEST(items) AS item

13. Standard SQL vs Legacy SQL? Standard SQL = ANSI-compliant, supports arrays/structs, CTEs, window functions, DML — recommended. Legacy SQL = older BigQuery-specific dialect, uses [project:dataset.table] syntax, limited features. Always use Standard SQL.

14. BigQuery pricing model? Two dimensions: Storage (active: $0.02/GB/month, long-term: $0.01/GB) and Compute (on-demand: $5/TB queried, or flat-rate: fixed monthly slots). Also charges for streaming inserts and other operations.

15. What is on-demand pricing? Pay per query based on bytes scanned — $5 per TB. No upfront commitment. Best for variable, unpredictable workloads. Each project gets 1TB free per month. Cost controlled by limiting bytes scanned (partition pruning, column selection).

16. What is flat-rate pricing? Purchase dedicated slot capacity (100-slot increments) for a fixed monthly/annual price. Unlimited queries within purchased capacity. Best for large, predictable workloads. Supports workload management and slot reservations.

17. What is a partitioned table? A table divided into segments (partitions) based on a partition column (DATE, TIMESTAMP, INTEGER range) or ingestion time. Queries filtering on partition column scan only relevant partitions — reduces cost and improves speed.

18. What is a clustered table? A table whose data is automatically sorted and co-located based on specified cluster columns (up to 4). Queries filtering or aggregating on cluster columns skip irrelevant blocks — improves performance and reduces bytes scanned.

19. Difference between partitioning and clustering? Partitioning = coarse-grained, divides table into distinct segments, works on one column, enables partition pruning, reduces cost directly. Clustering = fine-grained, sorts data within partitions, works on up to 4 columns, improves scan efficiency within partitions.

20. What is ingestion-time partitioning? Automatically partitions data based on when it was loaded into BigQuery using _PARTITIONTIME pseudo-column. No explicit partition column needed. Good for append-only streaming data where load time ≈ event time.

21. What is columnar storage? BigQuery stores data column-by-column instead of row-by-row. Analytical queries typically access few columns — columnar storage reads only required columns, dramatically reducing I/O and bytes scanned.

22. What is table expiration? Automatic deletion of a table or partition after a specified time. Set at dataset level (default table expiration) or table level. Useful for temporary tables and managing storage costs.

23. What is dataset location? The geographic region where a dataset’s data is stored and processed (e.g., US, EU, asia-northeast1). Location is set at dataset creation and cannot be changed. All tables in a dataset share the same location.

24. What is BigQuery UI? The web-based console in GCP for running queries, exploring datasets, managing tables, viewing job history, and monitoring costs. Also accessible via bq CLI, REST API, client libraries (Python, Java, Go), and Looker Studio.

25. What is a query job? A BigQuery job that executes a SQL query. Has a unique job ID, project, location, status, bytes processed, and slot usage. Jobs are asynchronous — submit and poll for completion. Viewable in Job History.


🔷 ARCHITECTURE & PERFORMANCE (26–45)

26. What is a slot in BigQuery? A unit of computational capacity (CPU, memory, network) used to execute queries. On-demand queries use shared slots automatically. Flat-rate customers purchase dedicated slots. Complex queries use more slots in parallel.

27. What is query execution model? BigQuery decomposes SQL into a distributed execution plan → allocates slots → workers read columnar data from Capacitor storage in parallel → shuffle intermediate results → aggregate → return results. Fully managed, no user configuration needed.

28. What is Dremel architecture? BigQuery’s underlying execution engine. Uses a tree-shaped serving architecture: root server → intermediate servers → leaf servers (read storage). Enables massively parallel SQL execution across thousands of nodes for petabyte-scale queries in seconds.

29. What is shuffle in BigQuery? Data redistribution between query stages — when intermediate results must be regrouped (like after GROUP BY or JOIN). BigQuery’s shuffle is a managed, in-memory distributed shuffle service, much faster than disk-based alternatives.

30. What is broadcast join? A join optimization where a small table is copied (broadcast) to all workers processing the large table — avoids expensive shuffle. BigQuery automatically uses broadcast join when one table is small enough (typically under a few hundred MB).

31. What is hash join? The default join strategy in BigQuery for larger tables. Both tables are hashed by join key and matching rows are co-located on the same worker for joining. More scalable than broadcast but involves shuffle.

32. What is query optimization in BigQuery? Best practices: SELECT only needed columns (avoid SELECT *), filter on partition/cluster columns, use approximate functions, avoid self-joins, pre-aggregate with materialized views, use proper data types, avoid JavaScript UDFs in hot paths.

33. What is a materialized view? A precomputed, cached query result stored as a physical table that auto-refreshes when base table changes. Queries can transparently use materialized views for faster results. Reduces redundant computation for frequent aggregations.

34. What is a view in BigQuery? A virtual table defined by a SQL query — no data stored. Query the view to execute the underlying SQL dynamically. Used for abstraction, access control, and simplifying complex queries. Authorized views share data across projects.

35. Difference between view and materialized view? View = virtual, no storage, always fresh, runs query on access. Materialized view = physical storage, precomputed, auto-refreshed on base table changes, faster to query but slight staleness possible. Choose materialized for frequently run expensive queries.

36. What is a temporary table? A table created within a script or session that is automatically deleted after 24 hours. Created without a dataset: CREATE TEMP TABLE name AS SELECT .... Used for intermediate results within multi-statement scripts.

37. What is CTE in BigQuery? Common Table Expression — defined with WITH clause, creates a named temporary result set for use within the query. Improves readability and enables recursive queries. BigQuery supports multiple CTEs and recursive CTEs.

38. What is a window function in BigQuery? A function that computes results across a set of rows related to the current row without collapsing the result. Uses OVER(PARTITION BY ... ORDER BY ...). Examples: ROW_NUMBERRANKLAGLEADSUMAVG.

39. What is an analytic function? BigQuery’s term for window functions — compute values across a window of rows. Categories: navigation (LAGLEADFIRST_VALUELAST_VALUE), ranking (RANKDENSE_RANKROW_NUMBER), aggregate (SUMAVGCOUNT with OVER).

40. What is partition pruning? BigQuery skips reading partitions that don’t match the WHERE clause filter on the partition column. Only scans relevant partitions — dramatically reduces bytes processed and cost. Requires filter to be on the exact partition column.

41. What is clustering pruning? Within partitions, BigQuery skips blocks whose metadata shows they don’t contain values matching the WHERE filter on cluster columns. Reduces bytes scanned within partitions. Effectiveness increases as data volume grows.

42. What is a wildcard table? A way to query multiple tables with a same prefix using *FROM project.dataset.table_* queries all tables starting with table_. Use _TABLE_SUFFIX to filter: WHERE _TABLE_SUFFIX BETWEEN '20240101' AND '20240131'.

43. What is a table decorator? A syntax for querying a snapshot of a table at a specific time or range (Legacy SQL only). [table@timestamp] or [table@time1-time2]. Standard SQL uses FOR SYSTEM_TIME AS OF for time travel queries instead.

44. What is a federated query? A query that reads data directly from external sources (Cloud SQL, Cloud Spanner, Google Sheets) without loading into BigQuery. Uses EXTERNAL_QUERY() function. Useful for joining BigQuery data with operational databases.

45. What is an external table? A table whose data resides in external storage (GCS, Google Drive, Cloud Bigtable) but is queryable via BigQuery SQL. Schema is defined in BigQuery but data stays in source. No storage charges in BigQuery.


🔷 STORAGE & INGESTION (46–56)

46. What is BigLake table? An evolution of external tables that supports fine-grained access control (row/column level security) on data stored in GCS or other clouds. Unified governance layer across BigQuery, Spark, and other engines via BigLake Metastore.

47. What is streaming insert? Loading data into BigQuery in real time using the insertAll API (Storage Write API recommended). Data available within seconds. Has per-row costs. Use for real-time dashboards and event-driven pipelines.

48. What is batch load? Loading data from GCS, local files, or other sources in bulk using load jobs. Free (no charge for batch loads). Supports CSV, JSON, Avro, Parquet, ORC. Higher latency than streaming but cost-effective for large volumes.

49. What is data ingestion methods? Batch load jobs (from GCS), Storage Write API (streaming/batch), Dataflow pipelines, Data Transfer Service (SaaS connectors), BigQuery Data Transfer Service, federated queries, and bq load CLI command.

50. What is schema auto-detection? BigQuery automatically infers column names and types from source data during load or query of external tables. Convenient but can misdetect types — validate and explicitly define schema in production pipelines.

51. What is schema evolution? Handling changes to table structure over time. BigQuery supports adding columns (backward compatible) and relaxing REQUIRED to NULLABLE. Changing types or removing columns requires table recreation or column aliasing workarounds.

52. What is JSON support in BigQuery? Native JSON data type (2022+) stores semi-structured data without predefined schema. Use JSON_VALUE()JSON_QUERY() to extract fields. Also supports loading JSON files and querying STRING columns containing JSON using JSON functions.

53. What is Avro format? A row-based, schema-embedded binary format. Preferred for BigQuery batch loads because it preserves data types precisely, supports schema evolution, and handles nested/repeated fields natively without type ambiguity.

54. What is Parquet format? A columnar binary format with efficient compression. Excellent for BigQuery loads — preserves types, supports nested structures, and loads efficiently. Preferred when source systems use Parquet (Spark, Databricks, Dataflow).

55. What is ORC format? Optimized Row Columnar — a columnar binary format from the Hive ecosystem. Supported in BigQuery for loads. Similar benefits to Parquet. Use when data originates from Hive or Hadoop ecosystems.

56. What is data compression in BigQuery? BigQuery automatically compresses data in Capacitor storage using columnar compression. For load files, supports GZIP compression for CSV/JSON (reduces transfer size). Avro/Parquet have built-in compression (snappy, deflate).


🔷 SQL FEATURES (57–85)

57. What is approximate aggregation functions? Functions that compute results with small error margin much faster and cheaper than exact computation. APPROX_COUNT_DISTINCT() (vs COUNT DISTINCT), APPROX_QUANTILES()APPROX_TOP_COUNT(). Use for exploration, dashboards where ~1% error acceptable.

58. What is SAFE function? A prefix that makes functions return NULL instead of raising errors on invalid input. SAFE.DIVIDE(x, 0) → NULL (vs error). SAFE.PARSE_DATE('%Y-%m-%d', 'invalid') → NULL. Use for data quality resilience.

59. What is ARRAY_AGG? An aggregate function that collects values into an ARRAY. ARRAY_AGG(col ORDER BY col LIMIT 10). Used to re-nest data after unnesting, collect related values per group, or build nested result structures.

60. What is STRUCT query usage?

 
sql
-- Create STRUCT
SELECT STRUCT('Alice' AS name, 30 AS age) AS person
-- Access STRUCT field
SELECT person.name FROM table
-- Use in aggregation
SELECT ARRAY_AGG(STRUCT(name, salary)) AS employees FROM emp GROUP BY dept

61. What is query execution plan? The breakdown of a query into stages (S00, S01…) showing operations, input/output rows, slot usage, and time per stage. Viewable in BigQuery UI under “Execution Details.” Essential for identifying performance bottlenecks.

62. How to analyze query performance? Check execution plan for: stages with most slot time, shuffle data volume between stages, stages reading excessive bytes, skewed input/output ratios. Use INFORMATION_SCHEMA.JOBS for historical analysis. Check partition/cluster pruning effectiveness.

63. What is slot reservation? In flat-rate pricing, allocating a specific number of slots to a reservation (logical group) and assigning projects/datasets to that reservation. Ensures workloads get guaranteed compute capacity without competing for shared slots.

64. What is workload management? Organizing slot reservations into assignments for different workloads (interactive, batch, BI). Allows setting baseline + autoscale slots, idle slot sharing between reservations, and priority-based compute allocation.

65. What is BI Engine? An in-memory analysis service that accelerates BI queries against BigQuery with sub-second response times. Works with Looker Studio and Connected Sheets. Purchase BI Engine capacity (GB of memory) per project/region.

66. What is BigQuery ML? Enables building and running ML models using SQL directly in BigQuery — no data movement to separate ML platform. Supports linear/logistic regression, k-means, DNN, XGBoost, ARIMA, and importing TensorFlow/Vertex AI models.

67. What is BigQuery Omni? Extends BigQuery to run queries on data in other clouds (AWS S3, Azure Blob) without moving data. Powered by Anthos. Enables multi-cloud analytics with a single BigQuery interface and unified governance.

68. What is data security in BigQuery? Multi-layered: IAM roles for resource access, column-level security (policy tags), row-level security (row access policies), data masking (policy tags + masking rules), VPC Service Controls, CMEK encryption, and audit logging.

69. What is IAM roles in BigQuery? Predefined roles: bigquery.adminbigquery.dataOwnerbigquery.dataEditorbigquery.dataViewerbigquery.jobUserbigquery.user. Applied at project, dataset, or table level. Follow principle of least privilege.

70. What is column-level security? Restricting access to sensitive columns using Policy Tags from Data Catalog taxonomy. Users without the Fine-Grained Reader role on a policy tag see NULL or masked values. Applied per column in table schema.

71. What is row-level security? Restricting which rows a user can see using row access policies. CREATE ROW ACCESS POLICY policy_name ON table GRANT TO ("user:alice@company.com") FILTER USING (region = 'US'). Multiple policies combined with OR logic.

72. What is data masking? Hiding sensitive column values for unauthorized users using masking rules associated with policy tags. Masking types: nullify (return NULL), default value, hash (SHA256), or email masking. Users see masked data, not raw values.

73. What is encryption in BigQuery? All data encrypted at rest (AES-256) and in transit (TLS) by default. Options: Google-managed keys (default), Customer-managed encryption keys (CMEK via Cloud KMS), or Customer-supplied encryption keys (CSEK).

74. What is audit logging? BigQuery logs all data access and admin activities to Cloud Audit Logs. Data Access logs (reads, writes, queries), Admin Activity logs (schema changes, job creation). Essential for compliance, security monitoring, and cost attribution.

75. What is data lineage? Tracking data origin, transformations, and destinations across systems. BigQuery integrates with Dataplex and Data Catalog for automatic lineage capture. Shows which tables/columns feed into other tables via jobs and queries.

76. What is partition filter requirement? Enforcing that queries on a partitioned table must include a filter on the partition column — prevents full table scans. Set with require_partition_filter = TRUE at table creation. Returns error if partition filter is missing.

77. What is clustering best practice? Cluster on columns frequently used in WHERE, JOIN, and GROUP BY clauses. Choose high-cardinality columns with good filtering selectivity. Put most selective column first. Cluster after partitioning for best combined effect. Re-cluster with ALTER TABLE if needed.

78. What is cost optimization strategy? Partition and cluster tables, SELECT only needed columns, avoid SELECT *, use partition filters, use materialized views for repeated queries, use approximate functions, schedule queries during off-peak, use flat-rate for predictable workloads, monitor with INFORMATION_SCHEMA.

79. What is query caching? BigQuery automatically caches query results for 24 hours. Identical queries (same SQL, same referenced tables, no DML on tables since last run) return cached results instantly at no charge. Cache can be disabled per query.

80. What is result cache? Same as query cache — the cached result of a previous query. Stored in a temporary table. Invalidated when referenced tables change, query includes non-deterministic functions (CURRENT_TIMESTAMPRAND), or tables use streaming inserts.

81. What is temporary function (UDF)? A user-defined function created within a query script that exists only for that query’s duration. Defined with CREATE TEMP FUNCTION. No persistence. Useful for one-time complex logic without creating permanent functions.

82. What is JavaScript UDF? A UDF written in JavaScript executed within BigQuery. Useful for complex string manipulation, regex, or logic not expressible in SQL. Slower than SQL UDFs due to JS engine overhead. Avoid in performance-critical paths.

 
sql
CREATE TEMP FUNCTION parseJson(json_str STRING)
RETURNS STRING LANGUAGE js AS """
  return JSON.parse(json_str).key;
""";

83. What is remote function? A BigQuery function that calls an external Cloud Run or Cloud Functions endpoint. Enables calling custom code, external APIs, or ML models from SQL. Configured with a connection and endpoint URL.

84. What is BigQuery scripting? Support for procedural SQL: variables (DECLARESET), control flow (IFWHILELOOPFOR), exception handling (BEGIN...EXCEPTION), and multi-statement transactions. Enables complex ETL logic within BigQuery itself.

85. What is stored procedure in BigQuery? A named, reusable procedural SQL block stored persistently. Created with CREATE PROCEDURE. Supports parameters (IN, OUT, INOUT), variables, control flow, and transactions. Called with CALL procedure_name(args).


🔷 SCENARIO-BASED / PRACTICAL (86–100)

86. How do you optimize a slow BigQuery query? Check execution plan for bottleneck stages → add partition filter to reduce scans → add clustering on filter columns → replace SELECT * with specific columns → use materialized views for repeated aggregations → replace exact COUNT DISTINCT with APPROX_COUNT_DISTINCT → avoid cross joins → pre-filter before joining large tables.

87. How do you reduce query cost?

 
sql
-- Instead of:
SELECT * FROM large_table WHERE event_date = '2024-01-01'
-- Use (partition pruning + column selection):
SELECT user_id, event_type, revenue
FROM large_table
WHERE DATE(event_timestamp) = '2024-01-01'  -- partition column
  AND region = 'US'                           -- cluster column

Preview bytes processed before running expensive queries using dry run.

88. How do you design partition strategy? Partition on the most common filter column (usually DATE/TIMESTAMP for time-series data). Use INTEGER RANGE partitioning for ID-based tables. Aim for partition sizes between 1GB–10GB for optimal performance. Enable require_partition_filter for large tables to prevent accidental full scans.

89. How do you handle large joins efficiently? Filter both tables before joining (reduce size early), ensure smaller table is on right side (broadcast join candidate), partition/cluster both tables on join key, avoid joining on non-indexed STRING columns, use approximate techniques for exploration, denormalize frequently joined tables.

90. How do you implement incremental load in BigQuery?

 
sql
-- MERGE pattern (upsert):
MERGE target_table T
USING source_table S ON T.id = S.id
WHEN MATCHED AND S.updated_at > T.updated_at
  THEN UPDATE SET T.value = S.value, T.updated_at = S.updated_at
WHEN NOT MATCHED
  THEN INSERT (id, value, updated_at) VALUES (S.id, S.value, S.updated_at);
-- Or INSERT only new records using watermark:
INSERT INTO target SELECT * FROM source WHERE created_at > (SELECT MAX(created_at) FROM target)

91. How do you handle duplicate data?

 
sql
-- Deduplicate using ROW_NUMBER:
CREATE OR REPLACE TABLE dataset.table AS
SELECT * EXCEPT(row_num) FROM (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY unique_key ORDER BY updated_at DESC
  ) AS row_num
  FROM dataset.table
) WHERE row_num = 1;
-- For streaming, use insertId deduplication (best-effort, ~1 min window)

92. How do you design data warehouse schema in BigQuery? Use denormalized, wide tables with nested/repeated STRUCTs and ARRAYs instead of normalized joins — BigQuery is optimized for denormalization. Apply Star schema (fact + dims) for BI tools. Partition fact tables by date, cluster by dimension keys. Use materialized views for common aggregations.

93. How do you handle schema changes?

 
sql
-- Add nullable column (safe, backward compatible):
ALTER TABLE dataset.table ADD COLUMN new_col STRING;
-- Relax required to nullable:
ALTER TABLE dataset.table ALTER COLUMN col DROP NOT NULL;
-- Change type or remove column: recreate table
CREATE OR REPLACE TABLE dataset.table AS
SELECT existing_cols, CAST(old_col AS NEW_TYPE) AS old_col FROM dataset.table;

94. How do you load streaming data into BigQuery? Use Storage Write API (recommended) — supports exactly-once semantics, high throughput, lower cost than legacy streaming. Via Dataflow WriteToBigQuery with STORAGE_WRITE_API method, Pub/Sub-to-BigQuery Dataflow template, or direct API calls from applications.

95. How do you monitor BigQuery jobs? Use INFORMATION_SCHEMA for query analysis:

 
sql
SELECT job_id, user_email, total_bytes_processed, total_slot_ms, 
       creation_time, total_bytes_billed
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE DATE(creation_time) = CURRENT_DATE()
ORDER BY total_slot_ms DESC LIMIT 20;

Also use Cloud Monitoring dashboards, Cloud Logging for audit trails, and BigQuery Admin panel for job history.

96. How do you secure sensitive data? Apply column-level security with Policy Tags (PII columns tagged → only authorized roles can read). Add row-level security for data segregation by region/team. Enable data masking for non-privileged users. Use VPC Service Controls to prevent data exfiltration. Encrypt with CMEK for regulated data. Audit all access via Cloud Audit Logs.

97. How do you handle nested JSON data?

 
sql
-- Parse JSON strings:
SELECT
  JSON_VALUE(payload, '$.user.id') AS user_id,
  JSON_VALUE(payload, '$.event_type') AS event_type,
  JSON_QUERY(payload, '$.items') AS items_json
FROM events_table;
-- Use native JSON type with JSON_VALUE/JSON_QUERY
-- Or load as STRUCT/ARRAY using schema with nested fields
-- Unnest arrays from JSON:
SELECT id, JSON_VALUE(item, '$.name') AS item_name
FROM table, UNNEST(JSON_QUERY_ARRAY(payload, '$.items')) AS item

98. How do you design a real-time analytics pipeline? Events → Pub/Sub → Dataflow (parse, validate, enrich, window) → BigQuery via Storage Write API (exactly-once) → partitioned/clustered table → Looker Studio / BI Engine for dashboards. Use materialized views for common aggregations. Monitor with Cloud Monitoring alerts on pipeline lag.

99. How do you integrate BigQuery with Dataflow?

 
python
# Read from BigQuery in Dataflow:
data = p | 'ReadBQ' >> beam.io.ReadFromBigQuery(
    query='SELECT * FROM dataset.table WHERE date = "2024-01-01"',
    use_standard_sql=True)
# Write to BigQuery from Dataflow:
data | 'WriteBQ' >> beam.io.WriteToBigQuery(
    table='project:dataset.table',
    schema=schema,
    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
    method='STORAGE_WRITE_API')

Use Storage Read API for reads and Storage Write API for writes for best performance.

100. How do you handle late arriving data in BigQuery? Design ingestion to use event timestamp column (not load time) for partitioning. Allow late writes to historical partitions (BigQuery supports updating any partition). Use MERGE with watermark comparison to upsert late records. For streaming analytics, design Dataflow pipeline with allowed_lateness and write late data to correct event-time partitions. Monitor partition freshness with scheduled queries and alert on gaps.


All 100 BigQuery questions answered with practical examples! Let me know if you want deeper dives on any topic. 🚀

 
 
 
 

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Popup