About this role
• Design and implement large-scale Azure and Databricks Data Lakehouse solutions. • Build and optimize ETL/ELT pipelines for batch and real-time streaming workloads using Azure Data Factory, Databricks, and Apache Spark. • Develop scalable data ingestion frameworks and integrate diverse structured and unstructured data sources. • Optimize data storage and query performance using Delta Lake, Parquet format, partitioning, and Spark performance tuning techniques. • Implement robust data governance, security, and access control using Unity Catalog, Azure Key Vault, and least-privilege principles. • Build and maintain data quality frameworks using Great Expectations or similar validation tools for batch and streaming data. • Automate pipeline deployment and CI/CD processes using Azure DevOps, Git, and configuration management tools. • Enable advanced analytics and ML workflows by preparing curated datasets and integrating with ML platforms like MLflow. • Develop real-time streaming solutions using Kafka, Azure Event Hubs, and Databricks Structured Streaming. • Create reusable PySpark libraries and frameworks for data curation, reconciliation, notifications, and Delta table automation. • Migrate legacy Hive or on-prem data sources to cloud platforms while ensuring security, compliance, and performance. • Monitor pipeline performance, troubleshoot issues, and implement alerting and logging for proactive management. • Collaborate with cross-functional teams and stakeholders to translate business requirements into scalable data solutions. • Establish best practices, standards, and documentation for enterprise-wide data engineering and analytics initiatives. Mentor junior engineers and promote knowledge sharing within the data engineering team
Also in Software Engineering