Infrastructure Management:
● Design, develop, and maintain robust and scalable data pipelines to handle large datasets using both on-premise and cloud platforms (e.g., AWS, GCP, Azure).
● Implement and manage data storage solutions, including databases and data lakes, ensuring data integrity and performance.
Data Integration:
● Integrate data from various internal and external sources such as databases, APIs, flat files, and streaming data.
● Ensure data consistency, quality, and reliability through rigorous validation and
transformation processes.
ETL Development:
● Develop and implement ETL (Extract, Transform, Load) processes to automate data
ingestion, transformation, and loading into data warehouses and lakes.
● Optimize ETL workflows to ensure efficient processing and minimize data latency.
Data Quality & Governance:
● Implement data quality checks and validation processes to ensure data accuracy and completeness.
● Develop data governance frameworks and policies to manage data lifecycle, metadata, and lineage.
Collaboration and Support:
● Work closely with data scientists, AI engineers, and developers to understand their data needs and provide technical support.
● Facilitate effective communication and collaboration between the AI and data teams and other technical teams.
Continuous Improvement:
● Identify areas for improvement in data infrastructure and pipeline processes.
● Stay updated with the latest industry trends and technologies related to data engineering and big data.
Education:
● Bachelor’s degree in Computer Science, Engineering, Data Science, or a related field. A Master’s degree is a plus.
Experience:
● Minimum of 3-5 years of experience in data engineering or a similar role.
● Proven experience with on-premise and cloud platforms (AWS, GCP, Azure).
● Strong background in data integration, ETL processes, and data pipeline development.
● Led the design and development of high-performance AI and data platforms, including IDEs, permission management, data pipelines, code management and model deployment systems.
Skills:
● Proficiency in scripting and programming languages (e.g., Python, SQL, Bash).
● Strong knowledge of data storage solutions and databases (e.g., SQL, NoSQL, data lakes).
● Experience with big data technologies (e.g., Apache Spark, Hadoop).
● Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI).
● Understanding of data engineering and MLOps methodologies.
● Awareness of security best practices in data environments.
● Excellent problem-solving skills and attention to detail.
Preferred Qualifications:
● Managed on-premise Spark cluster for hands-on big data processing - focuses on both deployment and usage.