Senior Data Engineer Resume
PA
SUMMARY
- Around 8 years of work experience in software analysis, datasets, design, development, testing, implementation of Cloud, Big Data, Big Query, Spark, Scala, Hadoop and in maintaining data pipelines in role of Data Engineer.
- Expert in developing data models, pipeline architectures, and providing ETL solutions for project models.
- Vast experience in designing, creating, testing, and maintaining the complete data management from Data Ingestion, Data Curation and Data Provision with in - depth noledge in Spark APIs like Spark Framework-SQL, DSL, Streaming, by working on different file formats like parquet, JSON, and performance tuning of spark applications from various aspects.
- Skilled in designing and implementing ETL architecture and actively tuned it for better performance.
- Proficient in data processing with Hadoop MapReduce & Apache Spark.
- Extensive experience in understanding security requirements of Hadoop and data governance.
- Proficient in SQL, PostgreSQL, Python programming and DBMS concepts.
- Worked with BI tools and services & Data visualization tools such as Tableau, Amazon QuickSight, Plotly, Matplotlib etc.
- Expertise in end-to-end Data Processing jobs to analyze data using MapReduce, Spark, and Hive.
- Strong Experience in working with Linux/Unix environments, writing Shell Scripts.
- Extensive experience with Apache Airflow, Bash/Python scripting for scheduling tasks and process automation. Hands-on-experience with ETL and ELT tools such as Kafka, NIFI, AWS Glue.
- Worked with Jenkins for CI/CD and New Relic dashboards for pipeline event logging.
- Worked with various streaming ingest services like Kafka, Kinesis, flume, and JMS and also worked on importing and exporting data using Apache Sqoop from RDBMS to Hadoop Platform and vice versa.
- Designed, configured, and deployed Amazon Web Services (AWS) for a multitude of applications utilizing the AWS stack (Including EC2, Glue, Lambda, SNS, S3, RDS, Cloud Watch, SQS, IAM), focusing on high-availability, fault tolerance, and auto-scaling.
- Has experience in building data models and Dimensional modelling with Star & Snowflake schemas for OLAP and ODS applications.
- Experience in designing and testing highly scalable mission-critical systems, and Spark jobs both in Scala and Pyspark, Kafka.
- Developed a pipeline using spark and Kafka to load data from a server to Hive with automatic ingestions and quality audits of the data to the RAW layer of the Data Lake.
- Hands-on expertise delivering Descriptive & Prescriptive analytics across several Big Data and Cloud technologies (AWS).
- Developed end to end Analytical/Predictive model applications leveraging Business intelligence, and insights with both Structured and Unstructured data in Big Data Environment.
- Can work parallelly in both GCP and AWS Clouds coherently.
- Designed, Implemented and Developed architectural and migration solutions from on-prem to cloud and multi cloud environments.
- Gathering and translating business requirements into technical designs and development of the physical aspects of a specified design by creating Materialized views/Views/Lookups.
TECHNICAL SKILLS
Programming: Python, PySpark, Shell Scripting
Big Data: Apache Spark, Hadoop, HDFS, MapReduce, Hive, Oozie, HBase, Impala, Hue
Big Data Platforms: Cloudera, Hortonworks, Palantir
Database Technologies: MySQL, PostgreSQL
Data Warehousing: Amazon Redshift, Talend
Cloud Services: Amazon Web Services (AWS), GCP
Data visualization and reporting tools: Tableau, Amazon QuickSight
Scheduling tools: Apache Airflow, Linux Cron, Windows scheduler
Tools: Terraform, ETL, GitHub, JIRA, Rally, Confluence, Jenkins, Jupyter Lab, IntelliJ, Databricks
Operating Systems: Windows, Linux/macOS
PROFESSIONAL EXPERIENCE
Confidential, PA
Senior Data Engineer
Environment: Palantir, AWS, EC2, Redshift, PySpark, EMR, Jira Align, code-cloud.
Responsibilities:
- Responsible for building scalable distributed data solutions usingBig Data Technologies.
- Experienced on loading and transforming of large sets of structured and semi structured data. Strong experience in migrating other databases to Snowflake.
- Designed and developed end to end ETL Solutions and processing applications using Spark, Scala, and Hive to perform Streaming ETL.
- Handled importing of data from various data sources, performed transformations using Spark and loaded data into HDFS.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Developed Spark using Scala for data extraction and transform the data as per the business needs.
- Familiar with troubleshooting spark application to make them more error tolerant
- Extracted transactional data from MySQL Databases and loaded the data into HIVE tables using Sqoop.
- Involved in creating Hive Tables and loaded the processed data.
- Worked on various performance optimizations like Partition,Bucketing and Map Side joins to increase performance and query execution time.
- Analyzed the large data sets by performing Hive queries (HiveQL).
- Scheduled and automated workflows using Oozie
- Resposible to identify the contacts, source of information and availability of data.
- Based on requirements, developed pipeline using Pyspark for normalizing data into strcutured format called BRD (Business Ready Dataset).
- Worked with PySpark, improving the performance and optimized of the existing applications running on EMR cluster to AWS Glue.
- Automating the process of data acquisition and BRD creation into a production environment with checks and balances (monitoring, etc.).
- Worked on AWS Kinesis for processing huge amounts of real-time data.
- Create Python Flask login and dashboard with Neo4j graph database and execute various cypher queries for data analytics.
- Developed Python programs and batch scripts on windows for automation of ETL processes to AWS Redshift.
- Worked on Palantir Data Platform for developing pipelines for data ingestion, transformation, curation, and processing to generate (BRDs) Business Ready Datasets.
- Implemented data extraction from various/distributed sources using scripting and generating BRD’S that will be used by downstream applications for taking insights.
- Optimize Pyspark scripts to run on Palantir DEEP Env for faster data processing.
- Developed self-service platform to be used by customers for driving cost savings initiatives.
- Used apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators b) Deployed application to GCP using Spinnekar(rpm based)
- Launched multi-node kubernetes cluster in Google Kubernetes Engine (GKE) and migrated the dockerized application from AWS to GCP.
- Built, maintained and tested infrastructure to aggregate critical business data into Google Cloud Platform (GCP) Big Query and GCP Storage for analysis.
- Design, implement and own administration of multiple public cloud environments (AWS & GCP)
- Experienced in AWS cloud environment and on S3 storage and EC2 instances.
- Analyzed datasets, performed logical analysis operations to deep dive into data, debug data quality, cleanse and transform data and create reports to share finding across the teams.
- Worked on generating reports Neo4J graph database.
- Scheduled daily running jobs to build and monitor the builds regurally in Data-Lineage, and congiured email alerts when ever the pipeline fails.
- Contributed in an Agile team of developers focused on data ingestion across multiple sources.
- Operate in a CI/CD environment. Used Code-cloud to store the code. And Jira Align for Tracking Features, stories, and tasks.
Confidential, CA
Data Engineer
Environment: Spark, Scala, Hive, GCP. BigQuery, Pub/Sub, Ozzie with Hue, Oracle, Tableau, Jira Kanban Board, Bitbucket.
Responsibilities:
- The primary focus was working with architects and other developers to design develop and deploy the new Hadoop based Data Lake Pipeline for data extraction, transformation / processing for load / presentation into visualization / analytic needs.
- Involved in developing New Spark Application using Scala Framework to migrate data from traditional databases and data warehousing, process and transform the data as per the business needs.
- Used Apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.
- Developed Spark Scala script for doing multiple transformations and aggregations on the Dataset.
- Deployed Cloud Functions to trigger services like Pub/Sub, Cloud Storage, and Datastore for real-time processing of messages on Pub/Sub or process files on Cloud Storage when a Compute Engine modifies or writes to it and uploads the data and messages to Datastore.
- Created various pipelines to load the on-premises data into GCP BigQuery warehouse.
- Fetched live data from Oracle database using Spark Streaming and Amazon Kinesis using the feed from API Gateway REST service.
- Designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators both old and newer operators.
- Worked on troubleshooting and Performance optimization on Spark Application to improve the over-all processing time and make more error tolerant for the pipeline.
- Coordinate with the Analytics team to visualize the Hive Tables.
- Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch.
- Wrote multiple Hive queries (HQL) on processed data for analysis.
- Automated Multiple Spark Scripts to run parallelly using Oozie based on dependency in the pipeline.
- Communicated TEMPeffectively with a mix of people technical / non-technical.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators
- Assesses Industry standards for data models, incorporating new advances into internal architecture standards, following company internal coding standards, best practices in application development and policies.
- Used Google Cloud Platform (GCP) to build, test and deploy applications on Google's very adaptable and solid framework for web, portable and backend arrangements.
- Used Bitbucket to store the code, code reviews and Jira to create the stories.
- Followed Agile methodology for the entire project. Bi-weekly meetings with team for tracking project milestones, technical discussion, planning and estimation, creating project related design documents and identifying technology related risks and issues.
Confidential, Atlanta GA
Hadoop Developer
Environment: Spark, Oracle, HBase, Python, Bitbucket, Rally, Azure, SQL Server, Hive, Kafka
Responsibilities:
- Responsible for ingesting large volumes of data into Hadoop Data Lake Pipeline on daily basis.
- Developed Spark programs to Transform and Analyze the data.
- Worked on Developing Spark Jobs to transform and analyze stored data as per the business requirements.
- Involved in developing Spark programs to parse the raw data, populate staging tables and store the refined data in partitioned table in the EDW.
- Build data pipeline using Kafka and developed Kafka producers to stream data from external source to Kafka Topics.
- Developed Kafka Consumer APIs in Scala for consuming data from Kafka topics and process consumed data using Spark and load the processed streams to HIVE for analytics.
- Developed Spark applications using Scala for performing data cleansing, event enrichment, data aggregation and data preparation needed for reporting teams to consume.
- Experienced in handling large datasets using Spark in Memory capabilities, using Broadcast variables in Spark, efficient Joins, transformations, and other capabilities.
- Involved in creating HIVE tables, imported processed data into tables and done analytics as per business developments.
- Read data from different source like Oracle and Hive using Apache Spark and process consumed data using Spark and load the processed to HIVE and downstream applications for different purposes.
- Developed Apache spark application with Python to perform different kind of validations and standardization on fields based on certain validations rules on incoming Data.
- Developed required UDFs using python scripts and implemented in application.
- Used Spark RDD’s, Data Frames and Datasets Spark- SQL faster processing of ETL workloads.
- For performance optimization implemented different optimization techniques like Partition, Bucketing and Map Side joins for fast query processing.
- Implemented Spark with Python and Spark SQL for faster testing and processing of data.
- Generated Python Scripts for retrieval and analysis on No-SQL database such as HBase.
- Used NOSQL HBase for storing batching information.
- Used Rally tracking tool for assigning and defect management.
- Responsible of web application deployments over cloud services (web and worker roles) on Azure, using VS and PowerShell
- Worked in Agile development environment and actively involved in daily Scrum and other design related meetings.
- Used Bitbucket to share the code snippet among the team members.
- Deployed Azure IaaS virtual machines (VMs) and Cloud services (PaaS role instances) into secure VNets and subnets.
- Provided high availability for IaaS VMs and PaaS role instances for access from other services in the VNet with Azure Internal Load Balancer.
- Implemented high availability with Azure Classic and Azure Resource Manager deployment models and worked on creating Custom Azure Templates for quick deployments and advanced PowerShell scripting.
- Migrated SQL Server 2008 database to Windows Azure SQL Database and updating the Connection Strings based on dis.
Confidential
SQL Developer
Environment: Hortonworks (HDP), Hadoop, HDFS, Hive, Spark, MapReduce, Scala, Python, Kafka, Sqoop, Zookeeper, MySQL, HBase, Jira, Tableau.
Responsibilities:
- Used update strategy to TEMPeffectively migrate data from source to target.
- Moved the mappings from development environment to test environment.
- Designed ETL Process using Informatica to load data from Flat Files, and Excel Files to target Oracle Data Warehouse database.
- Interacted with the business community and database administrators to identify the Business requirements and data realties.
- Experienced in Agile Methodologies and SCRUM Process.
- Built various graphs for business decision making using Python matplotlib library.
- Worked in development of applications especially in UNIX environment and familiar with all its commands.
- Implement code in Python to retrieve and manipulate data.
- Reviewed basic SQL queries and edited inner, left, & right joins in Tableau Desktop by connecting live/dynamic and static datasets.
- Reported and created dashboards for Global Services & Technical Services using SSRS, Oracle BI, and Excel. Deployed Excel VLOOKUP, PivotTable, and Access Query functionalities to research data issues.
- Created conceptual, logical and physical relational models for integration and base layer; created logical and physical dimensional models for presentation layer and dim layer for a dimensional data warehouse in Power Designer.
- Involved in reviewing business requirements and analyzing data sources form Excel, Oracle SQL Server for design, development, testing, and production rollover of reporting and analysis projects.
- Analyzing, designing, developing, implementing and maintaining ETL jobs using IBM Info sphere Data stage and Netezza.
- Designed Data Flow Diagrams, E/R Diagrams and enforced all referential integrity constraints.
- Extensively worked in Client-Server application development using Oracle 10g, Teradata 14, SQL, PL/SQL, Oracle Import and Export Utilities.
- Coordinated with DB2 on database build and table normalizations and de-normalizations.
- Conducted brainstorming sessions with application developers and DBAs to discuss about various de-normalization, partitioning and indexing schemes for Physical Model.
- Involved in several facets of MDM implementations including Data Profiling, metadata acquisition and data migration.
- Extensively used SQL Loader to load data from the Legacy systems into Oracle databases using control files and used Oracle External Tables feature to read the data from flat files into Oracle staging tables.
- Involved in extensive Data validation by writing several complex SQL queries and involved in back-end testing and worked with data quality issues.
- Used SSIS to create ETL packages to validate, extract, transform and load data to data warehouse databases, data mart databases, and process SSAS cubes to store data to OLAP databases
- Created ETL packages using OLTP data sources (SQL Server, Flat files, Excel source files, Oracle) and loaded the data into target tables by performing different kinds of transformations using SSIS.
- Migrated SQL server 2008 to SQL Server 2008 R2 in Microsoft Windows Server 2008 R2 Enterprise Edition.
- Developing reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
- Performed data validation on the flat files that were generated in UNIX environment using UNIX commands as necessary.
- Strong understanding of Data Modeling (Relational, dimensional, Star and Snowflake Schema), Data analysis, implementations of Data warehousing using Windows and UNIX
- Used Model Mart of Erwin for TEMPeffective model management of sharing, dividing and reusing model information and design for productivity improvement and identified and tracked the slowly changing dimensions and determined the hierarchies in dimensions.
- Designed high level ETL architecture for overall data transfer using SSIS from the source server to the Warehouse and defined various facts and dimensions in the data mart including Fact less Facts and designed the Data Mart defining Entities, Attributes, and relationships between them.