Azure/snowflake Data Engineer Resume
4.00/5 (Submit Your Rating)
SUMMARY:
- Dynamic and motivated IT professional wif around 7 years of experience as a Big Data Engineer wif expertise in designing data intensive applications using Cloud Data engineering, Data Warehouse, Hadoop Ecosystem, Big Data Analytical, Data Visualization, Reporting, and Data Quality solutions.
- Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Apache Cassandra, NoSQL, Spark, Python, Scala, Sqoop, HBase, Hive, Oozie, Impala, Pig, Zookeeper, and Flume.
- Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming. Utilized Flume to analyze log files and write into HDFS.
- Experienced wif the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark - SQL, Dataframe API, Spark Streaming, Pair RDD's and worked explicitly on PySpark.
- Developed framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
- Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Hands-on experience wif Amazon EC2, S3, RDS(Aurora), IAM, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, EMR, Redshift, DynamoDB and other services of the AWS family and in Microsoft Azure.
- Proven expertise in deploying major software solutions for various high-end clients meeting the business requirements such as big data Processing, Ingestion, Analytics and Cloud Migration from On-prem to AWS Cloud.
- Experience in Work on AWS Databases like Elastic Cache (Memcached & Redis) and NoSQL databases - HBase, Cassandra & MongoDB, database performance tuning & data modeling.
- Established connection from Azure to On-premises
PROFESSIONAL EXPERIENCE:
Confidential
Azure/SnowFlake Data Engineer
Responsibilities:
- Analyze, develop, and build modern data solutions wif the Azure PaaS service to enable data visualization. Understand the application's current Production state and the impact of new installation on existing business processes. Worked on migration of data from On - prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB). Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake,
- Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks. Pipelines were created in Azure Data Factory utilizing Linked Services/Datasets/Pipeline/ to extract, transform, and load data from many sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards. Used Azure ML to build, test and deploy predictive analytics solutions based on data. Developed Spark applications wif Azure Data Factory and Spark-SQL for data extraction, transformation, and aggregation from different file formats in order to analyze and transform the data in order to uncover insights into customer usage patterns. Analyzed the SQL scripts and designed it by using PySpark SQL for faster performance. Applied technical knowledge to architect solutions that meet business, and IT needs, created roadmaps, and ensure long term technical viability of new deployments, infusing key analytics and AI technologies where appropriate (e.g., Azure Machine Learning, Machine Learning Server, BOT framework
- Azure Cognitive Services, Azure Databricks, etc.) Managed relational database service in which the Azure SQL handles reliability, scaling, and maintenance. Integrated data storage solutions wif Spark, particularly wif Azure Data Lake storage and Blob storage. Configured stream analytics, Event hubs and worked to manage IoT solutions wif Azure. Successfully completed a proofofconcept for Azure implementation, wif the larger goal of migrating on-premises servers and data to the cloud. Responsible for estimating cluster size, monitoring, and troubleshooting the Spark Databricks cluster. Experienced in adjusting the performance of Spark applications for the proper batch interval time, parallelism level, and memory tuning. Extensively involved in the Analysis, design and Modeling.
- Worked on Snowflake Schema, Data Modeling and Elements,, and Source to Target Mappings, Interface Matrix and Design elements. Performed data quality issue analysis using Snow SQL by building analytical warehouses on Snowflake. Helped individual teams to set up their repositories in bit bucket and maintain their code and help them setting up jobs which can make use of CI/CD environment. To meet specific business requirements wrote UDF's in Scala and PySpark. Analyzed large data sets using Hive queries for Structur
Confidential
AWS Data Engineer
Responsibilities:
- Developed Apache presto and Apache drill setups in AWS EMR (Elastic Map Reduce) cluster, to combine multiple databases like MySQL and Hive. dis enables to compare results like joins and inserts on various data sources controlling through single platform. The AWS Lambda functions were written in Scala wif cross - functional dependencies that generated custom libraries for delivering the Lambda function in the cloud. Performed raw data ingestion into S3 from kinesis firehouse, which triggered a lambda function and put refined data into another S3 bucket and wrote to SQS queue as aurora topics.
- Writing to the Glue metadata catalog allows us to query the improved data from Athena, resulting in a serverless querying environment. Created Pyspark frame to bring data from DB2 to Amazon S3. Worked on Kafka Backup Index, Log4j appender minimized logs and pointed ambari server logs to NAS
- Storage. Created AWS RDS (Relational database services) to work as Hive metastore and could combine 20 EMR cluster's meta data into a single RDS, which avoids the data loss even by terminating the EMR. Used AWS Code Commit Repository to store their programming logics and script and has them again to their new clusters. Spin up the EMRs clusters from 30 to 50 nodes which are memory optimized such as R2, R4, X1 and X1e instances wif autoscaling feature. Hive Being the primary query engine of EMR, we has created external table schemas for the data that is being processed. Mounted Local directory file path to
- Amazon S3 using S3fs fuse, to has KMS encryption enabled on the data reflecting in S3 buckets. Designed and implemented ETL pipelines on S3 parquet files on data lake using AWS Glue. Migrated the data from Amazon Redshift data warehouse to Snowflake. Involved in code migration of quality monitoring tool from AWS EC2 to AWS Lambda and built logical datasets to administer quality monitoring on snowflake warehouses. Used AWS glue catalog wif crawler to get the data from S3 and perform SQL query operations and JSON schema to define table and column mapping from S3 data to Redshift.
- Applied Auto scaling techniques to scale in and scale out the instances wif given Memory out of time. dis helped in reducing the number of instances count when the cluster is not actively in use. dis is applied by even considering Hive's replication factor as 2 leaving minimum 5 instances running.
Environment: Amazon Web Services, Elastic Map Reduce cluster, EC2s, CloudFormation, Amazon S3, Amazon Redshift, Hive, Scala, PySpark, Snowflake, Shell Scripting, Tableau, Kafka.
Confidential
Big Data Engineer
Responsibilities:
- Imported real time weblogs using Kafka as a messaging system and ingested the data to Spark Streaming. Implemented data quality checks using Spark Streaming and arranged bad and passable flags on the data. Developed business logic using Kafka & Spark Streaming and implemented business transformations. Supported Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances. Created PL/SQL scripts to extract the data from the operational database into simple flat text files using UTL FILE package. Developed Spark code for using Scala and Spark
- SQL for faster processing and testing. Worked on loading CSV/TXT/AVRO/PARQUET files using Scala language in Spark Framework and process the data by creating Spark Data frame and RDD and save the file in parquet format in HDFS to load into fact table using ORC Reader. Involved in data loading using PL/SQL and SQL*Loader calling UNIX scripts to download and manipulate files. Involved in creating data-models for customer data using Cassandra Query Language. Performed benchmarking of the No-SQL databases, Cassandra and HBase Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala
- Exploring wif Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's. Processed the schema oriented and non-schema-oriented data using Scala and Spark. Wrote entities in Scala along wif named queries to interact wif database. Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS. Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds. Involved in loading data from rest endpoints to Kafka
- Producers and transferring the data to Kafka Brokers. Ran many performance tests using the Cassandra -stress tool to measure and improve the read and write performance of the cluster. Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop. Developed Preprocessing job using Spark Data frames to flatten Json documents to flat file.
Environment: Spark-RDD data frames, Kafka, file formats, Scala, Spark UDFs, AWS S3, oracle SQL-Cassandra, hive.
Confidential
Data Engineer
Responsibilities:
- Performed multiple MapReduce jobs in Hive for data cleaning and pre - processing. Loaded the data from Teradata tables into Hive Tables. Experience in importing and exporting data by Sqoop between HDFS and RDBMS and migrating according to client's requirement. Used Flume to collect, aggregate, and store the web log data from different sources like web servers and pushed to HDFS. Developed Big Data solutions focused on pattern matching and predictive modeling. Involved in Agile methodologies
- Scrum meetings and Sprint planning. Worked on installing cluster commissioning decommissioning of datanode namenode recovery capacity planning and slots configuration. Resource management of HADOOP Cluster including adding/removing cluster nodes for maintenance and capacity needs. Involved in loading data from UNIX file system to HDFS. Partitioned the fact tables and materialized views to enhance the performance. Implemented Hive Partitioning and Bucketing on the collected data in HDFS. Involved in integrating hive queries into spark environment using Spark SQL.
- Used Hive to analyze the partitioned and bucketed data to compute various metrics for reporting. Improved performance of the tables through load testing using Cassandra stress tool. Involved wif the admin team to setup, configure, troubleshoot and scaling the hardware on a Cassandra cluster. Created data models for customers data using Cassandra Query Language (CQL). Developed and ran Map-Reduce Jobs on YARN and
- Hadoop clusters to produce daily and monthly reports as per user's need. Experienced in connecting Avro Sink ports directly to Spark Streaming for analyzation of weblogs. Address the performance tuning of Hadoop ETL processes against very large data set work directly wif statistically on implementing solutions involving predictive analytics. Performed Linux operations on the HDFS server for data lookups, job changes if any commits were disabled, and rescheduling data storage jobs. Created data processing pipelines for data transformation and analysis by developing spark jobs in Scala. Testing and validating database tables in relational databases wif SQL queries, as well as performing Data Validation and Data Integration. Worked on visualizing the aggregated datasets in Tableau. Migrating code to version controllers using Git Commands for future use and to ensure a smooth development workflow
Environment: Hadoop, Spark, MapReduce, Hive, HDFS, YARN, MobaExtrm, Linux, Cassandra, NoSQL database, Python Spark SQL, Tableau, Flume, Spark Streaming