We provide IT Staff Augmentation Services!

Aws Data Engineer Resume

3.00/5 (Submit Your Rating)

SUMMARY:

  • Around 7+ years of experience as a Big Data Engineer with expertise in the Hadoop Ecosystem, including 3 years on AWS, Azure and Snowflake.
  • Extensive experience deploying cloud - based applications using Amazon Web Services such as Amazon EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, EMR, Redshift, and DynamoDB.
  • Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
  • Proven expertise in deploying major software solutions for various high-end clients meeting the business requirements such as Big data Processing, Ingestion, Analytics and Cloud Migration from On-prem to AWS Cloud using AWS EMR, S3, DynamoDB
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Experience with migrating SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse, as well as controlling and giving database access and migrating on-premises databases to Azure Data Lake stores using Azure Data Factory.
  • Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
  • Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Apache Cassandra, HBase, Hive, Oozie, Impala, Pig, Zookeeper and Flume, Kafka, Sqoop, Spark.
  • Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming.
  • Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, Pair RDD's and worked explicitly on PyS

PROFESSIONAL EXPERIENCE:

Confidential

AWS Data Engineer

Responsibilities:

  • Used AWS Athena extensively to ingest structured data from S3 into other systems such as RedShift or to produce reports. The Spark - Streaming APIs were used to conduct on-the-fly transformations and actions for creating the common learner data model, which receives data from Kinesis in near real time.
  • Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3, Athena, Glue and Kinesis. Hive As the primary query engine of EMR, we have built external table schemas for the data being processed. AWS RDS (Relational database services) was created to serve as a Hive meta store, and it was possible to integrate the meta data from 20 EMR clusters into a single RDS, avoiding data loss even if the EMR was terminated. Involved in the development of a shell script that collects and stores logs created by users in AWS S3 (Simple storage service) buckets. This contains a record of all user actions and is a good indicator of security to detect cluster termination and safeguard data integrity. Partitioning and Bucketing ideas were implemented in the Apache Hive database, which increases query retrieval performance. Using AWS Glue
  • I designed and deployed ETL pipelines on S3 parquet files in a data lake. Created a cloud formation template in JSON format to leverage content delivery with cross-region replication through Amazon Virtual Private Cloud AWS Code Commit Repository was used to save programming logics and scripts and then replicate them to new clusters. Used the Multi-node Redshift technology to implement Columnar Data Storage, Advanced Compression, and Massive Parallel Processing Worked on the code transfer of a quality monitoring program from AWS EC2 to AWS Lambda, as well as the creation of logical datasets to administrate quality monitoring on snowflake warehouses

Environment: Amazon Web Services, Elastic Map Reduce cluster, EC2s, CloudFormation, Amazon S3, Amazon Redshift, Dynamo DB, Cloud Watch, Hive, Scala, Python, HBase, Apache Spark, Spark SQL, Shell Scripting, Tableau, Cloudera.

Confidential

Azure Data Engineer

Responsibilities:

  • Contributed to the development of PySpark Data Frames in Azure Databricks to read data from Data Lake or Blob storage and utilize Spark Sql context for transformation. Experience in Creating, developing, and deploying high - performance ETL pipelines with PySpark and Azure Data Factory. Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster. worked on an Azure copy to load data from an on-premises SQL server to an Azure SQL Data warehouse. Worked on redesigning the existing architecture and implementing it on Azure SQL. Experience with Azure SQL database configuration and tuning automation, vulnerability assessment, auditing, and threat detection. Integration of data storage solutions in spark - especially with Azure Data Lake storage and Blob storage. Improving the performance of Hive and Spark tasks. Knowledge with Kimball data modeling and dimensional modeling techniques. Worked on cloud point to identify the best cloud vendor based on a set of strict success criteria. Used Hive queries to analyze huge data sets of structured, unstructured, and semi-structured data. Created Hive scripts from Teradata SQL scripts for data processing on Hadoop.
  • Developed Hive tables to hold processed findings, as well as Hive scripts to convert and aggregate heterogeneous data. Created and utilized sophisticated data types for storing and retrieving data in Hive using HQL. Used structured data in Hive to enhance performance using sophisticated techniques including as bucketing, partitioning, and optimizing self joins. Created a series of technology demos utilizing the Confidential Edison Arduino shield, Azure EventHub, and Stream Analytics, to show the possibilities of Azure Stream Analytics.

Environment: Azure Data Factory(V2), Azure Databricks, Python, SSIS, Azure SQL, Azure Data Lake, Azure Blob Storage, Spark 2.0, Hive

Confidential

Big Data Developer

Responsibilities:

  • Contributed to the analysis of functional requirements by collaborating with business users/product owners/developers. Worked on analyzing Hadoop clusters using various big data analytic tools such as Pig, HBase database, and Sqoop. Transferred data from HDFS to Relational Database Systems using
  • Sqoop for Business Intelligence, visualization, and user report generation. Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark - SQL, Data Frame, pair RDD's. Maps were used on many occasions like Reducing the number of tasks in Pig and Hive for data cleansing and pre-processing. Build Hadoop solutions for big data problems using MR1 and MR2 in YARN. Created Hive Tables, used Sqoop to load claims data from Oracle, and then put the processed data into the target database. Contributed to the conversion of HiveQL into Spark transforms utilizing Spark RDD and Scala programming. Integrated Kafka-Spark streaming for high efficiency throughput and reliability
  • Job management experience using Fair scheduler, as well as script development with Oozie workflow. Worked on Spark and Hive to perform the transformations required to link daily ingested data to historical data. Analyzed Snowflake Event, Change, and Job data and built a dependency tree-based model based on the occurrence of incidents for each application service present internally. Performed Sqoop ingestion from MSSQL server and SAP HANA Views using Oozie processes. Created Cassandra tables to store various data formats of data coming from different sources. Designed and implemented data integration applications in a Hadoop environment for data access and analysis using the NoSQL data store Cassandra. Used Cloudera Manager for installation and management of Hadoop Cluster.

Environment: Hadoop 3.0, Hive 2.1, J2EE, JDBC, Pig 0.16, HBase 1.1, Sqoop, NoSQL, Impala, Java, Spring, MVC, XML, Spark 1.9, PL/SQL, HDFS, JSON, Hibernate, Bootstrap, jQuery

Confidential

Data Engineer

Responsibilities:

  • Running Spark SQL operations on JSON, converting the data into a tabular format with data frames, then saving and publishing the data to Hive and HDFS. Developing and refining shell scripts for data input and validation with various parameters, as well as developing custom shell scripts to execute spark Jobs. Creating Spark tasks by building RDDs in Python and data frames in Spark SQL to analyze data and store it in S3 buckets. Working with JSON files, parsing them, saving data in external tables, and altering and improving data for future use. Taking part in design, code, and test inspections to discover problems throughout the life cycle. At appropriate meetings, explain technical considerations and upgrades to clients.
  • Creating data processing pipelines by building spark jobs in Scala for data transformation and analysis. Working with structured and semi - structured data to process data for ingestion, transformation, and analysis of data behavior for storage. Using the Agile/Scrum approach for application analysis, design, implementation, and improvement as stated by the standards. Creating and putting data into Hive tables for dynamically adding data into data tables for EDW tables and historical metrics utilizing partitioning and bucketing. Performed Linux actions on the HDFS server for data lookups, job changes if any commits were disabled, and data storage rescheduling. Using SQL queries, test and validate database tables in relational databases, as well as execute Data Validation and Data Integration.
  • Collaborate with SA and Product Owners to gather needs and analyze them for documentation in JIRA user stories for technical and business teams to enhance the requirements. Documenting confluence tool and technology procedures and workflows for future usage, improvements, and upgrades. Migrating code to version controllers using Git Commands for future usage and to guarantee a seamless development workflow.

Environment: MobaExtrm, Linux, Shell, JIRA, Confluence, Jupyter, SQL, HDFS, Spark, Hive 2.0, Python, confluence, AWS, CDH, Putty.

We'd love your feedback!