We provide IT Staff Augmentation Services!

Aws Data Engineer Resume

5.00/5 (Submit Your Rating)

PA

SUMMARY

  • 8+ years of professional IT experience in BIGDATA using HADOOP framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies as well as Java / J2EE technologies with AWS, AZURE
  • Experience in Hadoop Ecosystem components like Hive, HDFS, Sqoop, Spark, Kafka, Pig.
  • Experience in architecting, designing, installation, configuration and management of Apache Hadoop Clusters, MapR, Hortonworks & Cloudera Hadoop Distribution.
  • Good understanding of Hadoop architecture and Hands on experience with Hadoop components such as Resource Manager, Node Manager, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
  • Expertise in Data Migration, Data Profiling, Data Ingestion, Data Cleansing, Transformation, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Centre.
  • Working knowledge of Spark RDD, Dataframe API, Data set API, Data Source API,
  • Spark SQL and Spark Streaming.
  • Hands - on experience in Unified Data Analytics with Databricks, Databricks Workspace UI, Databricks Notebook Management, Delta Lake with Spark SQL.
  • Extensive experience in performing data ingestion, data processing (transformations, enrichment, and aggregations). Strong knowledge of Distributed system Architecture and Parallel processing, deep understanding of MapReduce programming paradigm and Spark execution framework.
  • Hands-on experience with various programming languages such as Python, SAS.
  • Experienced with Spark improving performance and optimize existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame APIs, Spark Streaming, Pair RDD and worked explicitly on PySpark and Scala.
  • Handled data ingestion from various data sources into HDFS using Sqoop, Flume and performed transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate external HIVE tables. Experience importing streaming data into HDFS with flume sources and sinks and transforming the data with flume interceptors.
  • Experience in Oozie and workflow programmer to manage Hadoop jobs through Direct Acyclic Graph (DAG) of actions with control flows.
  • Profound experience in creating real time data streaming solutions using Apache Spark /Spark Streaming, Kafka and Flume.
  • Good knowledge of using apache NiFi to automate the data movement between different Hadoop Systems.
  • Good experience in handling messaging services using Apache Kafka.
  • Expert in designing parallel jobs with different phases like Join, Merge, Find, Remove Duplicates, Filter, Data Set, Find File Set, Complex Flat File, Modify, Aggregator, XML.
  • Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR, and other AWS family services.
  • Well versed with Big data on AWS cloud services i.e, EC2, S3, Glue, Athena, DynamoDB and RedShift.
  • Creation and configuration of a new batch job in Denodo scheduler with email notification capabilities and implemented Cluster configuration for multiple Denodo nodes and creation of load balancing to improve performance activity.
  • Instantiate, build, and maintain CI/CD pipelines and apply automation to environments and applications. I've worked on various automation tools like GIT, CFT, Ansible.
  • Experienced in Software Development Lifecycle (SDLC) using SCRUM, Agile methodologies
  • Strong knowledge of data preparation, data modeling and data visualization using Power Bi and experience in developing various analytics services using DAX queries.
  • Very keen in knowing newer techno stack that Google Cloud Platform (GCP) adds.
  • Have profound knowledge of computing platforms such as Amazon Web Services (AWS), Azure and Google Cloud (GCP).
  • Can work parallelly in both GCP and Azure Clouds coherently.
  • Extensive experience in IT data analysis projects. Hands-on experience migrating from on-premises ETL to Google Cloud Platform (GCP) using cloud-native tools such as BIG Query, Cloud Data Proc, Google Cloud Storage, Composer.
  • Very interested in learning more about the latest technology stack that Google Cloud Platform (GCP) is adding.
  • Setting up Datalake in google cloud using Google cloud storage, Big Query, and Big Table.
  • Profound Knowledge of the Hadoop architecture and its components such as YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming.
  • Extensive Hadoop experience which has led to the development of enterprise solutions using Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper and YARN.
  • Good understanding of Spark architecture using databricks, structured streaming. Set up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, manage clusters on Databricks, manage machine learning lifecycle.
  • Experienced with cloud platforms like Amazon Web Services, Azure, Databricks.
  • Proficient with Azure Data Lake Services (ADLS), Databricks & iPython notebooks formats, Databricks Delatalakes & Amazon Web Services (AWS).
  • Hands on experience in configuring workflows using Apache Airflow and the Oozie workflow engine to manage and schedule Hadoop jobs.
  • Experience with partitioning, bucketing concepts in Hive, and designed both managing and external tables in Hive to optimize performance. Experience with various file formats such as Avro, Parquet, ORC, Json and XML.
  • Experience in Creating, Debugging, Scheduling and Monitoring jobs with M-control and Oozie.
  • Hands-on experience dealing with database issues and connections to SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL Server, and PostgreSQL. Creation of Java applications to process data in MongoDB and HBase. Used Phoenix to create a SQL layer in HBase.
  • Experience in designing and creating RDBMS tables, views, user-defined data types, indexes, stored procedures, cursors, triggers, and transactions.
  • Expert in designing ETL data flows by creating mappings/workflows to extract data from SQL Server and data migration and transformation from Oracle/Access/Excel Sheets using SQL Server.
  • Expert in designing parallel jobs with different phases like Join, Merge, Find, Remove Duplicates, Filter, Data Set, Find File Set, Complex Flat File, Modify, Aggregator, XML.
  • Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR, and other AWS family services.

TECHNICAL SKILLS

Programming: Python, PySpark, Scala, Shell script, Perl script, SQL

Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Impala, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, Sparklib.

Cloud Technologies: AWS, Microsoft Azure, Google Cloud Platform (GCP), Data Pipeline, Databricks.

Databases: Oracle, MySQL, SQL Server, MongoDB, Cassandra, DynamoDB, Apache Airflow, PostgreSQL, Teradata.

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman, Setting up AWS and AZURE, Databricks Account.

Versioning tools: SVN, Git, GitHub

Monitoring Tool: Control-M, Oozie

Operating Systems: Windows 7/8/XP/2008/2012, Ubuntu Linux, MacOS

Network Security: Kerberos

Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling.

PROFESSIONAL EXPERIENCE

Confidential, PA

AWS Data Engineer

Responsibilities:

  • Develop and add features to existing data analytic applications built with Spark and Hadoop on a Scala, java and Python development platform on the top of AWS services.
  • Programming using Python, Scala along with Hadoop framework utilizing Cloudera Hadoop Ecosystem projects (HDFS, Spark, Sqoop, Hive, HBase, Oozie, Impala, Zookeeper etc.).
  • Involved in developing spark applications using Scala, Python for Data transformations, cleansing as well as validation using Spark API.
  • Worked on all the Spark APIs, like RDD, Dataframe, Data source and Dataset, to transform the data.
  • Worked on both batch processing and streaming data Sources. Used Spark streaming and Kafka for the streaming data processing.
  • Built data pipelines for reporting, alerting, and data mining. Experienced with table design and data management using HDFS, Hive, Impala, Sqoop, MySQL, and Kafka.
  • Worked on Apache Nifi to automate the data movement between RDBMS and HDFS.
  • Created shell scripts to handle various jobs like Map Reduce, Hive, Pig, Spark etc., based on the requirement.
  • SSIS performance tuning using counters, error handling, event handling, re-running of failed SSIS packages using checkpoints and scripting with Active-X and VB.NET in SSIS.
  • Development using SSIS script task, look up transformations anddataflow tasks using T- SQL and Visual Basic (VB) scripts.
  • Transferred thedata(ETL) todatawarehouse by SSIS and processed SSAS cubes to storedatato OLAP databases.
  • Performance Monitoring with SQL Profiler Windows System Monitor.
  • Used MongoDB to stored data in JSON format and developed and tested many features of dashboard using Python, Bootstrap, CSS and JavaScript.
  • Developed and has setup the Enterprise Data Lake to provide support to different use cases including Analysis, Processing, Storing and Reporting of Voluminous, rapidly changing data.
  • Responsible for maintaining high quality reference data at source by performing operations such as cleaning, transformation and ensuring integrity in a relational environment in close collaboration with stakeholders and the solution architect.
  • Designed and developed Security Framework to provide access to objects in AWS S3 using AWS Lambda, DynamoDB.
  • Configured and worked on Kerberos authentication principles to set up secure network communication on the cluster and tested HDFS, Hive, Pig and MapReduce to access the cluster for new users.
  • Comprehensive architectural assessment and implementation of various AWS services such as Amazon EMR, Redshift, S3. Implemented ML algorithms using Python to predict the quantity a user would want to order for a given item, allowing us to automatically suggest using Kinesis Firehose and the S3 Data Lake.
  • Developed the PySpark code for AWS Glue jobs and for EMR.
  • Used Spark SQL for Scala and amp, the Python interface that automatically converts RDD case classes.
  • Used AWS data pipelines for data extraction, transformation and loading from homogeneous or heterogeneous data sources and built various graphs for business decision making using Python matplot library.
  • Imported the data from various sources such as HDFS/HBase into Spark RDD and perform calculations with PySpark to generate the output response.
  • Created Lambda functions with Boto3 to deregister idle AMIs in all application regions to reduce EC2 resource costs.
  • Performed data blending, data preparation using Alteryx and SQL for Tableau consumption, and publishing data sources to Tableau Server.
  • Developing and writing SQLs and stored procedures in Teradata. Loading data into snowflake and writing Snow SQLs scripts
  • Gather data from Data warehouses in Teradata and Snowflake.
  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
  • Developed Kibana dashboards based on log stash data and integrated various source and target systems with Elasticsearch for near real-time log analysis of end-to-end transaction monitoring.

Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache PIG, Python, SSRS, Tableau.

Confidential, VIRGINIA

Azure Data Engineer

Responsibilities:

  • Worked on Azure Data Factory to integrate data for both and prem and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
  • Managed, configured, and scheduled resources across the cluster using Azure Kubernetes Service.
  • Spark cluster monitored with Log Analytics and Ambari web UI. Switched log storage from Cassandra to Azure SQL data warehouse and improved query performance.
  • Involved in developing data ingestion pipelines on Azure HDInsight Spark clusters using Azure Data Factory and Spark SQL. Has also worked with Cosmos DB (SQL API and Mongo API).
  • Exposure to NoSQL databases MongoDB
  • Develop dashboards and visualizations to help business users analyze data and provide data insights to senior management with a focus on Microsoft products such as SQL Server Reporting Services (SSRS) and Power BI.
  • Migrated large datasets to Databricks (Spark), cluster creation and management, data loading, data pipeline configuration, data loading from ADLS Gen2 to Databricks using ADF pipelines.
  • Created Multiple pipelines to load Data from Azure data Lake into staging SQLDB and then into Azure SQL DB.
  • Databrick notebooks were built to streamline and curate data for various business use cases and mounted blob storage on Databrick.
  • Experience Spark application development using Spark SQL on Databricks for data extraction, transformation and aggregation from multiple file formats to analyze and transform the data to gain insights into customer usage patterns.
  • Responsible for cluster size estimation, monitoring and troubleshooting of Spark Data Brick cluster.
  • Experience in building cloud functions in GCP.
  • Build Data Pipelines in airflow in GCP for ETL related jobs using different Airflow operations.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage and Big Query.
  • Used Azure Logic Apps to create workflows to schedule and automate batch jobs by integrating apps, ADF pipelines, and other services such as HTTP requests, email triggers, and more.
  • Extensive work on Azure Data Factory, including data transformations, integration runtimes, Azure Key Vaults, triggers and migration of Data Factory pipelines to higher environments using ARM templates.
  • Automated scripts and resulting workflows using Apache Airflow and Shell Scripts to ensure daily execution in production.
  • Build Data Pipleline Architecture on Azure Cloud platform using NiFi, Azure DataLake Storage Service, Azure HD Insight, Airflow and Data Engineer tool.
  • Install and configure Apache Airflow for the S3 bucket and Snowflake data store and created dags to run Airflow.
  • Data is ingested in mini-batches and performs RDD transformations on those mini-batches of data, using Spark Streaming to perform streaming analytics in Data Bricks.

Environment: Azure SQL DW, Databricks, Azure Synapse, Cosmos DB, ADF, SSRS, Power BI, Azure Data Lake, ARM, Azure HDInsight, Blob Storage, Apache Spark, Apache Airflow.

Confidential

Hadoop Developer

Responsibilities:

  • Communicated with business partners, business analysts, and product owners to understand requirements and build scalable distributed data solutions using the Hadoop ecosystem.
  • Development of Spark streaming programs to process Kafka data in near real time and to process data with stateless and full transformations.
  • Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
  • Creating databricks notebooks using SQL, Python and automated notebooks using jobs.
  • Transform data by running a Python activity in Azure Databricks.
  • Migrate data into RV data pipeline using Databricks, Spark SQL and Scala.
  • Used Databricks for encrypting data using server-side encryption.
  • Performed data purging and applied changes using Databricks and Spark data analysis.
  • Used Databricks notebooks for interactive analysis utilizing Spark APIs.
  • Worked with HIVE's data storage infrastructure creating tables, distributing data by implementing partitions and buckets, writing and optimizing HQL queries.
  • Created and implemented automated procedures to split large files into smaller batches of data to simplify FTP transfers and reduced execution time by 60%.
  • Analyzed data where it lives by Mounting Azure Data Lake and Blob to Databricks.
  • Built a common sftp download or upload framework using Azure Data Factory and Databricks.
  • Worked on developing ETL (Data Stage Open Studio) processes to load data from multiple data sources into HDFS using FLUME and SQOOP and made structural changes using Map Reduce, HIVE.
  • Wrote several MapReduce jobs using Java API, Pig and Hive to extract, transform and aggregate data from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like gZip, Snappy, Lzo.
  • Strong understanding of partitioning and bucketing concepts in Hive and designed both External and Managed tables in Hive to optimize performance.
  • Developed ETL pipelines inside and outside the data warehouse using a combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
  • Wrote Oozie scripts and configured workflows using the Apache Oozie workflow engine to manage and schedule Hadoop jobs.
  • Worked on implementing a log producer in Scala that retrieves application logs, transforms incremental logs and sends them based on log collection platform to Kafka and Zookeeper.
  • Used Big Data Development and Analytics using Hadoop stack (HDFS, MapReduce, Pig, Hive, Impala), Spark to analyze the data brought into HBase using the Hive-HBase integration and compute various metrics for dashboard reports.
  • Was responsible for creating on-demand tables on S3 files using Lambda functions and AWS Glue using Python and PySpark.
  • Transformed data with dynamic AWS Glue frame using PySpark; cataloged the transformed data using Crawlers and scheduled the job and crawler using the workflow feature.
  • End-to-end deployment of PySpark ETL pipelines to Azure Databricks to perform an Azure Data Factory (ADF) orchestrated data transformation, scheduled through Azure automation accounts and triggered with Tidal Scheduler.
  • Worked on cluster setup, data node commissioning and decommissioning, name node recovery, capacity planning and slot configuration.
  • Used AWS Glue for the data transformation, validate and data cleansing.
  • Used python Boto3 to configure the services AWS glue, EC2, S3.
  • Data pipeline programs developed with Spark Scala API, data aggregations with Hive and data formatting (JSON) for visualization and generation.

Environment: AWS, Cassandra, PySpark, Databricks, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, FLUME, Apache oozie, Zookeeper, ETL, UDF, Map Reduce, Snowflake, Apache Pig, Python, Java, SSRS.

Confidential

Python Developer

Responsibilities:

  • Used a test-driven approach to developing the application and implemented the unit tests using the Python unit testing framework.
  • Successfully migrated Django database from SQLite to MySQL to PostgreSQL with full data integrity.
  • Contribute to writing reports using SQL Server Reporting Services (SSRS) and create different types of reports such as grid, matrix and chart reports, web reports by customizing URL access.
  • Views and templates developed using the Python and Django view controller and template language to create a user-friendly website interface.
  • API tests performed with POSTMAN tool for different request methods like GET, POST, PUT and DELETE for each URL to verify responses and error handling.
  • Created Python and Bash tools to increase the efficiency of retail management application system and operations; data conversion scripts, AMQP/Rabbit MQ scripts, REST, JSON and CRUD for API integration.
  • Create several types of data visualizations using Python and Tebleau.
  • Executed debugging and troubleshooting of web applications using Git as a version control tool to collaborate and coordinate with team members.
  • Developed and executed various MySQL database queries from Python using the Python-MySQL connector and the MySQL database package.
  • Built various graphs for business decision making using Python matplotlib library.
  • Implement code in Python to retrieve and manipulate data.
  • Designing and maintaining databases using Python and developing a Python based API (RESTful Web Service) using SQL Alchemy and PostgreSQL.
  • Creation of a web application with Python scripting for data processing, MySQL for the database and HTML CSS, jQuery, and High Charts for data visualization of the provided pages.
  • List of properties dynamically generated for each application using Python modules such as math, glob, random, itertools, functools, NumPy, matplotlib, seaborn, and pandas.
  • Added navigation and paging and filter columns and added and removed desired columns for view with Python based GUI components.

Environment: SQLite, MySQL, PostgreSQL, Python, Git, CRUD, POSTMAN, RESTful, web service, SOAP, HTML, Git, CSS, JQuery, Django1.4.

We'd love your feedback!