Senior AWS Big Data Engineer Resume

SUMMARY:

Over 7+ years of professional experience in Information Technology expertise in Big Data using Hadoop framework and Analysis, Design, Development, Testing, Documentation, Deployment, and Integration using SQL and Big Data technologies.
Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
Adept in programming languages like Python, Scala, Java, Apache Spark, Pyspark, including Big Data technologies like Hadoop, Hive, Kafka, Zookeeper, Grafana.
Expertise working with AWS cloud services like EMR, S3, Redshift, EMR cloud watch for big data development.
Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), autoscaling, Security Groups, EC2 Container Service (ECS), Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon's Approach.
Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.
Worked on streaming data to consume data from KAFKA topics and load the data to landing area for reporting in near real time
Experience in developing enterprise level solution using batch processing (using Apache Pig) and streaming framework (using Spark Streaming, apache Kafka & Apache Flink).
Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
Extensively used Terraform in AWS Virtual Private Cloud to automatically setup and modify settings by interfacing with control layer.

PROFESSIONAL EXPERIENCE:

Confidential

Senior AWS Big Data Engineer

Responsibilities:

Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions. Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing. Created a Lambda Deployment function and configured it to receive events from S3 buckets. Installed and configured Hive and written Hive UDFs and Used Map Reduce and Junit for unit testing. Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data Migrated data from on - premises to AWS storage buckets Created data pipelines to use for business reports and process streaming data by using Kafka on premise cluster. Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
Processed the data from Kafka pipelines from topics and show the real time streaming in dashboards. Developed a python script to transfer data from on-premises to AWS S3. Joined various tables in Cassandra using spark and Scala and ran analytics on top of them. Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access. Collected data using Spark Streaming from AWS S3bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS. Responsible for building scalable distributed data solutions using the EMR cluster environment with Amazon EMR. Developed Java Map Reduce programs for the analysis of sample log files stored in clusters. Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin. Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries. Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2. Used Sqoop to channel data from different sources of HDFS and
RDBMS. Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Worked on Docker and Kubernetes architecture for deploying the data pipelines. Worked on Databricks environment and created delta lake tables. Worked on airflow as a scheduling and orchestration tool. Worked on performance tuning of spark jobs for better performance and cost savings. Worked on moving data from s3 to snowflake data warehouse. Crated External storage integratio

Confidential

Data Engineer

Responsibilities:

Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). Utilized AWS services with focus on big data analytics, enterprise data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility Experienced in Importing and exporting data into HDFS and Hive using Sqoop. Participated in development/implementation of Cloudera Hadoop environment.
Experienced in running query - using Impala and used BI tools to run ad-hoc queries directly on Hadoop. Using Bash and Python included Boto3 to supplement automation provided by Ansible and Terraform for tasks such as encrypting EBS volumes backing AMIs. Involved in using Terraform migrate legacy and monolithic systems to Amazon Web Services. Wrote Lambda function code and set Cloud Watch Event as trigger with Cron job Expression. Validate Scoop jobs, Shell scripts& perform data validation to check if data is loaded correctly without any discrepancy. Perform migration and testing of static data and transaction data from one core system to another. Worked on creating and running Docker images with multiple micro - services and Docker container orchestration using ECS, ALB and lambda. Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on
RDDs. Created Metric tables, End user views in Snowflake to feed data for Tableau refresh. Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs. Implemented Kafka producers create custom partitions, configured brokers, and implemented high level consumers to implement data platform. Developed best practice, processes, and standards for effectively carrying out data migration activities. Work across multiple functional projects to understand data usage and implications for data migration. Prepared data migration plans including migration risk, milestones, quality and business sign-off details. Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala. Worked on migrating MapReduce programs into Spark transformations using Scala. Developed spark code and spark-SQL/streaming for faster testing and processing of data. Wrote Python modules to extract data from the MySQL source database. Worked on Cloudera distribution and deployed on AWS EC2 Instances. Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage. Created Jenkins jobs for CI/CD using git, Maven and Bash scripting Built regression test suite in CI/CD pipeline with Data setup, test case execution and tear down using Cucumber- Gherkin, Java, Spring DAO, PostgreSQL Connected Redshift to

Confidential

Data Engineer/Hadoop Spark Developer

Responsibilities:

Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka Performing hive tuning techniques like partitioning and bucketing and memory optimization. Worked on different file formats like parquet, orc, json and text files. Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark). Used spark SQL to load data and created schema RDD on top of that which loads into hive tables and handled structured using spark SQL. Worked on analysing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka. As a Big Data Developer implemented solutions for ingesting data from various sources and processing the Data - at-Rest utilizing Big Data technologies such as Hadoop, MapReduce
Frameworks, MongoDB, Hive, Oozie, Flume, Sqoop and Talend etc. Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, Data Frame, Pair RDD's, Spark, YARN, pyspark. Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop. Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB) Worked in Azure environment for development and deployment of Custom Hadoop Applications. Created, provisioned different Databricksclusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
Developed a process using the python scripting to connect Azure blob container and retrieve the latest. bai & .xml files from the container and load them into the SQL database. Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL Performed tuning on PostgreSQL Databases using vacuum and analyze. Extracted and loaded CSV files, json files data from AWS S3 to Snowflake Cloud Data Warehouse. Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users. Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data The Databricks platform follows best practices for securing network access to cloud applications. Performed data validation which does the record wise counts between the source and destination. Involved in the data support team as role of bug fixes, schedule change, memory tuning, schema changes loading the historic data. Worked on implementation of some check points like hive count check, Sqoop records check, done file create check, done

Confidential

Data Engineer/Hadoop Spark Developer

Responsibilities:

Extensively worked with Spark - SQL context to create data frames and datasets to preprocess the model data. Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR. Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order. Wrote Junit tests and Integration test cases for those Microservice. Worked in Azure environment for development and deployment of Custom Hadoop Applications. Experienced in loading and transforming large sets of Structured
Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts. Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology. Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive. Built pipelines to move hashed and un-hashed data from XML files to Data Lake. Developed NiFi workflow to pick up the multiple files from ftp location and move those to HDFS on daily basis. Worked with developer teams on NiFi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka. Developed and maintained batch data flow using HiveQL and Unix scripting Built large-scale data processing systems in data warehousing solutions, and work with unstructured data mining on NoSQL. S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform. Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format. Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop. Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL). Developed customized Hive UDFs and UDAFs in Java, JDBC connectivity with hive development and execution of Pig scripts and Pig UDF's.

Environment: Hadoop, Microservices, Java, MapReduce, Agile, HBase, JSON, Spark, Kafka, JDBC, AWSEMR/EC2/S3, Hive, JSON, Pig, Flume, Zookeeper, Impala, Sqoop.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship