We provide IT Staff Augmentation Services!

Aws Big Data Engineer Resume

3.00/5 (Submit Your Rating)

Malvern, PA

SUMMARY

  • Having around 9 years of total IT experience with over 5+ years experience in AWS Big Data Hadoop, 2 years of experience in Development and Design of Java based enterprise applications.
  • Extensive working experience on Hadoop eco - system components like HDFS, MapReduce, Hive, Sqoop, Flume, Spark, Kafka, Oozie and Zookeeper.
  • Implemented performance tuning techniques for Spark SQLqueries.
  • Strong knowledge on Hadoop HDFS architecture, Map-Reduce(MRv1) and YARN(MRv2) framework.
  • Strong hands on Experience in publishing the messages to various Kafka topics using Apache NIFI and consuming the message to HBase using Spark and Python.
  • Worked on creating Spark jobs that process the true source files and successful in performing various transformations on the source data using Spark Dataframe, Spark SQL API's.
  • Developed Sqoop scripts to migrate data from Teradata, Oracle to Bigdata Environment.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Worked with Hue GUI in scheduling jobs with ease and File browsing, Job browsing, Metastore management.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
  • Hands on experience in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4), Yarn distributions (CDH 5.X).
  • Implemented real time data streaming pipeline usingAWS Kinesis, Lambda, and Dynamo DBand deployedAWS Lambda codefrom Amazon S3 buckets.
  • Integrated AWS DynamoDB using AWS Lambda to store the value items and backup the DynamoDB streams.
  • Experienced in AWS Elastic Beanstalk for app deployments and worked on AWS Lambda with Amazon Kinesis.
  • Developing a Marketing Cloud Service on Amazon AWS. Developed serverless application using AWS Lambda, S3, Redshift and RDS.
  • Work on large scale data transfer across different Hadoop clusters, implement new technology stacks on Hadoop clusters using Apache Spark.
  • Wrote Python scripts to parse XML documents and load the data in database.
  • Added support for AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
  • Experience in project deployment using Heroku/Jenkins and using web services like Amazon Web Services (AWS) EC2, AWS S3, Autoscaling, CloudWatch and SNS.
  • Performed Data scrubbing and processing with Oozie and for workflow automation and coordination.
  • Hands on experience in analyzing log files for Hadoop and eco-system services and finding root cause.
  • Hands on experience on handling different file formats like AVRO, PARQUET, Sequential files, MAP Files, CSV, xml, log ORC and RC.
  • Experience with NoSQL Database HBase, Cassandra, MongoDB.
  • Experience with AIX/Linux RHEL, Unix Shell Scripting and SQL Server 2008.
  • Worked on data search tool Elastic Search and data collection tool Logstash.
  • Strong knowledge in Hadoop cluster installation, capacity planning and performance tuning, benchmarking, disaster recovery plan and application deployment in production cluster.
  • Experience in developing stored procedures, triggers using SQL, PL/SQL in relational databases such as MS SQL Server 2005/2008.
  • Exposed into methodologies Scrum, Agile and Waterfall.

TECHNICAL SKILLS

Programming Languages: Java, Scala, Python, SQL, and C/C++

Big Data Ecosystem: Hadoop, MapReduce, Kafka, Spark, Pig, Hive, YARN, Flume, Sqoop, Oozie, Zookeeper, Talend.

Hadoop Distributions: Cloudera Enterprise, Data Bricks, Horton Works, EMC Pivotal.

Databases: Oracle, SQL Server, PostgreSQL.

Web Technologies: HTML, XML, JQuery, Ajax, CSS, JavaScript, JSON.

Streaming Tools: Kafka, RabbitMQ

Testing: Hadoop Testing, Hive Testing, MRUnit.

Operating Systems: Linux Red Hat/Ubuntu/CentOS, Windows 10/8.1/7/XP.

Cloud: AWS EMR, Glue, RDS, CloudWatch, S3, Redshift Cluster, Kinesis, DynamoDB.

Technologies and Tools: Servlets, JSP, Spring (Boot, MVC, Batch, Security), Web Services, Hibernate, Maven, GitHub, Bamboo.

Application Servers: Tomcat, JBoss.

IDE’s: Eclipse, Net Beans, IntelliJ.

PROFESSIONAL EXPERIENCE

AWS Big Data Engineer

Confidential, Malvern, PA

RESPONSIBILITIES:

  • Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
  • Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Spark framework and Tableau for data visualization.
  • Ensure that application continues to function normally through software maintenance and testing in production environment.
  • Leverage Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency.
  • Implemented real-time solutions for Money Movement and transactional data using Kafka, Spark Streaming, Hbase.
  • The project also includes a spread of big data tools and programming languages like Sqoop, Python, Oozie etc.
  • Worked on scheduling Oozie workflow engine to run multiple jobs.
  • Experience in creating python topology script to generate cloud formation template for creating the EMR cluster in AWS.
  • Experience in using the AWS services Athena, Redshift and Glue ETL jobs.
  • Good knowledge on AWS Services like EC2, EMR, S3, Service Catalog, and Cloud Watch.
  • Experience in using Spark SQL to handle structureddatafrom Hive in AWS EMR Platform (M4.Xlarge, M5.12Xlarge clusters).
  • Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Experienced in optimizing Hive queries, joins to handle different data sets.
  • Involved in creating Hive tables (Managed tables and External tables), loading and analyzing data using hive queries.
  • Actively involved in code review and bug fixing for improving the performance.
  • Good experience in handling data manipulation using python Scripts.
  • Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.
  • Created Splunk dashboard to capture the logs for end to end process of data ingestion.
  • Written unit test cases for Spark code for CICD process.
  • Good knowledge about the configuration management tools like BitBucket/Github and Bamboo(CICD).

Big Data Engineer

Confidential, Phoenix, AZ

RESPONSIBILITIES:

  • Worked with extensive data sets in Big Data to uncover pattern, problem & unleash value for the Enterprise.
  • Worked with internal and external data sources on improving data accuracy / coverage and generate recommendation on the process flow to accomplish the goal.
  • Ingestion of various types of data feeds from SOR and use-case perspective into Cornerstone 3.0 platform.
  • Re-engineered legacy IDN FastTrack process to get the Bloomberg data directly from source to the CS3.0.
  • Converted legacy Shell scripts to Map-Reduce jobs in a distributed manner without performing any kind of processing on the Edgenode to eliminate the burden.
  • Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
  • Created Spark applications for data preprocessing for greater performance.
  • Developed Spark code and Spark-SQL/streaming for faster testing and processing ofdata.
  • Experience in creating spark applications using RDD, Dataframes.
  • Worked extensively on hive to analyse the data and create reports for data quality.
  • Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for increasing performance benefit and helping in organizing data in a logical fashion.
  • Written Hive queries for data analysis to meet the business requirements and Designed and developed User Defined Function (UDF) for Hive.
  • Involved in creating Hive tables (Managed tables and External tables), loading and analyzing data using hive queries.
  • Good knowledge about the configuration management tools like SVN/CVS/Github.
  • Experience in configuring Event Engine nodes to import and export the data from Teradata to HDFS and vice-versa.
  • Worked with source to get the history data as well as BAU data from IDN Teradata to the CornerStone platform and migrated also feeds from CS2.0.
  • Expert in creating the nodes in Event Engine as per the use-case requirement to automate the process for the BAU data flow.
  • Exported the Event Engine nodes created in the silver environment to the IDN repository in BitBucket and created DaVinci package to migrate it to Platinum.
  • Worked with FDP team to create a secured flow to get the data from KAFKA Queue to CS3.0.
  • Expert in creating the SFTP Connection to the internal and external source to get data in secured manner without any breakage.
  • Handle the production Incidents assigned to our workgroup promptly and fix the bugs or route it to the respective teams and optimized the SLA’s.

Hadoop Developer

Confidential, Emeryville-CA

RESPONSIBILITIES:

  • Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
  • Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
  • Involved in loading data from Linux file systems, servers, java web services using Kafka producers and partitions.
  • Applied Kafka custom encoders for custom input format to load data into Kafka Partitions.
  • Implement POC with Hadoop. Extract data with Spark into HDFS.
  • Used Spark SQL with Scala for creating data frames and performed transformations on data frames.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed code to read data stream from Kafka and send it to respective bolts through respective stream.
  • Worked on Spark streaming using Apache Kafka for real time data processing.
  • Developed Map Reduce jobs using Map Reduce Java API and HIVEQL.
  • Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
  • Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various Hadoop Programs using Oozie.
  • Experienced in optimizing Hive queries, joins to handle different data sets.
  • Involved in ETL, Data Integration and Migrationby writing pig scripts.
  • Integrated Hadoop with Solr and implement search algorithms.
  • Experience in Storm for handling realtime processing.
  • Hands on Experience working in Hortonworks distribution.
  • Worked hands on No-SQL databases like MongoDB for POC purpose in storing images and URIs.
  • Designed and implemented MongoDB and associated RESTful web service.
  • Involved in writing test cases and implement test classes using MRUnit and mocking frameworks.
  • Developed Sqoop scripts to extract the data from MYSQL and load into HDFS..
  • Experience in processing large volume of data and skills in parallel execution of process usingTalendfunctionality.
  • UsedTalendtool to create workflows for processing data from multiple source systems.

Hadoop Developer

Confidential, Oakbrook, IL

RESPONSIBILITIES:

  • Experience with professional software engineering practices and best practices for the full software development life cycle including coding standards, code reviews, source control management and build processes.
  • Worked on analyzing Hadoop cluster and different big data analytic tools including Map Reduce, Hive.
  • Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV & other compressed file formats.
  • Worked on Teradata parallel transport (TPT) to load data from databases and files to Teradata.
  • Wrote views based on user and/or reporting requirements.
  • Wrote Teradata Macros and used various Teradata analytic functions.
  • Involved in migration projects to migrate data from data warehouses on Oracle/DB2 and migrated those to Teradata.
  • Configured Flume source, sink and memory channel to handle streaming data from server logs and JMS sources.
  • Experience in working with Flume to load the log data from multiple sources directly into HDFS.
  • Worked in the BI team in Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
  • Involved in source system analysis, data analysis, data modeling to ETL (Extract, Transform and Load).
  • Handling structured and unstructured data and applying ETL processes.
  • Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa. Loading data into HDFS.
  • Involved in collecting, aggregating and moving data from servers to HDFS using Flume.
  • Implemented logging framework - ELK stack (Elastic Search, LogStash& Kibana) onAWS.
  • Developed the Pig UDF’S to pre-process the data for analysis.
  • Coding complex Oracle stored procedures, functions, packages, and cursors for the client specific applications.
  • Experienced in using Java Rest API to perform CURD operations on HBase data.
  • Applied Hive queries to perform data analysis on HBase using Storage Handler to meet the business requirements
  • Writing Hive Queries to Aggregate Data that needs to be pushed to the HBase Tables.
  • Create/Modify shell scripts for scheduling various data cleansing scripts and ETL loading process.
  • Supports and assist QA Engineers in understanding, testing and troubleshooting.

ETL Developer

Confidential

RESPONSIBILITIES:

  • Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.
  • Extensively usedSparkstack to develop preprocessing job which includes RDD, Datasets and Data frames Api's to transform the data for upstream consumption.
  • Developed Realtime data processing applications by using Scala and Python and implemented ApacheSparkStreaming from various streaming sources like Kafka, Flume and JMS.
  • Worked on extracting and enriching HBase data between multiple tables using joins inSpark.
  • Worked on writing APIs to load the processed data to HBase tables.
  • Replaced the existing MapReduce programs intoSparkapplication using Scala.
  • Built on-premise data pipelines using Kafka andSparkstreaming using the feed from API streaming Gateway REST service.
  • Developed the Hive UDF's to handle data quality and create filtered datasets for further processing
  • Experienced in writing Sqoop scripts to import data into Hive/HDFS from RDBMS.
  • Good knowledge on Kafka streams API for data transformation.
  • Developed oozie workflow for scheduling & orchestrating the ETL process.
  • Used Talend tool to create workflows for processing data from multiple source systems.
  • Created sample flows in Talend, Stream sets with custom coded jars and analyzed the performance of Stream sets and Kafka steams.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs
  • Optimized Hive QL/ pig scripts by using execution engine like Tez, Spark.
  • Developed Hive Queries to analyze the data in HDFS to identify issues and behavioral patterns.
  • Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts.
  • Deployed applications using Jenkins framework integrating Git- version control with it.
  • Participated in production support on a regular basis to support the Analytics platform
  • Used Rally for task/bug tracking.
  • Used GIT for version control.

ETL Developer

Confidential

RESPONSIBILITIES:

  • Analyzed business requirements, transformed data, and mapped source data using the Teradata Financial Services Logical Data Model tool, from the source system to theTeradataPhysical Data Model.
  • Developed Ab Initio graphs in accordance to the Business requirements and worked on Unit testing and as well as system testing.
  • Played a key role inoverall designof the entire application
  • Wrote multipleAb Initio transformsto covert currency values coming from different countries.
  • Lead thedesign and development effort to consolidate all business rulesinto a common XFR for re-usability.
  • ConfiguringAb InitioConfig files to interact withTeradatadatabase.
  • Worked on Common Extract and Load Graphs from aTeradatadatabase.
  • Reviewed and helped define test cases forsystem test and UAT.
  • Played a key role indatabase design for tablesused for reporting purposes
  • Responsible forcode reviews across all modulesof the application.
  • Anticipated and prepared teams through changes (requests from client, scope of work, adjustment in deadlines, etc.)
  • Identified andapplied best practices to issues at hand. Generated new perspectives and frameworks that allow problems to be solved. Makes difficult ideas and concepts easy to understand (e.g. using diagrams, analogies, etc.)
  • Integrated requirements from multiple subject areas/businessareas to define solution requirements.
  • Facilitated and negotiated the resolution of conflicting requirements and out of scope requirements.

We'd love your feedback!