Aws Data Engineer Resume
Dallas, TX
SUMMARY
- Around 5.5 years of experience as Data Engineer in BIGDATA using HADOOP, Spark framework and Analysis, Design, Development, Documentation, Deployment, and Integration using SQL and BigData technologies.
- 2+ years of experience as Snowflake Engineer.
- Well versed in configuring and administering the Hadoop Cluster using Cloudera and Hortonworks.
- Experience in creating separate virtual data warehouses with difference size classes in AWS Snowflake.
- Experience with data transformations utilizing SnowSQL and Python in Snowflake.
- Hands - on experience in bulk loading and unloadin gdata into Snowflake tables using COPY command.
- Experience in working with AWS S3 and Snowflake cloud Data warehouse.
- Expertise in building Azure native enterprise applications and migrating applications from on-premises to Azure environment.
- Implementation experience with Data lakes and Business intelligence tools in Azure.
- Experience in creating real time data streaming solutions using Apache Spark/ Spark Streaming/ Apache Storm,Kafka,and Flume
- Currently working on Spark applications extensively using Scala as the main programming platform
- Processing this data using Spark Streaming API with Scala.
- Used Spark Data Frames, Spark-SQL and RDD API of Spark for performing various data transformations and dataset building.
- Developed RESTful web Services to retrieve, transform and aggregate the data from different end points to Hadoop (Hbase, Solr).
- Created Jenkins Pipeline using Groovy scripts for CI/CD.
- Exposure to Data Lake Implementation and developed Data pipelines and applied business logic utilizing Spark.
- Involved converting Cassandra/Hive/SQL queries into Spark transformations using RDD's and Scala.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
- Hands on experience doing real time on NO-SQL databases like MongoDB, HBase and Cassandra
- Experience in creating MongoDB clusters and hands on experience with complex MongoDB aggregate functions and mapping.
- Experience in using Flume to load log files into HDFS and Oozie for data scrubbing and process.
- Experience on performance tuning of HIVE queries and Map Reduce programs for scalability and faster execution.
- Experienced in handling real time analytics using HBase on top of HDFS data.
- Experience in transforming, Grouping, Aggregations, Joins using Kafka Streams API
- Hands on experience deploying KAFKA connect in standalone and distributed mode creating docker containers using DOCKER.
- Created TOPICS and written KAFKA producer and consumer in Python as required, developed KAFKA source/sink connectors to store the streaming new data into topics, from topics to different database by performing ETL tasks.
- Used Akka toolkit with Scala to perform builds.
- Experienced in collecting metrics for Hadoop clusters using Ambari & Cloudera Manager.
- Has knowledge on Storm architecture, Experience in using data modeling tools like Erwin
- Excellent experience in using scheduling tools to automate batch jobs
- Hands on experience in using Apache SOLR/Lucene
- Expertise using SQL Server, SQL, queries, procedures, functions
- Hands on experience in App Development using Hadoop, RDBMS, and Linux shell scripting
- Strong experience in Extending Hive and Pig core functionality by writing custom UDFs
- Experience in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across big volume of structured and unstructured data.
- Extensive experience in Text Analytics, developing different statistical Machine Learning, Data mining solutions to various business problems and gathering data visualization using Python and R.
- Ability to work as team and individually on many cutting-edge technologies.
TECHNICAL SKILLS
Hadoop/Big Data: HDFS, MapReduce, Yarn, HBase, Pig, Hive, Sqoop, Flume, Oozie, Zookeeper, Splunk, Hortonworks, Cloudera
Programming languages: SQL, Python, R, Scala, Spark, Linux shell scripts
Databases: RDBMS (MySQL, DB2, MS-SQL Server, Teradata, PostgreSQL), NoSQL (MongoDB, HBase, Cassandra), Snowflake virtual warehouse
OLAP & ETL Tools: Tableau, Spyder, Spark, SSIS, Informatica Power Center, Pentaho, Talend
Data Modelling Tools: Microsoft Visio, ER Studio, Erwin
Python and R libraries: R-tidyr, tidyverse, dplyr reshape, lubridate, Python - beautiful Soup, NumPy, scipy, matplotlib, python-twitter, pandas,scikit-learn, keras.
Machine Learning: Regression, Clustering, MLlib, Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, and Gradient Boost & Adaboost, Neural Networks and Time Series Analysis.
Data analysis Tools: Machine Learning, Deep Learning, Data Warehouse, Data Mining, Data Analysis, Big data, Visualizing, Data Munging, Data Modelling
Cloud Computing Tools: Snowflake, SnowSQL, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)
Amazon Web Services: EMR, EC2, S3, RDS, Cloud Search, Redshift, Data Pipeline, Lambda.
Reporting Tools: JIRA, MS Excel, Tableau, Power BI, QlikView, Qlik Sense, D3, SSRS, SSIS
IDE’s: Pycharm
Development Methodologies: Agile, Scrum, Waterfall
PROFESSIONAL EXPERIENCE
Confidential, Dallas, TX
AWS Data Engineer
Responsibilities:
- Developed Talend Bigdata jobs to load heavy volume of data into S3 data lake and then into Snowflake.
- Developed Snowpipes for continuous injection of data using event handler from AWS (S3 bucket).
- Developed SnowSql scripts to deploy new objects and update changes into Snowflake.
- Developed a Python script to integrate DDL changes between on-prem Talend warehouse and snowflake.
- Working with AWS stack S3, EC2, Snowball, EMR, Athena, Glue, Redshift, DynamoDB, RDS, Aurora, IAM, Firehose, and Lambda.
- Designing and implementing new HIVE tables, views, schema and storing data optimally.
- Performing Sqoop jobs to land data on HDFS and running validations.
- Configuring Oozie Scheduler Jobs to run the Extract jobs and queries in an automated way.
- Querying data by optimizing the query and increasing the query performance.
- Designing and creating SQL Server tables, views, stored procedures, and functions.
- Performing ETL operations using Apache Spark, also using Ad-Hoc queries, and implementing Machine Learning techniques.
- Worked on configuring CICD for CaaS deployments (k8's).
- Involved in migrating master-data form Hadoop to AWS.
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, Pair RDD's.
- Developed preprocessing job using Spark Data frames to transform JSON documents to flat file
- Loaded D-Stream data into Spark RDD and did in-memory data computation to generate output response
- Processing with Amazon EMR big data across a Hadoop cluster of virtual servers on AmazonElastic Compute Cloud (EC2) andAmazonSimple Storage Service (S3).
- Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD's.
- Worked on Big Data infrastructure for batch processing and real-time processing using Apache Spark
- Developed Apache Spark applications by using Scala for data processing from various streaming sources
- Processed the Web server logs by developing Multi-Hop Flume agents by using Avro Sink and loaded into Cassandra for further analysis, Extracted files from Cassandra through Flume
- Responsible for design and development of Spark SQL Scripts based on Functional Specifications
- Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive, and Cassandra
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using RDD's and Scala
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables to spark for faster processing of data.
- Developed Some Helper class for abstracting Cassandra cluster connection act as core toolkit
- Involved in creating Data Lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers
- Moved data from HDFS to Cassandra using Map Reduce and Bulk Output Format class.
- Extracted files from Cassandra through Sqoop and placed in HDFS and processed it using Hive
- Writing MapReduce (Hadoop) programs to convert text files into AVRO and loading into Hive table
- Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system
- Extending HIVE/PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig
- Involved in loading data from rest endpoints to Kafka producers and transferring the data to Kafka brokers
- Used Apache Kafka functionalities like distribution, partition, replicated commit log service for messaging
- Partitioning Data streams using Kafka. Designed and configured Kafka cluster to accommodate heavy throughput.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team
- Used Apache Oozie for scheduling and managing multiple Hive Jobs. Knowledge of HCatalog for Hadoop based storage management
- Migrated an existing on-premises application to Amazon Web Services (AWS) and used its services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR
- Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats like Text, Avro, Sequence, XML, JSON, and Parquet
- Generated various kinds of reports using Pentaho and Tableau based on Client specification
- Have come across new tools like Jenkins, Chef and Rabbit MQ.
- Worked with SCRUM team in delivering agreed user stories on time for every Sprint
Environment: Snowflake, SnowSQL, Hadoop, MapReduce, HDFS, Yarn, Hive, Sqoop, Oozie, Spark, Scala, AWS, EC2, S3, EMR, Cassandra, Flume, Kafka, Pig, Linux, Shell Scripting
Confidential, Woonsocket, RI
Hadoop / Spark Developer
Responsibilities:
- Participated Responsible for Scalable distributed Infrastructure for Model Development and Code Promotion using Hadoop.
- Developed Spark scripts by using Python and Shell scripting commands as per the requirement.
- Used Spark API over Cloudera/Hadoop/YARN to perform analytics on data in Hive and MongoDB.
- Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Used Spark for series of dependent jobs and for iterative algorithms. Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS.
- Developed MapReduce jobs in Java to convert data files into Parquet file format.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS.
- Developed Spark streaming pipeline in Python to parse JSON data and to store in Hive tables
- Developed Hive queries to process the data and generate the data cubes for visualizing.
- Implemented schema extraction for Parquet and Avro file Formats in Hive/MongoDB.
- Experienced working on cloud AWS using EMR. Performed operations on AWS using EC2 instances, S3 storage, performed RDS, Lambda, analytical Redshift operations.
- Worked with Talend open studio for designing ETL Jobs for Processing of data.
- Used Zookeeper to manage Hadoop clusters and Oozie to schedule job workflows.
- Worked with continuous Integration of application using Jenkins.
- Developed ETL Applications using Hive, Spark, andImpala& Sqoop for Automation using Oozie.
- Used AWS EMR as a data processing platform. Worked on AWS S3 and snowflake as data storage platform
- Used Reporting tools to connect with Hive for generating daily reports of data.
- Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.
- Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive tables.
- Importing existing datasets from Oracle to Hadoop system using SQOOP.
- Writing the Spark Core Programs for processing and cleansing data thereafter load that data into Hive or HBase for further processing.
- Used Partitioning and Bucketing techniques in Hive to improve the performance, involved in choosing different file formats like ORC, Parquet over text file format.
- Responsible for importing data from Postgres to HDFS, HIVE, MongoDB, HBASE using SQOOP tool.
- Experienced in migrating HiveQL into Impala to minimize query response time.
- Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
Environment: Hadoop YARN, AWS, EC2, Spark Core, Spark SQL, Scala, Python, Hive, AWS S3, RDS, Lambda, Redshift, MongoDB, Sqoop, Impala, Oracle, HBase, Oozie, Jenkins, HDFS, Snowflake, Zookeeper, Yarn, Autosys, AWS, Windows, SQL, OLTP, Shell Script, Cloudera, SQL, Map Reduce, Parquet, Avro, Linux, GIT, Oozie.
Confidential, Boston, MA
Spark/ Hadoop Developer
Responsibilities:
- Developed Spark applications using Scala.
- Used Data frames/ Datasets to write SQL type queries using Spark SQL to work with datasets.
- Performed real-time streaming jobs using Spark Streaming to analyze data on a regular window time interval to the incoming data from Kafka.
- Created Hive tables and had extensive experience with HiveQL.
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
- Extended Hive functionality by writing custom UDFs, UDAFs, UDTFs to process large data.
- Performed Hive UPSERTS, partitioning, bucketing, windowing operations, efficient queries for faster data operations.
- Involved in moving data from HDFS to AWS Simple Storage Service (S3) and extensively worked with S3 bucket inAWS.
- Created and maintained technical documentation for launching Hadoop Clusters and for executing Hive queries and Pig Scripts
- Responsible for loading bulk amount of data in HBase using MapReduce by directly creating H-files and loading them.
- Developed spark application for filtering Json source data inAWSS3 location and store it into HDFS with partitions and used spark to extract schema of Json files.
- Imported and exported data between relational database systems and HDFS/Hive using Sqoop.
- Wrote custom Kafka consumer code and modified existing producer code in Python to push data to Spark-streaming jobs.
- Scheduled jobs and automated workflows using Oozie.
- Automated the movement of data using NIFI dataflow framework and performed streaming and batch processing via micro batches. Controlled and monitored data flow using web UI.
- Worked with HBase database to perform operations with large sets of structured, semi-structured and unstructured data coming from different data sources. - need to add new line
- Exported analytical results to MS SQL Server and used Tableau to generate reports and visualization dashboards.
Environment: AWS, S3, Cloudera, Spark, Spark SQL, HDFS, HiveQL, Hive, Zookeeper, Hadoop, Python, Scala, Kafka, Sqoop, MapReduce, Oozie, Tableau, MS SQL Server, HBase, Agile, Eclipse.
Confidential
Hadoop Developer
Responsibilities:
- Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR and MapR (MapR data platform).
- Developed Simple to complex Map/reduce streaming jobs using Python, Hive and Pig.
- Used various compression mechanisms to optimize Map/Reduce Jobs to use HDFS efficiently.
- Used ETL component Sqoop to extract the data from MySQL and load data into HDFS.
- Wrote Hive queries and Pig scripts to study customer behavior by analyzing the data.
- Loaded data into Hive tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.
- Involved in migrating data from different sources intoPUB-SUBmodel inKafkausingproducers, consumers and pre-process data usingStormtopologies
- Great expose to Unix scripting and good hands-on shell scripting.
- Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export throughPython.
- Wrote Python scripts to process semi-structured data in formats like JSON.
- Involved in loading and transforming of large sets of structured, semi structured and unstructured data.
- Troubleshooting and finding the bugs in the Hadoop applications and to clear off all the bugs took help from the testing team.
- CreatedTableaureports with complex calculations and worked on Ad-hoc reporting usingPowerBI.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
- Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.
- Used Python API by developing Kafka producer, consumer for writing Avro Schemes.
- Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
- Developed the Pyspark code for AWS Glue jobs and for EMR.
- Installed Ganglia Monitoring Tool to generate reports related to Hadoop cluster like CPUs running, Hosts Up and Down etc., operations were performed to maintain Hadoop cluster.
- Responsible for analyzing and data cleaning using Spark SQL Queries.
- Handled importing of data from various data sources performed transformations using spark and loaded data into hive.
- CreatedHBase tables,HBase sinksand loaded data into them to perform analytics usingTableau.
- Imported Data from AWS S3 into Spark RDD and performed transformation and actions on RDDs
- Developed python code for different tasks, dependencies and time sensor for each job workflow management and automation using Apache Airflow
- Developed DAGs in Airflow scheduler using Python. Experience in creatingDag-Runs and Task-Instances
- Created instance of Operators, Sensors and Transfers to trigger DAGs to perform computation
- Worked with spark core, Spark Streaming and Spark SQL modules of Spark.
- Used Scala to write the code for all the use cases in Spark and extensive experience with Scala for data analytics on Spark cluster and Performed map-side joins on RDD.
- Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
- Involved in implementing DevOps culture through CI/CD tools like Repos, Code Deploy, Code Pipeline, GitHub.
- On demand, secure EMR launcher with custom Spark submit steps using S3 Event, SNS, KMS and Lambda function.
- Used Cloud watch logs to move application logs to S3 and create alarms based on a few exceptions raised by applications.
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing data.
- Determining the viability of a business problem for a Big Data solution with Pyspark.
- Proactively monitored systems and services, architecture design and implementation of Hadoop deployment, configuration management, backup, and disaster recovery systems and procedures.
- Monitored multiple Hadoop clusters environments using Ganglia and Monitored workload, job performance and capacity planning using MapR.
- Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.
- Involved in time series data representation using HBase.
- Great working experience with Splunk for real time log data monitoring.
- Build cluster on AWS environment using EMR using S3,EC2,Redshift.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports by our BI team.
- Great hands-on experience with Pyspark for using Spark liberties by using python scripting for data analysis.
- Working with (BI)Tableau teams as requirement of datasets and good working experience with Data visualization.
Environment: MapReduce, AWS, S3, EC2, EMR, MySQL, RedShift, Hadoop, Tableau, Lambda, Glue, Java, HDFS, Hive, Pig, Tez, Oozie, Hbase, Spark, GitHub, Docker, Scala, Spark SQL, Spark Streaming, Kafka, Python, Putty, Pyspark, Cassandra, Shell Scripting, ETL, YARN,Splunk, Sqoop, LINUX, Cloudera, Ganglia, SQL Server.
Confidential
Hadoop Developer
Responsibilities:
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs for data cleaning and preprocessing.
- Involved in creating Hive tables, writing complex Hive queries to populate Hive tables.
- Load and transform large sets of structured, semi-structured and unstructured data.
- Used Hive to analyze the partitioned and bucketeddataand compute various metrics for reporting on the dashboard.
- Developed and ConfiguredKafka brokersto pipeline server logs data into Spark streaming.
- Worked on Spark RDD transformations to map business analysis and apply actions on top of transformations.
- Experienced in working with Spark eco system using Spark SQL and Scala on different formats like Text file, Avro, Parquet files.
- Optimized Hive QL Scripts by using execution engine like Tez.
- Wrote complex Hive queries to extractdatafrom heterogeneous sources (DataLake) and persist thedatainto HDFS.
- Implemented SQOOP for large dataset transfer between Hadoop and RDBMS.
- Used different file formats like Text files, Avro, Parquet and ORC.
- Worked with different File Formats like text file, Parquet for HIVE querying and processing based on business logic.
- Used JIRA for creating the user stories and creating branches in the bitbucket repositories based on the story.
- Knowledge on creating various repositories and version control using GIT.
- Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
Environment: Spark, Scala, Python, Hadoop, MapReduce, CDH, Cloudera Manager, Control M Scheduler, Shell Scripting, Agile Methodology, JIRA, Git, Tableau.