HADOOP DATA ENGINEER Resume

SUMMARY

Skilled in managing data analytics and data processing, database and data driven projects
Skilled in Architecture of Big Data Systems, ETL Pipelines, and Analytics Systems for diverse end users
Skilled in Database systems and administration
Proficient in writing technical reports and documentation
Adept with various distributions such as Cloudera Hadoop, Hortonworks, and Elastic Cloud, Elasticsearch
Expert in bucketing and partitioning
Expert in Performance Optimization

TECHNICAL SKILLS

Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS
Hortonworks, MapR, MapReduce
HiveQL, MapReduce, XML, FTPPython, UNIX, Shell scripting, LINUX
Unix/Linux, Windows 10, Ubuntu, Apple OS
Parquet, Avro & JSON, ORC, text, csv
Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS)
Apache Spark, Spark Streaming, Flink
Pentaho, QlikView, Tableau, PowerBI, Matplotlib, Plotly, Dash
Apache Spark, Spark Streaming, Storm
Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache Hbase, Apache Hive, MongoDB
Microsoft Project, Primavera P6, VMWare, Microsoft Word, Excel, Outlook, Power Point; Technical Documentation Skills

PROFESSIONAL EXPERIENCE

Confidential

HADOOP DATA ENGINEER

Responsibilities:

Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration
Created a pipeline to gather data using PySpark, Kafka and HBase
Sent requests to source REST Based API from a Scala script via Kafka producer
Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance
Hands - on experience with SparkCore, SparkSession, SparkSQL, and Data Frames/Data Sets/RDD API
Spark jobs, Spark SQL and Data Frames API to load structured data into Spark clusters
Created a Kafka broker which uses the schema to fetch structured data in structured streaming
Define Spark data schema and set up development environment inside the cluster
Management of Spark-submit jobs to all environments
Interacted with data residing in HDFS using PySpark to process the data
Decoded the raw data and loaded into JSON before sending the batched streaming file over the Kafka producer
Received the JSON response in Kafka consumer written in Python.
Formatted the response into a data frame using a schema containing, News Type, Article Type, Word Count, and News Snippet to parse the JSON
Established a connection between the HBase and Spark for the transfer of the newly populated data frame
Design Spark Scala job to consume information from S3 Buckets
Monitor background operations in Hortonworks Ambari
HDFS Monitoring job status and life of the DataNodes according to the specs
Managed Zookeeper configurations and ZNodes to ensure High Availability on the Hadoop Cluster
Managed hive connection with tables, databases and external tables
Setup ELK collections to all environments and replications of shards
Create standardized documents for company all usage
Work one on one with clients to resolve issues regarding Spark jobs submissions
Worked on AWS Kinesis for processing huge amounts of real-time data
Developed multiple Spark Streaming and batch Spark jobs using Java, Scala, and Python on AWS
Implemented Hortonworks medium and low recommendations on all environment

Confidential

HADOOP DATA ENGINEER

Responsibilities:

Involved in the creation of Hive tables, loading data from different sources and performing Hive queries.
Implemented data queries and transformations on Hive/SQL tables by using the Spark SQL and Spark DataFrames APIs
Worked on importing and exporting data between the HDFS and Relational Data Bases
Created a pipeline to gather new music releases of a country for a given week using PySpark, Kafka and Hive
Sent requests to Confidential REST Based API from a python script via Kafka producer
Experience collecting log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.
Received the JSON response in Kafka consumer python file, formatted the response into a data frame using a schema containing, country code, artist name, number of plays and genre to parse the JSON
Established a connection between Hive and Spark for the transfer of the newly populated data frame
Extracted metadata from Hive tables with Hive QL
Utilized a cluster of three Kafka brokers to handle replication needs and allow for fault tolerance
Stored the data pulled from the API into Apache Hive on Hortonworks Sandbox
Utilized HiveQL to query the data to discover music release trends from week to week
Assist in Install and configuration of Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches
Loaded into ingested data into Hive Managed and External tables.
Wrote custom user define functions (UDF) for complex Hive queries (HQL)
Performed upgrades, patches and bug fixes in Hadoop in a cluster environment
Wrote shell scripts to automate workflows to pull data from various databases into Hadoop framework for users to access the data through Hive based views
Writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language
Built the Hive views on top of the source data tables, and built a secured provisioning
Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster
Wrote shell scripts for automating the process of data loading

Confidential

AWS Cloud DATA ENGINEER

Responsibilities:

Made and oversaw cloud VMs with AWS EC2 command-line clients and AWS administration reassure.
Used Spark DataFrame API over the Cloudera platform to perform analytics on Hive data.
Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
Used Ansible Python Script to generate inventory and push the deployment to AWS Instances.
Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.
Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift
Implemented AWS Lambda functions to run scripts in response to events in the Amazon Dynamo DB table or S3.
Populating database tables via AWS Kinesis Firehose and AWS Redshift.
Automated the installation of ELK agent (file beat) with Ansible playbook. Developed KAFKA Queue System to Collect Log data without Data Loss and Publish to various Sources.
AWS Cloud Formation templates used for Terraform with existing plugins.
Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline
Implemented AWS IAM user roles and policies to authenticate and control access
Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS

Confidential

BIG DATA ENGINEER

Responsibilities:

Wrote shell scripts to automate data ingestion tasks.
Used Cron jobs to schedule the execution of data processing scripts.
Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift
AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3)
Worked with Spark using Scala and Spark SQL for a faster processing of the data. Built Spark Streaming pipelines including ingestion tools like Kafka and Flume.
Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS Cloud Formation templates
Implemented security measures AWS provides, employing Confidential concepts of AWS Identity and Access Management (IAM)
Migrated complex MapReduce scripts to Apache Spark RDDs code.
Designed and developed ETL workflows using Scala and Python for processing structured and unstructured data from the HDFS.
Developed data transformation pipelines using Spark RDDs and Spark SQL.
Created multiple batch Spark jobs using Java
Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configuring the servers for specified applications
Developed metrics, attributes, filters, reports, dashboards and created advanced chart types, visualizations and complex calculations to manipulate the data.
Implemented a Hadoop cluster and different processing tools including Spark, MapReduce
Pushing containers into AWS ECS
Use Scala to connect to EC2 and push files to AWS S3
Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship