We provide IT Staff Augmentation Services!

Hadoop Data Engineer Resume

2.00/5 (Submit Your Rating)

SUMMARY

  • Skilled in managing data analytics and data processing, database and data driven projects
  • Skilled in Architecture of Big Data Systems, ETL Pipelines, and Analytics Systems for diverse end users
  • Skilled in Database systems and administration
  • Proficient in writing technical reports and documentation
  • Adept with various distributions such as Cloudera Hadoop, Hortonworks, and Elastic Cloud, Elasticsearch
  • Expert in bucketing and partitioning
  • Expert in Performance Optimization

TECHNICAL SKILLS

  • Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS
  • Hortonworks, MapR, MapReduce
  • HiveQL, MapReduce, XML, FTPPython, UNIX, Shell scripting, LINUX
  • Unix/Linux, Windows 10, Ubuntu, Apple OS
  • Parquet, Avro & JSON, ORC, text, csv
  • Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS)
  • Apache Spark, Spark Streaming, Flink
  • Pentaho, QlikView, Tableau, PowerBI, Matplotlib, Plotly, Dash
  • Apache Spark, Spark Streaming, Storm
  • Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache Hbase, Apache Hive, MongoDB
  • Microsoft Project, Primavera P6, VMWare, Microsoft Word, Excel, Outlook, Power Point; Technical Documentation Skills

PROFESSIONAL EXPERIENCE

Confidential

HADOOP DATA ENGINEER

Responsibilities:

  • Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration
  • Created a pipeline to gather data using PySpark, Kafka and HBase
  • Sent requests to source REST Based API from a Scala script via Kafka producer
  • Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance
  • Hands - on experience with SparkCore, SparkSession, SparkSQL, and Data Frames/Data Sets/RDD API
  • Spark jobs, Spark SQL and Data Frames API to load structured data into Spark clusters
  • Created a Kafka broker which uses the schema to fetch structured data in structured streaming
  • Define Spark data schema and set up development environment inside the cluster
  • Management of Spark-submit jobs to all environments
  • Interacted with data residing in HDFS using PySpark to process the data
  • Decoded the raw data and loaded into JSON before sending the batched streaming file over the Kafka producer
  • Received the JSON response in Kafka consumer written in Python.
  • Formatted the response into a data frame using a schema containing, News Type, Article Type, Word Count, and News Snippet to parse the JSON
  • Established a connection between the HBase and Spark for the transfer of the newly populated data frame
  • Design Spark Scala job to consume information from S3 Buckets
  • Monitor background operations in Hortonworks Ambari
  • HDFS Monitoring job status and life of the DataNodes according to the specs
  • Managed Zookeeper configurations and ZNodes to ensure High Availability on the Hadoop Cluster
  • Managed hive connection with tables, databases and external tables
  • Setup ELK collections to all environments and replications of shards
  • Create standardized documents for company all usage
  • Work one on one with clients to resolve issues regarding Spark jobs submissions
  • Worked on AWS Kinesis for processing huge amounts of real-time data
  • Developed multiple Spark Streaming and batch Spark jobs using Java, Scala, and Python on AWS
  • Implemented Hortonworks medium and low recommendations on all environment

Confidential

HADOOP DATA ENGINEER

Responsibilities:

  • Involved in the creation of Hive tables, loading data from different sources and performing Hive queries.
  • Implemented data queries and transformations on Hive/SQL tables by using the Spark SQL and Spark DataFrames APIs
  • Worked on importing and exporting data between the HDFS and Relational Data Bases
  • Created a pipeline to gather new music releases of a country for a given week using PySpark, Kafka and Hive
  • Sent requests to Confidential REST Based API from a python script via Kafka producer
  • Experience collecting log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.
  • Received the JSON response in Kafka consumer python file, formatted the response into a data frame using a schema containing, country code, artist name, number of plays and genre to parse the JSON
  • Established a connection between Hive and Spark for the transfer of the newly populated data frame
  • Extracted metadata from Hive tables with Hive QL
  • Utilized a cluster of three Kafka brokers to handle replication needs and allow for fault tolerance
  • Stored the data pulled from the API into Apache Hive on Hortonworks Sandbox
  • Utilized HiveQL to query the data to discover music release trends from week to week
  • Assist in Install and configuration of Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches
  • Loaded into ingested data into Hive Managed and External tables.
  • Wrote custom user define functions (UDF) for complex Hive queries (HQL)
  • Performed upgrades, patches and bug fixes in Hadoop in a cluster environment
  • Wrote shell scripts to automate workflows to pull data from various databases into Hadoop framework for users to access the data through Hive based views
  • Writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language
  • Built the Hive views on top of the source data tables, and built a secured provisioning
  • Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster
  • Wrote shell scripts for automating the process of data loading

Confidential

AWS Cloud DATA ENGINEER

Responsibilities:

  • Made and oversaw cloud VMs with AWS EC2 command-line clients and AWS administration reassure.
  • Used Spark DataFrame API over the Cloudera platform to perform analytics on Hive data.
  • Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
  • Used Ansible Python Script to generate inventory and push the deployment to AWS Instances.
  • Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.
  • Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift
  • Implemented AWS Lambda functions to run scripts in response to events in the Amazon Dynamo DB table or S3.
  • Populating database tables via AWS Kinesis Firehose and AWS Redshift.
  • Automated the installation of ELK agent (file beat) with Ansible playbook. Developed KAFKA Queue System to Collect Log data without Data Loss and Publish to various Sources.
  • AWS Cloud Formation templates used for Terraform with existing plugins.
  • Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline
  • Implemented AWS IAM user roles and policies to authenticate and control access
  • Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS

Confidential

BIG DATA ENGINEER

Responsibilities:

  • Wrote shell scripts to automate data ingestion tasks.
  • Used Cron jobs to schedule the execution of data processing scripts.
  • Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift
  • AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3)
  • Worked with Spark using Scala and Spark SQL for a faster processing of the data. Built Spark Streaming pipelines including ingestion tools like Kafka and Flume.
  • Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS Cloud Formation templates
  • Implemented security measures AWS provides, employing Confidential concepts of AWS Identity and Access Management (IAM)
  • Migrated complex MapReduce scripts to Apache Spark RDDs code.
  • Designed and developed ETL workflows using Scala and Python for processing structured and unstructured data from the HDFS.
  • Developed data transformation pipelines using Spark RDDs and Spark SQL.
  • Created multiple batch Spark jobs using Java
  • Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configuring the servers for specified applications
  • Developed metrics, attributes, filters, reports, dashboards and created advanced chart types, visualizations and complex calculations to manipulate the data.
  • Implemented a Hadoop cluster and different processing tools including Spark, MapReduce
  • Pushing containers into AWS ECS
  • Use Scala to connect to EC2 and push files to AWS S3
  • Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring

We'd love your feedback!