Hadoop Data Engineer Resume
2.00/5 (Submit Your Rating)
SUMMARY
- Skilled in managing data analytics and data processing, database and data driven projects
- Skilled in Architecture of Big Data Systems, ETL Pipelines, and Analytics Systems for diverse end users
- Skilled in Database systems and administration
- Proficient in writing technical reports and documentation
- Adept with various distributions such as Cloudera Hadoop, Hortonworks, and Elastic Cloud, Elasticsearch
- Expert in bucketing and partitioning
- Expert in Performance Optimization
TECHNICAL SKILLS
- Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS
- Hortonworks, MapR, MapReduce
- HiveQL, MapReduce, XML, FTPPython, UNIX, Shell scripting, LINUX
- Unix/Linux, Windows 10, Ubuntu, Apple OS
- Parquet, Avro & JSON, ORC, text, csv
- Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS)
- Apache Spark, Spark Streaming, Flink
- Pentaho, QlikView, Tableau, PowerBI, Matplotlib, Plotly, Dash
- Apache Spark, Spark Streaming, Storm
- Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache Hbase, Apache Hive, MongoDB
- Microsoft Project, Primavera P6, VMWare, Microsoft Word, Excel, Outlook, Power Point; Technical Documentation Skills
PROFESSIONAL EXPERIENCE
Confidential
HADOOP DATA ENGINEER
Responsibilities:
- Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration
- Created a pipeline to gather data using PySpark, Kafka and HBase
- Sent requests to source REST Based API from a Scala script via Kafka producer
- Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance
- Hands - on experience with SparkCore, SparkSession, SparkSQL, and Data Frames/Data Sets/RDD API
- Spark jobs, Spark SQL and Data Frames API to load structured data into Spark clusters
- Created a Kafka broker which uses the schema to fetch structured data in structured streaming
- Define Spark data schema and set up development environment inside the cluster
- Management of Spark-submit jobs to all environments
- Interacted with data residing in HDFS using PySpark to process the data
- Decoded the raw data and loaded into JSON before sending the batched streaming file over the Kafka producer
- Received the JSON response in Kafka consumer written in Python.
- Formatted the response into a data frame using a schema containing, News Type, Article Type, Word Count, and News Snippet to parse the JSON
- Established a connection between the HBase and Spark for the transfer of the newly populated data frame
- Design Spark Scala job to consume information from S3 Buckets
- Monitor background operations in Hortonworks Ambari
- HDFS Monitoring job status and life of the DataNodes according to the specs
- Managed Zookeeper configurations and ZNodes to ensure High Availability on the Hadoop Cluster
- Managed hive connection with tables, databases and external tables
- Setup ELK collections to all environments and replications of shards
- Create standardized documents for company all usage
- Work one on one with clients to resolve issues regarding Spark jobs submissions
- Worked on AWS Kinesis for processing huge amounts of real-time data
- Developed multiple Spark Streaming and batch Spark jobs using Java, Scala, and Python on AWS
- Implemented Hortonworks medium and low recommendations on all environment
Confidential
HADOOP DATA ENGINEER
Responsibilities:
- Involved in the creation of Hive tables, loading data from different sources and performing Hive queries.
- Implemented data queries and transformations on Hive/SQL tables by using the Spark SQL and Spark DataFrames APIs
- Worked on importing and exporting data between the HDFS and Relational Data Bases
- Created a pipeline to gather new music releases of a country for a given week using PySpark, Kafka and Hive
- Sent requests to Confidential REST Based API from a python script via Kafka producer
- Experience collecting log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.
- Received the JSON response in Kafka consumer python file, formatted the response into a data frame using a schema containing, country code, artist name, number of plays and genre to parse the JSON
- Established a connection between Hive and Spark for the transfer of the newly populated data frame
- Extracted metadata from Hive tables with Hive QL
- Utilized a cluster of three Kafka brokers to handle replication needs and allow for fault tolerance
- Stored the data pulled from the API into Apache Hive on Hortonworks Sandbox
- Utilized HiveQL to query the data to discover music release trends from week to week
- Assist in Install and configuration of Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches
- Loaded into ingested data into Hive Managed and External tables.
- Wrote custom user define functions (UDF) for complex Hive queries (HQL)
- Performed upgrades, patches and bug fixes in Hadoop in a cluster environment
- Wrote shell scripts to automate workflows to pull data from various databases into Hadoop framework for users to access the data through Hive based views
- Writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language
- Built the Hive views on top of the source data tables, and built a secured provisioning
- Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster
- Wrote shell scripts for automating the process of data loading
Confidential
AWS Cloud DATA ENGINEER
Responsibilities:
- Made and oversaw cloud VMs with AWS EC2 command-line clients and AWS administration reassure.
- Used Spark DataFrame API over the Cloudera platform to perform analytics on Hive data.
- Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
- Used Ansible Python Script to generate inventory and push the deployment to AWS Instances.
- Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift
- Implemented AWS Lambda functions to run scripts in response to events in the Amazon Dynamo DB table or S3.
- Populating database tables via AWS Kinesis Firehose and AWS Redshift.
- Automated the installation of ELK agent (file beat) with Ansible playbook. Developed KAFKA Queue System to Collect Log data without Data Loss and Publish to various Sources.
- AWS Cloud Formation templates used for Terraform with existing plugins.
- Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline
- Implemented AWS IAM user roles and policies to authenticate and control access
- Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS
Confidential
BIG DATA ENGINEER
Responsibilities:
- Wrote shell scripts to automate data ingestion tasks.
- Used Cron jobs to schedule the execution of data processing scripts.
- Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift
- AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3)
- Worked with Spark using Scala and Spark SQL for a faster processing of the data. Built Spark Streaming pipelines including ingestion tools like Kafka and Flume.
- Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS Cloud Formation templates
- Implemented security measures AWS provides, employing Confidential concepts of AWS Identity and Access Management (IAM)
- Migrated complex MapReduce scripts to Apache Spark RDDs code.
- Designed and developed ETL workflows using Scala and Python for processing structured and unstructured data from the HDFS.
- Developed data transformation pipelines using Spark RDDs and Spark SQL.
- Created multiple batch Spark jobs using Java
- Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configuring the servers for specified applications
- Developed metrics, attributes, filters, reports, dashboards and created advanced chart types, visualizations and complex calculations to manipulate the data.
- Implemented a Hadoop cluster and different processing tools including Spark, MapReduce
- Pushing containers into AWS ECS
- Use Scala to connect to EC2 and push files to AWS S3
- Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring