Data Engineer Resume
Betonville, AR
SUMMARY
- Around 8 years of IT experience which includes 5+ years of experience as Hadoop/Spark developer using Big data technologies like Hadoop Ecosystem, Spark Ecosystem and 2 years of experience in application development using J2EE.
- Experience in working on various Hadoop data access components like MAPREDUCE, PIG, HIVE, HBASE, SPARK and KAFKA.
- Experience on handling Hive queries using Spark SQL that integrates with Spark environment
- Having good knowledge on Hadoop data management components like HDFS and YARN.
- Hands on experience in using various Hadoop workflow components like SQOOP, FLUME and KAFKA.
- Worked on Hadoop data operation components like ZOOKEEPER and OOZIE.
- Working knowledge on AWS technologies like S3 and EMR for storage, big data processing and analysis.
- Good understanding of Hadoop security components like RANGER and KNOX.
- Good experience working with Hadoop distributions such as HORTONWORKS and CLOUDERA.
- Excellent programming skills at higher level of abstraction using SCALA and JAVA.
- Experience in Java programming with skills in analysis, design, testing and deploying with various technologies like J2EE, JavaScript, JSP, JDBC, HTML, XML and JUNIT.
- Having good knowledge on Apache Spark components including SPARK CORE, SPARK SQL, SPARK STREAMING and SPARK MLLIB.
- Experience in performing transformations and actions on Spark RDDS using Spark Core.
- Experience in using Broadcast variables, Accumulator variables and RDD caching in Spark.
- Experience in troubleshooting Cluster jobs using Spark UI
- Experience working with Cloudera Distribution Hadoop (CDH) and Horton works data platform (HDP).
- Expert in Hadoop and Big data ecosystem including Hive, HDFS, Spark, Kafka, MapReduce, Sqoop, Oozie and Zookeeper
- Good Knowledge on Hadoop Cluster architecture and monitoring the cluster
- Hands - on experience in distributed systems technologies, infrastructure administration, monitoring configuration
- Expertise in data transformation & analysis using Spark, Hive
- Knowledge of writing Hive Queries to generate reports using Hive Query Language
- Hands on experience with the Spark SQL for complex data transformations using Scala programming language.
- Developed Spark code using Python/Scala and Spark-SQL for faster testing and processing of data
- Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts
- Extensive experience in data ingestion technologies like Flume, Kafka, Sqoop and NiFi
- Utilize Flume, Kafka and NiFi to gain real-time and near real-time streaming data in HDFS from different data sources
- Good in analyzing data using HiveQL and custom MapReduce program in Java
- Good Knowledge in working with AWS (Amazon Web Services) cloud platform
- Good knowledge in Unix shell commands
- Experience in analyzing Log files for Hadoop and eco system services and finding root cause and setting up and managing the batch scheduler on Oozie
- Thorough knowledge of Release management, CI/CD process using Jenkins and Configuration management using Visual Studio Online
- Experience in extracting the data from RDBMS in to HDFS using Sqoop Ingestion, collecting the logs from log collector into HDFS using Flume
- Used Project Management services like JIRA for handling service requests and tracking issues.
- Good experience with Software methodologies like Agile and Waterfall.
- Experienced working with Zookeeper to provide coordination services to the cluster
- Skilled in Tableau 9 for data visualization, Reporting and Analysis
- Extensively involved through the Software Development Life Cycle (SDLC) from initial planning through implementation of the projects by using Agile and waterfall methodologies
- Good team player with ability to solve problems, organize and prioritize multiple tasks.
TECHNICAL SKILLS
Data Access Tools: HDFS, YARN, Hive, Pig, HBase, Solr, Impala, Spark Core, Spark SQL, Spark Streaming
Data Management: HDFS, YARN
Data Workflow: Sqoop, Flume, Kafka
Data Operation: Zookeeper, Oozie
Data Security: Ranger, Knox
BigData Distributions: Horton works, Cloudera
Cloud Technologies: AWS (Amazon Web Services) EC2, S3, IAM, CLOUD WATCH, DynamoDB, SNS, SQS, EMR, KINESIS
Programming & Languages: Java, Scala, Pig Latin, HQL, SQL, Shell Scripting, HTML, CSS, JavaScript
IDE/Build Tools: Eclipse, Intellij
Java/J2EE Technologies: XML, JUnit, JDBC, AJAX, JSON, JSP
Operating Systems: Linux, Windows, Kali Linux
SDLC: Agile/SCRUM, Waterfall
PROFESSIONAL EXPERIENCE
Confidential, BETONVILLE AR
DATA ENGINEER
Responsibilities:
- Created atomic work flow with using of one automation.
- Working on yaml files with using hive properties. Created yaml files with using shell commands and shell script.
- Working with Unix and Linux commands.
- Import and export hive properties to generate hive raw and stage tables. Created partitions tables. Knowledge on bucketing.
- Working on data analysis with using of sql(oracle) in source table, as well as target table(hive).
- Creating data modeling with required business. And optimize the data, analyze the data.
- Apply business logic in my task. Working tez, MR. good knowledge on validation.
- Working on data modeling and data warehouse.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data
- Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing customer behavioral data.
- Worked on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions using Hadoop.
- Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka.
- Developed Spark jobs and Hive Jobs to summarize and transform data
- Expertise in implementing Spark Scala application using higher order functions for both batch and interactive analysis requirement.
- Experienced in developing Spark scripts for data analysis in Scala.
- Used Spark-Streaming APIs to perform necessary transformations.
- Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala.
- Worked with spark to consume data from Kafka and convert that to common format using Scala.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.
- Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.
- Wrote new spark jobs in Scala to analyze the data of the customers and sales history.
- Involved in requirement analysis, design, coding and implementation phases of the project.
- Used Spark API over Hadoop YARN to perform analytics on data in Hive.
- Experience in both SQL Context and Spark Session.
- Developed Scala based Spark applications for performing data cleansing, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
- Worked on troubleshooting spark application to make them more error tolerant.
- Involved in HDFS maintenance and loading of structured and unstructured data and imported data from mainframe dataset to HDFS using Sqoop and written the PySpark Script to process the HDFS data.
- Used Spark API over Hadoop YARN to perform analytics on data in Hive.
- Extensively worked on the core and Spark SQL modules of Spark.
- Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
- Created partitioned tables and loaded data using both static partition and dynamic partition method.
- Implemented POC’s on migrating to Spark-Streaming to process the live data.
Environment: Hadoop 2.x, Spark Core, Spark SQL, Spark API Spark Streaming, Scala, Pyspark, Hive, Pig, Kafka, Oozie, Amazon EMR, Tableau, Impala, RDBMS, Hive, HDFS,YARN, JIRA, MapReduce.
Confidential, Durham NC
Spark/Hadoop Developer
Responsibilities:
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data
- Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing customer behavioral data.
- Worked on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions using Hadoop.
- Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka.
- Developed Spark jobs and Hive Jobs to summarize and transform data
- Expertise in implementing Spark Scala application using higher order functions for both batch and interactive analysis requirement.
- Experienced in developing Spark scripts for data analysis in Scala.
- Used Spark-Streaming APIs to perform necessary transformations.
- Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala.
- Worked with spark to consume data from Kafka and convert that to common format using Scala.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.
- Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.
- Wrote new spark jobs in Scala to analyze the data of the customers and sales history.
- Involved in requirement analysis, design, coding and implementation phases of the project.
- Used Spark API over Hadoop YARN to perform analytics on data in Hive.
- Experience in both SQL Context and Spark Session.
- Developed Scala based Spark applications for performing data cleansing, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
- Worked on troubleshooting spark application to make them more error tolerant.
- Involved in HDFS maintenance and loading of structured and unstructured data and imported data from mainframe dataset to HDFS using Sqoop and written the PySpark Script to process the HDFS data.
- Used Spark API over Hadoop YARN to perform analytics on data in Hive.
- Extensively worked on the core and Spark SQL modules of Spark.
- Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
- Created partitioned tables and loaded data using both static partition and dynamic partition method.
- Implemented POC’s on migrating to Spark-Streaming to process the live data.
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to HDFS as per the business requirement.
- Used Impala to read, write and query the data in HDFS.
- Worked on troubleshooting spark application to make them more error tolerant.
- Stored the output files for export onto HDFS and later these files are picked up by downstream systems.
- Load the data into Spark RDD and do in memory data Computation to generate the Output response.
- The typical responsibilities of a Hadoop admin include - deploying a Hadoop cluster, maintaining a Hadoop cluster, adding and removing nodes using cluster monitoring tools like Ganglia Nagios or Cloudera Manager, configuring the Name Node high availability and keeping a track of all the running Hadoop jobs.
- Implementing, managing and administering the overall Hadoop infrastructure.
- Takes care of the day-to-day running of Hadoop clusters
- Experience on Hadoop administrator will have to work closely with the database team, network team, BI team and application teams to make sure that all the big data applications are highly available and performing as expected.
- Working with open source Apache Distribution then Hadoop admins have to manually setup all the configurations- Core-Site, HDFS-Site, YARN-Site and Map Red-Site. However, when working with popular Hadoop distribution like Horton works, Cloudera or MapR the configuration files are setup on startup and the Hadoop admin need not configure them manually.
- Hadoop admin is responsible for capacity planning and estimating the requirements for lowering or increasing the capacity of the Hadoop cluster.
- Good knowledge Hadoop admin is also responsible for deciding the size of the Hadoop cluster based on the data to be stored in HDFS.
- Ensure that the Hadoop cluster is up and running all the time.
- Monitoring the cluster connectivity and performance.
- Manage and review Hadoop log files.
- Backup and recovery tasks
- Resource and security management
- Troubleshooting application errors and ensuring that they do not occur again
Environment: Hadoop 2.x, Spark Core, Spark SQL, Spark API Spark Streaming, Scala, Pyspark, Hive, Pig, kafka, Oozie, Amazon EMR, Tableau, Impala, RDBMS, Hive, HDFS,YARN, JIRA, MapReduce.
Confidential, New York
Spark/Hadoop Developer
Responsibilities:
- Responsible to collect, clean, and store data for analysis using Kafka, Sqoop, Spark, HDFS
- Used Kafka and Spark framework for real time and batch data processing
- Ingested large amount of data from different data sources into HDFS using Kafka
- Implemented Spark using Scala and performed cleansing of data by applying Transformations and Actions
- Used Case Class in Scala to convert RDD’s into Data Frames in Spark
- Processed and Analyzed data in stored in HBase and HDFS
- Developed Spark jobs using Scala on top of Yarn for interactive and Batch Analysis.
- Developed UNIX shell scripts to load large number of files into HDFS from Linux File System.
- Experience in querying data using Spark SQL for faster processing of the data sets.
- Offloaded data from EDW into Hadoop Cluster using Sqoop.
- Developed Sqoop scripts for importing and exporting data into HDFS and Hive
- Created Hive internal and external Tables by Partitioning, bucketing for further Analysis using Hive
- Used Oozie workflow to automate and schedule jobs
- Used Zookeeper for maintaining and monitoring clusters
- Exported the data into RDBMS using Sqoop for BI team to perform visualization and to generate reports
- Continuously monitored and managed the Hadoop Cluster using Cloudera Manager
- Used JIRA for project tracking and participated in daily scrum meetings
Environment: Spark, Sqoop, Scala, Hive, Kafka, YARN, Teradata, RDBMS, HDFS, Oozie, Zookeeper, HBase, Tableau, Hadoop (Cloudera), JIRA
Confidential, New York
Hadoop Developer
Responsibilities:
- Actively participated in interaction with users to fully understand the requirements of the system
- Experience with the Hadoop ecosystem and NoSQL database
- Migrating the needed data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files in to HDFS
- Imported data from RDBMS (MySQL, Teradata) to HDFS and vice versa using Sqoop (Big Data ETL tool) for Business Intelligence, visualization and report generation
- Working with Kafka to get near real-time data onto big data cluster and required data into Spark for analysis
- Used Spark streaming to receive near real time data from the Kafka and store the stream data to HDFS using Scala and NoSQL database such as Cassandra and HDFS
- Involved in Analyzing data by writing queries using HiveQL for faster data processing
- Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets
- Optimized queries in Hive to increase performance and query execution time
- Involved in writing Flume and Hive scripts to extract, transform and load the data into Database
- Created tables in DataStax Cassandra and loaded large sets of data for processing
- Worked on Oozie workflows, coordinators to run multiple Hive jobs
- Used Git for version control, JIRA for project tracking and Jenkins for continuous integration
- Utilized Agile and Scrum Methodology to help manage and organize a team of developers with regular code review session.
Environment: HDFS, Kafka, Sqoop, Scala, java, Hive, Oozie, NoSQL, Oracle, MySQL, GIT, Zookeeper, DataStax Cassandra, Agile methodology, JIRA, Horton works data platform, Jenkins, AGILE(SCRUM).