Data Engineer (spark And Scala)/hadoop Admin Resume
Virginia Beach, VA
SUMMARY:
- Snapshot: 6+ years of overall IT experience which includes 3+ years of experience in Big Data as Hadoop Developer/Data Engineer in implementing complete Hadoop solutions and 3+ years of experience with development, maintenance and support of applications using Java / J2EE technologies.
- Experience on various business domains like Financial, Insurance and Banking.
- Experience on Hadoop clusters using major Hadoop distributions like Cloudera, Hortonworks and AWS Servers.
- Expertise in Hadoop Architecture and various components in Hadoop and Ecosystems: MapReduce, HDFS, HBase, Hive, Sqoop, Pig, Flume, Kafka, Yarn, Oozie, Zookeeper, HCatalog, Sqoop, Spark, Shark, Spark SQL, Spark streaming, Hadoop streaming for scalability, distributed computing and high performance computing.
- Experienced in creating MapReduce jobs in Java as per the business requirements.
- Migrating ETL project to Hadoop with no defects and running in production.
- Strong Experience in data analytics using Spark Streaming, Storm, HIVE, Pig Latin, HBase.
- Experience in writing custom UDFs in java for Hive and Pig to extend the functionality.
- Developed Pig Latin scripts using operators such as load, store, dump, filter, distinct, foreach, generate, group, cogroup, order, limit, union, split to extract data from data files to load into HDFS.
- Experience in writing MapReduce programs in java for data cleansing and preprocessing.
- Experience in working with Flume to load the log data from multiple sources directly in to HDFS.
- Experience in working with Message Broker services like Kafka and Amazon SQS.
- Worked with different file formats like flat files, Sequence, Avro and Parquet.
- Experience with compressing data with different algorithms like gzip, bzip.
- Built real - time Big Data solutions using HBase handling billions of records.
- Experience in designing both time driven and data driven automated workflows using Oozie.
- Experience working with Apache SOLR.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
- Exposure to Cloudera development environment and management using Cloudera Manager.
- Experience in AWS Hadoop distribution.
- Experience in Core Java and multi-thread processing.
- Extensive knowledge in using SQL Queries for backend database analysis.
- Good experience in using Linux/Unix shell scripting.
- Experience in UML (Class Diagrams, Sequence Diagrams, Use case Diagrams).
- Hands on experience in application development using Core JAVA, Scala, RDBMS and Linux shell scripting.
- Experienced in creating and analyzing Software Requirement Specifications (SRS) and Functional Specification Document (FSD).
- Excellent working experience in Scrum / Agile framework, Iterative and Waterfall project execution methodologies.
TECHNICAL SKILLS:
Hadoop/Big Data: HDFS, MapReduce, Pig, Hive, Sqoop, Oozie, Flume, Kafka, Zookeeper, Spark
IDE Tools: Eclipse, IBM WebSphere, NetBeans
Programming languages: C, C++, Java, J2EE, Scala, Pig Latin, Hive QL, UML, Linux shell scripts
Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBase, MongoDB
Web Technologies: JavaScript, JSP, Servlets, JDBC, Unix/Linux Shell Scripting, Python, HTML, XML
Version control: Git, Git Hub
Design Technologies: UML
Development Approach: Agile/Scrum, Waterfall, Iterative, Spiral
Operating Systems: All Versions of Microsoft Windows, UNIX and LINUX
Protocols: TCP/IP, HTTP, HTTPS, TELNET, FTP
Cloud: AWS, Azure and OpenStack
PROFESSIONAL EXPERIENCE:
Confidential, Virginia Beach, VA
Data Engineer (Spark and Scala)/Hadoop Admin
Environment: Scala, Apache Spark, AWS, Spark Mllib, Spark SQL, PostgreSQL, Hive, Mongo DB, Apache Storm, Kafka, Git, Jira
Responsibilities:
- Developed Spark SQL Scripts for data Ingestion from Oracle to Spark Clusters and relevant data joins using Spark SQL.
- Experience building distributed high-performance systems using Spark and Scala
- Experience developing Scala applications for loading/streaming data into NoSQL databases (MongoDB) and HDFS.
- Designed Distributed algorithms for identifying trends in data and processing them effectively.
- Used Spark and Scala for developing machine learning algorithms which analyses click stream data.
- Experience in developing machine learning code using spark MLLIB
- Used Spark SQL for data pre-processing, cleaning and joining very large data sets.
- Experience in creating data lake using spark which is used for downstream applications
- Designed and Developed Scala workflows for data pull from cloud based systems and applying transformations on it.
- Installed and configured multi-nodes on fully distributed Hadoop cluster.
- Involved in Hadoop Cluster environment administration that includes De-commissioning and commissioning nodes, cluster capacity planning, balancing, performance tuning, cluster Monitoring and Troubleshooting.
- Configured Fair Scheduler to provide service-level agreements for multiple users of a cluster.
- Implemented the Hadoop Name-node HA services to make the Hadoop services highly available.
- Developed the Cronjob for storing the Name-node metadata onto the NFS mount directory.
- Worked on installing Hadoop Ecosystem components such as Sqoop, Pig, Hive, Oozie, and HCatalog.
- Involved in HDFS maintenance and administering it through Hadoop-Java API.
- Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
- Proficient in writing Flume and Hive scripts to extract, transform and load the data into Database.
- Responsible for maintaining, managing and upgradation of Hadoop cluster connectivity and security.
- Worked on Machine Learning Algorithms Development for analyzing click stream data using Spark and Scala.
- Database migrations from Traditional Data Warehouses to Spark Clusters.
- Data Workflows and Pipelines are created for transition and analyzing trends using Spark Mllib.
- Entire Project is set up on Amazon Web Services Cloud and all the Algorithms are tuned for their best performances for better performance.
- Analyzing Streaming data and identifying important trends in data for further analysis using Spark Streaming and Storm.
- Collected and aggregated large amounts of web log data from different sources such as web servers, mobile and network devices using Apache Kafka and stored the data into HDFS for analysis.
- Experience configuring spouts and bolts in various Apache Storm topologies and validating data in the bolts
- Used Spark Streaming to collect data from Kafka in near-real-time and perform necessary transformations and aggregation on the fly to build the common learner data model and persists the data in NoSQL store
- Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.
- Worked on AWS cloud to create EC2 instance and installed Java, Zookeeper and Kafka on those instances.
- Worked on S3 buckets on AWS cloud to store Cloud Formation Templates.
- Batch loading of data to NOSQL storage like MongoDB
- Implemented Spark RDD transformations, actions to migrate MapReduce algorithms
- Used GIT to check-in and checkout code changes.
- Used Jira for bug tracking
Hadoop Developer/Hadoop Admin
Environment: Cloudera Hadoop Framework, MapReduce, Hive, Pig, HBase, Business Objects, Platfora, HParser, Java, Python, UNIX Shell Scripting
Responsibilities:
- Written the Apache PIG scripts to process the HDFS data.
- Created Hive tables to store the processed results in a tabular format.
- Developed the Sqoop scripts to make the interaction between Pig and MySQL Database.
- Involved in gathering the requirements, designing, development and testing
- Writing the script files for processing data and loading to HDFS
- Storing and retrieved data using HQL in Hive.
- Developed the UNIX shell scripts for creating the reports from Hive data.
- Data Ingestion using Kafka, Data pipeline architecture, Data cleansing, ETL, Processing and some visualization experience. Enable CDH to consume data from customer’s enterprise tool (I have worked with sources like RabbitMQ, IBM MQ, RDBMS, etc.)
- Use-case development (Hive, Pig, Spark, Spark Streaming); Implemented MapReduce to discover interesting patterns in data.
- Installed and configured Hadoop cluster in Development, Testing and Production environments.
- Performed both major and minor upgrades to the existing CDH cluster.
- Responsible for monitoring and supporting Development activities.
- Responsible for administering applications and their maintenance on daily basis. Prepared System Design document with all functional implementations.
- Installation of various Hadoop Ecosystems and Hadoop Daemons.
- Installation and configuration of Sqoop and Flume.
- Involved in Data model sessions to develop models for HIVE tables.
- Understanding the existing Enterprise data warehouse set up and provided design and architecture suggestion converting to Hadoop using MapReduce, HIVE, SQOOP and Pig Latin.
- Developed Java MapReduce Programs that includes use of custom data types, Input format, record reader etc.
- Involved in writing Flume and Hive scripts to extract, transform and load the data into Database
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Converting ETL logic to Hadoop mappings.
- Extensive hands on experience in Hadoop file system commands for file handling operations.
- Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for hive performance enhancement and storage improvement.
- Worked with SQOOP import and export functionalities to handle large data set transfer between DB2 database and HDFS.
- Used Sentry to control access to databases/data-sets.
- Worked on security of the Hadoop cluster and tuning the cluster to meet necessary performance standards.
- Configured backups and performed Name node recoveries from previous backups
- Experienced in managing and analyzing Hadoop log files.
- Providing documentation on the architecture, deployment and all details the customer would require to run the CDH cluster as part of a “delivery document(s)”
- RDBMS: MySQL and PostgreSQL (some experience to support it as a backend for Hive Metastore, Cloudera Manager related components, Oozie etc.)
- Responsible for developing efficient MapReduce on AWS cloud programs for more than 20 years' worth of claim data to detect and separate fraudulent claims.
- Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and Flume.
- Provide Subject Matter Expertise on Linux (To support running CDH/Hadoop optimally on the underlying OS).
- Training customers/partners when required.
- Understanding customer requirements and identifying how the Hadoop eco-system could be leveraged to implement their requirements into Hadoop, how CDH can fit into their current infrastructure, where Hadoop can complement existing products, etc.
Hadoop Developer
Environment: Hadoop Framework, MapReduce, Hive, Sqoop, Pig, HBase, Flume, Oozie, Java, Python, UNIX Shell Scripting, Spark
- Worked with the business users to gather, define business requirements and analyze the possible technical solutions.
- Developed job flows in Oozie to automate the workflow for Pig and Hive jobs.
- Designed and built the reporting application that uses the Spark SQL to fetch and generate reports on HBase table data.
- Worked with the Data Science team to gather requirements for various data mining projects
- Involved in creating Hive tables, and loading and analyzing data using hive queries
- Developed Simple to complex MapReduce Jobs using Hive and Pig
- Extracted feeds from social media sites such as Facebook, Twitter using Python scripts.
- Implemented helper classes that access HBase directly from Java using Java API.
- Integrated MapReduce with HBase to import bulk amount of data into HBase using MapReduce programs.
- Experienced in converting ETL operations to Hadoop system using Pig Latin Operations, transformations and functions.
- Extracted the needed data from server and into HDFS and bulk loaded the cleaned data into HBase.
- Handled different time series data using HBase to store data and perform analytics based on time to improve queries retrieval time.
- Implemented CDH3 Hadoop cluster on CentOS, assisted with performance tuning and monitoring.
- Used Hive to analyze data ingested into HBase and compute various metrics for reporting on the dashboard.
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts
- Managed and reviewed Hadoop log files.
- Involved in review of functional and non-functional requirements.
Java Developer
Environment: J2EE, JDBC, Java, Servlets, JSP, Struts, Hibernate, Web services, SOAP, WSDL, Design Patterns, MVC, HTML, JavaScript 1.2, WebLogic, XML and Junit
Responsibilities:
- Developed User Interfaces module using JSP, Java Script, DHTML and form beans for presentation layer.
- Developed Servlets and Java Server Pages (JSP).
- Developed PL/SQL queries, and wrote stored procedures and JDBC routines to generate reports based on client requirements.
- Enhancement of the System per the customer requirements.
- Involved in the customization of the available functionalities of the software for an NBFC (Non-Banking Financial Company).
- Involved in putting proper review processes and documentation for functionality development.
- Providing support and guidance for Production and Implementation Issues.
- Used Java Script validation in JSP.
- Used Hibernate framework to access the data from back-end SQL Server database.
- Used AJAX (Asynchronous JavaScript and XML) to implement user friendly and efficient client interface.
- Used MDB for consuming messages from JMS queue/topic.
- Designed and developed Web Application using Struts Framework.
- ANT to compile and generate EAR, WAR, and JAR files.
- Conducted Design reviews and Technical reviews with other project stakeholders.
- Implemented Services using Core Java.
- Created test case scenarios for Functional Testing and wrote Unit test cases with JUnit.
- Responsible for Integration, unit testing, system testing and stress testing for all the phases of project.