Sr. Big Data Spark Engineer Resume
TN
PROFESSIONAL SUMMARY:
- Overall 8+ years of IT experience in analysis, design, development and implementation of business applications with thorough knowledge in Java, J2EE, Big Data, Hadoop Eco System and RDBMS related technologies with domain exposure in Retail, Healthcare, Banking, E - commerce websites, Insurance, Logistics and Financial (Mortgage) systems.
- Expertise with the tools in Hadoop Ecosystem including Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Kafka, Yarn, Oozie, Ambari and Zookeeper.
- Excellent knowledge on Hadoop Architecture such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
- Strong experience on Hadoop distributions like Cloudera, MapR and HortonWorks .
- Good Knowledge on Hadoop Cluster architecture and monitoring the cluster. Worked with Admin team in setting up a cluster.
- Hands on experience in installing, configuring Cloudera Apache Hadoop ecosystem components like Flume, Hbase, Zoo Keeper, Oozie, Hive, Sqoop and Pig.
- Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Highly capable of processing large sets of Structured, Semi-structured and Unstructured datasets supporting Big Data applications.
- Extensive hold over Hive and Pig core functionality by writing Pig Latin UDFs in Java and used various UDFs from Piggybanks and other source.
- Good experience in Hive partitioning, bucketing and perform different types of joins on Hive tables and implementing Hive Sere like JSON and ORC .
- Worked on different file formats (ORCFILE, TEXTFILE) and different Compression Codecs (GZIP, SNAPPY, LZO) .
- Proficiency in Hadoop data formats like AVRO & Parquet.
- Comprehensive knowledge and experience in process improvement, normalization/de-normalization, data extraction, data cleansing, data manipulation on Hive.
- Have good knowledge on NoSQL databases like HBase , Cassandra and MongoDB .
- Proficient in implementing HBase .
- Used Zookeeper to provide coordination services to the cluster.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Experience using Sqoop to import data into HDFS from RDBMS and vice-versa.
- Extensive Experience on importing and exporting data using stream processing platforms like Flume and Kafka .
- Implemented indexing for logs from Oozie to Elastic Search.
- Analysis on integrating Kibana with Elastic Search.
- Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using Scala.
- Developed Apache Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
- Experience in creating Spark Contexts, Spark SQL Contexts, and Spark Streaming Context to process huge sets of data.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Experienced in Spark Core, Spark RDD, Pair RDD, Spark Deployment Architectures.
- Extensive experience using MAVEN and ANT as a Build tool for the building of deployable artifacts from source code.
- Experience in Unix shell scripting .
- Worked with Big Data distributions like Cloudera (CDH 3 and 4) with Cloudera Manager.
- Knowledge on Cloud technologies like AWS Cloud and Amazon Elastic Map Reduce (EMR) .
- Good knowledge in Azure technologies- Azure Data Lake, Azure Data Factory, HD Insights .
- Experience working with Amazon Web Services S3, EC2, Redshift .
- Proficient in using OOPs Concepts (Polymorphism, Inheritance, Encapsulation) etc.
- Extensive programming experience in developing web based applications using Java, J2EE, JSP, Servlets, EJB, Struts, Spring, Hibernate, JDBC, JavaScript, HTML, JavaScript Libraries, and Web Services etc.
- Proficient in developing web page quickly and effectively using, HTML 5, CSS3, JavaScript and jQuery and also experience in making web page cross browser compatible.
- Good knowledge in advanced java topics such as Generics, Collections and multi-threading .
- Experience in database development using SQL and PL/SQL and experience working on databases like Oracle 9i/10g, SQL Server and MySQL.
- Good knowledge in Software Defined Networks, Controllers, OpenFlow, NFV, Cloud/OpenStack.
- Data Warehouse experience using Informatica Power Center as ETL tool.
- Excellent interpersonal skills, good experience in interacting with clients with good team player and problem solving skills.
- Hands on experience in provisioning and managing multi-tenant Hadoop clusters on public cloud environment - Amazon Web Services ( AWS ) and on private cloud infrastructure - open stack cloud platform
- Strong knowledge in development of Object Oriented and Distributed applications.
- Written unit test cases using JUnit and MRUnit for Map Reduce jobs.
- Proficiency in Hadoop data formats like AVRO & Parquet.
- Experience with code development frameworks - GitHub, Jenkins.
- Expertise with Application servers and web servers like WebLogic, IBM WebSphere, Apache Tomcat, JBOSS and VMware.
- Good understanding of Hadoop Gen1/Gen2 architecture and hands-on experience with Hadoop components such as Job Tracker, Task Tracker, Name Node, Secondary Name Node, Data Node, Map Reduce concepts and YARN, Mesos architecture which includes Node manager, Resource manager and App Master.
- Deployed and monitored scalable infrastructure on cloud environment Amazon web services (AWS)
- Knowledge about Splunk architecture and various components indexer, forwarder, search head, deployment server, Heavy and Universal forwarder, License model.
- Knowledge in Machine Learning (Linear Regression, logistic regression, Clustering, Classification, and Decision Tree, support vector machines and dimensionality reduction).
- Comprehensive knowledge of Software Development Life Cycle (SDLC), having thorough understanding of various phases like Requirements Analysis, Design, Development and Testing.
- Involved in the Software Life Cycle phases like Agile and Waterfall estimating the timelines for projects.
- Ability to quickly master new concepts and applications.
TECHNICAL SKILLS:
Big Data Technologies: Hadoop (HDFS & MapReduce), PIG, HIVE, HBASE, ZOOKEEPER, Sqoop, Apache Storm, Flume, Kafka, Spark, Spark Streaming, Mlib, Spark SQL and Data Frames, Graph X, Scala, Solr, Lucene, Elastic Search and AWS
Programming & Scripting Languages: Java, C, SQL, R, Python, Impala, Scala, C++
J2EE Technologies: JSP, SERVLETS, EJB, Angular JS
Web Technologies: HTML, JavaScript
Frameworks: Spring 3.5 - Spring MVC, Spring ORM, Spring Security, Spring ROO, Hibernate, Struts.
Application Servers: IBM Web Sphere, JBoss WebLogic
Web Servers: Apache Tomcat
Databases: MS SQL Server & SQL Server Integration Services (SSIS), My SQL, MongoDB, Cassandra, Oracle DB, Teradata
Designing Tools: UML, Visio
IDEs: Eclipse, Net Beans
Operating System: Unix, Windows, Linux, Cent OS
Others: Putty, WinScp, DataLake, Talend, Tableau, GitHub, SVN, CVS.
PROFESSIONAL EXPERIENCE:
Sr. Big Data Spark Engineer
Confidential, TN
Responsibilities:
- Working on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions using Hadoop .
- Involved in creating Hive Tables , loading with data and writing Hive queries which will invoke and run Map Reduce jobs in the backend.
- Updated knowledge on Amazon AWS concepts like EMR & EC2 web services which provides fast and efficient processing of Big Data.
- Designed and implemented Incremental Imports into Hive tables.
- Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Experience in importing and exporting tera bytes of data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Moved Relational Database data using Sqoop into Hive Dynamic partition tables using staging tables .
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Involved in collecting, aggregating and moving data from servers to HDFS using Apache Flume .
- Experienced in managing and reviewing the Hadoop log files .
- Involved in developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS.
- Deployed Hadoop cluster on AWS EC2 .
- Migrated ETL jobs to Pig scripts do transformations, even joins and some pre-aggregations before storing the data onto HDFS.
- Implemented the workflows using Apache Oozie framework to automate tasks.
- Used Zookeeper to co-ordinate cluster services.
- Worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
- Used Impala where ever possible to achieve faster results compared to Hive during data Analysis.
- Implemented data ingestion and handling clusters in real time processing using Kafka .
- Worked on writing transformer/mapping Map-Reduce pipelines using Java .
- Developed scripts and automated data management from end to end and sync up between all the clusters.
- Transform the logs data into data model using apache pig and written UDF’s functions to format the logs data.
- Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala .
- Experience in both SQLContext and SparkSession
- Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
- Used Spark API over Hadoop YARN to perform analytics on data in Hive.
- Developed and Configured Kafka brokers to pipeline server logs data into spark streaming.
- Developed Spark scripts by using scala shell commands as per the requirement.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Experience in implementing Log Error Alarmer in Spark
- Exported the analyzed data to relational databases using sqoop for visualization and to generate reports.
- Integrate AWS Kinesis with on premise Kafka cluster
- Experienced in Monitoring Cluster using Cloudera manager.
- Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.
Environment: Hadoop, HDFS, Pig, Apache Hive, Sqoop, AWS EC2 , Flume, Kafka, Apache Spark, Storm, Solr, Shell Scripting, HBase, Scala, Python, Kerberos, Agile, Zoo Keeper, Maven, AWS, MySQL.
Sr. Big Data Engineer
Confidential, OH
Responsibilities:
- Developing and running Map-Reduce jobs on YARN and Hadoop clusters to produce daily and monthly reports as per user's need.
- Worked on analyzing, writing Hadoop MapReduce jobs using Java API, Pig and Hive.
- Developed Spark scripts by using Python shell commands as per the requirement
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Implemented data access jobs through Pig, Hive, Tez , Solr, Accumulo , Hbase , and Storm .
- Worked on Developing custom MapReduce programs and User Defined Functions (UDFs) in Hive to transform the large volumes of data with respect to business requirement.
- Extending HIVE and PIG core functionality by using custom User Defined Function’s (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig using python.
- Developed HIVE scripts for analyst requirements for analysis.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
- Analyzed the data by performing Hive queries (HiveQL), Impala and running Pig Latin scripts to study customer behavior.
- Developed ETL job in Talend to load data from ASCII, Flat files.
- Processed HDFS data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.
- Implemented data ingestion and handling clusters in real time processing using Kafka .
- Filter the dataset with PIG UDF, PIG scripts in HDFS and Storm/Bolt in Apache Storm.
- Involved in writing Pig Scripts for Cleansing the data and implemented Hive tables for the processed data in tabular format.
- Involved in collecting, aggregating and moving data from servers to HDFS using Apache Flume.
- Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration .
- Developed ETL jobs using Spark-Scala to migrate data from Oracle to new hive tables.
- Experienced with performing CURD operations in HBase.
- Used JSON, Parquet and Avro SerDe's for serialization and de-serialization.
- Setting up CRON job to delete hadoop logs/local old job files/cluster temp files.
- Using HBase to store majority of data which needs to be divided based on region.
- Used Maven extensively for building jar files of MapReduce programs and deployed to cluster.
- Used Zookeeper to provide coordination services to the cluster. Experienced in managing and reviewing Hadoop log files.
- Hands-on expertise with various architectures in MongoDB & CassandraDB.
- Very good experience in monitoring and managing the Hadoop cluster using Hortonworks .
- Developed Oozie workflow for scheduling and orchestrating the ETL process
- Used Amazon Redshift for data warehouse and to generate backend reports.
- Exported the business required information to RDBMS using Sqoop to make the data available for BI team to generate reports based on data.
Environment: Hadoop, HDFS, Pig, Hive, HBase, Map Reduce, Sqoop, Flume, Impala, Oozie, Zookeeper, ETL, LINUX, BigData, Apache Ranger, MapR, Netezza, Java, Eclipse, Maven, SQL, Knox, Ambari, NoSql.
Big Data Engineer
Confidential, NJ
Responsibilities:
- Involved in all phases of the Big Data Implementation including requirement analysis, design, development, building, testing, and deployment of Hadoop cluster in fully distributed mode Mapping the DB2 V9.7, V10.x Data Types to Hive Data Types and validations.
- Identifying the various data sources and understanding the data schema in source environment .
- Design, build and support pipelines of data ingestion, transformation, conversion and validation.
- Provided quick response to ad hoc internal and external client requests for data and experienced in creating ad hoc reports.
- Experience in ingesting data into Cassandra and consuming the ingested data from Cassandra to Hadoop.
- Worked on cloud platform which was built with a scalable distributed data solution using Hadoop on a 40-node cluster using AWS cloud to run analysis on 25+ Terabytes of customer usage data.
- Implemented Partitioning, Dynamic Partitions and bucketing in HIVE for efficient data access.
- Created final tables in Parquet format. Use of Impala to create and manage Parquet tables.
- Enhanced HIVE queries performance using TEZ for Customer Attribution datasets.
- Implemented data Ingestion and handling clusters in real time processing using Apache Kafka.
- Involved in running Ad-Hoc query through PIG Latin language, Hive or Java MapReduce.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud .
- Worked on NoSQL databases including HBase and Cassandra .
- Participated in development/implementation of Cloudera impala Hadoop environment.
- Installed Oozie workflow engine to run multiple MapReduce jobs.
- Developed NoSQL database by using CRUD, Indexing, Replication and Sharing in MongoDB.
- Developed the data model to manage the summarized data.
Environment: Hadoop, HDFS, Pig, Hive, AWS, MapReduce, Java, Flume,Knox, Talend, Oozie, Linux/Unix Shell scripting, Avro, Parquet, Cassandra, MongoDB, Azure, Python, Perl, Java, Git, Maven, Jenkins.
Hadoop Developer
Confidential, CA
Responsibilities:
- Worked with Big Data team responsible for building Hadoop stack and different big data analytic tools, migration from RDBMS to Hadoop using Sqoop.
- Used Bash shell scripting to perform Hadoop operations.
- Designed the sequence diagrams to depict the data flow into Hadoop.
- Involved in importing and exporting data between HDFS and Relational Systems like Oracle, Mysql, DB2 and Teradata using Sqoop.
- As a POC, extensively worked with Oozie workflow engine to run multiple Hive Jobs.
- Working on Hive to analyze the data and to extract report.
- Involve in creating Hive tables, loading with data and writing Hive queries which will run internally in map reduce way.
- Developed Simple to complex MapReduce Jobs using Hive and Pig. Developed Shell and Python scripts to automate and provide Control flow to Pig scripts.
- Responsible for managing data from multiple sources.
- Managing and scheduling Jobs on a Hadoop cluster using Oozie.
- Developed Simple to complex Map/reduce Jobs using Java programming language that are implemented using Hive and Pig.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop
Environment: Hadoop, Hive, Pig, Sqoop, Map Reduce, Linux, HDFS, Java.