Etl & Hadoop & Spark Developer Resume
Charlotte, NC
SUMMARY
- Around 9 years of professional experience in Information Technology which includes 5+years in Big Data and Hadoop Ecosystem
- Experience in working with BI team and transform big data requirements into Hadoop centric technologies.
- Expert in creating indexes, views, complex stored procedures, user - defined functions, cursors, derived tables, common table expressions (CTEs) and triggers to facilitate efficient data manipulation and data consistency.
- Excellent understanding/knowledge of Hadoop architecture and various components of Big Data.
- Hands on experience in installing, configuring, and using Hadoop ecosystem components like MapReduce, HDFS, HBase, Oozie, Hive, Sqoop, Pig, Solr and Zookeeper.
- Experience in analyzing data using HQL, Pig Latin and custom MapReduce programs in Java.
- Knowledge on various file formats such as Avro, Parquet, ORC, etc. and on various compression codecs such as GZip, Snappy, LZO etc.
- Experience in installing, customizing and testing the Hadoop Eco Systems such as Hive, Pig, Sqoop, Spark, Oozie etc.
- Extensive experience in CoreJava, Struts2, JSF2.2, Spring3.1, Hibernate, Servlets, JSP and Hands-on experience with PL/SQL, XML and SOAP.
- Currently working on Spark applications extensively using Scala as the main programming platform
- Experience in working with EclipseIDE, NetBeans, and Rational Application Developer.
- Extensive experience in SOA-based solutions - WebServices, Web API, WCF, SOAP including RestfulAPIs services
- Experience in Hadoop MapReduce, Pig, Hive, Oozie, Sqoop, Flume, Zookeeper
- Excellent experience in AWS, Cloudera and Hortonworks Hadoopdistribution and maintaining and optimized AWS infrastructure (EC2 and EBS).
- Strong competency in HIVE schema design, partitions and bucketing.
- Experience in ingestion, storage, querying, processing and analysis of Big Data with hands-on experience in Apache Spark and Spark Streaming.
- Owned the design, development and maintenance of ongoing metrics, reports, analyses, dashboards, etc., to drive key business decisions and communicate key concepts to readers.
- Expertise in designing and developing a distributed processing system running into a Data Warehousing platform for reporting.
- Data cleaning, pre-processing and modelling using Spark and Python.
- Strong experience in designing and developing Business Intelligence solutions in Data Warehousing using ETL Tools and excellent understanding and best practice of Data Warehousing Concepts, involved in Full Development life cycle of Data Warehousing
- Experience in performing ETL on top of streaming log data from various web servers into HDFS using Flume.
- Experienced in analyzing, designing and developing ETL strategies and processes, writing ETL specifications
- Extensive experience in writing UNIX shell scripts and automation of the ETL processes using UNIX shell scripting, and also used Netezza Utilities to load and execute SQL scripts using Unix
- Expertise in working with relational databases such as Oracle 12c/11g/10g, SQL Server 2012/2008, DB2 8.0/7.0, UDB, MS Access and Teradata, Netezza.
- Performed data analytics and insights using Impala and Hive.
- Expert in developing and scheduling jobsusing Oozie and Crontab.
- Hands-on Git, Agile (Scrum), JIRA and Confluence.
TECHNICAL SKILLS
Languages: Java, Python, Scala, HiveQL.
Hadoop Ecosystem: HDFS, Hive, MapReduce, HBase, YARN, Sqoop, Flume, Oozie, Zookeeper, Impala.
Databases: Oracle, RDBMS, DB2, SQL Server, MySQL.
NoSQL Databases: HBase, MongoDB, Cassandra.
Scripting Languages: JavaScript, CSS, Python, Perl, Shell Script.
PROFESSIONAL EXPERIENCE
ETL & Hadoop & Spark Developer
Confidential - Charlotte, NC
Responsibilities:
- Worked on implementation and data integration in developing large-scale system software experiencing with Hadoop ecosystem components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig.
- Developed Hive UDF's for extended use and wrote HiveQL for sorting, joining, filtering and grouping the structure data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala
- Developed processes on both Teradata and Oracle using shell scripting and RDBMS utilities such as Multi Load, Fast Load, Fast Export, BTEQ (Teradata) and SQL*Plus, SQL*Loader (Oracle)
- Developed ETL Applications using Hive, Spark , and Impala & Sqoop for Automation using Oozie. Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Designing & Creating ETL Jobs through Talend to load huge volumes of data into Cassandra, Hadoop Ecosystem and relational databases
- Parsed high-level design specification to simple ETL coding and mapping standards and cluster co-ordination services through Zookeeper
- Automated the process for extraction of data from warehouses and weblogs by developing work-flows and coordinator jobs in Oozie.
- Worked with Confidential EMR to process data directly in S3 when we want to copy data from S3 to the Hadoop Distributed File System (HDFS) on your Confidential EMR cluster by setting up the Spark Core for analysis work.
- Worked on Apache Solr which is used as indexing and search engine.
- Used AmazonKinesis to run or streaming data real-time on AWS.
- Worked on both kind of data processing as batch and streaming with ingestion to NoSQL and HDFS with different file format such as parquet and AVRO.
- Involved on configuration, development of Hadoop environment with AWS cloud such as EC2, EMR, Redshift, Cloud watch, and Route.
- Responsible for coding MapReduce program, Hive queries, testing and debugging the MapReduce programs.
- Extracted Real time feed using Sparkstreaming and convert it to RDD and process data into DataFrame and load the data into Cassandra.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS .
- Creating Hive tables, dynamic partitions, buckets for sampling, and working on them using HiveQL.
- Used Sqoop for importing the data into HBase and Hive,exporting result set from Hive to MySQL using Sqoop export tool for further processing.
- Enumerated Hive queries to do analysis of the data and to generate the end reports to be used by business users.
- Introduced Tableau Visualization to Hadoop to produce reports for Business and BI team and worked for ETL job design as per criteria in ODI and loaded data table to Teradata server.
- Worked on Clouderadistribution and deployed on AWSEC2 Instances
- Implemented Spark RDD transformations, actions to migrate Map reduce algorithms.
- Experience in transferring Streaming data, data from different data sources into HDFS and NoSQL databases using Apache Flume.
- Developed Spark jobs written in Scala to perform operations like data aggregation, data processing and data analysis.
- Used Soap for WebServices by exchanging XML data between applications over HTTP
- Used Kafka, Flume for building robust and fault tolerant data Ingestion pipeline between JMS and Spark Streaming Applications for transporting streaming web log data into HDFS.
- Used Spark for series of dependent jobs and for iterative algorithms. Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS .
Environment: Hadoop HDFS, Flume, ETL Tool, CDH, Pig, Hive, Oozie, Zookeeper, HBase, Spark, Storm, Spark SQL, NoSQL, Scala, Teradata, Kafka, MongoDB
ETL Big Data/Hadoop Developer
Confidential - Dallas, TX
Responsibilities:
- Developed Scala, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Exported analyzed data to relational databases using Sqoop in deploying data from various sources into HDFS and building reports using Tableau.
- Developed the Restful webservices using Spring IOC to provide user a way to run the job and generate daily status report
- Exported analyzed data to relational database using Sqoop for visualization to generate reports for the BI team.
- Responsible for developing datapipeline with Confidential AWS to extract the data from weblogs and store in MongoDB.
- Developed unit test case scenarios for thoroughly testing ETL processes and shared them with testing team.
- Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation
- Used the JSON and Avro for serialization and deserialization packaged with Hive to parse the contents of streamed log data and implemented Hive custom UDF's.
- Configured Spark streaming to receive real time data from Kafka and store the stream data to HDFS using Scala .
- Involved in developing ETL data pipelines for performing real-time streaming by ingesting data into HDFS and HBase using Kafka and Storm.
- Used Teradata utilities (TPT, BTEQ) to load data from source to target table and created various kinds of indexes for performance enhancement.
- Involved in moving log files generated from varied sources to HDFS, further processing through Flume.
- Involved in creating Hive tables by using Impala and working on them using HiveQL and perform data analysis using Hive and Pig.
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase.
- Involved in creating Hive tables by using Impala and working on them using HiveQL and perform data analysis using Hive and Pig.
- Worked on Apache Flume for collecting and aggregating huge amount of log data and stored it on HDFS for doing further analysis.
- Implemented a distributed messaging queue to integrate with Cassandra using ApacheKafka and Zookeeper.
- Used Spring Boot to develop and deploy both RESTful web services and MVC applications
- Load the data into SparkRDD and performed in-memory data computation to generate the output response.
- Efficiently put and fetched data to/from HBase by writing MapReduce job.
Environment: Hadoop, Flume, Kafka, Spark, Sqoop, Spark SQL, Spark-Streaming, ETL Tool, Hive, Scala, pig, NoSQL, Impala, Oozie, Teradata, HBase, Zookeeper.
ETL Hadoop Consultant
Confidential - Dallas, TX
Responsibilities:
- Worked on loading the customer's data and event logs from Kafka into HBase using REST API.
- Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, manage and review data backups and log files.
- Involved in identifying the source data from different systems and map the data into the warehouse
- Responsible to design, develop and test the software (PL SQL, UNIX shell scripts) to maintain the data marts (Load data, analyze using OLAP tools).
- Implemented various MapReduce Jobs in custom environments and updating them to HBase tables by generating hive queries.
- Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.
- Developed Rest web services which produce XML and JSON to perform task which leverages both web and mobile applications.
- Collecting and aggregating large amounts of log data using ApacheFlume and staging data in HDFS for further analysis.
- Created Hive tables from JSON data using data serialization framework using Avro, Parquet File formats and Snappy compression.
- Developed data pipeline using Pig and Hive from Teradata, DB2 data sources. These pipelines had customized UDF'S to extend the ETL functionality
- Implemented generic export framework for moving data from HDFS to RDBMS and vice-versa.
- Worked on analyzing Hadoopcluster using different big data analytic tools including Kafka, PigHive and MapReduce (MR1 and MR2).
- Automated all the jobs for pulling data from FTP server to load data into Hive tables using Oozie workflows. Involved using HCATALOG to access Hive table metadata fromPigcode.
- Implemented SQL, PL/SQL Stored Procedures. Actively involved in code review and bug fixing for improving the performance.
- Worked on tuning the performance Pig queries and involved in loading data from LINUX file system to HDFS. Importing and exporting data into HDFS using Sqoop and Kafka.
- Created HBase tables to store various data formats of PII data coming from different portfolios, Implemented Map-reduce for loading data from Oracle database.
- Used NoSQL database with HBase and MongoDB. Exported the result set from Hive to MySQL using Shell scripts.
- Gained experience in managing and reviewing Hadoop log files. Involved in scheduling Oozie workflow engine to run multiple Pig jobs.
Environment: Hadoop, HDFS, HBase, Pig, Hive, Spark, Hortonworks, Oozie, ETL Tool, MapReduce, Sqoop, MongoDB, Kafka, LINUX, Java APIs, Java collection.
Hadoop Developer
Confidential - Overland Park, KS
Responsibilities:
- Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and MapReduce programming.
- Developed Pyspark, Scala code to cleanse and perform ETL on the data in data pipeline in different stages.
- Developed ETL processes to transfer data from different sources, using Sqoop, Impala, and bash
- Involved in loading data from UNIX file system to HDFS. Imported and exported data into HDFS and Hive using Sqoop.
- Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into the target database
- Experience in developing batch processing framework to ingest data into HDFS, Hive and Cassandra.
- Worked on Hive and Pig extensively to analyze network data in collecting metrics for Hadoop clusters.
- Automation of data pulls from SQL Server to Hadoop for analyzing large amounts of data sets to determine optimal way to aggregate and report on it. Provided quick response for client requests and experienced in creating ad hoc reports.
- Performance Tuning for Hive and Pig Job's performance parameters along with native MapReduce parameters to avoid excessive disk spills, enabled temp file compression between jobs in the data pipeline to handle production size data in a multi-tenant cluster environment.
- Designed workflows and coordinators in Oozie to automate and parallelize Hive and Pig jobs on Apache Hadoop environment by Hortonworks.
- Developed a process for the Batch ingestion of CSV Files, Sqoop from different sources and also generating views on the data source using Shell Scripting and Python.
- Delivered Hadoop migration strategy, roadmap and technology fitment for importing real time network log data into HDFS.
- POCs on moving existing Hive / Pig Latin jobs to Spark for Deploying and configuring agents to stream log events into HDFS for analysis.
- Load the data into Hive tables using HiveQL along with Deduplication and Windowing to generate ad-hoc reports using Hive to validate customer viewing history and debug issues in production.
- Experienced with multiple Input Formats such as Text File, Key Value, Sequence File input format load to HDFS.
- Worked on business specific custom UDF's in Hive and Pig for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
Environment: ETL Tool, HDFS, MapReduce, Pig, Hive, Oozie, Sqoop, Cassandra, Hortonworks.
Java Developer
Confidential - Detroit, MI
Responsibilities:
- Experience using middleware architecture using Java technologies like J2EE, Servlets, and application servers like Web Sphere and Web logic.
- Worked on loading data from Linux file system to HDFS.
- Understanding and analyzing the requirements. Designed, developed and validated User Interface using HTML, Java Script, and XML.
- Involved with writing SQL queries using Joins and Stored Proceduresusing Maven to build and deploy the applications in JBoss application Server in Software Development Lifecycle Model.
- Worked on Eclipse IDE for front end development environment for insertions, updating and retrieval operations of data from oracle database by writing stored procedures.
- Developed MapReduce jobs to convert data files into Parquet file format and included MRUnit to test the correctness of MapReduce programs.
- Experienced in working with various kinds of datasets for structured, semi structured and unstructured data with Teradata and Oracle for successfully loading files to HDFS from Teradata and loaded from HDFS to Hive.
- Installed Oozie workflow engine to run multiple Hive. Developed Hive queries to process the data and generate the data cubes for visualizing
- Concatenated ETL logics from RDBMS to Hive.
- Implemented partitioning, bucketing and worked on Hive, using file formats and compressions techniques with optimizations.
- Computed various metrics using MapReduce to calculate metrics that define user experience, revenue etc.
Environment: Hadoop, HDFS, Pig, Oozie, Hive, Python, MapReduce, Java, SQL Scripting and Linux Shell Scripting, Cloudera, Cloudera Manager.
Software Developer
Confidential
Responsibilities:
- Worked on Business logic for web service using spring annotations which enables dependency injection.
- Developed Spring Application Framework for Dependency Injection, support for the Data Access Object (DAO) pattern and integrated with Hibernate ORM.
- Developed user interface for designing and developing the application using Java, JEE and spring core using JSP, JavaScript, Ajax, jQuery, HTML, CSS and JSTL.
- Outlining agile methodology with daily scrums using TDD and continuous integration in the SDLC process and used JIRA for bug tracking and task management.
- Developed Talend jobs to populate the claims data to data warehouse - star schema.
- Used Jenkins for continuous integration purpose in using SVN, Junit and Mockito as version control and Unit testing by Creating design documents and test cases for development work.
Environment: Java, Servlets, JSP, HTML, CSS, Talend, Ajax, JavaScript, Hibernate, Spring, WebLogic, JMS, REST, SVN