Hadoop/spark Developer Resume
Herndon, VA
SUMMARY:
- Expertise in Hadoop ecosystem - HDFS, YARN, Pig, HBase, Spark and Hive for data analysis, Sqoop for data migration, Flume for data ingestion, Oozie for scheduling and Zookeeper for coordinating cluster resources.
- 8+ years of experience in IT industry with over 4 Years as a Hadoop Developer, with a strong background in Databases and Data warehousing concepts.
- Extensive experience in managing data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Experience in Installation and Configuration of Hadoop Eco System components.
- Experience with AWS components and services, particularly, EMR, S3, and Lambda.
- Experience with open source NOSQL technologies such as HBase, Cassandra.
- Experience with messaging & complex event processing systems such as Kafka and Storm.
- Sound knowledge in using Apache Solr to search against structured and un-structured data.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Excellent understanding of Hadoop distributed File system and experienced in developing efficient jobs to process large datasets.
- Sound knowledge on Hadoop Distributions such as Cloudera, Hortonworks.
- Highly Proficient in Agile, Test Driven, Iterative, Scrum, and Waterfall software development life cycle.
- Strong problem-solving skills, good communication, interpersonal skills and a good team player.
- Have the motivation to take independent responsibility as well as ability to contribute and be a productive team member.
- Good working knowledge in using Sqoop and Flume for data ingestion into HDFS.
- Self-motivated with a strong desire to learn and an Effective Team Player.
TECHNICAL SKILLS:
Hadoop Ecosystem: HDFS, YARN, MapReduce, Hive, HBase, Pig, Scoop, Impala, Zookeeper, Storm, Kafka, Hue, Cloudera Manager, Ambari and Spark
Programming Languages: JAVA, SQL, HQL, PIG LATIN, Python, Scala and shell scripting
Databases: HBASE, SQL Server 2005, MS Access, DB2, Oracle
Platforms: UNIX, Win 98/XP/2000/NT, DOS
Reporting Tools: Tableau
PROFESSIONAL EXPERIENCE:
Confidential, Herndon, VA
Hadoop/Spark Developer
Environment: CDH 5.7.0(Cent OS): Apache Hadoop 2.7.1, MapReduce, HBase 1.1.2, Pig 0.15.0, Sqoop 1.4.6, Oozie 4.2.0, Java 8, Hive 1.2.1, Impala, ZooKeeper 3.4.6, Oracle 11g, PL/SQL, SQL Developer 4.0, UNIX. Rest API, Web Services REST, SQL, Shell Script
Responsibilities:
- Configured Hadoop components including Hive, Pig, HBase, Spark, Sqoop, Oozie and Hue in the client environment.
- Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in HDFS.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with HDFS reference tables and historical metrics.
- Responsible for managing data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Enabled speedy reviews and first-mover advantages by defining the job flow in Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data.
- Used Flume to collect, aggregate, and store the web log data from different sources like web servers and network devices and stored it into HDFS.
- Developed Pig scripts to transform raw data from several data sources into forming baseline data and loaded the data into HBase.
- Worked on custom Pig Loaders and storage classes to work with variety of data formats such as JSON and XML file formats.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with HDFS reference tables and historical metrics.
- Analyzed the web log data using the HiveQL to extract the number of unique visitors per day, page views, visit duration, the most purchased product on the website.
- Involved in creating POCs to ingest and process streaming data using Spark and HDFS.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to HDFS, HBase and Hive by integrating with Storm.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to HBase as per the business requirement.
- Worked in provisioning and managing multi-tenant Hadoop clusters on Amazon Web Services (AWS).
- Involved in installing cloudera distribution of Hadoop on amazon EC2 Instances.
- Created Hive UDF's to process business logic that varies based on policy.
- Experienced in Monitoring Cluster using Cloudera manager.
- Imported data using Sqoop to load data from MySQL to S3Buckets on regular basis.
Hadoop Developer
Environment: Hadoop 2.2 (Horton works): PIG 0.14.0, Map Reduce, Hive 0.14.0, TEZ 0.5.2, HDFS 2.6.0, Sqoop 1.4.5, Oozie 4.1.0, HBase 0.98.4, Zookeeper, Hadoop Data Lake with Linux-Cent OS.
Responsibilities:
- Participated in brainstorming sessions on finalizing the data ingestion requirements and design.
- Implemented solutions for ingesting data from various sources and processing the data utilizing Big Data Technologies such as Hive, Pig, Sqoop, HBase, MapReduce.
- Strategized Sqoop jobs to parallelize data loads from source systems.
- Designed and developed a daily process to do incremental import of raw data from DB2 into Hive tables using Sqoop.
- Worked on design and development of Oozie works flows to perform orchestration of Map Reduce, PIG and HIVE jobs.
- Participated in providing inputs for the design of the ingestion patterns.
- Worked on the design of Hive data store to store the data from various data sources.
- Developed MapReduce and MRUnit jobs to operate on streaming data.
- Worked with source system load testing teams to perform loads while ingestion jobs are in progress.
- Developed MapReduce programs to parse the raw data, populate tables and store the refined data in partitioned tables.
- Optimized Map Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Worked on building analytical data stores for data science team's model development.
- Involved in debugging MapReduce job using MR Unit framework and optimizing MapReduce.
- Involved in providing inputs to analyst team for functional testing.
- Extensively used Hive/HQL queries to query data in Hive Tables.
- Worked on performance tuning of HIVE queries with partitioning and bucketing process.
Hadoop Developer
Environment: Hadoop Cloudera CDH3: MapReduce, HIVE, PIG, Sqoop, Oozie, Java, HBase, HIVE, Zookeeper, Infosphere Datastage v8.5, Oracle 11g, SQL Developer 4.1.3, Unix.
Responsibilities:
- Analyzed Hadoop cluster using big data analytic tools including MapReduce, Pig, Hive.
- Actively monitored systems architecture design and implementation and configuration of Hadoop deployment, backup, and disaster recovery systems and procedures.
- Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
- Designed and developed Data Ingestion components.
- Assisted in upgrading, configuration, and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase.
- Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in HDFS.
- Developed data pipelines using Flume, Sqoop, PIG and MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.
- Monitored multiple Hadoop clusters environments using Ganglia.
- Worked on TOAD for Data Analysis, ETL/IBM Infosphere Datastage for data mapping and ETL transformations between the source and the target databases.
- Extensively used teradata utilities and created BTEQ and fast load scripts to perform etl operations with in the teradata environment.
- Used Oozie workflow engine to manage and automate Hadoop jobs.
- Implemented test scripts to support test-driven development and continuous integration.
- Responsible for managing data coming from different sources.
- Used Zookeeper for coordinating the clusters and to maintain the data consistency.
DataStage Consultant
Environment: IBM InfoSphere Information server DataStage V 9.1, 8.5 (Parallel Jobs), Oracle Business Intelligence Suite Enterprise Edition (OBIEE), Oracle 10g, Toad, SQL Developer 1.5, SQL plus, Oracle PeopleSoft Campus Solutions, UC4 V8 Application Manager, Windows XP, Linux.
Responsibilities:
- Responsible for all aspects of managing and operating the InfoSphere platform - architecture, infrastructure, user/service account setup, access control, troubleshooting, technology strategy for upgrades, managing work requests, stakeholder communication etc.
- Implemented CDC Technology in Student Warehouse platform which enhanced the performance of jobs and reduced the daily batch completion time.
- Prepared Technical Design Approach Document for Server to Parallel DataStage Jobs Migration
- Extensively worked with various stages of Parallel Extender like Sequential file, Dataset, Lookup, Peek, Transformer, Merge, Aggregator, Row Generator, Surrogate Key Generator, and many more to design jobs and load the data in to Fact and Dimension tables.
- Involved in design and development of optimized parallel jobs to extract transform and effectively load data into Oracle, text files using best practices.
- Created master controlling sequencer jobs and implemented restorability using checkpoints while automating the Entire EDW loading process and implemented proper failure actions.
- Involved in providing technical design review, development plan review, code review, test plans, and results as per best practices of IBM DataStage.
DataStage consultant
Environment: IBM InfoSphere Information server datastage 8.0 (Parallel Jobs), Oracle xsBusiness Intelligence Suite Enterprise Edition (OBIEE), Oracle 10g, Toad, SQL Developer 1.5, SQL plus, UC4 V8 Application Manager, Control - M, Windows XP, Linux.
Responsibilities:
- Involved in multiple FDIC ETL projects which involved new development as well as maintenance of existing Process using IBM DataStage.
- Worked extensively on different types of stages like sequential file, ODBC stage, aggregator, transformer, copy, merge, join, filter, column generator, funnel, peek, change capture, change apply stage, and several other stages for developing parallel jobs.
- Responsible for creating parameter driven jobs and multiple instance jobs to be used across the projects.
- Designed and Developed DataStage jobs that handle the Initial load and the Incremental load for the Financial System’s EPM.
- Involved in various roles of Administrator and Developer throughout the project. Extensively worked in the design and development of the data acquisition process for the data warehouse including the initial load and subsequent refreshes.
- Responsible for identifying opportunities to optimize ETL environment including monitoring and implementing quality and validation processes to ensure data accuracy and integrity as a part of data optimization project.
- Worked on existing change log issues pertaining to performance optimization including new requirements and adding additional conditions.
- Involved in Production Scheduling to setup jobs through Control - M in order and provided production support
- Responsible for Unit Testing of code and running multiple cycles end-to-end before migrating code to QA and prod.
DataStage Developer
Environment: Ascential datastage 7.5, Oracle 10g/11g, Sybase, Green plum, PgAdminIII, AIX 5.3, Erwin 4.0, Smart CVS 7.0.9, UNIX Scripts, IBM Rational Clear Quest Web client 7.1.1, Toad 9.7.2.5, Harvest 12.0.2
Responsibilities:
- Involved in meetings with business analysts and DBAs to gather and understand the business requirements and process flow details to plan the ETL extraction and loading.
- Extracted data using different strategies from the various native SQL against relational databases like Oracle, Sybase, Green Plum and flat files for multiple projects involved in.
- Enhanced and developed DataStage jobs for data load into Fact and Dimension Tables using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Sequential file, Dataset, Funnel, Remove Duplicates, Copy, Modify, Filter, Surrogate Key.
- Used PgAdmin III and postgre SQL to access Greenplum database for extraction and loading purposes in the process of the migration effort.
- Imported healthcare data from various transactional data sources residing on Facets, Oracle, Sybase, flat files and performed Null value handling, data Cleansing using null handling functions and UNIX routines.
- Developed complex jobs and made sure that each record in all processed files has at least one ETL action indicator in its corresponding ETL audit table identifying if the record was an insert, update, no change, or delete followed by loading into oracle Database and flat files.
- Enhanced the reusability of the jobs by making and deploying shared containers and multiple instances of the jobs.
- Involved in writing UNIX shell scripts for automation, job run, file processing, initial load, batch loads, cleanup, job scheduling, and reports in Linux/UNIX environment.
- Extensively worked on error handling, data Cleansing and performing lookups for faster access of data.
- Used lookup stage with reference to Oracle tables for insert/update strategy and updating of slowly changing dimensions.
- Involved in Unit testing, Integration testing, UAT by creating test cases, test plans and helping DataStage administrator in the deployment of code across Dev, Test and Prod Repositories.