Hadoop / Spark Developer Resume Phoenix, AZ - Hire IT People

SUMMARY:

5 years of IT work experience on Big Data Analytics which includes Analysis, Design, Development, Deployment & Maintenance of projects using Apache Hadoop - HDFS, Amazon S3, MapReduce, YARN, Hive, Pig Latin, Impala, Hue, Sqoop, Kafka, Flume, Spark, Scala, Oozie and Hadoop APIs’.
Experience in meeting expectations with Hadoop clusters using Cloudera and MapR, in Agile Scrum methodologies.
Experience in implementation and integration using Big Data Hadoop ecosystem components in Cloudera and MapR environments working with various file formats like Avro, Parquet, JSON and ORC.
Experience in data ingestion, processing and analysis using Spark with Scala, Spark Streaming, Kafka, Flume, Sqoop and Shell Script.
Efficient in developing Sqoop jobs for migrating data from RDBMS to Hive / HDFS and vice versa.
Experience in developing NoSQL applications using MongoDB, HBase and Cassandra.
Thorough knowledge of Hadoop architecture and core components Name node, Data nodes, Job trackers, Task Trackers, Oozie, Hue, Flume, HBase, etc.
Very good experience of partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for data analysis.
Experience in extending HIVE and PIG core functionality by using Custom User Defined functions.
Working knowledge on Oozie, a workflow scheduler system to manage the jobs that run on PIG, HIVE and SQOOP.
Experience in Spark applications using Scala for easy Hadoop transitions.
Knowledge of utilizing Kafka / Flume technologies for real time data streaming and ingestion.
Excellent knowledge in Data Analysis, Data Validation, Data Cleansing, Data Verification and Identifying data mismatch.
Wrote multiple customized MapReduce Programs for various Input file formats.
Developed multiple internal and external Hive Tables using dynamic partitioning & bucketing.
Involved in converting SQL queries into HiveQL.
Designing and creating Hive external tables using shared meta-store instead of the derby with partitioning, dynamic partitioning and buckets.
Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
Experience in integrating Hive and HBase for effective operations.
Design and development of full text search feature with multi-tenancy elastic search after collecting the real time data through Spark streaming.
Experienced in working with Apache Spark ecosystem using Spark-SQL and Scala queries on different data file formats like .txt, .csv etc.
Developed data pipeline for real time use cases using Kafka, Flume and Spark Streaming.
Experience in analyzing large scale data to identify new analytics, insights, trends, and relationships with a strong focus on data clustering.
An excellent team player with good organizational, interpersonal, communication skills and leadership qualities, Quick learner, possesses a positive attitude and flexibility towards the ever-changing industry.
Technically strong person who has capability to work with business users, project managers, team leads, architects and peers, thus maintaining healthy environment in the project.

TECHNICAL SKILLS:

Distributed Computing: Apache Hadoop 2.x, HDFS, YARN, Map Reduce, Hive, Pig, HBase, Sqoop, Flume, Zookeeper, Hue, Impala, Oozie, Kafka, Spark

SDLC Methodologies: Agile Scrum

Relational Databases: Teradata, Netezza, Oracle, My SQL

Distributed Databases: No SQL (HBase, Cassandra, Mongo DB)

Distributed Filesystems: HDFS, Amazon S3

Distributed Query Engines: Hive, Preston

Distributed Computing Environment: Cloudera, MapR

Operating Systems: Windows, Mac OS, Unix, Ubuntu

Programming: Java, Python, Scala, UNIX, Pig Latin, HiveQL

Scripting: Shell Scripting

Version Control: GitHub

IDE: Scala IDE, PyCharm, Jupyter Notebook, Eclipse (PyDev)

PROFESSIONAL EXPERIENCE:

Confidential, Phoenix, AZ

Hadoop / Spark Developer

Responsibilities -

Worked with Hadoop Ecosystem components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig with MapR Hadoop distribution.
Wrote Pig Scripts for sorting, joining, filtering and grouping the data.
Developed programs in Spark based on the application for faster data processing than standard MapReduce programs.
Developed Spark programs using Scala, involved in creating Spark SQL Queries and Developed Oozie workflow for Spark jobs.
Developed the Oozie workflows with Sqoop actions to migrate the data from relational databases like Oracle, Teradata to HDFS.
Used Hadoop FS actions to move the data from upstream location to local data locations.
Written extensive Hive queries to do transformations on the data to be used by downstream models.
Developed map reduce programs as a part of predictive analytical model development.
Developed Hive queries to do analysis of the data and to generate the end reports to be used by business users.
Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and ingested streaming data into Hadoop using Spark Framework and Scala.
Expertise working with NOSQL databases like MongoDB.
Extensively used GIT as a code repository and Version One for managing day agile project development process and to keep track of the issues and blockers.
Written Spark Python for model integration layer.
Implemented Spark using Scala, Java and utilizing Data frames and Spark SQL API for faster processing of data.
Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
Wrote new spark jobs in Scala to analyze the data of the customers and sales history.
Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
Developed a data pipeline using Kafka, HBase, Mesos Spark and Hive to ingest, transform and analyzing customer behavioral data.

Technologies Used: MapR, LINUX, Hadoop, HBase, Hive, Impala, Oracle, Spark, Scala, Python, Pig, Sqoop, Teradata, Zookeeper, Oozie, MongoDB, Map Reduce, GitHub.

Confidential, Waltham, MA

Big Data Hadoop Engineer

Responsibilities -

Involved in designing the data pipeline using various Hadoop components like Spark, Hive, Impala, Sqoop with Cloudera distribution.
Developed Spark (Python API) applications in PyCharm using spark core libraries (RDD, Spark-SQL) for performing ETL transformations, thereby eliminating the need of utilizing ETL tool (SSIS/ODI).
Developed Spark application for Batch processing.
Implemented Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table for more efficient data.
Using Open Source packages, designed POC to demonstrate Integration of Kafka/Flume with Spark Streaming for real-time data Ingestion and processing.
Designed Sqoop application for migration of sensitive PHI data residing on Netezza (RDBMS) to HDFS / Hive tables.
Designed Sqoop Application for CDC (Change Data Capture) process and integrated it with Spark and Oozie.
Performed data validation using Sqoop on the exported data.
Performed transformations / analysis by writing complex HQL queries in Hive and exported result to HDFS in discrete file format (JSON, AVRO, Parquet, ORC).
Implemented data ingestion and transformation using automated workflows using Oozie.
Utilizing Cloudera Navigator created Audit reports, which notifies security threat and will track all the user / tools activity which uses various Hadoop components.
Developed strategy for various Hadoop components used to track Data Lineage and Meta-Data extracted from Pipeline using Cloudera Navigator.
Designed various plots showing HDFS Analytics and Other operations performed on the environment.
Worked with Infra team for testing the environment after patches, upgrades and migration take place.
Developed multiple python scripts for delivering End-To-End support and common routines, while maintaining product integrity.
Performed analytical queries using Impala (Cloudera) for swifter responses and exported result to Tableau for data visualization.
Documented all the applications worked on and presented it to higher level.

Technologies Used: Cloudera, LINUX, Hadoop, HDFS, HBase, Hive, Spark, MapReduce, Sqoop, Flume, Kafka, Python, Netezza, Oozie, PyCharm, Cloudera Impala, Tableau 10.0, GitHub.

Confidential

Hadoop Developer

Responsibilities -

Imported Data from Different Relational Data Sources like RDBMS, Teradata to HDFS using Sqoop.
Analyzed large data sets by running Hive queries and exported the results as views/ Flat files, etc.
Worked with the Data Science team to gather requirements for various data mining projects and conducted POC.
Involved in creating Hive managed / external tables while maintaining raw files integrity and analyzed data using hive queries.
Worked on Spark core Libraries like RDD, Spark SQL, Spark Streaming modules of Spark extensively by handling structured and unstructured data.
Designed Spark applications performing ETL transformations using Python API.
Developed Simple to complex MapReduce Jobs using Hive.
Wrote Hive Queries to have a consolidated view of the mortgage and retail data.
Orchestrated hundreds of Sqoop scripts, pig scripts, hive queries using oozie workflows and sub-workflows.
Used Hive to analyze the partitioned, bucketed data and compute various metrics for reporting
Developed Python utility to validate ingested Hive tables using Sqoop with source RDBMS tables
Involved in running Hadoop jobs for processing millions of records of text data.
Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
Responsible for managing data from multiple sources using Flume.
Load and Transform large data sets consisted of structured, semi structured and unstructured data.
Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries.
Using Python libraries like Pandas, NumPy, and SciPy performed statistical analysis on dataset. Discovered patterns by evaluating parameters, which recommends top parameters to focus while designing.
Integrated Tableau with Impala as a source to create interactive BI dashboard.

Technologies Used: Cloudera, LINUX, Hadoop, HDFS, Hive, Spark, MapReduce, Sqoop, Flume, Teradata, Python, MySQL, Oozie, Impala, Tableau 10.0.

Confidential

Hadoop Developer

Responsibilities -

Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
Importing and exporting data into HDFS and Hive using Sqoop.
Experience in defining job flows using Oozie and shell scripts.
Experienced in implementing various customizations in MapReduce at various levels by implementing custom input formats, custom record readers, partitioners and data types in java.
Experience in ingesting data using flume from web server logs and telnet sources.
Installed and configured Cloudera Manager, Hive, Pig, Sqoop, and Oozie on CDH5 cluster.
Experienced in managing disaster recovery cluster and responsible for data migration and backup.
Performed an upgrade in development environment from CDH 4.x to CDH 5.x.
Implemented encryption and masking on customer sensitive data in flume by building a custom interceptor and masking and encrypting the data as per the requirement by considering the rules in MySQL.
Experience in managing and reviewing Hadoop log files.
Extracted files from RDBMS through Sqoop and placed in HDFS and processed.
Experience in running Hadoop streaming jobs to process terabytes of xml format data.
Supported MapReduce Programs those are running on the cluster.
Involved in loading data from UNIX file system to HDFS.
Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
Experiences in implementing Hive-HBase integration by creating hive external tables and using HBase storage handler.
Executed queries using Hive and developed MapReduce jobs to analyze data.
Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
Developed Hive queries for the analysts.
Involved in loading data from LINUX and UNIX filesystem to HDFS.
Designed and implemented MapReduce based large scale parallel relation learning system.
Developed Master tables in HIVE using a "jsondeserializer" or "get json object" or "json tuple" functions of HIVE.
Designed the entire flow in HDFS in such a way that is needed to be achieved using Oozie workflows.

Technologies Used: Cloudera, Eclipse, Hadoop, Hive, HBase, MapReduce, Flume, HDFS, PIG, Sqoop, Oozie, Cassandra, Java (JDK 1.6), My SQL, UNIX Shell Scripting.

Confidential

Jr. Hadoop Developer

Responsibilities -

Involved in creating Hive tables, loading with data and writing hive queries to process the data.
Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive.
Implemented Partitioning, Bucketing in Hive for better organization of the data.
Involved with the team of fetching live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.
Developing Spark Streaming program on Scala for importing data from the Kafka topics into the HBase tables.
Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats (Text file, Avro data files, Sequence files, XML and JSON files, ORC and Parquet).
Involved in the Design Phase for getting live event data from the database to the front-end application using Spark Ecosystem.
Importing data from hive table and run SQL queries over imported data and existing RDD’s Using Spark SQL.
Responsible for loading and transforming large sets of structured, semi structured and unstructured data.
Collected the log data from web servers and integrated into HDFS using Flume.
Responsible to manage data coming from different sources.
Extracted files from Couch DB and placed into HDFS using Sqoop and pre-process the data for analysis.
Developed the sub queries in Hive.
Partitioning and bucketing the imported data using HiveQL.
Partitioning dynamically using dynamic partition insert feature.
Moving this partitioned data onto the different tables as per as business requirements.

Technologies Used: Eclipse, Hadoop, HDFS, Map Reduce, Pig, Hive, Spark, Kafka, Flume, HBase, Couch DB, Apache-Maven.

We provide IT Staff Augmentation Services!

Hadoop / Spark Developer Resume

Phoenix, AZ

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship