We provide IT Staff Augmentation Services!

Sr. Hadoop/ Spark Developer Resume

4.00/5 (Submit Your Rating)

New York, NY

SUMMARY

  • 8+ years of professional experience working wif data, which includes hands on experience of 3+ years in analysis, design, development and maintenance of Hadoop and Java based applications.
  • Expertise in understanding of Hadoop Architecture and various components such as Kafka, Flume and MapReduce concepts and experience in working wif MapReduce programs using Apache Hadoop for working wif Big Data to analyze large data sets efficiently.
  • Extensive experience of development using Hadoop ecosystem covering Map Reduce, HDFS, YARN, Hive, Impala, Pig, Hbase, Spark, Sqoop, Oozie, Cloudera.
  • Experience wif an in - depth level of understanding in teh strategy and practical implementation of AWS Cloud-Specific technologies includingIAM, EC2, EMR, SNS, RDS, Redshift, Atana, Dynamo DB, Lambda, Cloud Watch, Auto-Scaling, S3, and Route 53.
  • Strong experience in analyzing data using HiveQL, SparkSQL, HBase and custom Map Reduce programs.
  • Performed importing and exporting data into HDFS and Hive using Sqoop.
  • Experience in writing shell scripts to dump teh Shared data from MySQL servers to HDFS.
  • Working noledge in python and Scala to use spark.
  • Knowledge of extracting an Avro schema using Avro-tools, XML using XSD and evolving an Avro schema by changing JSON files.
  • Experience in Amazon AWS cloud Administration and actively involved highly available, Scalability, cost effective and fault tolerant systems using multiple AWS services.
  • Strong problem-solving, organizing, team management, communication and planning skills, wif ability to work in team environment. Ability to write clear, well-documented, well-commented and efficient code as per teh requirement.
  • Capable of processing large sets of structured, Semi-structured and unstructured data and supporting systems application architecture.
  • Able to assess business rules, collaborate wif stakeholders and perform source-to-target data mapping, design and review.
  • Good noledge of No-SQL databasesCassandra, MongoDBandHBase.
  • Worked onHBaseto load and retrieve data for real time processing usingRest API.
  • Experience in developing applications usingwaterfallandAgile(XPandScrum).
  • Strong Problem Solving and Analytical skills and abilities to make Balanced & Independent decisions.
  • Expertise in Database Design, Creation and Management of Schemas, writing Stored Procedures, Functions, DDL, DMLSQLqueries.
  • Experienced wif build tool ANT, Maven and continuous integrations like Jenkins.
  • An excellent team player and self-starter wif good communication skills and proven abilities to finish tasks before target deadlines.

TECHNICAL SKILLS

Big Data Ecosystems: Hadoop, MapReduce, HDFS, HBase, Zookeeper, Hive, Pig, Sqoop, Spark, Cassandra, Oozie, Flume, kafka and Talend

Programming Languages: Java, C/C++, Scala, Python and shell Scripting

Scripting Languages: JavaScript, XML, HTML, Python and Linux Bash Shell Scripting, Unix

Tools: Eclipse, JDeveloper, JProbe, CVS, MS Visual Studio

Platforms: Windows(2000/XP), Linux, Solaris

Databases: NoSQL, Oracle, DB2, MS SQL Server (2000, 2008), Teradata, Hbase, Cassandra, Cloudera 5.9

PROFESSIONAL EXPERIENCE

Sr. Hadoop/ Spark Developer

Confidential, New York, NY

Responsibilities:

  • Developed data pipeline using Spark, Hive, Pig, python, Impala and HBase to ingest customer behavioral data and financial histories into Hadoop cluster for analysis.
  • Responsible for implementing a generic framework to handle different data collection methodologies from teh client primary data sources, validate transform using spark and load into S3.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary transformations and Aggregation on teh fly to build teh common learner data model and persists teh data in HDFS.
  • Explored teh usage of Spark for improving teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark SQL and Spark Yarn.
  • Developed Spark Code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and Scala.
  • Worked on teh Spark SQL and Spark Streaming modules of Spark and used Scala and Python to write code for all Spark use cases.
  • Explored teh Spark to improve teh performance and optimization of teh existing algorithms in Hadoop using Spark-Context, Spark-SQL, Data Frame and Pair RDD's.
  • Migrated historical data to S3 and developed a reliable mechanism for processing teh incremental updates.
  • Scheduled spark jobs and Apache Airflow jobs inside EMR, to read data from S3, transform it and load it to Postgres RDS.
  • Using Kafka, implemented data solution to correlate data from SQL and NoSQL databases.
  • Using scala shell commands, wrote spark scripts as per teh requirement.
  • Analyzed data in hive using Spark API over Hortonworks.
  • Used Oozie workflow engine to manage independent Hadoop jobs and to automate several types of Hadoop such as java MapReduce, Hive and Sqoop as well as system specific jobs.
  • Used to monitor and debug Hadoop jobs/applications running in production.
  • Worked on providing user support and application support on Hadoop infrastructure.
  • Worked on evaluating, comparing different tools for test data management wif Hadoop.
  • Supported teh testing team on Hadoop Application Testing.

Environment: Cloudera, Spark, Hive, Pig, Spark SQL, Spark Streaming, HBase, Sqoop, Kafka, AWS EC2, S3, EMR, RDS, Linux Shell Scripting, Postgres, MySQL.

Hadoop/ Spark Developer

Confidential, Atlanta, GA

Responsibilities:

  • Involved in teh high-level design of teh Hadoop 2.6.3 architecture for teh existingdata structure and Problem statement and setup a new cluster and configured teh entire Hadoop platform.
  • Extracted files from MySQL, Oracle, and Teradatathrough Sqoop 1.4.6 and placed in HDFS storage Distribution and processed.
  • Push data from Amazon S3 storage to Redshift using Key, Value pairs as required by BI team.
  • Processed data using Atana on S3 worked on gateway nodes and connectors (Jar files) connecting sources wif AWS cloud.
  • Developed efficient MapReduce programs for filtering out teh unstructured data and developed multiple MapReduce jobs to perform data cleaning and preprocessing on EMR.
  • Worked wif various HDFS file formats like Avro 1.7.6, Sequence File, Json and various compression formats like Snappy, bzip2.
  • Continuous monitoring and managing teh Hadoop cluster using Ambari.
  • Used Pig to perform data validation on teh data ingested using Sqoop and Flume and teh cleansed data set is pushed into Hive.
  • Increased performance of teh HiveQLs by splitting larger queries into small and by introducing temporary tables in between them.
  • Implemented various performance techniques like (Partitioning, Bucketing) in Hive to get better performance.
  • Designed and built teh Reporting Application, which uses teh SparkSQL to fetch and generate reports on HBase table data.
  • Developed data pipeline using Kafka to ingest behavioral data, used Spark Streaming for teh data filtering and storing into HDFS.
  • Consuming data from Kafka topics Using Pyspark, Parsing and transforming data using python and spark functions from built-in libraries, tan storing into Hive tables.
  • Wrote Kafka producers to stream teh data from external rest APIs to Kafka topics.
  • Developed custom Unix SHELL scripts to do pre-post validations of master and slave nodes, before and after configuring teh name node and datanodes respectively.
  • Driving teh application from development phase to production phase using Continuous Integration and Continues Deployment (CI/CD) model using Maven and Jenkins.
  • Developed small distributed applications in our projects using Zookeeper 3.4.7 and scheduled teh workflows using Oozie 4.2.0.

Environment: Hadoop, Amazon S3, EMR, Redshift, HDFS, Hive,Impala, Spark, Scala, Python, Pig, Sqoop, Oozie, GIT,Oracle, DB2, MySQL, UNIX Shell Scripting, JDBC.

Hadoop Developer

Confidential, Dallas, TX

Responsibilities:

  • Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
  • Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache Kafka and stored teh data into HDFS for analysis.
  • Developed multiple Kafka Producers and Consumers from scratch implementing as per organization's requirements.
  • Setup Flume for different sources to bring teh log messages from outside to Hadoop HDFS.
  • Responsible for creating, modifying topics (Kafka Queues) as and when required wif varying configurations involving replication factors, partitions and TTL.
  • Performing aggregations on large amounts of data using Apache SparkScalaand landing data in Hive warehouse forfurther analysis.
  • Wrote and tested complex MapReducejobs for aggregatingidentified and validated data.
  • Created Managed and External Hive tables wif static/dynamic partitioning.
  • Written Hive queries for data analysis to meet teh Business requirements.
  • Increased performance of teh HiveQLs by splitting larger queries into small and by introducing temporary tables in between them.
  • Used open source web scraping framework for python to crawl and extract data from web pages.
  • Optimized teh Hive queries by setting different combinations of Hive parameters.
  • Developed UDF’s (User Defined Functions)to extend core functionality of PIG and HIVE queries as per requirement.
  • Implemented workflow using Oozie for running Map Reduce jobs and Hive Queries.
  • Extensively involved in performance tuning of teh ImpalaQL by performing bucketing on large tables
  • Design teh extraction, transformation and loading solutions using Informatica Power Center, and Teradata: BTEQ, FastLoad, MLoad, TPump tools.

Environment: Apache Hadoop, HDFS, Map Reduce, Hive, Sqoop, Kafka, Flume, Zookeeper, Spark, Hbase, Python, Shell Scripting, Oozie.

We'd love your feedback!