Big Data Developer Resume
5.00/5 (Submit Your Rating)
New, YorK
SUMMARY
- Over 5 and half years of IT experience as a Developer, Designer & quality Tester with cross platform integration experience using Hadoop development and Admin.
- Hands on experience in installing, configuring and using Hadoop Ecosystem - HDFS, MapReduce, Pig, Hive, Oozie, Flume, HBase, Spark, Sqoop, Flume and Oozie.
- Strong understanding of various Hadoop services, MapReduce and YARN architecture.
- Responsible for writing Map Reduce programs.
- Experienced in importing-exporting data into HDFS using SQOOP.
- Experience loading data to Hive partitions and creating buckets in Hive.
- Developed Map Reduce jobs to automate transfer the data from HBase.
- Expertise in analysis using PIG, HIVEand MapReduce.
- Experience in HDFS data storage and support for running map-reduce jobs.
- Experience in Chef, Puppet or related tools for configuration management.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, HBase database and Sqoop.
- Involved in Infrastructure set up and installation of HDP stack on Amazon Cloud.
- Experience with ingesting data from RDBMS sources like - Oracle, SQL and Teradata into HDFS using Sqoop.
- Experience in big data technologies: Hadoop HDFS, Map-reduce, Pig, Hive, Oozie, Sqoop, Zookeeper and NoSQL.
- Adding/installation of new components and removal of them through Cloudera Manager.
- Experience in benchmarking, performing backup and disaster recovery of Name Node metadata and important sensitive data residing on cluster.
- Experience in designing and implementing HDFS access controls, directory and file permissions user authorization that facilitates stable, secure access for multiple users in a large multi-tenant cluster
- Extensive experience in importing and exporting streaming data into HDFS using stream processing platforms like Flume and Kafka messaging system.
- Implemented Capacity schedulers on the Job tracker to share the resources of the Cluster for the Map Reduce jobs given by the users.
- Responsible for the Provisioning, installing, configuring, monitoring and maintaining HDFS, Yarn, HBase, Flume, Sqoop, Oozie, Pig, Hive, Ranger, Falcon, Smart sense, Storm, Kafka.
- Experience in AWS CloudFront, including creating and managing distributions to provide access to S3 bucket or HTTP server running on EC2 instances.
- Good working knowledge of Vertica DB architecture, column orientation and High Availability.
- Good understanding of Scrum methodologies, Test Driven Development and continuous integration.
- Major strengths are familiarity with multiple software systems, ability to learn quickly new technologies, adapt to new environments, self-motivated, team player, focused adaptive and quick learner with excellent interpersonal, technical and communication skills.
- Experience in defining detailed application software test plans, including organization, participant, schedule, test and application coverage scope.
- Experience in gathering and defining functional and user interface requirements for software applications.
- Experience in real time analytics with Apache Spark (RDD, Data Frames and Streaming API).
- Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data.
- Experience in integrating Hadoop with Kafka. Expertise in uploading Click stream data from Kafka to HDFS.
- Expert in utilizing Kafka for messaging and publishing subscribe messaging system.
PROFESSIONAL EXPERIENCE
Confidential, New York
Big Data Developer
Responsibilities:
- Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
- Involved in Agile methodologies, daily scrum meetings, spring planning, and scripts were written for distribution of query for performance test jobs in Amazon Data Lake.
- Created Hive Tables, loaded transactional data from Teradata using Sqoop, and worked with highly unstructured and semi-structured data of 2 Petabytes in size.
- Developed MapReduce jobs for cleaning, accessing, and validating the data and created and worked Sqoop jobs with the incremental load to populate Hive External tables.
- Developed optimal strategies for distributing the weblog data over the cluster importing and exporting the stored web log data into HDFS and Hive using Sqoop.
- Apache Hadoop installation & configuration of multiple nodes on AWS EC2 system and developed Pig Latin scripts for replacing the existing legacy process to the Hadoop and the data is fed to AWS S3.
- Responsible for building scalable distributed data solutions using Hadoop Cloudera and designed and developed automation test scripts using Python
- Integrated Apache Storm with Kafka to perform web analytics and to perform clickstream data from Kafka to HDFS.
- Analyzed the SQL scripts and designed the solution to implement using Spark and implemented Hive Generic UDF's to in corporate business logic into Hive Queries.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store it in HDFS.
- Uploaded streaming data from Kafka to HDFS, HBase, and Hive by integrating with storm and writing Pig-scripts to transform raw data from several data sources into forming baseline data.
- Worked on MongoDB by using CRUD (Create, Read, Update and Delete), Indexing, Replication, and Sharding features.
- Involved in designing the row key in HBase to store Text and JSON as key values in the HBase table and designed row key in such a way to get/scan it in sorted order.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts)
- Creating Hive tables and working on them using Hive QL and designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real-time and persists into Cassandra.
- Developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBASE and Hive tables.
- Worked on Cluster coordination services through Zookeeper and monitored workload, job performance, and capacity planning using Cloudera Manager
- Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters and implemented data ingestion and handling clusters in real-time processing using Kafka.
- Creating the cube in Talend to create different types of aggregation in the data and also to visualize them.
- Monitor Hadoop Name Node Health status, number of Task trackers running, number of Data Nodes running and automated all the jobs starting from pulling the Data from different Data Sources like MySQL to pushing the result set Data to Hadoop Distributed File System.
- Developed story-telling dashboards in Tableau Desktop and published them on to Tableau Server and used GitHub version controlling tools to maintain project versions.
Environment: Hadoop, HDFS, HBase, Spark, Scala, Hive, MapReduce, Sqoop, ETL, Java, PL/SQL, Oracle 11g, Unix/Linux.
Confidential, New York, New York
Big Data Developer
Responsibilities:
- Involved in Hive/SQL queries performing spark transformations using Spark RDDs and Python (spark).
- Created a Serverless data ingestion pipeline on AWS using lambda functions.
- Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to DynamoDB using Scala.
- Developed Apache Spark Applications by using Scala, Python and Implemented Apache Spark data processing module to handle data from various RDBMS and Streaming sources.
- Experience in developing and scheduling various Spark Streaming / batch Jobs using python (pyspark) and Scala.
- Developing spark code using pyspark to be applying various transformations and actions for faster data processing.
- Achieved high-throughput, scalable, fault-tolerant stream processing of live data streams using Apache Spark Streaming
- Used Spark Stream processing using Scala to get data into in-memory, created RDDs, Data Frames and applied transformations and actions.
- Involved in using various Python libraries with spark in order to create data frames and store them to Hive.
- Sqoop jobs and Hive queries were created for data ingestion from relational databases to analyze historical data.
- Experience in working with Elastic MapReduce (EMR) and setting up environments on amazon AWS EC2 instances.
- Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment.
- Executed Hadoop/Spark jobs on AWS EMR using programs, stored in S3 Buckets.
- Knowledge on creating the user defined functions (UDF's) in Hive.
- Worked with different File Formats like c, Avro, parquet for HIVE querying and processing based on business logic.
- Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
- Implemented Hive UDF's to implement business logic and Responsible for performing extensive data validation using Hive.
- Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API.
- Involved in developing code and generated various data frames based on the business requirement and created temporary tables in hive.
- Utilized AWS CloudWatch to monitor the performance environment instances for operational and performance metrics during load testing.Scripting Hadoop package installation and configuration to support fully automated deployments.
- Involved in chef-infra maintenance including backup/security fix on Chef Server.
- Deployed application updates using Jenkins. Installed, configured, and managed Jenkins
- Triggering the SIT environment build of client remotely through Jenkins.
- Deployed and configured Git repositories with branching, forks, tagging, and notifications.
- Experienced and proficient deploying and administering GitHub
- Deploy builds to production and work with the teams to identify and troubleshoot any issues.
- Worked on MongoDB database concepts such as locking, transactions, indexes, Shading, replication, schema design.
- Consulted with the operations team on deploying, migrating data, monitoring, analyzing, and tuning MongoDB applications.
- Viewing the selected issues of web interface using SonarQube.
- Developed a fully functional login page for the company's user facing website with complete UI and validations.
- Installed, Configured and utilized AppDynamics (Tremendous Performance Management Tool) in the whole JBoss Environment (Prod and Non-Prod).
- Responsible for upgradation of SonarQube using upgrade center.
- Resolving tickets submitted by users, P1 issues, troubleshoot the error documenting, resolving the errors.
- Installed and configured Hive in Hadoop cluster and help business users/application teams fine tune their HIVE QL for optimizing performance and efficient use of resources in cluster.
- Conduct performance tuning of the Hadoop Cluster and map reduce jobs. Also, the real-time applications with best practices to fix the design flaws.
- Implemented Oozie workflow for ETL Process for critical data feeds across the platform.
- Configured Ethernet bonding for all Nodes to double the network bandwidth
- Implementing Kerberos Security Authentication protocol for existing cluster.
- Built high availability for major production cluster and designed automatic failover control using Zookeeper Failover Controller (ZKFC) and Quorum Journal nodes.
Environment: HDFS, Map Reduce, Hive 1.1.0, Kafka, Hue 3.9.0, Pig, Flume, Oozie, Sqoop, Apache Hadoop 2.6, Spark, SOLR, Storm, Cloudera Manager, Red Hat, MySQL, Prometheus, Docker, Puppet.