Sr. Hadoop Administrator Resume
SUMMARY
- Having 6 years of IT experience involving project development, implementation, deployment and maintenance using Hadoop ecosystem and related technologies with domain knowledge in Finance, Banking, manufacturing and Health care.
- 5 years of experience in using Hadoop and its ecosystem components like HDFS, Map Reduce, Yarn, Spark, Hive, Pig, HBase, Zoo Keeper, Oozie, Flume, Storm and Sqoop.
- In depth understanding of Hadoop Architecture and its various components such as Job Tracker, Task Tracker, Name Node, Data Node and Resource Manager concepts.
- In depth understanding of Map Reduce and AWS Cloud concepts and its critical role in Data Analysis of huge and complex datasets.
- Expertise in writing Map - Reducers in Java, handling differ file formats such as Parquet, Avro, Json, ORC etc.
- Skilled on streaming data using Apache Spark, migrating the data from Oracle to Hadoop HDFS using Sqoop.
- Proficient on processing the data using Apache Pig by registering User Defined Functions (UDF) written in Java.
- Proficient in designing and querying the NoSQL databases like MongoDB, HBase, Cassandra.
- Skilled in RDBMS and very good hands on experience on DB systems like Oracle & MySQL.
- Proficiency on Advanced UNIX concepts and working experience on advance scripting/programming.
- A nalyzed data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications.
- Strong experience creating real time data streaming solutions using Apache Spark Core, Spark SQL and Data Frames.
- Hands on experience with Spark streaming to receive real time data using Kafka.
- Worked extensively with Hive DDLs and Hive QLs.
- Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
- Developed Sqoop scripts for large dataset transfer between Hadoop and RDBMs.
- Experience with Big Data ML toolkits, such as Mahout and Spark ML.
- Experience in HBase Cluster Setup and Implementation.
- Strong experience troubleshooting failures in spark applications and fine-tuning for better performance.
- Experience in installation, configuration, support and monitoring of Hadoop clusters using Apache, Cloudera distributions and AWS.
- Extensive knowledge of AWS cloud infrastructure - RDS, Redshift, DynamoDB, EC2, EMR, EBS, S3, Autoscaling, CloudWatch and IAM.
- Understanding of Software Development Life Cycle (SDLC) and sound knowledge of project implementation methodologies including Waterfall and Agile.
- Proficient in using data visualization tools like Tableauand MS Excel.
- Intensive work experience in developing enterprise solutions using Python, Java, J2EE, Servlets, JSP, JDBC, Struts, Spring, Hibernate, JavaBeans, JSF, MVC.
- Fluent with the core Java concepts like I/O, Multi-threading, Exceptions, Collections, Data-structures and Serialization.
- Experience in sending weekly and monthly status reports to Client/Higher Management.
- Ready to take on new challenges and flexible, excellent problem-solving skills, Self-starter, ability to explore and learn new concepts, tools and applications.
- Possess good team management, coordination, documentation and presentation skills along with excellent communication and interpersonal skills.
TECHNICAL SKILLS
Hadoop/Big Data: Hadoop (Yarn), HDFS, Map Reduce, Spark, Hive, Pig, Sqoop, Flume, Kafka, Storm, Zookeeper, Oozie, Tez, Impala, Mahout
Java/J2EE Technologies: Java Beans, JDBC, Servlets, RMI & Web services
Development Tools: Eclipse, IBM DB2 Command Editor, QTOAD, SQL Developer, Microsoft Suite (Word, Excel, PowerPoint, Access), VM Ware
Web/Application Servers: Apache Tomcat, WebLogic, Websphere.
Frameworks: Hibernate, EJB, Struts, Spring
Programming/Scripting Languages: Java, SQL, Unix Shell Scripting, Python
Databases: Oracle 11g/10g/9i, MySQL, SQL Server 2005,2008
NoSQL Databases: HBase, Cassandra, Mongo DB
ETL Tools: Informatica
Visualization: Tableau and MS Excel.
Version Control Tools: Sub Version (SVN), Concurrent Versions System (CVS) and IBM Rational ClearCase.
Methodologies: Agile/ Scrum, Rational Unified Process and Waterfall.
Operating Systems: Windows 98/2000/XP/Vista/7/8,10, Macintosh, Unix, Linux and Solaris.
PROFESSIONAL EXPERIENCE
Confidential
Sr. Hadoop Administrator
Responsibilities:
- Analyzing the functional specs provided by the client and developing detailed solution design document with the Architect and the team.
- Discussing with the client business teams to confirm the solution design and changing the requirements if needed.
- Added Support for three different modes of execution: Spark, Hive and Native.
- Spark Mode of Execution for Amazon S3 and Redshift connectors:
- Designed the approach to pull the data from Amazon Redshift to S3 through UNLOAD queries based on the requirement.
- Used Hadoop APIs to stream the data from Amazon S3.
- Written Spark Jobs using Scala to process the data based on different transformations configured in a mapping like aggregation, filter, expression etc.
- Once the processing is done, used Hadoop APIs to push the data in to Amazon S3 based on the target selection by user.
- Used COPY queries to push the resultant data from Amazon S3 to Amazon Redshift service.
- Supported different file formats like CSV, Avro, Parquet, ORC etc.
- Added compression support as a feature while storing the data on Amazon S3 like Snappy, Gzip etc.
- Added partitioning support while pulling the data and processing to improve the performance.
- Hive Mode of Execution for Amazon S3 and Redshift connectors:
- Designed the approach to pull the data from Amazon Redshift to S3 through UNLOAD queries based on the requirement.
- As the stored format is file on Amazon S3, used distcp command to download the file to Local HDFS before processing.
- Used Java to produce dynamic Hive Scripts for different mappings.
- Tested and certified the scripts for different kinds of mappings with multiple transformations in between.
- After processing the data used distcp to push the resultant file to Amazon S3.
- Used COPY queries to push the resultant data from Amazon S3 to Amazon Redshift service.
- Native Mode of Execution for Amazon S3 and Redshift connectors:
- Designed the approach to pull the data from Amazon Redshift to S3 through UNLOAD queries based on the requirement.
- Used the Amazon Java SDK provided by Amazon to download the file.
- Implemented partitioning support in downloading the file in parallel which helped in improving the performance.
- Based on the file type (CSV, Avro, Parquet, Json) used the appropriate parsers available to parse the metadata and data for processing.
- Added 25+ features in Native mode with multiple main releases and hotfixes.
- Fixed a lot of performance issues with the different components of products using different java profilers like YourKit and Valgrind.
- Helped QA team in testing these connectors with a 16-node cluster with few Terabytes of data.
- Worked on different distributions of Hadoop like Cloudera, Hortonworks
Environment: CDH 5, PIG(0.12.0), HDFS(2.6.0), HIVE(2.0), HBase 1.0.0, Sqoop (1.4.6), Oozie (V2.3.2), Zookeeper (3.4.5), Impala (2.3.x), Spark 2.2.0, Java(1.8), Parquet, Oracle 11g, SQL Server 2008, Hadoop Distribution of Hortonworks, MapReduce, YARN, Apache Kafka, Oracle 11g / 10g, PL/SQL, SQL*PLUS, Squirrel-SQL, DB visualizer, Windows NT, LINUX, UNIX Shell Scripting.
Confidential
Hadoop Administrator
Responsibilities:
- Developed custom input adapters in Java for moving the data from raw sources (FTP, S3) to HDFS.
- Developed Spark applications using Scala to perform data cleansing, data validation, data transformations and other enrichments.
- Worked extensively on making the spark applications production ready by implementing possible best practices, to make them highly scalable and fault tolerant.
- Developed many Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning exercise.
- Worked on troubleshooting spark application to make them more error tolerant.
- Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
- Wrote Kafka producers to stream the data from external rest API to Kafka topics.
- Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
- Utilized Data frames and Spark SQL API extensively whereever needed.
- Data pipeline consists Sqoop, custom build Input Adapters, Spark and Hive.
- Worked on performing Hive modeling and written many hive scripts to perform various kinds of data preparations that are needed for running machine learning models.
- Worked closely with the data science team in automating and production analyzing various models like logistic regression, k-means using spark-ml.
- Worked on working prototype to build a real time workflow for streaming the user events from external applications.
- Utilized Kafka and Spark Streaming for building the real time pipeline.
- Worked on converting existing map-reduce jobs to spark jobs.
- Developed Oozie workflows to automate and productionize the data pipelines.
- Implemented Logistic Regression and K-Means models and automated them to run in production.
Environment: Cloudera Distribution, Hadoop, HDFS, Spark, Scala, Kafka, HBase, Oozie, Hive, Flume, Sqoop, Java, SQL, Oracle 11g, Unix/Linux
Confidential
BigdataAdmin
Responsibilities:
- Involved in installing Hadoop Ecosystem components under cloudera distribution.
- Gathered the business requirements from the Business Partners and Subject Matter Experts.
- Responsible to manage data coming from different sources.
- Supported MapReduce Programs those are running on the cluster.
- Routine Performance Analysis, Capacity analysis, security audit analysis reports to customer for necessary planned changes. Co-ordination with team as per business requirement.
- Creating and managing users &groups in Centrify.
- Automation of jobs using Oozie for pig and Hive.
- Installing the patches and packages
- Configuring and administering NFS server and clients, starting / stopping NFS service.
- Handle the server for the DR activity project.
- The change management process to be followed for any changes in the production environment.
- Review Risk and Issue logs as frequent as possible
- Weekly status review with the Customer and Access Checklist document for the servers
- Wrote MapReduce job using Java API for data Analysis and dim fact generations.
- Installed and configured Pig and also written Pig Latin scripts.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Developed Scripts and Batch Job to schedule various Hadoop Program.
- Wrote Hive queries for data analysis to meet the business requirements.
- Utilized Agile Scrum Methodology to help manage and organize a team of 4 developers with regular code review sessions.
- Used Storm to analyze large amounts of non-unique data points with low latency and high throughput.
- Weekly meetings with technical collaborators and active participation in code review sessions with senior and junior developers.
Environment: Cloudera, Java, MapReduce, HDFS, Hive, Pig, Linux, XML, MySQL, MySQL Workbench, Java 7, Eclipse, PL/SQL, SQL connector, Sub Version.