Hadoop & Spark Engineer Resume
Overland Park, KS
SUMMARY
- Over 16 years of IT experience as a Developer, Designer & QA Engineer with cross - platform integration experience using Hadoop Ecosystem,
- Hands on experience in installing, configuring and architecting Hadoop and Hortonworks clusters and services - HDFS, MapReduce, Yarn, Pig, Hive, Hbase, Spark, Sqoop, Flume and Oozie
- 7+ years of experience in Cloud platform
- 7+ Years of Experience on working using Spark Technology.
- Expertise on Spark streaming (Lambda Architecture), Spark SQL, Tuning and Debugging the Spark Cluster (MESOS).
- Expertise on working with Machine Learning with MLlib using Python.
- Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data
- Expertise on working with MongoDB, Apache Cassandra.
- Programming knowledge on Scala, Python, C#,
- Experience in working with Teradata. And making the data to be batch processing using distributed computing.
- Good working experience on Hadoop tools related to Data warehousing like Hive, Pig and also involved in extracting the data from these tools on to the cluster using Sqoop.
- Developed Oozie workflow schedulers to run multiple Hive and Pig jobs that run independently with time and data availability.
- Good knowledge of High-Availability, Fault Tolerance, Scalability, Database Concepts, System and Software Architecture, Security and IT Infrastructure.
- Lead onshore & offshore service delivery functions to ensure end-to-end ownership of incidents and service requests.
- Getting in touch with the Junior developers and keeping them updated with the present cutting Edge technologies like Hadoop, Spark, SparkSQL, Presto
- All the projects which I have worked for are Open Source Projects and has been tracked using JIRA.
- Experience on agile methodologies Over 10+ years of IT experience on various technologies which includes Kafka, Apache Spark (Big Data Eco System) Informatica Power Center
- Possess good Technical and Architectural knowledge in Big Data ecosystem such as Spark, HDFS, Map Reduce, Hive, Oozie, Hbase, Sqoop, Kafka
- Strong analytical and problem solving skills
- Have 7+ years of experience as Hadoop developer with very good exposure on Hadoop Technologies like HDFS, Spark, Hive, Sqoop, kafka
- Have around 7 Years of experience in developing Data Ware Housing applications using Informatica, oracle, Unix.
- Have good experience in developing streaming application using Kafka technologies
- Build Predictive Modeling Algorithm in Spark using multivariate linear regression with minimal (error) cost function.
- Extensively worked on data Extraction, Transformation and Loading data from various sources like Oracle, DB2 and Flat files
- Extensive experience in developing complex SQL/HQL queries for data analysis and exploration
- Able to assess business rules, collaborate with stakeholders and perform source-to-target data mapping, design and review
- Excellent communication, documentation and presentation skills using tools like Visio and PowerPoint
- Extensive experience in Software development life Cycle (SDLC) in both waterfall and scrum methodology.
PROFESSIONAL EXPERIENCE
Hadoop & Spark Engineer
Confidential, Overland Park, KS
Responsibilities:
- Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop - based Data Lake.
- Created multi-node Hadoop, Spark clusters in AWS instances to generate terabytes of data & stored it in AWS HDFS.
- Worked with JSON file format for Stream Sets. Worked with Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs.
- Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Developed data pipeline using Flume, Sqoop, Spark with Scala to ingest customer behavioral data and financial histories into HDFS for analysis.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Upgraded the Hadoop cluster from CDH4.7 to CDH5.2 and worked on installing cluster, commissioning & decommissioning of Data Nodes, NameNode recovery, capacity planning, and slots configuration.
- Developed Spark scripts to import large files from Amazon S3 buckets and imported the data from different sources like HDFS/HBase into Spark RDD.
- Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
- Monitored cluster for performance and, networking and data integrity issues and responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
- Created 25+ Linux Bash scripts for users, groups, data distribution, capacity planning, and system monitoring.
- Install OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
- Worked with AWS team to manage servers in AWS.
- Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE(EMR) and setup Hadoop environment in AWS EC2 Instances.
- Created Hive External tables and loaded the data in to tables and query data using HQL and worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Developed Pyspark code to read data from Hive, group the fields and generate XML files and enhanced the Pyspark code to write the generated XML files to a directory to zip them to CDAs.
- Have hands-on experience in working with Hadoop distribution platforms like Hortonworks, Cloudera, MapR, and others
- Experience in installing, configuring, testing Hadoop ecosystem from ground up, including both node hardware and software configurations.
- Experience in performance benchmarking and understanding of both hardware and software bottlenecks in system design
Environment: Hadoop, Map Reduce, HDFS, Hive, Java, Oozie, Linux, Eclipse, Putty, Winscp, Oracle 10g, PL/SQL, YARN, Spark, Scala, Python, Sqoop, DB2, java, AWS.
Hadoop Engineer
Confidential, Detroit, MI
Responsibilities:
- Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming.
- Developed Map - Reduce programs to clean and aggregate the data.
- Responsible for building scalable distributed data solutions using Hadoop and Spark.
- Implemented Hive Ad-hoc queries to handle Member data from different data sources such as Epic and Centricity.
- Implemented Hive UDF's and did performance tuning for better results.
- Analyzed the data by performing Hive queries and running Pig Scripts.
- Involved in loading data from UNIX file system to HDFS.
- Implemented optimized map joins to get data from different sources to perform cleaning operations before applying the algorithms.
- Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis
- Experience in using Sqoop to import and export the data from Netezza and Oracle DB into HDFS and HIVE.
- Implemented POC to introduce Spark Transformations.
- Worked with NoSQL database HBase, MongoDB and Cassandra to create tables and store data and Azure Microsoft.
- Handled importing data from various data sources, performed transformations using Hive and Map Reduce, streamed using Flume and loaded data into HDFS.
- Worked in transforming data from map reduce into HBase as bulk operations.
- Implemented CRUD operations on HBase data using thrift API to get real time insights.
- Installed Oozie workflow engine to run multiple MapReduce, Hive, Impala, Zookeeper and Pig jobs which run independently with time and data availability.
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster for generating reports on nightly, weekly and monthly basis.
- Used Zookeeper to manage Hadoop clusters and Oozie to schedule job workflows.
- Implemented test scripts Incorta to support test driven development and continuous integration.
- Involved in data ingestion into HDFS using Apache Sqoop from a variety of sources using connectors like JDBC and import parameters
- Coordination with Hadoop Admin's during deployment to production.
- Developed Pig Latin Scripts to extract data from log files and store them to HDFS. Created User Defined Functions (UDFs) to pre- process data for analysis.
- Developing Scripts In Pyspark and Batch Job to schedule various Hadoop Program.
- Continuously monitoring and managing the Hadoop cluster through Cloudera Manager.
- Participated in design and implementation discussion for the developing Cloudera 5 Hadoop eco system.
- Used JIRA and Confluence to update tasks and maintain documentation.
- Worked in Agile development environment in sprint cycles of two weeks by dividing and organizing tasks. Participated in daily scrum and other design related meetings.
- Created final reports of analyzed data using Apache Hue and Hive Browser and generated graphs for studying by the data analytics team.
- I processed large sets of structured, semi-structured and unstructured data and supporting systems application architecture.
- Familiar with data architecture including Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing
- Used SQOOP to export the analyzed data to relational database for analysis by data analytics team.
Environment: Hadoop, Cloudera Hadoop, Map Reduce, Hive, Pig, Sqoop, Flume, HBase, Java, JSON, Spark, HDFS, YARN, Oozie Scheduler, Zookeeper, Mahout, Linux, UNIX, ETL, My SQL.
Hadoop Admin
Confidential, Sacramento, CA
Responsibilities:
- Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis
- Installed Cloudera distribution of Hadoop Cluster and services HDFS, Pig, Hive, Sqoop, Flume and MapReduce.
- Responsible for providing open source platform based on Apache Hadoop for analyzing, storing and managing big data.
- Loaded and transformed large sets of structured, semi - structured and unstructured data.
- Responsible for managing data coming from different sources.
- Imported and exported data into HDFS and Hive using Sqoop and Azure Microsoft.
- Wrote Hive queries.
- Involved in loading data from UNIX file system to HDFS.
- Created Hive tables, loaded with data and wrote queries which will run internally in MapReduce and performed data analysis as per the business requirements.
- Worked with analysts to determine and understand business requirements.
- Loaded and transformed large datasets of structured, semi structured and unstructured data using Hadoop/Big Data concepts
- Developed data pipeline using Flume, Sqoop, Pig and MapReduce to ingest customer data and financial histories into HDFS for analysis
- Used MapReduce and Flume to load, aggregate, store and analyze web log data from different web servers.
- Created MapReduce programs to handle semi/unstructured data like XML, JSON, AVRO data files and sequence files for log files.
- Involved in submitting and tracking MapReduce jobs using Job Tracker.
- Experience writing Pig Latin scripts for Data Cleansing, ETL operations and query optimizations of exists scripts.
- Written Hive UDF to sort Structure fields and return complex data types.
- Created Hive tables from JSON data using data serialization framework like AVRO.
- Experience writing reusable custom Hive and Pig UDF's in Java and using existing UDF's from Piggybank and other sources
- Experience in working with NoSQL database HBase in getting real time data analytics.
- Integrated Hive tables to HBase to perform row level analytics.
- Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files
- Developed Unit Test Cases for Mapper, Reducer and Driver classes using MR Testing library.
- Supported operations team in Hadoop cluster maintenance including commissioning and decommissioning nodes and upgrades.
- I take end-to-end responsibility of the Hadoop life cycle in the organization
- I lead team and bridge between Data Scientists, Engineers, and the organizational needs
- I perfomred in-depth requirement analyses and exclusively choose the work platform
- I have full knowledge of Hadoop architecture and HDFS
- Provided technical assistance to all development projects.
- Hands-on experience with Qlik Sense for Data Visualization and Analysis on large data sets, drawing various insights.
- Created dashboards using Qlik Sense and performed Data extracts, Data blending, Forecasting, and table calculations.
Environment: Hadoop, MapReduce, Yarn, Hive, HDFS, PIG, Sqoop, Solr, Oozie, Impala, Spark, Hortonworks, HBase, Zoo Keeper and Unix/Linux, Hue (Beeswax), AWS.
Hadoop Engineer
Confidential
Responsibilities:
- Spark and Cassandra based Analytics and Content Serving platform design and development
- Technology analysis and selection (spark vs storm, cassandra vs hadoop vs redis, mesos vs yarn)
- Real - Time Data Streaming storage and analysis
- Spark cluster programmatic node control (custom Akka extensions)
- Spark embedded jetty server extensions and ui
- Custom Json API layer
- Custom Cassandra ORMs (with scala reflection api, and another with php)
- Content Serving application layer
- Platform sys-admin and deployment operations
- R and MlLib data analytics statistical modeling and algorithm development
- General Exploratory Data Analysis
- Descriptive Statistics workups (web visitor behaviour kpi s)
- Classification (web visitor audience segments - as function of - web visitor behaviour kpi s)
- A/B Testing (UI elements performance - as function of - web visitor behaviour)
- Outlier, Dependency and Association tests (web visitor behaviour kpi s)
- Logistic Modeling (UI elements performance - as
- Defined Big data / Hadoop Architecture adoption Roadmap
- Designed and implemented Sqoop and Hive scripts for Data Archival and Ingestion into Hadoop.
- Managed architectural design and development of data management projects.
- Gathered business requirements, working closely with business users and technical teams to design conceptual, logical and physical data models.
- Development and maintenance of logical and physical Erwin data models, along with corresponding metadata, to support Marketplace and Strategic projects.
- Defined and maintained naming standards and data dictionary for data models.
- Performed performance tuning of complex queries for data retrieval and modification.
- Analyzed business needs; worked closely with project staff to make decisions and recommendations on the future developments.
- Interacted with the support teams for any co-ordination points and business users to ensure the requirements were addressed.
- Managed multiple data warehousing projects spanning build and test for different clients over the period of 6+ years.
- Managed DWBI projects encompassing multivendor teams across broad technical and business disciplines.
- Was responsible for leadership reporting, managing customer expectations, high impact and timely communication and resolving risks and issues.
- Conducted performance appraisals and talent management.
- Worked with HR to manage staffing issues.
- Made vital decisions and drove decision-making across projects on a day-to-day basis
- Mentored the new team members to observe the DWBI development standards and architecture.
- Prepared project management plan, prioritized needs, managed change requests, tracked costs and efforts to keep budget under control.
- Developed status reports, cost estimates and resource plans.
- Prepared the Statement of Work and related Service Level Agreement for the suggested projects.
Confidential
Sr. QA Analyst
Responsibilities:
- Gathering the Business Requirements, Functional Requirements, Business Strategies, Use Cases, Guidelines Alberta Enterprise and Advanced Education in each phase of the application I was involved a series of meetings with the Development Team and Business Team.
- Organized multiple Brainstorming Sessions with the End Users in understanding the Business Requirements and Functional Requirements.
- Assigned as QA representative on complex application development or maintenance projects.
- Participated in release meetings for maintenance and development projects ensuring all platform requirements are met.
- I Performed Unit Test, Regression Test, Smoke Test, System Test, Integration Test and User Acceptance Testing PAPRS Application.
- Participated in periodic requirement analysis and job seasons conducted.
- Involved in various meeting with the DBA and the Systems analyst in order to plan and coordinate the Performance Testing effort.
- I closely worked with development and operational groups in support of software and production releases.
- I led the team in multi-faceted testing including both User Acceptance testing (UAT) and Operational Readiness Testing (ORT).
- Maintained effective communication of status with Project Owner, management, and the team.
- Defined of test strategies and test plans to collaborate on complex application solutions.
- Revived and understand cross-platform impacts during the creation of test plans.
- Provided application risk analysis (gaps) and recommend mitigation plan for gaps.
- Involved in performing (UAT) user acceptance testing and signed off the project.
- Derived test cases from use case, non-functional requirements and design documents using Team Foundation Server.
- Tracked defects, which include adding defects, open defects, testing a new build of the application, and analyzed defect data using Microsoft Visual Studio and involved data testing, tracking defects, lodging defects .
- Microsoft Azure, Grey Matter.
- Paired with business users to support black box exploratory testing on the required software.
- Located, reported, and follow up on defects using the Bugzilla defect management software
- Exploratory testing can be used as a check on the formal test process by helping to ensure that the most serious defects have been found.
- Wrote complex SQL queries to verify data by using Microsoft SQL Server ‘2008R2’.
- Involved testing XML messages supported by different protocols and Alps (e.g. HTTP, JMS) and emulating respond of the server.
- Created modeling, editing, and transformation and debugging XML related technologies.
- I was involved in implementing CMMI quality standards, procedures or other process improvement methodologies.
- Established quality standards, procedures and QA methodologies.
- By Agile methodology the system is re-structured, without changing its functionality, to improve simplicity or flexibility
- Participated Daily SCRUM meetings for day to day challenges discussed in team.
- Worked continuously with terms implement Agile SDLC.
- Configured test environment necessary for each cycle of Agile SDLC.
- Wrote complex SQL queries for report generation and verification of dada.
- Setup security for external user.
- Performed quick test and verified that the UAT issues have been fixed.
- Performed new UAT SIAMS test account set-up.
- Participated in Status Updates and Meetings.
- Prepared LF Sessions and Costs Detailed Report.
- Reviewed Business Design Documentation.
- Performed PAPRS Security testing - LF User Security Testing.
- Performed PAPRS Security Testing - PSP User Security Testing.
- Performed SIAMS Security setup for Public & Private Institutions.
- Prepared LF Sessions and Costs Completed Record Summary Report.
- Performed Set up Access IDs for PVTs and check the access for PSPs.
Platforms: VBScript and ASP, Net, Oracle and SQL server 2008 R2, Microsoft Visual Studio, Microsoft Azure, Grey Matter, Team Foundation Server, Sparx Systems (enterprise Architect), SQL Server Management Studio, . (Cúram projects)IBM Rational Quality Manager 5.0 &5.0.2, Microsoft SQL server 6.5, HTML,XML, Java, Java script, Jmeter, Oracle, Sun Solaris, MKS, MQ, IBM AIX, ETL tool Informatics Forms, Crystal Reports, VB script, J2EE, ASP.Net, Lotus Notes, Microsoft SharePoint, MS Access, Bugzilla, UNIX Shell script., IIS and PWS, Lotus Notes.