Hadoop Data Engineer Resume
St Paul, MN
PROFESSIONAL SUMMARY:
- Hadoop Developer with 9 Years of IT experience including 6 years in Big Data and Analytics field in Storage, Querying, Processing and Analysis for developing E2E Data pipelines.
- Expertise in designing scalable Big Data solutions, data warehouse models on large - scale distributed data, performing wide range of analytics.
- Expertise in all components of Hadoop/Spark Ecosystems - Spark, Hive, Pig, Flume, Sqoop, HBase, Kafka, Oozie, Impala, Stream sets, Apache NIFI, Hue, AWS.
- 3+ years of experience working in programming languages Scala/Python.
- Extensive knowledge on data serialization techniques like Avro, Sequence Files, Parquet, JSON and ORC.
- Acute knowledge on Spark architecture and real-time streaming using Spark.
- Hands on experience with Spark Core, Spark SQL and Data Frames/Data Sets/RDD API.
- Good knowledge on Amazon Web Services (AWS) cloud services like EC2, S3, EMR and VPC.
- Experienced in Data Ingestion, Data Processing, Data Aggregations, Visualization in Spark Environment.
- Hands on experience in working with large volume of Structured and Un-Structured data.
- Expert in migrating the code components from SVN repository to Bit Bucket repository.
- Experienced in building Jenkins pipelines for continuous code integration from Github into Linux machine. Experience in Object Oriented Analysis Design (OOAD) and development.
- Good understanding in end-to- end web applications and design patterns.
- Hands on experience in application development using Java, RDBMS, and Linux shell scripting.
- Experience in implementing by using agile methodology. Well versed in using Software development methodologies like Agile Methodology and Waterfall processes.
- Experienced in handling databases: Netezza, Oracle and Teradata.
- Strong team player with good communication, analytical, presentation and inter-personal skills.
TECHNICAL SKILLS:
Bigdata Technologies : HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Scala, Spark, Kafka, Flume, Ambari, Hue
Hadoop Frameworks : Cloudera CDHs, Hortonworks HDPs, MAPR
Database : Oracle 10g/11g, PL/SQL, MySQL, MS SQL Server 2012, DB2
Language : C, C++, Java, Scala, Python
AWS Components : IAH, S3, EMR, EC2,Lambda, Route 53, Cloud Watch, SNS,
Methodologies : Agile, Waterfall
Build Tools : Maven, Gradle, Jenkins.
Databases : NO-SQL, HBase, Cassandra, MongoDB, DynamoDB
IDE Tools : Eclipse, Net Beans, Intellij
Modelling Tools : Rational Rose, Star UML, Visual paradigm for UML
BI Tools : Tableau
Operating System : Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X
WORK EXPERIENCE:
Hadoop Data Engineer
Confidential, St Paul, MN
Responsibilities:
- Designed data models and data flow diagrams for various insights.
- Designed hive tables for schema and overall approach of processing data in hive.
- Analyzed and processes the data according to the need of every insight.
- Developed aggregation logics using Spark Scala to calculate results of insight.
- Applied performance improvement techniques like partitioning, bucketing, parquet file format.
- Developed various hql scripts and UNIX scripts for insights.
- Developed UNIX scripts for automation of project.
- Processed metadata files into AWS S3 and Elasticsearch cluster.
- Worked with testing team to solve the defects raised by them.
- Deployment of Hortonworks stack for use with HDFS and Spark.
- Informatica 8.5 ETL used to move data to HortonWorks HDFS.
- Worked on DataStage production job scheduling process using the Scheduling Tool Control M.
- Involved in the deployment of DataStage jobs from Development to Production environment.
- Experienced in using advance DataStage real time stages like SAP IDOC, ABAP, Web services, XML, MQ. Used Regroup, parser, h-join and sort steps in xml.
- Worked with DataStage Designer to create the table definitions for the CSV and flat files, import the table
- Deployed new applications into the production environment. Supported existing and new DataStage applications.
- Develop Statement of Work, RFI, RFP and POC for new Hadoop Hortonworks
- Develop Project Plans and times lines, Develop AWS EC2 Architecture for support of Hortonworks Hadoop Stack 2.3.
- Provision AWS instances using Ansible Tower and use Hortonworks Cloudbreak to build clusters to AWS instances.
- Responsible for implementation and administration of Hortonworks infrastructure.
- Setup NameNode HA, ResourceManager HA and multiple HBase Masters by using Hortonworks Ambari console.
- Performed various tuning for running Spark Scala code with high volume.
- Developed solutions to big data problems utilizing common tools found in the ecosystem.
- Developed solutions to real-time and offline event collecting from various systems.
- Develop, maintain, and perform analysis within a real-time architecture supporting large amounts of data from various sources.
- Analyzed massive amounts of data and help drive prototype ideas for new tools and products.
- Designed, build and support APIs and services that are exposed to other internal teams
- Employ rigorous continuous delivery practices managed under an agile software development approach
- Included migration of existing applications and development of new applications using AWS cloud services.
- Designed and Developed Real Time Stream processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
- Ensure a quality transition to production and solid production operation of the software
- Test-driven development/test automation, continuous integration, and deployment automation
- Enjoy working with data - data analysis, data quality, reporting, and visualization
- Good communicator, able to analyze and clearly articulate complex issues and technologies understandably and engagingly.
- Great design and problem solving skills, with a strong bias for architecting at scale.
- Adaptable, proactive and willing to take ownership.
- Keen attention to detail and high level of commitment.
- Good understanding in any: advanced mathematics, statistics, and probability.
- Experience working in agile/iterative development and delivery environments. Comfort in working in such an environment. Requirements change quickly and our team needs to constantly adapt to moving targets.
- Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters with agile methodology.
- Monitored multiple Hadoop clusters environments using Ganglia, monitored workload, job performance and capacity planning using Cloudera Manager.
Environment: AWS,Core,Kinesis,IAM,S3/Glacier,Glue,DynamoDB,SQSm,StepFunctions,Lambda,API,Gateway,Cognito,EMR,RDS/Auora,CloudFormation,CloudWatch,Python,Scala/Java,Spark,Batch,Streaming, ML,Performance tuning at scale,Hadoop,Hive,HiveQL,YARN,Pig,Scoop,Ranger,Real-time Streaming,Kafka,Kinesis,Avro, Parquet, JSON, ORC, CSV,XML,NoSQL/SQL,Microservice development,RESTful API development,CI/CD
Hadoop Developer
Confidential, Edison, NJ
Responsibilities:
- Experience with Hadoop Ecosystem components like HBase, Sqoop, ZooKeeper, Oozie, Hive and Pig with Cloudera Hadoop distribution.
- Developed PIG and Hive UDF's in java for extended use of PIG and Hive and wrote Pig Scripts for sorting, joining, filtering and grouping the data.
- Redesigned jobs in Datastage Designer to meet the changes in new incoming feeds. Involved in importing and exporting jobs category wise and maintaining the backup regularly.
- Used DataStage as an ETL to extract data from sources like flat files and DB2 and loaded to target DB2 UDB.
- Implement Slowly Changing Dimensions (Type1 and Type2) using DataStage ETL jobs.
- Worked with NoSQL databases like HBase for creating HBase tables to load large sets of semi structured data coming from various sources.
- Expand programs in Spark based on the application for faster data processing than standard MapReduce programs.
- Install and test Hortonworks HDP on IBM Power systems
- Install Hortonworks on new server model and test performance.
- Elaborated spark programs using Scala, involved in creating Spark SQL Queries and Developed Oozie workflow for spark jobs.
- Prepared the Oozie workflows with Sqoop actions to migrate the data from relational databases like Oracle, Teradata to HDFS.
- Creating Hive tables, dynamic partitions, buckets for sampling, and working on them using HiveQL.
- Used Sqoop to store the data into HBase and Hive.
- Enumerated Hive queries to do analysis of the data and to generate the end reports to be used by business users.
- Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and Apache Storm etc. and ingested streaming data into Hadoop using Spark, Storm Framework and Scala.
- Good experience with NOSQL databases like MongoDB.
- Written spark python for model integration layer.
- Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient joins, transformations and other capabilities.
- Elaborated Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Developed a data pipeline using Kafka, HBase, Mesos Spark and Hive to ingest, transform and analyzing customer behavioral data.
Environment: Hadoop, HDFS, CDH, Pig, Hive, Oozie, ZooKeeper, HBase, Spark, Storm, Spark SQL, NoSQL, Scala, Kafka, Mesos, Mango DB.
Hadoop Developer
Confidential, Oak Brook, IL
Responsibilities:
- Involved in Discussions with business users to gather the required knowledge.
- Analyzing the requirements to develop the framework.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL and Big Data technologies.
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Developed Java Spark streaming scripts to load raw files and corresponding.
- Processed metadata files into AWS S3 and Elasticsearch cluster.
- Developed Python Scripts to get the recent S3 keys from Elasticsearch.
- Elaborated Python Scripts to fetch/get S3 files using Boto3 module.
- Implemented PySpark logic to transform and process various formats of data like XLSX, XLS, JSON, TXT.
- Built scripts to load PySpark processed files into Redshift Db and used diverse PySpark logics.
- Developed scripts to monitor and capture state of each file which is being through.
- Designed and Developed Real Time Stream processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
- Developed Map Reduce programs to cleanse the data in HDFS obtained from heterogeneous data sources.
- Involved in scheduling Oozie workflow engine to run multiple Hives and pig jobs and used Oozie Operational Services for batch processing and scheduling workflows dynamically.
- Included migration of existing applications and development of new applications using AWS cloud services.
- Wrought with data investigation, discovery and mapping tools to scan every single data record from many sources.
- Implemented Shell script to automate the whole process.
- Extracted data from SQL Server to create automated visualization reports and dashboards on Tableau.
- Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, Managing and reviewing data backups & log files.
Environment: AWS S3, Java, Maven, Python, Spark, Kafka, Elasticsearch, MapR Cluster, Amazon Redshift Db, Shell script, Boto3, pandas, Elasticsearch, certifi, PySpark, Pig, Hive, Oozie, JSON.
Hadoop Developer
Confidential, Boston, MA
Responsibilities:
- Developed simple to complex MapReduce jobs using Java language for processing and validating the data.
- Developed data pipeline using Sqoop, Spark, MapReduce, and Hive to ingest, transform and analyze, customer behavioral data.
- Exported analyzed data to relational databases using SQOOP for visualization to generate reports for the BI team.
- Implemented Spark using python and Spark SQL for faster processing of data and algorithms for real time analysis in Spark.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Used the Spark - Cassandra Connector to load data to and from Cassandra. Real time streaming the data using Spark with Kafka.
- Developing Kafka producers and consumers in java and integrating with apache storm and ingesting data into HDFS and HBase by implementing the rules in storm.
- Built a prototype for real time analysis using Spark streaming and Kafka.
- Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
- Involved in creating Hive tables and working on them using HiveQL and perform data analysis using Hive and Pig.
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
- Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Expertise in extending Hive and Pig core functionalities by writing custom User Defined Functions (UDF).
- Used IMPALA to pull the data from Hive tables.
- Worked on Apache Flume for collecting and aggregating huge amount of log data and stored it on HDFS for doing further analysis.
- Create and develop an End to End Data Ingestion on to Hadoop.
- Involved in architecture and design of distributed time-series database platform using NOSQL technologies like Hadoop/HBase, Zookeeper.
- Integrated NoSQL database like HBase with Map Reduce to move bulk amount of data into HBase.
- Efficiently put and fetched data to/from HBase by writing MapReduce job.
Environment: Hadoop, Kafka, Spark, Sqoop, Spark SQL, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Scala, HBase, Zookeeper.
Hadoop developer
Confidential
Responsibilities:
- Identified System Requirements and Developed System Specifications, responsible for high-level design and development of use cases.
- Involved in designing Database Connections using JDBC.
- Organized and participated in meetings with clients and team members.
- Developed web-based Bristow application using J2EE (Spring MVC Framework), POJOs, JSP, JavaScript, HTML, jQuery, Business classes and queries to retrieve data from backend.
- Development of Client-Side Validation techniques using jQuery.
- Worked with Bootstrap to develop responsive web pages.
- Implemented client side and server side data validations using the JavaScript.
- Responsible for customizing data model for new applications by using Hibernate ORM technology. Involved in the implementation of DAO and DTO using spring with Hibernate ORM.
- Implemented Hibernate for the ORM layer in transacting with MySQL database.
- Developed authentication and access control services for the application using Spring LDAP.
- Experience in event - driven applications using AJAX, Object Oriented JavaScript, JSON and XML. Good knowledge on developing asynchronous applications using jQuery. Valuable experience with Form Validation by Regular Expression, and jQuery Light box.
- Used MySQL for the EIS layer.
- Involved in design and Development of UI using HTML, JavaScript and CSS.
- Designed and developed various data gathering forms using HTML, CSS, JavaScript, JSP and Servlets.
- Developed user interface modules using JSP, Servlets and MVC framework.
- Experience in implementing of J2EE standards, MVC2 architecture using Struts Framework.
- Developed J2EE components on Eclipse IDE.
- Used JDBC to invoke Stored Procedures and used JDBC for database connectivity to SQL.
- Deployed the applications on Tomcat Application Server.
- Developed Web services using Restful and JSON.
- Created Java Beans accessed from JSPs to transfer data across tiers.
- Database Modification using SQL, PL/SQL, Stored procedures, triggers, Views in Oracle9i.
Environment: Java, JSP, Servlets, JDBC, Eclipse, Web services, Spring 3.0, Hibernate 3.0, MySQL, JSON, Struts, HTML, JavaScript, CSS