Hadoop/bigdata/kafka Engineer Resume
Chicago, IL
PROFESSIONAL SUMMARY:
- 8 Years of experience in Big Data Analytics using Various Hadoop eco - systems tools and Spark Framework and Currently working on Spark and Spark Streaming frameworks extensively using Scala as the main programming dialect
- Developed AWS Cloud formation templates to create custom sized VPC, subnets, EC2 instances, ELB and security groups
- Good experience on working with Amazon EMR framework for processing Large data sets using Spark
- Experience in building AWS Data Pipeline to configure data loads from S3 to into Redshift& Snowflake
- Experience installing/configuring/maintaining Apache Hadoop clusters for application development and Hadoop tools like Sqoop, Hive, PIG, Flume, HBase, Kafka, Hue, Storm, Zoo Keeper, Oozie, Cassandra, Sqoop, Python
- Worked with major distributions like Cloudera (CDH 3&4) & Horton works Distributions and AWS. Also worked on Unix and DWH in support for various Distributions
- Experience on Cloud Databases and Data warehouses SQLAzure and Confidential Redshift.
- Hands on experience in developing and deploying enterprise based applications using major components in Hadoop ecosystem like Hadoop 2.X, YARN, Hive, Pig, Map Reduce, Spark, Kafka, Storm, Oozie, HBase, Flume, Sqoop and Zookeeper
- Experience in AWS platform and its features including IAM, EC2, EBS, VPC, RDS, Cloud Watch, Cloud Trail, Cloud Formation AWS Configuration, Auto scaling, Cloud Front, S3, SQS, SNS, Lambda and Route53
- Experience in handling large datasets using Partitions, Spark in memory capabilities, Broadcasts in Spark with Scala and python, Effective and efficient Joins, Transformations and other during ingestion process itself
- Experienced with different file formats like Parquet, ORC, CSV, Text, Sequence, XML, JSON and Avro files.
- Experience in developing data pipeline using Pig, Sqoop, and Flume to extract the data from weblogs and store in HDFS and accomplished developing Pig Latin Scripts and using HiveQL for data analytics
- Extensively dealt with Spark Streaming and Apache Kafka to fetch live stream data.
- Experience in converting Hive/SQL queries into Spark transformations using Java and experience in ETL development using Kafka, Flume and Sqoop
- Good experience in writing Spark applications using Scala and Java and used Scala sbt to develop Scala projects and executed using Spark-Submit
- Experience working on NoSQL databases including HBase, Cassandra and MongoDB and experience using Sqoop to import data into HDFS from RDBMS and vice-versa
- Developed Spark scripts by using Scala shell commands as per the requirement
- Good experience in writing Sqoop queries for transferring bulk data between Apache Hadoop and structured data stores
- Substantial experience in writing Map Reduce jobs in Java, PIG, Flume, Zookeeper,Hive and Storm
- Created multiple MapReduce Jobs using Java API, Pig and Hive for data extraction
- Strong expertise in troubleshooting and performance fine-tuning Spark, MapReduce and Hive applications
- Good experience on working with Amazon EMR framework for processing data on EMR and EC2 instances
- Extensive experience in developing applications that perform Data Processing tasks using Teradata, Oracle, SQL Server and MySQL database
- Worked on data warehousing and ETL tools like Informatica, Tableau, and Pentaho
- Experience in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure
- Acquaintance with Agile and Waterfall methodologies. Responsible for handling several clients facing meetings with great communication skills.
TECHNICAL SKILLS:
Bigdata Technologies: HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Zookeeper, Scala,PySpark, Spark, Kafka, Flume, Ambari, Hue
Hadoop Frameworks: Cloudera CDHs, Hortonworks HDPs, MAPR
Database: Oracle 10g/11g, PL/SQL, MySQL, MS SQL Server 2012, DB2
Language: C, C++, Java, Scala, Python
AWS Components: IAH, S3, EMR, EC2,Lambda, Route 53, Cloud Watch, SNS
Methodologies: Agile, Waterfall
Build Tools: Maven, Gradle, Jenkins.
NO: SQL, HBase, Cassandra, MongoDB, DynamoDB
IDE Tools: Eclipse, Net Beans, Intellij
Modelling Tools: Rational Rose, Star UML, Visual paradigm for UML
Others Tools: Tableau, Datameer, AutoSys
Operating System: Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X
PROFESSIONAL EXPERIENCE:
Hadoop/BigData/Kafka Engineer
Confidential, Chicago, IL
Responsibilities:
- Preparing Design Documents (Request-Response Mapping Documents, Hive Mapping Documents).
- Experienced with batch processing of data sources using Apache Spark and Elastic search.
- Experienced in implementing Spark RDD transformations, actions to implement business analysis
- Migrated Hive QL queries on structured into Spark QL to improve performance
- Documented the data flow form Application Kafka Storm HDFS Hive tables
- Configured, deployed and maintained a single node storm cluster in DEV environment
- Developing predictive analytic using Apache Spark Scala APIs.
- Implemented Spring boot microservices to process the messages into the Kafka cluster setup.
- Closely worked with Kafka Admin team to set up Kafka cluster setup on the QA and Production environments.
- Had knowledge on Kibana and Elastic search to identify the Kafka message failure scenarios.
- Implemented to reprocess the failure messages in Kafka using offset id.
- Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.
- Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup.
- Have knowledge on partition of Kafka messages and setting up the replication factors in Kafka Cluster.
- Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods.
- Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats (Text file, Avro data files, Sequence files, Xml and JSon files, ORC and Parquet).
- Handled importing of data from RDBMS into HDFS using Sqoop.
- Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.
- Developed Spark scripts to import large files from Amazon S3 buckets.
- Developed Spark core and Spark SQL scripts using Scala for faster data processing.
- Experienced in data cleansing processing using Pig Latin operations and UDFs.
- Experienced in writing Hive Scripts for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Involved in creating Hive tables, loading with data and writing hive queries to process the data.
- Load the data into Spark RDD and performed in-memory data computation to generate the output response.
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Created scripts to automate the process of Data Ingestion.
- Experience in using Testing Frameworks of Bigdata world, MRUnit, PIGUnit for testing raw data and executed performance scripts. Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with Data Frames in Spark.
- Utilized SparkSQL to extract and process data by parsing using Datasets or RDDs in Hive Context, with transformations and actions (map, flat Map, filter, reduce, reduceByKey).
- Extend the capabilities of Data Frames using User Defined Functions in Python and Scala.
- Resolve missing fields in Data Frame rows using filtering and imputation.
- Involved in Agile methodologies, daily scrum meetings, spring planning.
- Integrate visualizations into a Spark application using Databricks and popular visualization libraries (gplot, matplotlib).
- Implemented discretization and binning, data wrangling, cleaning, transforming, merging and reshaping data frames using Python.
- MR2 Batch job was written to fetch required data from DB and store the same in CSV (static file)
- Spark job to process the files from Vision EMS and AMN Cache to identify the violations and sending the same to Smarts as SNMP traps.
- Automated workflows using shell scripting to schedule(crontab) Spark jobs.
- Developed data pipeline using Flume, Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
- Experience in deploying data from various sources into HDFS and building reports using Tableau.
- Developed a data pipeline using Kafka and Strom to store data into HDFS.
- Developed REST APIs using Scala and Play framework to retrieve processed data from Cassandra database.
- Performed real time analysis on the incoming data.
- Re-engineered n-tiered architecture involving technologies like EJB, XML and JAVA into distributed applications.
- Load the data into Spark RDD and performed in-memory data computation to generate the output response.
- Loading data into HBase using Bulk Load and Non-bulk load.
- Created HBase column families to store various data types coming from various sources.
- Loaded data into the cluster from dynamically generated files
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures
- Created common audit and error logging processes job monitoring and reporting mechanism
- Troubleshooting performance issues with ETL/SQL tuning.
- Developed and maintained the continuous integration and deployment systems using Jenkins, ANT, Akka and MAVEN.
- Effectively used GIT (version control) to collaborate with the Akka team members.
Environment: HDFS, Apache Spark, Kafka, Cassandra, Hive, Scala, Java, Sqoop's, Shell scripting.
Hadoop/BigData/Kafka Engineer
Confidential, McLean, VA
Responsibilities:
- Installed and configured Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Developed Simple to complex MapReduce Jobs using Hive.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Worked with Senior Engineer on configuring Kafka for streaming data.
- Developed Spark jobs by using Scala as per the requirement.
- Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods.
- Worked on analyzing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka.
- Performed processing on large sets of structured, unstructured and semi structured data.
- Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.
- Handled importing of data from various data sources using Sqoop,performed transformations using Spark andloaded data into DynamoDB.
- Analyzed the data by performing Hive queries and use visualization tools to generate insights from data and analyze customer behavior.
- Worked with Spark Ecosystem using Scala and SparkSQL Queries on different data formats like Text file and parquet.
- Used Hive UDF's to implement business logic in Hadoop.
- Implemented business logic by writing UDFs in Python and used various UDFs.
- Responsible to migrate from Hadoop to Spark frameworks, in-memory distributed computing for real time fraud detection.
- Used Spark to store data in-memory.
- Implemented batch processing of data sources using Apache Spark.
- Responsible for Data Cleaning, features scaling, features engineering by using NumPy and Pandas in Python.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with HiveQL.
- Developed and maintained the continuous integration and deployment systems using Jenkins, ANT, Akka and MAVEN.
- Effectively used GIT (version control) to collaborate with the Akka team members.
- Involved in creating Hive tables, loading data and writing hive queries which will run internally as mapreduce job.
- Develop predictive analytic using Apache Spark Scala APIs
- Cluster co-ordination services through ZooKeeper.
- Used Apache Kafka for collecting, aggregating, and moving large amounts of data from application servers.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- As Part of POC setup Amazon web services (AWS) to check whether Hadoop is a feasible solution or not.
- Used Docker as part of CI/CD to build and deploy applications using ECS in AWS.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
Environment: Hadoop, MapReduce,Akka, HDFS, Hive, Python, Kafka, SQL, Cloudera Manager, Pig, Apache Sqoop, Spark, Oozie, HBase, AWS, PL/SQL, MySQL and Windows.
Spark/ Hadoop/Big data Engineer
Confidential, Boston, MA
Responsibilities:
- Developed Spark applications using Scala.
- Used Data frames/ Datasets to write SQL type queries using Spark SQL to work with datasets.
- Performed real-time streaming jobs using Spark Streaming to analyze data on a regular window time interval to the incoming data from Kafka.
- Created Hive tables and had extensive experience with HiveQL.
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
- Extended Hive functionality by writing custom UDFs, UDAFs, UDTFs to process large data.
- Performed Hive UPSERTS, partitioning, bucketing, windowing operations, efficient queries for faster data operations.
- Involved in moving data from HDFS to AWS Simple Storage Service (S3) and extensively worked with S3 bucket in AWS.
- Created and maintained Technical documentation for launching Hadoop Clusters and for executing Hive queries and Pig Scripts
- Responsible for loading bulk amount of data in HBase using MapReduce by directly creating H-files and loading them.
- Developed spark application for filtering Json source data in AWS S3 location and store it into HDFS with partitions and used spark to extract schema of Json files.
- Imported and exported data between relational database systems and HDFS/Hive using Sqoop.
- Wrote custom Kafkaconsumer code and modified existing producer code in Python to push data to Spark-streaming jobs.
- Scheduled jobs and automated workflows using Oozie.
- Automated the movement of data using NIFI dataflow framework and performed streaming and batch processing via micro batches. Controlled and monitored data flow using web UI.
- Worked with HBase database to perform operations with large sets of structured, semi-structured and unstructured data coming from different data sources. - need to add new line
- Exported analytical results to MS SQL Server and used Tableau to generate reports and visualization dashboards.
- Interacted with business/IT stakeholders and other involved teams to understand requirements and build data pipelines
- Leveraged Hortonworks HDP cluster on AWS EC2 instances to support data storage and processing needs.
- Developed data ingestion framework to import data from PostgreSQL into data lake using Spark's JDBC connectors.
- Built PySpark API's to enrich raw trips data and perform aggregations on the enriched data.
- Designed and implemented ETL jobs to load, transform and join data from multiple data sources using PySpark API's.
- Developed Spark jobs to perform data validations and data quality checks on raw and transformed data
- Designed and developed data lake delete (DLD) process to be in compliance with GDPR regulations.
- Developed and implemented real-time data pipelines with Spark Streaming, Kafka, and Cassandra to replace existing lambda architecture without losing the fault-tolerant capabilities of the existing architecture.
- Created a Spark Streaming application to consume real-time data from Kafka sources and applied real-time data analysis models that we can update on new data in the stream as it arrives.
- Built Spark jobs to efficiently read and write data from/to AWS S3 buckets.
- Created Athena tables by defining crawlers to create table definitions in AWS Glue Data Catalog.
- Lead efforts for migrating from shared multi-tenant HDP cluster to transient EMR clusters
- Addressed data skew issue by implementing salting technique on Spark aggregation jobs
- Designed and created hive tables to perform data analysis on HDFS data
- Assisted other teams and engineers with performance tuning of spark jobs
- Supported data science teams by building performant data platforms to generate business insights
- Leveraged GitHub for source code version control, and Jenkins for scheduling data pipelines job execution
- Adhered to Agile process using Scrum methodology.
Environment: Hadoop, Spark, Python, PySpark, Postgres, AWS, EMR, Hortonworks, Athena, Scrum, Agile, Spark SQL, GitHub, Jenkins, AWS, S3, Cloudera, Spark, Spark SQL, HDFS, HiveQL, Hive, Zookeeper, Hadoop, Python, Scala, Kafka, Sqoop, MapReduce, Oozie, Tableau, MS SQL Server, HBase, Agile, Eclipse.
HADOOP DEVELOPER
Confidential, New York, NY
Responsibilities:
- Responsible for implementation, administration and management of Hadoop infrastructures
- Evaluation of Hadoop infrastructure requirements and design/deploy solutions (high availability, big data clusters and involved in cluster monitoring and troubleshooting Hadoop issues
- Worked with application teams to install OSs and Hadoop updates, patches, version upgrades as required
- Helped maintain and troubleshoot UNIX and Linux environment
- Analyzed and evaluated system security threats and safeguards
- Developed Pig program for loading and filtering the streaming data into HDFS using Flume.
- Experienced in handling data from different datasets, join them and preprocess using Pig join operations.
- Developed HBase data model on top of HDFS data to perform real time analytics using Java API.
- Developed different kind of custom filters and handled pre-defined filters on HBase data using API.
- Imported and exported data from Teradata to HDFS and vice-versa.
- Strong understanding of Hadoop eco system such as HDFS, MapReduce, HBase, Zookeeper, Pig, Hadoop streaming, Sqoop, Oozie and Hive
- Implement counters on HBase data to count total records on different tables.
- Experienced in handling Avro data files by passing schema into HDFS using Avro tools and Map Reduce.
- Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV, etc.
- We used Amazon Web Services to perform big data analytics.
- Implemented Secondary sorting to sort reducer output globally in map reduce.
- Implemented data pipeline by chaining multiple mappers by using Chained Mapper.
- Created Hive Dynamic partitions to load time series data
- Handled different types of joins in Hive like Map joins, bucker map joins, sorted bucket map joins.
- Created tables, partitions, buckets and perform analytics using Hive ad-hoc queries.
- Experienced import/export data into HDFS/Hive from relational data base and Tera data using Sqoop.
- Handled continuous streaming data from different sources using Flume and set destination as HDFS.
- Integrated spring schedulers with Oozie client as beans to handle corn jobs.
- Experience with CDH distribution and Cloudera Manager to manage and monitor Hadoop clusters
- Actively participated in software development lifecycle (scope, design, implement, deploy, test), including design and code reviews.
- Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
- Worked on spring framework for multi-threading.
Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, RDBMS/DB, Flat files, Teradata, MySQLCSV, Avro data files. JAVA, J2EE.
Hadoop Developer
Confidential
Responsibilities:
- Developed several advanced Map Reduce programs to process data files received.
- Developed Map Reduce Programs for data analysis and data cleaning.
- Firm knowledge on various summarization patterns to calculate aggregate statistical values over dataset.
- Experience in implementing joins in the analysis of dataset to discover interesting relationships.
- Completely involved in the requirement analysis phase.
- Extending Hive and Pig core functionality by writing custom UDFs.
- Worked on partitioning the HIVE table and running the scripts in parallel to reduce the run time of the scripts.
- Strong expertise in internal and external tables of HIVE and created Hive tables to store the processed results in a tabular format.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Developed Pig Scripts and Pig UDFs to load data files into Hadoop.
- Analyzed the data by performing Hive queries and running Pig scripts.
- Developed PIG Latin scripts for the analysis of semi structured data and unstructured data.
- Strong knowledge on the process of creating complex data pipelines using transformations, aggregations, cleansing and filtering.
- Experience in writing cron jobs to run at regular intervals.
- Wrote ETL scripts in Python/SQL for extraction and validating the data.
- Create data models in Python to store data from various sources.
- Developed MapReduce jobs for Log Analysis, Recommendation and Analytics.
- Experience in using Flume to efficiently collect, aggregate and move large amounts of log data.
- Involved in loading data from edge node to HDFS using shell scripting.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Experience in managing and reviewing Hadoop log files.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
Environment: Hadoop, Python, Apache Pig, Apache Hive, MapReduce, HDFS, Flume, GIT, UNIX Shell scripting, PostgreSQL, Linux.
SQL/Java Developer
Confidential
Responsibilities:
- Worked with several clients with day to day requests and responsibilities
- Designed and developed Struts like MVC 2 Web framework using the front-controller design pattern, which is used successfully in a number of production systems
- Wrote SQL queries to perform back-end database operations
- Wrote various SQL, PLSQL queries and stored procedures for data retrieval
- Prepared utilities for the Unit -Testing of Application Using JSP and Servlets
- Developed Database applications using SQL and PL/SQL
- Applied design patterns and Object-Oriented design concept to improve the existing Java/J2EE based code base
- Convert the Power Center code using the Informatica developer to BDM Mappings.
- Identified and fixed transactional issues due to incorrect exception handling and concurrency issues due to unsynchronized blocks of code
- Resolved product complications at customer sites and narrowed the understanding to the development and deployment teams to adopt long term product development strategy with minimal roadblocks
- Convinced business users and analysts with alternative solutions that are more robust and simpler to implement from technical perspective and satisfying the functional requirements from the business perspective
- Played a crucial role in developing persistence layer
- Developed Restful web services, and Restful API.
- Developed screens using Java, HTML, DHTML, CSS, JSP and JavaScript.
- Diagnose and correct errors within Java/HTML/PHP code to allow for connection and utilization of proprietary applications.
- End user support and administrative functions to include password and account management.
- Used Quartz schedulers to run the jobs in a sequential with in the given time
- Used JSP and JSTL Tag Libraries for developing User Interface components.
- Analysed, developed, tuned, tested, debugged and documented process using technologies SQL, PL/SQL, Informatica, UNIX and Control-M
- Documented technical specs, class diagrams and sequence diagrams, developed technical design documents based on changes. Analyzed Portal capabilities and scalability and identified area where Portal could be used to enhance usability and improve productivity
Environment: Java, HTML, Servlets, JSP, Junit Testing, J2EE, JSP, Eclipse, SQL, Windows, PL/SQL, Oracle, Informatica, UNIX, Control-M.