Big Data/pyspark Developer Resume
Eden Prairie, MN
SUMMARY
- Over 8 + Years of experience in IT industry, 4 + Years of experience in developing large scale applications using Hadoop and Other Big data tools, 1 year of experience in developing Spark.
- Well experienced in the Hadoop ecosystem components like Hadoop, MapReduce, Cloudera, Horton works, Mahout, HBase, Oozie, Hive, Sqoop, Pig, and Flume.
- Experience in developing solutions by analyzing large data sets efficiently.
- Good Experience with data lake projects.
- Experience with distributed systems, large - scale non-relational data stores, MapReduce systems, data modeling, and big data systems.
- Good working Experience with Progarmming languages Such as Java,Scala,Python,R
- Knowledge on implementing BigData in Amazon Elastic MapReduce (Amazon EMR) for processing, managing Hadoop framework dynamically scalable Amazon EC2 instances.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node.
- Experience with Amazon Web Services, AWS command line interface, and AWS data pipeline.
- Experience with different Distributions like MapR,cloudera,Hortonworks,EMR(AWS).
- Intense hands on experience in writing complex Map reduce jobs, PigScripts and Hive data modeling and using Sqoop to import data into HDFS from RDBMS and vice-versa, Extending Hive and Pig core functionality by writing custom UDFs.
- Good hands on PigScripts, HiveScripts on Tez.
- Experience in converting MapReduce applications to Spark and developing Stream Applications using Scala Akka framework.
- Good working experience Good knowledge in using job scheduling and workflow designing tools like Oozie.
- Experience in working with BI team and transform big data requirements into Hadoop centric technologies.
- Good Knowledge on Hadoop Cluster administration, monitoring and managing Hadoop clusters using Cloudera Manager.
- Good Hands On creating real time data streaming solutions using Apache Spark/Spark Streaming/Apache Storm, Kafka and Flume.
- Good experience With file formats like ORC,Avro,Parquet.
- Good understanding of Data Mining and Machine Learning techniques.
- Experience in handling messaging services using Apache Kafka.
- Experience in fine-tuning Map reduce jobs for better scalability and performance.
- Worked extensively with Dimensional modeling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
- Greate knowledge on database management systems like vertica,couchbase.
- Good knowledge on Kerberos for user authentication and make Cluster in security mode.
- Great working experience with Redshift, snowflake Data warehouse schemas.
- Expertise in design patterns including Front Controller, Data Access Object, Session Facade, Business Delegate, Service Locator, MVC, Data Transfer Object and Singleton.
- Sound Relational Database Concepts and extensively worked with ORACLE, MySQL, DB2 and SQL Server.
- Good Experience with databases, writing complex queries and stored procedures using SQL and PL/SQL and Expertise in workflow management tools like SQL Workbench, SQL Developer and TOAD tool for accessing the Database server.
- Experience in developing and implementing web applications using Java, JSP, jQuery UI, CSS, HTML, HTML5, XHTML and Java script, AJAX, JSON, XML, JDBC and JNDI.
- Expertise in writing Shell-Scripts, Cron Automation and Regular Expressions.
- Expertise in Web Services architecture in SOAP and WSDL using JAX-RPC.
- Expertise in using configuration management tool like Sub Version (SVN), Rational Clear case, CVS and Git for version controlling.
- Expert in Various Agile methodologies like SCRUM, Test Driven Development, Incremental and Iteration methodology and Pair Programming.
- Experience in writing SQL, PL/SQL queries, Stored Procedures for accessing and managing databases such as SQL, MySQL, and IBM DB2.
- Involved in all phases of Software Development Life Cycle (SDLC) in large scale enterprise software using Object Oriented Analysis and Design.
- Highly motivated team player with zeal to learn new technologies.
TECHNICAL SKILLS
Big Data Skillset - Frameworks & Environments: Cloudera CDHs, Hortonworks HDPs, Hadoop1.0, Hadoop2.0, HDFS, MapReduce, Pig, Hive, Impala, HBase, Data Lake, Cassandra, MongoDB, Mahout, Sqoop, Oozie, Zookeeper, Flume, Splunk, Spark, Storm, Kafka, YARN, Falcon, Avro.
Amazon Web Services(AWS): Elastic Map Reduce(EMR), Amazon EC2, Amazon S3, AWS CodeCommit, AWS CodeDeploy, AWS CodePipeline, Amazon CloudFront, AWS Import/Export,Amazon RedShift.
JAVA & J2EE Technologies: Core Java, Hibernate, Spring, JSP, Servlets, Java Beans, JDBC, EJB 3.0, JSon,JavaScript, jQuery, JSF, Prime Faces, XML, Servlets, EJB, JDBC, HTML, XHTML, CSS, SOAP, XSLT and DHTML
Messaging Services: JMS, MQ Series, MDB, Structs, Spring 3.2, MVC, Spring Web Flow, AJAX.
IDE&Build Tools: Eclipse, Net Beans, intellij, Spring Tool Suite(STS), Hue,Maven,SBT,Gradle.
Web services & Technologies: XML, HTML, XHTML, JNDI, HTML5, AJAX, jQuery, CSS, JavaScript, AngularJS, VB Script, WSDL, SOAP, JDBC, ODBC Architectures REST, MVC architecture.
Databases & Application Servers: Oracle, MySQL, DB2, Cassandra, MangoDB, Hbase, MangoDB, Database Technologies MySQL, Oracle 8i, 9i, 11i & 10g, MS Access, Teradata,Vertica,Microsoft SQL-Server 2000 and DB2 8.x/9.x, PostgreSQL.
Other Tools: Putty, WinScp, Talend, Tableau, Datameer,GitHub, SVN, CVS.
PROFESSIONAL EXPERIENCE
Big Data/PySpark Developer
Confidential - Eden Prairie, MN
Responsibilities:
- Worked on scalable distributed data system using Hadoop ecosystem in MapR Distribution.
- Great experience in working with MapR distribution.
- Good working experience with python to develop Custom Framework for generating of rules (just like rules engine).
- Developed Hadoop streaming Jobs using python for integrating python API supported applications.
- Developed Python metaprograms applying business rules on the data that automatically create, spawn, monitor, and then terminate customized programs as dictated by the events and problems detected within in the data stream.
- Developed Python code to gather the data from HBase and designs the solution to implement using PySpark.
- Develop code in Python utilizing Pandas (DataFrame) to read data from the excel for creating Measure Rule objects dynamically.
- Apache Spark DataFrames/RDD's were used to apply business transformations and utilized HiveContext objects to perform read/write operations.
- Re-write some Hive queries to Spark SQL to reduce the overall batch time.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Spark, Hive, and Sqoop) as well as system specific jobs (such as Python programs and shell scripts).
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Highly motivated to work on Python,R scripts for statistics analytics for generating reports for data Quality.
- Good experience with understanding R code to Analyze Machine Learning Models.
- Working with python for Statically analysis of data to finding quality, confidence intervals.
Environment: MapReduce, Python, HDFS, Hive, Pig, Tez, Oozie, HBase, Pyspark, Spark, Scala, Spark SQL, UNIX, Putty, WinSCP, Shell Scripting, YARN.
Hadoop/Spark Developer
Confidential - Jersey City, NJ
Responsibilities:
- Worked on scalable distributed data system using Hadoop ecosystem in HDP (Hortonworks data platform).
- Developed Simple to complex Map/reduce streaming jobs using java, Hive and Pig.
- Used various compression mechanisms to optimize Map/Reduce Jobs to use HDFS efficiently.
- Transformed the imported data using Hive and MapReduce.
- Used ETL component Sqoop to extract the data from MySQL and load data into HDFS.
- Wrote Hive queries and Pig scripts to study customer behavior by analyzing the data.
- Loaded data into Hive tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs and Developed Oozie workflow for MapReduce and HiveQL jobs.
- Great expose to Unix scripting and good hands on shell scripting.
- Wrote python scripts to process semi-structured data in formats like JSON.
- Worked closely with the data modelers to model the new incoming data sets.
- Experienced in loading and transforming of large sets of structured, semi structured and unstructured data.
- Was involved with the team in updating the cluster configurations and Installed Zookeeper to maintain the high availability of the Name Node.
- Troubleshooting and finding the bugs in the Hadoop applications and to clear off all the bugs took help from the testing team.
- Good hands on experience with Java API by developing Kafka producer, consumer for writing Avro Schems.
- Installed Ganglia Monitoring Tool to generate reports related to Hadoop cluster like CPUs running, Hosts Up and Down etc., operations were performed to maintain Hadoop cluster.
- Good hands on experience with real -time data injection using Kafka and real-time processing engine through Storm (Spout, Bolt) and persisted data into HBASE database for data analytics.
- Responsible for analyzing and data cleaning using Spark SQL Queries.
- Handled importing of data from various data sources performed transformations using spark and loaded data into hive.
- Worked with spark core, Spark Streaming and spark SQL modules of Spark.
- Used Scala to write the code for all the use cases in spark and extensive experience with scala for data analytics on Spark cluster and Performed map-side joins on RDD.
- Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
- On demand, secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda function.
- Migrated an existing on-premises application to AWS.
- Used AWS services like EC2 and S3 for small data sets.
- Used Cloud watch logs to move application logs to S3 and create alarms based on a few exceptions raised by applications.
- Good Experience with Enterprise Data Hub(EDH) for working on third party data tools(BI).
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing data.
- Good hands on experience with Microservices on cloud foundry for real time data streaming to persist data on Hbase and to communicates with Restful web services using Java API.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team and good working on Datameer.
Environment: MapReduce, S3, EC2, EMR,Java, HDFS, Hive, Pig, Tez, Oozie, Hbase, Pyspark, Spark, Scala, Spark SQL, Kafka, Python, LINUX, Putty, Cassandra, Shell Scripting, ETL, YARN.
Hadoop Developer
Confidential
Responsibilities:
- Determining the viability of a business problem for a Big Data solution.
- Worked on a 42 nodes CDH.
- Worked with highlyunstructured and semi structured data sets of 120 TBin size.
- Proactively monitored systems and services, architecture design and implementation of Hadoop deployment, configuration management, backup, and disaster recovery systems and procedures.
- Responsible to manage data coming from various sources.
- Documented the systems processes and procedures for future references.
- Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
- Monitored multiple Hadoop clusters environments using Ganglia and Monitored workload, job performance and capacity planning using Cloudera Manager.
- Involved in time series data representation using HBase.
- Performed Map Reduce programs on log data to transform into structured way to find user location, age group, spending time using Java.
- Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Automated workflows using Oozie on Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce(Java), Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
- Used Flume to collect, aggregate, and store the web log data from various sources like web servers, mobile and network devices and pushed to HDFS.
- Greate working experience with Splunk for real time log data monitoring.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most purchased product on website.
- Build cluster on AWS environment using EMR using S3,EC2,Redshift.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports by our BI team.
- Great hands on experience with Pyspark for using Spark libiries by using python scripting for data analysis.
- Working with (BI)Tableau teams as requirement of datasets and good working experience with Data visualization.
Environment: AWS s3, EC2, Hadoop, HDFS, Pig, Hive, Map Reduce, Splunk, Pyspark, Sqoop, Spark, Spark Sql, AWS EC2, S3, LINUX, Cloudera, Big Data, Ganglia Java APIs, Java collection, Python, SQL Server, Cassandra, HBase.
Hadoop Developer
Confidential
Responsibilities:
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
- Worked on analyzing Hadoop cluster using different big data analytic tools including Hive, MapReduce, Pig and flume.
- Involved in analyzing system failures, identifying root causes, and recommended course of actions.
- Managed Hadoop clusters using Cloudera. Extracted, Transformed, and Loaded (ETL) of data from multiple sources like Flat files, XML files, and Databases.
- Worked on Talend ETL to load data from various sources to Datalake. Used tmap, treplicate, tfilterrow, tsort and various other features in Talend.
- Developing dataset which follows CDISC Standards and stored into HDFS(S3).
- Good experience with ODM, SEND data xml formats in interact with user data.
- Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS using UDF developing by python and java.
- Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
- Good knowledge on Amazon EMR (Elastic Map Reduce).
- Installed, configured and optimized Hadoop infrastructure using Apache Hadoop and Cloudera Hadoop distributions. Developed Simple to complex MapReduce Jobs using Hive and Pig.
- Managed and scheduled Jobs on a Hadoop cluster. Designed a data warehouse using Hive.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Developed the Pig UDF’S to pre-process the data for analysis and Hive queries for the analysts.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig. Cluster co-ordination services through ZooKeeper.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Implemented Fair schedulers on the Job tracker to share the resources of the Cluster for the Map Reduce jobs given by the users.
Environment: Hadoop,Talend,Hbase,Python,Cloudera, ETL, HDFS, Hive, Java (jdk1.7), Pig, Zookeeper, Oozie, Flume.
Hadoop/ETL Developer (Intern)
Confidential
Responsibilities:
- Worked on designing Poc's for implementing various ETL Process.
- Responsible for building scalable distributed data solutions using Hadoop.
- Analyzed large data sets by running Hive Queries and Pig scripts.
- Involved in creating Hive tables, loading and analyzing data using Hive Queries.
- Extracted the data from Teradata into HDFS using the Sqoop.
- Developed simple to complex MapReduce jobs.
- Load and transform large sets of structured, semi structured and unstructured data.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Mentored analyst and test team for writing Hive Queries.
- Involved in running Hadoop jobs for processing millions of records of text data.
- Worked with application teams to install Hadoop updates, patches and version upgrades as required.
- Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
- Implemented best income logic using Pig scripts and UDFs.
- Implemented test scripts to support test driven development and continuous integration.
- Worked on tuning the performance for Hive and Pig queries.
- Developed UNIX Shellscripts to automate repetitive database processes
Environment: HDFS, Hive, HBase, MapReduce, Hive, Pig, Sqoop, Unix Shell Scripting, Teradata, Python.
Confidential
JAVA/J2EE Developer
Responsibilities:
- Designed, configured and developed the web application using JSP, Jasper Report, JavaScript, HTML.
- Developed Session Beans for JSP clients. Configured and Deployed EAR & WAR files on WebSphere Application Server.
- Defined and designed the layers and modules of the project using OOAD methodologies and standard J2EE design patterns & guidelines.
- Designed and developed all the user interfaces using JSP, Servlets and Spring framework.
- Used Hibernate framework and Spring JDBC framework modules for backend communication in the extended application.
- Developed the DAO layer using Hibernate and used caching system for real time performance.
- Designed the application to allow all users to utilize core functionality, as well as business specific functionality based on log on ID.
- Developed Web Service provider methods (bottom up approach) using WSDL, XML and SOAP for transferring data between the Applications.
- Configured Java Messaging Services (JMS) on Web Sphere Server using Eclipse IDE.
- Used AJAX for developing asynchronous web applications on client side.
- Designed various applications using multi-threading concepts, mostly used to perform time consuming tasks in the background. Wrote JSP & Servlets classes to generate dynamic HTML pages. Designed class and sequence diagrams for Modify and Add modules.
- Designed and developed XML processing components for dynamic menus on the application.
- Adopted Spring framework for the development of the project.
- Developed the user interface presentation screens using HTML.
- Co-ordinated with QA lead for development of test plan, test cases, test code, and actual testing responsible for defects allocation and resolution of those defects.
- Maintained the existing code based developed in Spring and Hibernate framework by incorporating new features and fixing bugs. Used Log4j for application logging and debugging.
- Involved in fixing bugs and unit testing with test cases using JUnit framework.
- Developed build and deployment scripts using Apache ANT to customize WAR and EAR files.
- Developed stored procedures and triggers using PL/SQL in order to calculate and update the tables to implement business logic using Oracle database.
- Used Spring ORM module for integration with Hibernate for persistence layer.
- Involved in writing Hibernate Query Language (HQL) for persistence layer.
Environment: Java SE 7, Java EE 6, JSP 2.1, Servlets 3.0, HTML, JDBC 4.0, IBM WebSphere 8.0, PL/SQL, XML, Spring 3.0, Hibernate 4.0, Oracle 12c, ANT, Java Script & JQuery, JUnit, Windows 7 and Eclipse 3.7.
Jr. Java Developer
Confidential
Responsibilities:
- Developing new pages for personals.
- Implementing MVC Design pattern for the Application.
- Using Content Management tool (Dynapub) for publishing data.
- Implementing AJAX to represent data in friendly and efficient manner.
- Developing and Action Classes.
- Used JMeter for load testing of the application and captured the response time of the application
- Created simple user interface for application's configuration system using MVC design patterns and swing framework.
- Implementing Log4j for logging and debugging.
- Implementing Form based approach for ease of programming team.
- Involved in software development life cycle as a team lead.
Environment: Core Java, Java Swing, Struts, J2EE (JSP/Servlets), XML, AJAX, DB2, My SQL, Tomcat, JMeter.