We provide IT Staff Augmentation Services!

Spark/big Data Engineer Resume

3.00/5 (Submit Your Rating)

Alpharetta, GA

PROFESSIONAL SUMMARY:

  • 9+ years of total IT development experience in all phases of the SDLC
  • 1 + year of Python/Apache Spark experience and 3+ years of Hadoop/Java Developer experience in all phases of Hadoop and HDFS development,1+ year of ETL/Informatica exposer.
  • Extensive experience and actively involved in Requirements gathering, Analysis, Design, Coding and Code Reviews, Unit and Integration Testing.
  • Experience in designing Use Cases, Class diagrams, Sequence and Collaboration diagrams for multi - tiered object-oriented system architectures utilizing Unified Modeling Tools (UML) such as Rational Rose, Rational Unified Process (RUP) Working knowledge of Agile Development and Test-Driven Development(TDD) Business Driven Development (BDD) methodologies.
  • Extensive knowledge of Client - Server technology, web-based n-tier architecture, Database Design and development of applications using J2EE Design Patterns like Singleton, Session Facade, Factory Pattern and Business Delegate.
  • Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Pig, Impala, Sqoop, Oozie, Flume, Mahout, Storm, Tableau, Talend big data technologies.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD and Pyspark concepts.
  • Experience working with SQL, PL/SQL and NoSQL databases like Microsoft SQL Server, Oracle, HBase and Cassandra.
  • Experience in Importing and exporting data from different databases like MySQL, Oracle, Netezza, Teradata, DB2 into HDFS using Sqoop, Talend.
  • Involved in writing Pig scripts to transform raw data into forming baseline data.
  • Worked on data warehouse product Amazon Redshift, which is a part of the AWS.
  • Good experience in design the jobs and transformations and load the data sequentially & parallel for initial and incremental loads in Talend.
  • Experience in developing and scheduling ETL workflows in Hadoop using Oozie.
  • Experience in deploying and managing the Hadoop cluster using Cloudera Manager.
  • Experience in developing applications using Map Reduce, Pig and Hive.
  • Loading log data directly into HDFS using Flume.
  • Have implementation of Kerberos authentication for client/server applications by using secret-key cryptography.
  • Experience with Tableau that is used as a reporting tool.
  • Experienced in creative and effective front-end development using HTML, CSS, JavaScript, Bootstrap, jQuery, Ajax and XML.
  • Good Working experience in using different Spring modules like Spring IOC , Spring MVC, Spring AOP, Spring JDBC, Spring ORM in Web applications
  • Aced the persistent service, Hibernate and JPA for object mapping with database. Configured xml files for mapping and hooking it with other frameworks like Spring and Struts
  • Good exposure of Web Services using C XF and Apache A xi s , for the exposure and consumption of SOAP Messages
  • Good knowledge in developing Restful web services.
  • Working knowledge of database such as Oracle 8i/9i/10g, Microsoft SQL Server, DB2
  • Strong experience in database design, writing complex SQL Queries and Stored Procedures
  • Have extensive experience in building and deploying applications on Web/Application Servers like WebLogic, WebSphere, J Boss and Tomcat
  • Experience in Building, Deploying and Integrating with Ant, Maven
  • Experience in development of logging standards and mechanism based on Log4J
  • Experience in writing and executing unit test cases using Junit Testing Framework
  • Excellent communication skills and strong architecture skills
  • Ability to learn and adapt quickly to the emerging new technologies.

TECHNICAL SKILLS:

Technologies: Hadoop (Cloudera, Horton works, Pivotal HD), Apache Spark, Apache Kafka, Apache HBase, Flume, Talend, Hive, Pig, Sqoop, Storm, Mahout, Oozie, Tableau, Java Beans, Servlets, JSP, JDBC, EJB, JNDI, JMS, RMI.

Architecture & Framework: Client-Server, MVC, J2EE, Struts, Spring, Hibernate.

Database: Cassandra, HBase, Oracle 11g, SQL server 2008, MySQL

IDE: Eclipse, Net Beans, IBM RAD, JBuilder.

Design Methodology: UML, Water Fall, Perl, Agile

Operating Systems: Windows, Linux, Unix

GUI: HTML, XML, XSLT, AJAX, JavaScript, CSS, jQuery

Query Languages: SQL, PL/SQL.

Programming Language: Python, Java, C, C++

Design patterns: Business Delegate, Business Object, Value Object, Front Controller, Database Access Object, Factory, Singleton, Session Facade.

Tools: BEA WebLogic, JBOSS, IBM Web Sphere Application Server 6.1, Tomcat 6.0, J Unit 4.0, ANT, Log4j, Mercury Quality Centre, Rational Clear Quest, ANT, Maven, SVN, Toad

Design & Control: UML, Rational Rose, CVS, Clear Case

PROFESSIONAL EXPERIENCE:

Confidential, Alpharetta, GA

Spark/Big Data Engineer

Responsibilities:

  • Designed a data workflow model to create a data lake in hadoop ecosystem so that reporting tools like Tableau can plugin to generate the necessary reports
  • Created Source to Target Mappings (STM) for the required tables by understanding the business requirements for the reports
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
  • Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera Hadoop Cluster in Parquet format.
  • Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
  • Loading log data directly into HDFS using Flume.
  • Leveraged AWS S3 as storage layer for HDFS.
  • Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark
  • Used Bit Bucket as the code repository and frequently used Git commands to clone, push, pull code to name a few from the Git repository
  • Hadoop Resource manager was used to monitor the jobs that were run on the Hadoop cluster
  • Used Confluence to store the design documents and the STMs
  • Meet with business and engineering teams on a regular basis to keep the requirements in sync and deliver on the requirements
  • Used Jira as an agile tool to keep track of the stories that were worked on using the Agile methodology

Environment: SPARK, Hive, Pig, Flume Intellij IDE, AWS CLI, AWS EMR, AWS S3, Rest API, shell scripting, Git, Spark, PySpark, SparkSQL

Confidential, Long Beach, CA

Hadoop Developer.

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster
  • Setup and benchmarked Hadoop/HBase clusters for internal use
  • Developed Simple to complex Map/reduce Jobs using Java programming language that are implemented using Hive and Pig
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
  • Analyzed the data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) to study customer behavior.
  • Used UDF’s to implement business logic in Hadoop.
  • Used Impala to read, write and query the Hadoop data in HBase.
  • Develop programs in Spark to use on application for faster data processing than standard MapReduce programs.
  • Implemented business logic by writing UDFs in Java and used various UDFs from Piggybanks and other sources.
  • Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs.
  • Experience with Storm for the real-time procession of data.
  • Used Solr to navigate through data sets in the HDFS storage.
  • Loading log data directly into HDFS using Flume.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Experienced on loading and transforming of large sets of structured, semi structured and unstructured data.
  • Stored Solr indexes in HDFS.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.

Environment: Hadoop, MapReduce, HDFS, Hive, Spark, Pig, Java, SQL, Cloudera Manager, Sqoop, Strom, Solr, Mahout, Flume, Oozie, Java (jdk 1.6), Eclipse

Confidential, Warnshills, Illinois

Bigdata Engineer

Responsibilities:

  • Develop a bigdata web application using Agile methodology in Scala as Scala has the capability of combining functional and object-oriented programming.
  • Work with different data sources like HDFS, Hive and Teradata for Spark to process the data.
  • Use Spark to process the data before ingesting the data into the HBase. Both Batch and real-time spark jobs were created using Scala.
  • Use HBase as the database to store application data, as HBase offers features like high scalability, distributed NoSQL, column oriented and real-time data querying to name a few.
  • Use Kafka a publish-subscribe messaging system by creating topics using consumers and producers to ingest data into the application for Spark to process the data and create Kafka topics for application and system logs.
  • Utilize Play framework to build web applications that combines easily with Akka.
  • Configure Zookeeper to coordinate and support the distributed applications as it offers high throughput and availability with low latency.
  • Create and update the Terraform scripts to create the infrastructure and Consul scripts to enable Service Discovery for the application’s systems.
  • Configure Nginx to serve the static content of the web pages reducing the load on the web server for the static content.
  • Write SQL queries to perform CRUD operations on the PostgreSQL to save, store, update and delete rows in tables using Play Slick.
  • Perform database migrations as and when needed.
  • Use SBT to build the Scala project.
  • Involve in creating and updating stories for each sprint in Agile, suggest the technical direction to be taken for each story.
  • Demo the application once every month to customers explaining new features of the application and answer any questions that might arise from the discussions and take suggestions to improve the application for better user experience.
  • Create and update Jenkins jobs to develop pipelines to deploy the application in different environments like develop, QA and Production.
  • Use Git commands extensively for code check-in.

Environment: SPARK, Scala, Python, Intellij IDE, KAFKA, Play Framework, Slick, PostgreSQL, AWS CLI, Terraform, Consul, SBT, HBase, Akka.

Confidential, NY

Hadoop Developer.

Responsibilities:

  • Used Sqoop to extract data from Oracle SQL server and MySQL databases to HDFS.
  • Developed workflows in Oozie for business requirements to extract the data using Sqoop.
  • Developed MapReduce(YARN) jobs for cleaning, accessing and validating the data.
  • Wrote MapReduce jobs using Pig Latin, Optimized the existing Hive and Pig Scripts.
  • Used Hive and Impala to query the data in HBase.
  • Hive scripts were written in Hive QL to de-normalize and aggregate the data.
  • Index documents in HDFS using Solr Hadoop connectors.
  • Automated the work flows using shell scripts (Bash) to export data from databases into Hadoop.
  • Used JUnit framework to test the Unit testing of the application.
  • Hive queries for data were written to meet the business requirements.
  • Developed product profiles using Pig and commodity UDFs.
  • Designed workflows by scheduling Hive processes for Log file data, which is streamed into HDFS using Flume.
  • Developed schemas to handle reporting requirements using Tableau.
  • Actively participated in weekly meetings with the technical teams to review the code.
  • Involved in loading data from UNIX file system to HDFS.
  • Implemented test scripts to support test driven development and continuous integration.
  • Responsible to manage data coming from different sources.
  • Have deep and thorough understanding of ETL tools and how they can be applied in a Big Data environment.
  • Participate in requirement gathering and analysis phase of the project in documenting the business requirements by conducting workshops/meetings with various business users.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Flume.

Environment: Hadoop, Map Reduce, Hive QL, Hive, HBase, Sqoop, Solr, Cassandra, Flume, Tableau, Impala, Oozie, MYSQL, Oracle SQL, Java, Unix Shell, YARN, Pig Latin.

Confidential, Reston, VA

JAVA/ J2EE Developer.

Responsibilities:

  • Involved in all the phases of SDLC including Requirements Collection, Design & Analysis of the Customer Specifications, Development and Customization of the Application.
  • Developed JSP, JSF and Servlets to dynamically generate HTML and display the data to the client side.
  • Extensively used JSP tag libraries, Used Spring Security for Authentication and authorization extensively.
  • Designed and developed Application based on Struts Framework using MVC design pattern.
  • Used Struts Validator framework for client-side validations.
  • Used Spring Core for dependency injection/Inversion of control (IOC).
  • Used Hibernate Framework for persistence onto oracle database.
  • Written and debugged the ANT Scripts for building the entire web application.
  • Used XML to transfer the application data between client and server.
  • XSLT style sheets for the XML data transformations.
  • Developed web services in Java and Experienced with SOAP, WSDL.
  • Used Log4j for logging Errors.
  • Used MAVEN as build tool.
  • Used Spring Batch for scheduling and maintenance of batch jobs.
  • Deployed the application in various environments DEV, QA and Production.
  • Used the JDBC for data retrieval from the database for various inquiries.
  • Performed purification of the application database entries using Oracle 10g.
  • Used CVS as source control.
  • Created Application Property Files and implemented internationalization.
  • Used Junit to write repeatable tests mainly for unit testing.
  • Involved in complete development of ‘Agile Development Methodology’ and tested the application in each iteration.
  • Wrote complex Sql and Hql queries to retrieve data from the Oracle database.
  • Involved in fixing System testing issues and UAT issues.

Environment: Java, J2EE, JSF, Struts, Spring, JDBC, Web Services, XML, JNDI, Hibernate, JMS, Eclipse, Oracle Xg, WinCvs 1.2, Rational Rose XDE, Spring batch, Maven, Log4j, JQuery, XML/XSLT, SAX, DOM.

Confidential

ETL/Informatica developer.

Responsibilities:

  • Responsible for logical dimensional data model and use ETL skills to load the dimensional physical layer from various sources including DB2, SQL Server, Oracle, Flat file etc .
  • Successfully collaborated with business users to capture & define business requirements and contribute to defining the data warehouse architecture (data models, data analysis, data sourcing and data integrity)
  • Analyzed source data for potential data quality issues and addressing these issues in ETL procedures.
  • Developed technical design documents and mappings specifications to build Informatica Mappings to load data into target tables adhering to the business rules.
  • Design, develop, test, maintain and organize complex Informatica mappings, sessions and workflows.
  • Complete technical documentation to ensure system is fully documented.
  • Design and develop ETL using CDC using Power Exchange 9.1 in Mainframe DB2 environment.
  • Created registration and data map for mainframe source.
  • Demonstrate in-depth understanding of Data Warehousing (DWH) and ETL concepts, ETL loading strategy.
  • Worked with SAP data Services for Data Quality and Data Integration.
  • Created Unix script for identifying CDC hanging workflows.
  • Participate in Developing PL/SQL procedures, Korn Scripts to automate the process for daily and nightly Loads.
  • Created sequential/concurrent Sessions/ Batches for data loading process and used Pre-& Post Session SQL Script to meet business logic.
  • Extensively used pmcmd commands on command prompt and executed Unix Shell scripts to automate workflows and to populate parameter files.
  • Developed complex mappings from varied transformation logics like Connected and Unconnected Lookups, Router, Aggregator, Joiner, Update Strategy etc.…
  • Worked with Mapping/Session/Worklet/Workflow variables and parameters, running Workflows in RHEL Unix shell script.
  • Developed complex mappings from varied transformation logics like Connected and Unconnected Lookups, Router, Aggregator, Joiner, Update Strategy etc.…
  • Created Informatica Power Exchange Restart token and enables recovery for all real-time sessions.
  • Worked on data warehouses with sizes from 2-3 Terabytes.
  • Worked on Teradata utilities like FLOAD, MLOAD and TPUMP to load data to stage and DWH.
  • Worked on Data Modeling using Star/Snowflake Schema Design, Data Marts, Relational and Dimensional Data Modeling, Slowly Changing Dimensions, Fact and Dimensional tables, Physical and Logical data modeling using Erwin.

Environment: Informatica Power Center 9.X, Power Exchange 9.X, SQL Server 2008, Teradata, UNIX, Toad, Erwin, Linux, Harvest, Putty, DB2 Mainframe..

We'd love your feedback!