Sr. Spark Developer Resume
Dallas, TX
SUMMARY
- Around 10+ years of IT experience in Analysis, design, development, implementation, maintenance and support with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement.
- Around 5 years of experience on BIG DATA using HADOOP framework and related technologies such as HDFS, HBASE, MapReduce, Spark, Hbase, HIVE, PIG, FLUME, OOZIE, SQOOP, and ZOOKEEPER.
- Experience in data analysis using HIVE, PIG LATIN, HBASE and custom Map Reduce programs in Java.
- Experience in writing custom UDFs in JAVA and SCALA for HIVE and PIG TO EXTEND THE FUNCTIONALITY.
- Experience with Confidential and Horton works distributions.
- Over 3 - years’ experience on SPARK, SCALA, HBASE and KAFKA.
- Developed analytical components using KAFKA, SCALA, SPARK, HBASE and SPARK STREAM.
- Experience in working with Flume to load the log data from multiple sources directly into HDFS.
- Pretty Good knowledge On the Hortonworks administration and security things such as Apache Ranger, Knox Gateway, High Availability.
- Performed Hadoop backup Strategy to take the backup of hive, HDFS, Hbase, oozie etc.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and from RDBMS to HDFS.
- Involved in creating HDINSIGHT cluster in Confidential AZURE PORTAL also created EVENTSHUB and AZURE SQL DATABASES.
- Worked on a clustered Hadoop for Windows Azure using HDInsight and HORTONWORKS Data Platform for Windows.
- Experience in building ETL pipelines using NiFi.
- Built real time pipeline for streaming data using EVENTSHUB/ Confidential AZURE Queue and SPARK STREAMING.
- Microservice architecture development using Python and Docker on an Ubuntu Linux platform using HTTP/REST interfaces with deployment into a multi-node Kubernetes environment.
- Loaded the aggregated data into Hbase for reporting purpose
- Read the data from Hbase to Spark toperform Join on different tables.
- In-depth understanding of NiFi.
- Created the Hbase tables for validation, audit and offset management table.
- Created logical view instead of tables in order to enhance the performance of hive queries.
- Involved in developing Hive DDLS to create, alter and drop Hive tables
- Pretty Good Knowledge on hive Optimization techniques like Vectorization and column-based optimization.
- Written oozie workflow to invoke the Jobs in predefined Interval.
- Expert in scheduling Oozie coordinator based on input data events it starts Oozie workflow when input data is available.
- On Other Hand working on POC with Kafka and NIFI to pull the real-time events into Hadoop Box.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, SPARK-SQL, DATA FRAME, PAIR RDD'S and YARN.
- Experienced in managingHadoopCluster using HORTONWORKS AMBARI.
PROFESSIONAL EXPERIENCE
Confidential, Dallas TX
Sr. SPARK DEVELOPER
Responsibilities:
- Developed framework to encrypt sensitive data (SSN, Account number ...etc.) in all kinds of datasets and moved datasets one S3 bucket to another.
- Processed datasets like Text, Parquet, Avro, Fixed Width, Zip, Gz, JSON and XML.
- Developed framework to check data quality of datasets, schema defined in cloud. worked on Amazon Web service (AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake
- Configured Spark streaming to receive real time data from the Kafka and store the stream data into AWS S3 using Scala.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
- Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
- End to end data platform with Snowflake, Matillion, Power BI, Qubole, Databricks, Tableau, Looker, Python
- Migrating from data stage to Talend Using Big Data Components.
- Developed REST APIs using Java, Play framework
- •Very good understanding on NOSQL databases like Cassandra.
- Used NiFi to ping snowflake to keep Client Session alive.
- Working with tools like GitHub, Maven, Eclipse, Net Beans, IntelliJ, and Jenkins.
- Developed multiple MapReduce jobs in Java for data cleaning and pre-processing.
- Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- Hands on experience in AWS Cloud in various AWS services such as Redshift cluster, Route 53 domain configuration, SQS, EBS, CloudWatch, EC-2, ELB, RDS, S3, SNS.
- Using Talend to load the data into our warehouse systems.
- Implemented automatic CI/CD pipelines with Jenkins to deploy Micro services in AWS ECS for streaming data, Python jobs in AWS Lambda, Containerized deployments of Java & Python
- Having experience using in AWS with Lambda, WorkSpaces, Kinesis, DynamoDB.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Data Bricks Connectors, Spark core, Spark SQL, ksql, Sqoop, Pig, Hive, Impala and NoSQL databases.
- Used Spark-Streaming APIs to perform required transformations and actions on the learner data model which gets the data from Kafka in near real time.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL.
- Worked on Global Context variables, Context variables, and extensively used over 30+components in Talend to create jobs. leveraging SPARQL to exploit Semantic Web to leverage public data and leveragingHadoop Pig andHiveQL for unstructured datafrom both internal and external sources to be integrated into the architecture ofDaaS platform.
- Run predictive, spatial and statistical analytics in one intuitive interface in Alteryx
- Managed cluster nodes using Confidential Manager
- Write and tune complex Java, Scala, Spark, Airflow jobs
- Developed multiple Spark jobs in Scala & Python for data cleaning and preprocessing.
- Implemented monitoring solutions in Ansible, Terraform, Docker, and Jenkins.
- Worked on Migrating jobs from NiFi development to Pre-PROD and Production cluster.
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Used File Broker to schedule workflows to run Spark jobs to transform data on a persistent schedule.
- Developed ELT workflows using NiFI to load data into Hive and Teradata.
- Used Scala sbt to develop Scala coded spark projects and executed using spark-submit.
- Experience developing, deploying Shell Scripts for automation/notification/monitoring.
- Extensively used Apache Kafka, Apache Spark, HDFS and Apache Impala to build a near real time data pipelines that get, transform, store and analyze click stream data to provide a better personalized user experience.
- Maintained and developed Docker images for a tech stack including Cassandra, Kafka, Apache, and several in house written Java services on Kubernetes.
- Utilize Alteryx to Prepare, blend and analyze all your data using a repeatable workflow
- Participated in installing and configuring Confidential (CDH4) SuSE package on 40 node cluster
- Analyzed the SQL scripts and designed the solution to implement using PySpark.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Worked on Performance tuning on Spark Application.
- Responsible for architectingHadoop clusters with CDH4 on CentOS, managing with Confidential Manager.
- Written applications on NoSQL databases like HBase, Cassandra and MongoDB.
- Developed Scala scripts, UDF’s using both Data frames/SQL and RDD/MapReduce in Spark for Data
- Worked with Apache Spark SQL and data frame functions to perform data transformations and aggregations on complex semi structured data.
- Hands on experience in creating RDDs, transformations and actions while implementing Spark applications.
ENVIRONMENT: AWS, SPARK, HIVE, SPARK SQL, NiFi, KAFKA, EMR, SNOWFLAKE, NEBULA, HIVEPYTHON, SCALA, MAVEN, JUPYTER NOTEBOOK, VISUAL STUDIO, UNIX SHELL SCRIPTING, SPARQL
Confidential, Dallas TX
SPARK DEVELOPER
Responsibilities:
- Involved in complete Bigdata flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.
- Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers.
- Developed data pipeline using Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
- Managed Alteryx workflows & ETL Processes
- Schedule the Talend jobs with Talend Admin Console, setting up best practices and migration strategy.
- Implemented solutions using Hadoop, HBase, Hive, Sqoop, Java API, etc.
- Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
- Extensively used Confidential Manager and Confidential Director to manage cluster nodes, services, administering cluster and assigning users, groups and roles for authorization
- Involved in running all the hive scripts through hive. Hive on Spark and some through Spark SQL.
- Explored the usage of Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL and Spark Yarn.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Expertise On optimizing spark Jobs when dealing with Huge joins and data Skew.
- Performed Different type of Optimizations in spark such as Broadcast-join, repartition and kryo serial serialization.
- Write and tune complex Java, Scala, Spark, Airflow jobs
- Used spark and spark-SQL to read the parquet data and create the tables in hive using the Scala API.
- Developed spark applications in python(PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Implemented the Distributed / Enterprise / Web / Client Server systems using Java
- Worked on reading and writing multiple data formats like JSON,ORC,Parquet on HDFS using PySpark.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
- Interacted with multiple teams who are responsible for Azure Platform to fix the Azure Platform Bugs.
- Written Kafka REST API to collect events from front end.
- Extracted feeds form social media sites such as Facebook, Twitter using Python scripts.
- Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
- Knowledge on handling Hive queries using Spark SQL that integrate with Spark environment.
Confidential, Dallas TX
SPARK DEVELOPER
Responsibilities:
- Developed Fully automated Configuration driven data pipeline using spark, hive, HDFS, Sql Server, oozie and azure File storage to load the client data into Mirror databases. performed necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Explored the usage of Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL and Spark Yarn.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Expertise On optimizing spark Jobs when dealing with Huge joins and data Skew.
- Performed Different type of Optimizations in spark such as Broadcast-join, repartition and kryo serial serialization.
- Worked on Scala for implementing spark machine learning libraries and spark streaming.
- Developed and maintained the continuous integration and deployment systems using Jenkins, ANT, Akka and MAVEN.
- Written oozie workflow to invoke the Jobs in predefined Intervals.
- Expert in scheduling Oozie coordinator based on input data events it starts Oozie workflow when input data is available.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark
- Developed jobs in Talend Enterprise edition from stage to source, intermediate, conversion and target.
- Good experience with Artificial intelligent tool JBoss Drools-Engine.
- Developed Robust application with the help of Drools-engine In-order to execute complex rules on top of customer data.
- Troubleshooting, debugging & altering Talend particular issues, while maintaining the health and performance of the ETL environment.
- Good experience on creating drools DRL files and kmodule.xml fields.
- Pretty Good knowledge On the Hortonworks administration and security things such as Apache Ranger, Knox Gateway, High Availability.
- Implemented solutions using Hadoop, Apache Spark, Spark Streaming, Spark SQL, HBase and Scala.
- Performed Hadoop backup Strategy to take the backup of hive, hdfs, Hbase, oozie etc.
- Used Hive to analyze the Partitioned and Bucketed data and compute various metrics for reporting.
- Performed hive performance tuning aspects like Map join, cost based optimization and column level statistics.
- Created logical view instead of tables in order to enhance the performance of hive queries.
- Involved in developing Hive DDLS to create, alter and drop Hive tables.
- Experience in creating UDF's, UDAF's for Hive and Pig.
- Extensively used Pig for data cleansing and HIVE queries for the analysts.
- Performed Benchmark between Hive and sparkSql.
Environment: azure, spark, hive, spark sql, kafka, horton works, jboss drools, hive, pig, oozie,hbase, python, scala, maven, jupyter notebook, visual studio, unix shell scripting
Confidential, REDMOND WA
HADOOP AND SPARK DEVELOPER
Responsibilities:
- Developed data pipeline using EVENTHUBS, SPARK, HIVE, PIG AND AZURE SQL DATABASE to ingest customer behavioral data and financial histories into HDINSIGHT cluster for analysis.
- Involved in creating HDINSIGHT cluster in Confidential AZURE PORTAL also created EVENTSHUB and AZURE SQL DATABASES.
- Worked on a clustered Hadoop for Windows Azure using HDInsight and HORTONWORKS Data Platform for Windows.
- Spark Streaming collects this data from EVENTSHUB in near-real-time and performs necessary transformations and AGGREGATION on the fly to build the common learner data model and persists the data in AZURE DATABASE.
- Used PIG to do transformations, event joins, filter boot traffic and SOME PRE-AGGREGATIONS before storing the data onto azure database.
- Expertise with the tools inHadoopEcosystem including PIG, HIVE, HDFS, YARN, OOZIE, AND ZOOKEEPER.Hadooparchitecture and its components.
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Exploring with the SPARK improving the performance and optimization of the existing algorithms in Hadoop using SPARK CONTEXT, SPARK-SQL, DATA FRAME, PAIR RDD'S, SPARK YARN.
- I have been experienced with SPARK STREAMING to ingest data into SPARK ENGINE.
- Import the data from different sources like EVENTHUBS, COSMOS into SPARK RDD.
- Developed SPARK CODE using SCALA and Spark-SQL/Streaming for faster testing and processing of data.
- Involved in converting Hive/SQL queries into SPARK TRANSFORMATIONS using Spark RDDs, and SCALA.
- Developed multiple POCs using SCALA and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Worked on the SPARK SQL and SPARK STREAMING modules of Spark extensively and Used SCALA to write code for all Spark use cases.
- Used DATAFRAME API in Scala for converting the distributed collection of data organized into named columns.
- Involved in converting the JSON data into DATAFRAME and stored into hive tables.
- Experienced with AZCOPY, LIVY, WINDOWS POWERSHELL and CURL to submit the spark jobs on HDINSIGHT CLUSTER.
- Analyzed the SQL scripts and designed the solution to implement USING SCALA.
Environment: azure, spark, hive, spark sql, kafka, horton works, jboss drools, hive, pig, oozie, hbase, python, scala, maven, jupyter notebook, visual studio, unix shell scripting.
Confidential
HADOOP DEVELOPER
Responsibilities:
- Importing and exporting data into HDFS and Hive using Sqoop.
- Used Bash Shell Scripting, Sqoop, AVRO, Hive, Pig, Java, Map/Reduce daily to develop ETL, batch processing, and data storage functionality.
- Used Pig to do data transformations, event join sand some pre-aggregations before storing the data on the HDFS.
- Exploited Hadoop MySQL-Connector to store Map Reduce results in RDBMS.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Worked on loading all tables from the reference source database schema through Sqoop.
- Worked on designed, coded and configured server side J2EE components like JSP, AWS and JAVA.
- Collected data from different databases (i.e. Oracle, MySQL) to Hadoop
- Used Oozie and Zookeeper for workflow scheduling and monitoring.
- Worked on Designing and Developing ETL Workflows using Java for processing data in HDFS/Hbase using Oozie.
- Experienced in managing and reviewing Hadoop log files.
- Involved in loading and transforming large sets of structured, semi structured and unstructured data from relational databases into HDFS using Sqoop imports.
- Working on extracting files from MySQL through Sqoop and placed in HDFS and processed.
- Supported Map Reduce Programs those running on the cluster.
- Cluster coordination services through Zookeeper.
- Involved in loading data from UNIX file system to HDFS.
- Created several Hive tables, loaded with data and wrote Hive Queries in order to run internally in MapReduce.
- Developed Simple to complex MapReduce Jobs using Hive and Pig.
Confidential
JAVA DEVELOPER
Responsibilities:
- Involved in the analysis, design, and development and testing phases of Software Development Life Cycle (SDLC)
- Designed and developed framework components, involved in designing MVC pattern using Struts and spring framework.
- Responsible for developing Use case, Class diagrams and Sequence diagrams for the modules using UML and Rational Rose.
- Developed the Action Classes, Action Form Classes, created JSPs using Struts tag libraries and configured in Struts-config.xml, Web.xml files.
- Involved in Deploying and Configuring applications in Web Logic Server.
- Used SOAP for exchanging XML based messages.
- Used Confidential VISIO for developing Use Case Diagrams, Sequence Diagrams and Class Diagrams in the design phase.
- Developed Custom Tags to simplify the JSP code. Designed UI screens using JSP and HTML.
- Actively involved in designing and implementing Factory method, Singleton, MVC and Data Access Object design patterns.
- Web services used for sending and getting data from different applications using SOAP messages. Then used DOM XML parser for data retrieval.
- Wrote JUNIT test cases for Controller, Service and DAO layer using MOCKITO, DBUNIT.
- Developed unit test cases using proprietary framework which is similar to JUNIT.
- Used JUnit framework for unit testing of application and ANT to build and deploy the application on WebLogic Server.