We provide IT Staff Augmentation Services!

Data Engineer/spark Developer Resume

5.00/5 (Submit Your Rating)

Austin, TX

PROFESSIONAL SUMMARY:

  • Around 8 years of work experience in IT, which includes experience in Data Engineering and Implementation of Hadoop, Spark and cloud Data warehousing solutions.
  • Extensive experience in developing Kafka producers and Kafka consumers for streaming millions of events per minute on streaming data using PySpark, Python & Spark Streaming.
  • Significant experience in Scala, Python & Shell languages.
  • Experience in Spark eco - system, core, SQL, Streaming modules.
  • Extensive experience in AWS Big Data Services, S3, ECS (Elastic Container Service), EMR Spark.
  • Experienced in using security components of AWS IAM, KMS, VPC, Route 53 and Hashicorp Vault.
  • Experience in configuring networking AWS Services, ELB, NLB, ALB for TCP & HTTP applications.
  • Implemented automatic CI/CD pipelines with Jenkins to deploy Micro services in AWS ECS for streaming data, Python jobs in AWS Lambda, Containerized deployments of Java & Python.
  • Strong experience production ailing end to end data pipelines on Hadoop platform.
  • Good hands-on experience on DevOps Stack like Terraform, Vault, Jenkins, Ansible, Boto3, Docker, Elastic Container Service (ECS), CloudFormation and System Manager.
  • Hands on experience with AWS Messaging & streaming systems, Kinesis data streams, SQS, SNS.
  • Production deployments are continuously monitored & handled using observability tools, AWS CloudWatch, Datadog.
  • Good experience in Snowflake data warehouse, developed data extraction queries, automatic ETL for data loading from Data Lake.
  • Projects implements using agile methodology, with scrum management in JIRA.
  • Experience in working with different operating systems Windows, Mac, UNIX, and LINUX, EC2, Ubuntu, CentOS.
  • Experienced in Data Bricks, Hive SQL, Azure CI/CD pipeline, Delta Lake, Hadoop file system, Snowflake.
  • Experience in source code & build management with Git & Enterprise GitHub with Jenkins, Artifactory.
  • Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.
  • Extensive experience in creating batch & streaming data pipelines in AWS & Big Data infrastructure.
  • Skilled in working with Hive data warehouse and snowflake modeling
  • Involved in optimizing Snowflake cost optimization initiatives with data models & efficient queries.
  • Strong understanding of real time streaming technologies Spark and Kafka.
  • Strong understanding of Logical and Physical data base models and entity-relationship modeling.
  • Strong understanding of Message Queues and streaming systems architectures, ZeroMQ, SQS, Kinesis, Kafka, Spark.
  • Developed Microservices solution for streaming data in AWS ECS.
  • Developed REST API Service based on masking data using Python Flask, for Dev & QA regions testing for team utilities initiatives.
  • Possess excellent communication, interpersonal and analytical skills along with positive attitude.
  • Experience in dealing with Apache Hadoop components like HDFS, MapReduce, HiveQL, HBase, Pig, Sqoop, Big Data and Big Data Analytics.
  • Hands on experience in developing ETL jobs in Hadoop eco-system.
  • stem using Oozie & NiFi, Streaming and batch jobs in AWS ECS & Lambda, EMR Spark and docker containers.
  • Hands on experience in installing, configuring and using echo system components like Hadoop, MapReduce, HDFS, HBase, Zookeeper, Hive, Sqoop and Pig with Hortonworks Data Platform.
  • Experience working with Job Scheduling tools (Airflow, Luigi, Oozie).

TECHNICALSKILLS:

Programming/Scripting Languages: Scala, Java, Python, C, C++, SQL, Shell.

Big Data, Streaming & MQ: Hadoop, MapReduce, HDFS, Hive, Pig and Sqoop, Spark 2.0, Kafka, Kinesis Firehose & Data Streams, SQS, SNS, ZeroMQ, Atana.

AWS: EC2, S3, ECS, ECR, Lambda, CloudWatch, ELB, NLB, ALB, VPC, Route 53, IAM, KMS.

Other tools: Microsoft Office tools, VSTS, VM ware, IoT, Git, Enterprise GitHub, IntelliJ, PyCharm, Maven and SBT.

Databases: NoSQL, OracleSQL, MSQL, RDBMS, Apache Cassandra, HBase, Snowflake

Big data Eco System: HDFS, Oozie, Zookeeper, Spark, Spark streaming, Kafka, NiFi

UNIX Tools: Apache, Yum, RPM

FILE FORMATS: Txt, XML, JSON, Avro, Parquet, ORC, Protobuf

Cloud Computing: AWS

Visualization and Reporting Tools: Tableau, Microsoft Power BI, AWS Quick Sight.

Methodologies: Agile, UML, Design Patterns

PROFESSIONAL EXPERIENCE

Confidential, Austin TX

Data Engineer/Spark Developer

Responsibilities:

  • Worked with Hadoop 2.x version and Spark 2.x (Python and Scala).
  • Used Spark for interactive queries, processing of streaming data and integration with NoSQL database for huge volume of data.
  • Experienced in handling large datasets using Partitions, Spark in-memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Developed custom ETL solutions, batch processing and real-time data ingestion pipeline to move data in and out of Hadoop using Python and shell scripting.
  • Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse
  • Extracted, Transformed and Loaded data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
  • Designed, developed Hadoop eco system components, installation, configuration, supporting and monitoring of Hadoop clusters using Apache, Cloudera distributions, Azure data bricks and AWS.
  • DevelopedSparkapplications usingSpark - SQLinDatabricksfor data extraction.
  • Involved in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Developed and deployed the outcome using spark and Scala code in Hadoop cluster running on GCP.
  • Experienced in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
  • Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s.
  • Processed and loaded bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
  • Used rest API with Python to ingest Data from and some other site to BIGQUERY.
  • Built a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Bigquery tables.
  • Built a Scala and spark based configurable framework to connect common Data sources like MYSQL, Oracle, Postgres, SQL Server, Salesforce, Bigquery and load it in Bigquery.
  • Created Databricks notebooks using SQL, Python and automated notebooks using jobs.
  • Created Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
  • Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, Pair RDD and Spark on YARN.
  • Worked with Sqoop import and export functionalities to handle large data set transfer between Oracle databases and HDFS.
  • Developed Spark jobs to clean data obtained from various feeds to make it suitable for ingestion into Hive tables for analysis.
  • Developed Custom Input Formats in Spark jobs to handle custom file formats.
  • Configured Oozie workflow to run multiple Hive jobs which run independently with time and data availability.
  • Utilized Hive tables and HQL queries for daily and weekly reports. Worked on complex data types in Hive like Structs and Maps.
  • Designed and constructed AWS Data pipelines using various resources in AWS including AWS API Gateway to receives response from AWS lambda and retrieve data from snowflake using lambda function and converted the response into Json format using Database as Snow Flake, DynamoDB, AWS Lambda function and AWS S3.
  • Involved in code migration of quality monitoring tool from AWS EC2 to AWS lambda and built logical datasets to administer quality monitoring on snowflake warehouses.
  • Good experience in Snowflake data warehouse, developed data extraction queries, automatic ETL for data loading from Data Lake.
  • Involved in optimizing Snowflake cost optimization initiatives with data models & efficient queries.
  • Developed HiveQL queries for trend analysis and pattern recognition on user data.
  • Helped this regional bank streamline business processes by developing, installing and configuring Hadoop ecosystem components dat moved data from individual servers to HDFS.
  • Spark streaming is used for a single frame work to satisfy all their processing needs.
  • Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDDs.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Work heavily with Python, C++, Spark, SQL, Airflow, and Looker
  • Proven experience with ETL frameworks (Airflow, Luigi, or our own open-sourced garcon)
  • Involved in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
  • Experienced in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Supported code/design analysis, strategy development and project planning.
  • Created reports for the BI team using Sqoop to export data into HDFS and Hive.
  • Developed multiple Spark jobs in Scala & Python for data cleaning and preprocessing.
  • Assisted with data capacity planning and node forecasting.
  • Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.
  • Designing ETL processes using Informatica to load data from Flat Files, Oracle and Excel files to target Oracle Data Warehouse database.

Environment: Spark, Spark SQL, HBase, Hive, Oozie, Informatica, HQL, Sqoop, Oozie, Java, Scala, Python, Shell scripting, Maven, GIT, Tableau.

Confidential, Dallas, TX

Hadoop Developer

Responsibilities:

  • Experienced in writing Spark Applications in Scala and Python (PySpark).
  • Imported Avro files using Apache Kafka and did some analytics using Sparking Scala.
  • Extracting real time data using Kafka and Spark streaming by Creating D streams and converting them into RDD, processing it and stored it into Cassandra.
  • Configured, deployed and maintained multi-node Dev and Test Kafka Clusters.
  • Using Spark-Streaming APIs to perform transformations and actions on fly for building the common learner data model which gets the data from Kafka in near real time and persists into Cassandra.
  • Used Scala sbt to develop Scala coded spark projects and executed using spark-submit.
  • Launched multi-node Kubernetes cluster in Google Kubernetes Engine (GKE) and migrated the dockerized application from AWS to GCP.
  • Deployed application to GCP using Spinnaker (rpm based)
  • Developed pipeline for POC to compare performance/efficiency while running pipeline using the AWS EMR Spark cluster and Cloud Dataflow on GCP.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Developed the batch scripts to fetch the data from AWS S3storage and do required transformations in Scala using Spark framework.
  • Building the Cassandra nodes using AWS & amp; setting up the Cassandra cluster using Ansible automation tools
  • Involved in Building ETL to Kubernetes with Apache Airflow and Spark in GCP.
  • Automated resulting scripts and workflow usingApache Airflowandshell scriptingto ensure daily execution in production.
  • Installed and configuredApache Airflowfor S3 bucket and Snowflake data warehouse and createddagsto run the Airflow.
  • Worked and learned a great deal from Amazon Web Services (AWS) cloud services like EC2, S3, EMR, EBS, RDS and VPC.
  • Developed Scala scripts, UDF’s using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Involved in executing various Oozie workflows and automating parallel Hadoop MapReduce jobs.
  • Developed Oozie Bundles to Schedule Pig, Sqoop and Hive jobs to create data pipelines.
  • Developed Hive queries to do analysis of the data and to generate the end reports to be used by business users.
  • Used spark and spark-SQL to read the parquet data and create the tables in hive using the ScalaAPI.
  • Design solution for various system components using Microsoft Azure.
  • Used AWS Redshift, S3, Spectrum and Atana services to query large amount data stored on S3 to create a Virtual Data Lake without having to go through ETL process.
  • Loaded salesforce Data every 15 min on incremental basis to BIGQUERY raw and UDM layer using SOQL, Google DataProc, GCS bucket, HIVE, Spark, Scala, Python, Gsutil and Shell Script.
  • Configures Azure cloud services for endpoint deployment.
  • Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools) worked on Azure suite: Azure SQL Database, Azure Data Lake (ADLS), Azure Data Factory (ADF) V2, Azure SQL Data Warehouse, Azure Service Bus, Azure key Vault, Azure Analysis Service (AAS), Azure Blob Storage, Azure Search, Azure App Service, Azure data Platform Services.
  • Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake Store (ADLS) using Azure Data Factory (ADF V1/V2).
  • Designed solution for various system components using Microsoft Azure.
  • Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
  • Written generic extensive data quality check framework to be used by the application using impala.
  • Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
  • Providing guidance to the development team working on PySpark as ETL platform.
  • Used PySpark jobs to run on Kubernetes Cluster for faster data processing
  • Involved in the process of Cassandra data modelling and building efficient data structures.

Environment: Hadoop, Hive, Impala, Oracle, Spark, Python, Pig, Sqoop, Oozie, Map Reduce, GIT, HDFS, Cassandra, Apache Kafka, Storm, Linux, Solr, Confluence, Jenkins.

Confidential, Dallas, TX

Hadoop developer

Responsibilities:

  • Handling the importing of data from various data sources (media, MySQL) and performing transformations using Hive, MapReduce.
  • Developed multiple Map Reduce jobs in Java for data cleaning and preprocessing
  • Ran Pig scripts on Local Mode, Pseudo Mode, and Distributed Mode in various stages of testing.
  • Configured Hadoop cluster with Name node and slaves and formatted HDFS.
  • Performed Importing and exporting data from SQL server to HDFS and Hive using Sqoop.
  • End-to-end involvement in data ingestion, cleansing, and transformation in Hadoop.
  • Logical implementation and interaction with HBase.
  • Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users.
  • Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins.
  • Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) dat process the data using the Cosmos Activity.
  • Implemented Apache PIG scripts to load data from and to store data into Hive.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Used Impala to read, write and query the Hadoop data in HDFS from Cassandra and configured Kafka to read and write messages from external programs.
  • Optimizing existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
  • Create a complete processing engine, based on Cloudera distribution, enhanced performance.
  • Developed data pipeline using Flume, Spark and Hive to ingest, transform and analyzing data.
  • Wrote Flume configuration files for importing streaming log data into MongoDB with Flume.
  • Involved in production Hadoop cluster set up, administration, maintenance, monitoring and support.
  • Designed and Modified Database tables and used HBASE Queries to insert and fetch data from tables.
  • Developing and supporting Map-Reduce Programs running on the cluster.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Flume1.7.0.
  • Implemented the file validation framework, UDFs, UDTFs and DAOs.
  • Preparation of Technical architecture and Low -level design documents.
  • Tested raw data and executed performance scripts.

Environment: Linuxsuse12, eclipse photon(64bit), jdk1.8.0, Hadoop2.9.0, flume 1.7.0, HDFS, MapReduce, Pig0.16.0, Spark, Hive 2.0, Apache-Maven3.0.3

Confidential

Java developer

Responsibilities:

  • Developed application using JavaScript for Web pages to add functionality, validate forms, communicate with the server.
  • Extensively used Core Java concepts like Collections, Exception Handling, and Generics during development of business logic.
  • Extensively written Core Java and Multi-Threading code in application.
  • Provided connections using JDBC to the database and developed SQL queries to manipulate the data on DB.
  • Written JDBC statements, prepared statements and callable statements in Java, JSPs and Servlets.
  • Developed application components using JSPs, EJB's, Value Objects and model layer logic.
  • Performed CRUD operations like Update, Insert and Delete data in Oracle.
  • Created Functional Test cases and achieved bug fixes.
  • Code review and function testing for better client interface and usability.
  • Participation in meeting with team, senior management and client stakeholders.
  • Worked with Web logic application server set up & deployed the application on it. Have good working background with J2EE Frameworks like Servlets, JSP.
  • Responsible for building scalable distributed data solutions using Data tax Cassandra.
  • Implemented DAO layer using IBATIS and wrote queries for persisting demand core banking related information from the backend system using Query tool
  • Developed web-based presentation using JSP and Servlet technologies and implemented MVC pattern using STRUTS framework.
  • Used Spring Rest Controllers and Services classes to support migration to Spring framework
  • Experienced in developing the UNIX Shell Scripts and PERL Scripts to execute the scripts and manipulate files and directory.html
  • Used Ajax to communicate with the server to get the asynchronous response.
  • Developed code for Web services using XML, SOAP and used SOAP UI tool for testing the services.
  • Developed SOAP and WSDL files for Web Services interacting with business Logic.
  • Involved in planning process of iterations under the Agile Scrum methodology.
  • Used Log4J for logging the user events. Maven for compilation and building JAR, WAR and EAR files.
  • Used JUnit for unit testing and Continuum for integration testing.

Environment: Java 1.4, HTML, CSS, JSP 2.0, Servlets, Struts, EJB, JDBC, SQL, Oracle, Swing, Eclipse, MS Office, Windows, JPA Annotations, WebLogic.

We'd love your feedback!