Spark Data Engineer Resume

SUMMARY

5+ years of IT experience in a variety of industries working on BigDatatechnology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
Fluent programming experience wif Scala, Java, Python, SQL, T - SQL, R.
Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark Graph X, Spark SQL, Kafka.
Adept at configuring and installing Hadoop/Spark Ecosystem Components.
Proficient wif Spark Core, Spark SQL, Spark MLlib, Spark Graph X and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked wif Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
Worked On Real time data integration Using Kafka.
Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, Power BI and Microsoft SSIS.
Wrote Kafka Produces to stream teh data From External rest APIs to Kafka Topics.
Hands-on experience wif Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node,DataNode and Hadoop MapReduce programming.
Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed knowledge of MapReduce framework.
Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning and advanced data processing.
Experience working wif NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
Developed Spark Applications dat can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.
Experienced in working wif Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitorHadoop and Spark jobs on AWS.
Capable of processing large sets (Gigabytes) of structured, semi-structured or unstructured data.
Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java 8.
Experience working wif GitHub/Git 2.12 source and version control systems.
Strong in core Java concepts including Object-Oriented Design (OOD) and Java components like Collections Framework, Exception handling, me/O system

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, SparkSolr, Storm, Drill, Ambari, Mahout, MongoDB, Avro, Parquet and Snappy.

Hadoop Distributions: Cloudera, MapR, Hortonworks

Languages: Java, Scala, Python, ruby, SQL, HTML, DHTML, JavaScript, XML and C/C++

No SQL Databases: MongoDB and HBase

Java Technologies: Servlets, JavaBeans, JSP, JDBC, JNDI, EJB and struts

XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB

Web Design Tools: HTML, DHTML, AJAX, JavaScript, jQuery, and CSS, AngularJs, ExtJS and JSON

Frameworks: Struts, spring and Hibernate

App/Web servers: WebSphere, WebLogic, JBoss and Tomcat

DB Languages: MySQL, PL/SQL, PostgreSQL, and Oracle

RDBMS: Teradata, Oracle Pl/SQL, MS SQL Server, MySQL and DB2

Operating systems: UNIX, LINUX, Mac OS, and Windows

ETL Tools: Informatica Power center

Reporting tools: Tableau

PROFESSIONAL EXPERIENCE:

Confidential

Spark Data Engineer

Responsibilities:

Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
Using AWS Redshift, me Extracted, transformed and loaded data from various heterogeneous data sources and destinations.
Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
me has written shell script to trigger data Stage jobs.
Assist service developers in finding relevant content in teh existing reference models.
Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
Worked on developing Pyspark script to encrypting teh raw data by using hashing algorithms concepts on client specified columns.
Responsible for Design, Development, and testing of teh database and Developed Stored Procedures, Views, and Triggers
Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
Compiling and validating data from all departments and Presenting to Director Operation.
KPI calculator Sheet and maintain dat sheet wifin SharePoint.
Created Tableau reports wif complex calculations and worked on Ad-hoc reporting using PowerBI.
Creating datamodel dat correlates all teh metrics and gives a valuable output.
Worked on teh tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
Performing ETL testing activities like running teh Jobs, Extracting teh data using necessary queries from database transform, and upload into teh Data warehouse servers.
Pre-processing using Hive and Pig.
Extract Transform and Load data from Sources Systems to AzureDataStorage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in InAzure Databricks.
Implemented Copy activity, Custom AzureData Factory Pipeline Activities
Primarily involved in Data Migration using SQL, SQL Azure, Kafka, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
Developed a detailed project plan and halped manage teh data conversion migration from teh legacy system to teh target snowflake database.
Design, develop, and test dimensionaldatamodels using Star andSnowflakeschemamethodologies under teh Kimball method.
Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs, Python and Scala.
Ensure deliverables (Daily, Weekly & Monthly MIS Reports) are prepared to satisfy teh project requirements cost and schedule
Worked on a direct query using PowerBI to compare legacy data wif teh current data and generated reports and stored and dashboards.
Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for teh SSAS cubes (OLAP)
SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Subreports, ad-hoc reports, parameterized reports, interactive reports & custom reports
Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI
Developed visualizations and dashboards using PowerBI
Used ETL to implement teh Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
Performing ETL testing activities like running teh Jobs, Extracting teh data using necessary queries from database transform, and upload into teh Data warehouse servers.
Created dashboards for analyzing POS data using Power BI
Met wif business/user groups to understand teh requirement for new Data Lake Project.
Worked in Agile Iterative sessions to create Hadoop Data Lake for teh client.
Defined teh reference architecture for Big Data Hadoop to maintain structured and unstructured data wifin teh enterprise.
Lead teh efforts to develop and deliver teh data architecture plan and data models for teh multiple data warehouses and data marts attached to teh Data Lake Project.
Created Talend jobs to copy teh files from one server to another and utilized Talend FTP components
Used Power Query to acquire data and Power BI desktop for designing rich visuals.

Environment: MS SQL Server 2016, T-SQL, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS), Management Studio (SSMS), Advance Excel (creating formulas, pivot tables, Hlookup, Vlookup, Macros), Spark, Python, ETL, Power BI, Tableau, Hive/Hadoop, Snowflakes, Power BI, AWS Data Pipeline, Confidential Cognos 10.1, Data Stage, Cognos Report Studio 10.1, Cognos 8 & 10 BI, Cognos Connection, Cognos office Connection, Cognos 8.2/3/4, Data stage and Quality Stage 7.5

Confidential

Hadoop/Aws developer

Responsibilities:

Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
Strong understanding of AWS components such as EC2 and S3
Performed Data Migration to GCP
Responsible for data services and data movement infrastructures
Experienced in ETL concepts, building ETL solutions and Data modeling
Worked on architecting teh ETL transformation layers and writing spark jobs to do teh processing.
Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters
Loaded application analytics data into data warehouse in regular intervals of time
Designed & build infrastructure for teh Google Cloud environment from scratch
Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP
Worked on confluence and Jira
Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built wif Python
Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
Compiled data from various sources to perform complex analysis for actionable results
Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
Optimized teh Tensorflow Model for efficiency
Analyzed teh system for new enhancements/functionalities and perform Impact analysis of teh application for implementing ETL changes
Implemented a Continuous Delivery pipeline wif Docker, and GitHub and AWS
Built performant, scalable ETL processes to load, cleanse and validate data
Participated in teh full software development lifecycle wif requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
Collaborate wif team members and stakeholders in design and development of data environment
Preparing associated documentation for specifications, requirements, and testing.

Environment: AWS, Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Cloud Sql, Mysql, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql

Confidential

Hadoop Developer

Responsibilities:

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP
Strong understanding of AWS components such as EC2 and S3
Implemented a Continuous Delivery pipeline wif Docker and GitHub
Worked wif g-cloud function wif Python to load Data in to Bigquery for on arrival csv files in GCS bucket
Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow wif Python.
Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python.
Developed and deployed data pipeline in cloud such as AWS and GCP
Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
Responsible for data services and data movement infrastructures good experience wif ETL concepts, building ETL solutions and Datamodeling
Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines
Hands on experience on architecting teh ETL transformation layers and writing spark jobs to do teh processing.
Gather and process raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, writing applications)
Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
Developed logistic regression models (Python) to predict subscription response rate based on customers variables like past transactions, response to prior mailings, promotions, demographics, interests, and hobbies, etc.
Develop near real time data pipeline using spark
Process and load bound and unbound Data from Google pub/sub topic to Big-query using cloud Data flow wif Python
Hands of experience inGCP, Big Query, GCS bucket, G - cloud function, cloud data flow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
Worked on confluence and Jira skilled in data visualization like Matplotlib and seaborn library
Hands on experience wif big data tools like Hadoop, Spark, Hive
Experience implementing machine learning back-end pipeline wif Pandas, Numpy

Environment: Gcp, Big query, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas, Matplotlib, seaborn library, text mining, Numpy, Scikit-learn, Heat maps, Bar charts, Line charts, ETL workflows, linear regression, multivariate regression, Python, Scala, Spar

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship