We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Austin, TX

PROFESSIONAL SUMMARY:

  • Data Engineer with 8+ years of experience in building data intensive applications, tackling challenging architectural and scalability problems.
  • Experience on Big Data Technologies like Hadoop, Hive, Spark, Kafka, Sqoop, Oozie, Flume, HBase, Oozie and AWS.
  • Experience in building data pipelines using Azure Data factory, Azure data bricks and loading data to Azure data Lake, Azure SQL Database, Azure SQL Data warehouse and controlling and granting database access.
  • Experience in Developing Spark applications using Spark - SQL, PySpark and Delta Lake in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Good understanding of Spark Architecture, MPP Architecture, including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
  • Productionize models in Cloud environment, which would include, automated processes, CI/CD pipelines, monitoring/alerting, and troubleshooting issues. Present the model and results to technical and non-technical audience.
  • Experience in Database Design and development with Business Intelligence using SQL Server Integration Services (SSIS), DTS Packages, SQL Server Analysis Services (SSAS), DAX, OLAP Cubes, Star Schema and Snowflake Schema.
  • Experience working with Public Cloud platforms like Google Cloud, AWS, and Azure
  • Experience in creating AWS computing instance services like EC2 and Amazon Elastic Load Balancing as well as creating and managing AWS Storage services like S3, EBS and Amazon Cloud Front
  • Expertized working on Amazon EMR, Spark, Kinesis, S3, ECS, Elastic Cache, Dynamo DB, Redshift.
  • Highly skilled in various automation tools, continuous integration workflows, managing binary repositories and containerizing application deployments and test environments.
  • Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
  • Experience in creating Docker containers leveraging existing Linux Containers and AMI's in addition to creating Docker containers from scratch.
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, Power BI and Microsoft SSIS.
  • Used Kafka for activity tracking and log aggregation.
  • Experience in efficiently doing ETL’s using spark in memory processing, Spark SQL and Spark Streaming using Kafka Distributed Messaging System.
  • Hands-on experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR, Dynamo DBand other services of the AWS family.
  • Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
  • Experience in handling python and spark context when writing PySpark programs for ETL.
  • Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools).
  • Collaborate with application architects on infrastructure as a service (IaaS) application to Platform as a Service (PaaS).
  • Experience in Performance Tuning and optimization (PTO), Microsoft Hyper-V virtual infrastructure.

TECHNICAL SKILLS:

Programming Language: Java 1.7/1.8, SQL, Python, Scala, UNIX Shell Script, Power Shell, SQL, YAML

Cloud Platform: AZURE, AWS, GCP

Application/Web Servers: WebLogic, Apache Tomcat 5.x/6.x/7.x/8.xHadoop Distributions: Horton Works, Cloudera Hadoop

Hadoop/Bigdata Technologies: HDFS, Hive, Sqoop, Yarn, Spark, Spark SQL

Big Data: Azure Storage, Azure Data Factory, Azure Analysis Services, Azure Database, Map Reduce, AWS

Database Server: Oracle 9i/10g, SQL server, MySQL, SSIS, SSAS, SSRS

Version Control: SVN, GIT, GITHUB

DevOps, CI/CD Tools: Docker, Jenkins, Terraform

ETL Tools: Informatica, Data Studio

Reporting Tools: Power BI, Tableau, SSRS

Virtualization: Citrix, VDI, VMware

PROFESSIONAL EXPERIENCE:

Data Engineer

Confidential - Austin TX

Responsibilities:

  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
  • Performed data analysis and design, and created and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
  • I have written shell script to trigger data Stage jobs.
  • In-depth experience in automating various PySpark, Hive, Bash and python applications using Airflow and Oozie
  • ETL pipelines in and out of data warehouse using combination of Python and Snowflakes Snow SQL Writing SQL queries against Snowflake.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
  • Implemented PySpark logic to transform and process various formats of data like XLSX, XLS, JSON, TXT.
  • Built scripts to load PySpark processed files into Redshift Db and used diverse PySpark logics.
  • Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
  • Designed and Implemented Sharding and Indexing Strategies for MongoDB servers.
  • Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.
  • Creating data model that correlates all the metrics and gives a valuable output.
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data modeling tools like ER Studio.
  • Worked on Oracle Databases, RedShift and Snowflakes
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, and NoSQL DB).
  • Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
  • Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target snowflake database.
  • Design, develop, and test dimensional data models using Star and Snowflake schema methodologies under the Kimball method.
  • Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Analyzed the sql scripts and designed it by using PySpark SQL for faster performance.
  • Have extensively worked on Jupyter to develop and test PySpark applications.
  • Develop and deliver management dashboards that describe the state of master data health across the company
  • Worked on a direct query using PowerBI to compare legacy data with the current data and generated reports and stored and dashboards.
  • Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
  • Experience with Unix/Linux systems with scripting experience and building data pipelines.
  • Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing & transforming the data tfo uncover insights into the customer usage patterns.
  • Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI
  • Developed visualizations and dashboards using PowerBI
  • Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Created dashboards for analyzing POS data using Power BI.

Environment: MS SQL Server 2016, T-SQL, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS), Management Studio (SSMS), Spark, Python, ETL, Power BI, Tableau, Hive/Hadoop, Snowflakes, Power BI, AWS Data Pipeline, Mongo DB, Data Stage, Data stage and Quality Stage.

Data Engineer

Confidential - San Jose, CA

Responsibilities:

  • Responsible for architecting Hadoop clusters with CDH3 and involved in installation of CDH3 and up gradation to CDH4 from CDH3
  • Worked on creating Key space in Cassandra for saving the Spark Batch output
  • Worked on Spark application to compact the small files present into Hive ecosystem to make it equivalent to block size of HDFS
  • Manage migration of on-perm servers to AWS by creating golden images for upload and deployment
  • Manage multiple AWS accounts with multiple VPC's for both production and non-production where primary objectives are automation, build out, integration and cost control
  • Implemented the real time streaming ingestion using Kafka and Spark Streaming
  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Loaded data using Spark-streaming with Python
  • Involved in requirement and design phase to implement Streaming Lambda Architecture to use real time streaming using Spark and Kafka
  • Experience in loading the data into Spark RDD and performing in-memory data computation to generate the output responses
  • Migrated complex Map Reduce programs into In-memory Spark processing using Transformations and actions
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services.
  • Using AWS Redshift, I Extracted, transformed and loaded data from various heterogeneous data sources and destinations
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS glue code pipeline with AWS connect.
  • Developed frameworks and processes to analyze unstructured information. Assisted in Azure Power BI architecture design
  • Developed full text search platform using NoSQL and Logstash Elastic Search engine, allowing for much faster, more scalable and more intuitive user searches
  • Developed the Sqoop scripts to make the interaction between Pig and MySQL Database
  • Worked on Performance Enhancement in Pig, Hive and HBase on multiple nodes
  • Worked with Distributed n-tier architecture and Client/Server architecture
  • Supported MapReduce Programs those are running on the cluster and developed multiple MapReduce jobs in Java for data cleaning and pre-processing
  • Developed MapReduce application using Hadoop, MapReduce programming and HBase
  • Documentedlogical,physical, relational, and dimensionaldatamodels. Designed theDataMartsin dimensionaldatamodeling using star and snowflake schemas.
  • Evaluated usage of Oozie for Workflow Orchestration and experienced in cluster coordination using Zookeeper
  • Developing ETL jobs with organization and project defined standards and processes
  • Experienced in enabling Kerberos authentication in ETL process
  • Worked on POC to check various cloud offerings including Google Cloud Platform (GCP).
  • Continuously monitor and manage data pipeline (CI/CD) performance alongside applications from a single console with GCP.
  • Design of GUI using Model View Controller Architecture (STRUTS Framework)
  • Integrated Spring DAO for data access using Hibernate and involved in the Development of Spring Framework Controller

Environment: Spark, Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, Oozie, HBase, GCP, AWS, MySQL, Java, J2EE, Eclipse, HQL, ETL.

Data Engineer

Confidential, St. Louis, MO

Responsibilities:

  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP
  • Strong understanding of AWS components such as EC2 and S3
  • Implemented a Continuous Delivery pipeline with Docker and Git Hub
  • Worked with g-cloud function with Python to load Data in to Big query for on arrival csv files in GCS bucket
  • Process and load bound and unbound Data from Google pub/sub topic to big query using cloud Dataflow with Python.
  • Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
  • Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python.
  • Implemented to reprocess the failure messages in Kafka using offset id.
  • Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.
  • Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup.
  • Responsible for data services and data movement infrastructures good experience with ETL concepts, building ETL solutions and Data modeling
  • Developed various automated scripts for DI (Data Ingestion) and DL (Data Loading) using python & java Map Reduce.
  • Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Gather and process raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, writing applications)
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • Developed logistic regression models (Python) to predict subscription response rate based on customer’s variables like past transactions, response to prior mailings, promotions, demographics, interests, and hobbies, etc.
  • Hands on experience in GCP services like EC2, S3, ELB, RDS, SQS, EBS, VPC, EBS, AMI, SNS, RDS, EBS, Cloud Watch, Cloud Trail, Cloud Formation GCP Config, Autoscaling, Cloud Front, IAM, R53.
  • Process and load bound and unbound Data from Google pub/sub topic to big query using cloud Dataflow with Python
  • Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
  • Worked on confluence and Jira skilled in data visualization like Matplotlib and seaborn library
  • Hands on experience with big data tools like Hadoop, Spark, Hive
  • Experience implementing machine learning back-end pipeline with Pandas, Numpy

Environment: GCP, Big query, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas, Matplotlib, seaborn library, text mining, Numpy, Scikit-learn, Heat maps, Bar charts, Line charts, ETL workflows, linear regression, multivariate regression, Python, Scala, Spark

Big Data Engineer

Confidential

Responsibilities:

  • Imported data from different relational data sources like RDBMS, Teradata to HDFS using Sqoop.
  • Imported bulk data into HBase using Map Reduce programs.
  • Perform analytics on Time Series Data exists in HBase using HBase API.
  • Designed and implemented Incremental Imports into Hive tables.
  • Used Rest API to Access HBase data to perform analytics.
  • Worked in loading and transforming large sets of structured, semi structured, and unstructured data.
  • Involved in collecting, aggregating, and moving data from servers to HDFS using Apache Flume.
  • Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
  • Involved in creating Hive tables, loading with data, and writing hive queries that will run internally in Map Reduce way.
  • Experienced in managing and reviewing the Hadoop log files.
  • Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
  • Worked with Avro Data Serialization system to work with JSON data formats.
  • Worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
  • Involved in Unit testing and delivered Unit test plans and results documents using Junit.
  • Exported data from HDFS environment into RDBMS using Sqoop for report generation and visualization purpose.
  • Worked on Oozie workflow engine for job scheduling.
  • Created and maintained technical documentation for launching Hadoop Clusters and for executing Pig Scripts.

Environment: Hadoop, HDFS, Map Reduce, Hive, Oozie, Sqoop, Pig, Java, Rest API, Maven, JUnit.

Hadoop Developer

Confidential

Responsibilities:

  • Developed Spark Python for regular expression project in the Hadoop/Hive environment.
  • Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
  • Developed Hive, Bash scripts for source data validation and transformation. Automated data loading into HDFS and Hive for pre-processing the data using One Automation.
  • Gather data from Data warehouses in Teradata and Snowflake
  • Designed and implemented an ETL framework to load data from multiple sources into Hive and from Hive into Teradata.
  • Generate reports using Tableau.
  • Experience at building Big Data applications using Cassandra and Hadoop
  • Utilized SQOOP, ETL and Hadoop Filesystem APIs for implementing data ingestion pipelines
  • Worked on Batch data of different granularity ranging from hourly, daily to weekly and monthly.
  • Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager
  • Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Ambari, PIG, and Hive.
  • Developing and writing SQLs and stored procedures in Teradata. Loading data into snowflake and writing Snow SQLs scripts
  • TDCH scripts for full and incremental refresh of Hadoop tables.
  • Optimizing Hive queries by parallelizing with partitioning and bucketing.
  • Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and ORC.
  • Worked extensively on Teradata, Hadoop-Hive, Spark, SQLs, PLSQLs, Snow SQLs
  • Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making
  • Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
  • Experienced in working with Hadoop from Hortonworks Data Platform and running services through Cloudera manager

Environment: Hadoop, HDFS, AWS, Vertica, Bash, Kafka, MapReduce, YARN, Drill, Spark, Pig, Hive, Python, Java, NiFi, HBase, MySQL, Kerberos, Maven, Shell Scripting, SQL

Data Analyst

Confidential

Responsibilities:

  • Involved in designing physical and logical data model using ERwin Data modeling tool.
  • Designed the relational data model for operational data store and staging areas, Designed Dimension & Fact tables for data marts.
  • Extensively used ERwin data modeler to design Logical/Physical Data Models, relational database design.
  • Created Stored Procedures, Database Triggers, Functions and Packages to manipulate the database and to apply the business logic according to the user's specifications.
  • Created Triggers, Views, Synonyms and Roles to maintain integrity plan and database security.
  • Creation of database links to connect to the other server and access the required info.
  • Integrity constraints, database triggers and indexes were planned and created to maintain data integrity and to facilitate better performance.
  • Used Advanced Querying for exchanging messages and communicating between different modules.
  • System analysis and design for enhancements Testing Forms, Reports and User Interaction.

Environment: Oracle 9i, SQL* Plus, PL/SQL, ERwin, TOAD, Stored Procedures.

We'd love your feedback!