Data Engineer Resume
Plano, TX
SUMMARY
- Experienced Data Engineering and Analytics leader with expertise in building and growing amazing teams. Experience with 'Big Data' cloud technologies, traditional databases, traditional BI platforms, lighter visualization tools, Data pipeline/ETL architecture, and overall business alignment for optimal data impact .
- Detail specific with proven expertise in gathering requirements of clients / Vendors / Consultants & other stakeholders, followed by provisioning through Data Warehousing / Business Intelligence Solutions.
- Experienced in IT Project Management activities including project scoping, estimation, planning, finalization of technical / functional specifications, resource optimization & quality management of product / software application.
- Insightful knowledge of business process analysis, design and debugging of jobs for cleansing and transforming data, application - based process reengineering, process optimization, cost control & revenue maximization through Data Warehousing / Business Intelligence solutions using cutting-edge technologies .
- Well versed in Normalization/ Denormalization techniques for optimum performance in relational and dimensional database environments.
- Developed bash scripts to run python and hive to schedule periodic crontab jobs in various servers across organization to automate ETL to Database, with notification of the process on completion.
- Built Airflow pipelines Migrated Petabytes of data from Oracle, Hadoop, MSSQL, MySQL sources to the AWS cloud .
- Performed data pipeline checkpoints to validate data flow and mask the sensitive data.
- Proficient with data quality, cleansing, migration and warehousing activities with an exposure in ETL process, knowledge of AWS Glue suit, Snowflake (cloud data warehouse), EMR, Zeppelin, PySpark .
- Trained Co-Analysts about various data sources and cutting-edge tools to monitor daily pipelines to validate the data flow.
- Developed and maintained the automated APIs and ETL pipelines using PySpark that support product development.
- Expertise in writing T- SQL Queries, Dynamic-queries, sub-queries and complex joins for generating Complex Stored Procedures, Triggers, User-defined Functions, Views and Cursors, common table expressions (CTEs), Backups, Recovery and SQL Server Agent profiler using SQLSERVER 2000/2005/2008 R2/2012.
- Good experience in building interactive reports, dashboards, results and have strong Data Visualization design skills.
- Adept in Research of best ways to visualize and monitor the pipelines to automate them for better performance.
- Interacted with clients for requirement gathering, system study & analysis; besides documenting, tracking, and communicating bugs, enhancements, analysis, and unresolved problems.
- Executed all projects under the stipulated budget and scheduled timelines despite demanding schedules.
- Enhanced the speed of delivery and the time to market of our project by 3 times by using best practices in CICD using Jenkins pipeline.
- An effective communicator with excellent analytical, interpersonal skills and leadership skills.
- Expertise in writing/debugging/enhancing UNIX Shell Scripts.
- Experience in configuring rest API'S and Web API'S and troubleshooting issues.
- Working on in-house metadata tool which is a centralised repository for all enterprise tables metadata using React, Redux, Flask, Neo4j and Elasticsearch.
TECHNICAL SKILLS
Streaming: Spark Streaming, AWS Kinesis, Kafka
Data Migration: AWS Data Pipeline, Airflow, Apache Sqoop
Programming Languages: Python, R,T/ -SQL, PL/SQL, JavaScript (React), VBA
Big Data: HDFS, MapReduce, PySpark, SparkSQL, SparkML
Cloud: AWS (EMR, CloudFormation, lambda, VPC, Route53, EC2, Kinesis, Security groups, IAM, S3, RDS, RedShift, Elasticsearch )
Database: OracleDB, Dynamo DB, Neo4j, MS SQL Server, MySQL, MS Access
Statistical Software: R Studio, IBM SPSS, SSAS, SSRS, Tableau, MS Excel
Data Pipeline: Nebula, Apache Nifi
Agile Board: JIRA, Scrum
PROFESSIONAL EXPERIENCE
Confidential, Plano, TX
Data Engineer
Responsibilities:
- Worked with business partners, product development teams and senior designers to capture requirements, determine a solution that will integrate with existing BI assets, reviewed pipelines, developed iterative revisions and translated those requirements into ETL workflows using SparkSQL and PySpark.
- Established Best Practice in Data Ingestion for all internal and External Feeds for Exploratory, Development and Production lake.
- Built an ETL framework for Data Migration from on premise data sources such as Hadoop, Oracle to AWS using Apache Airflow, Apache Sqoop and Apache Spark (PySpark).
- Established Data Quality and validation check process per feed basis in order to maintain data accuracy and reliability.
- Worked with Clients to assist with new fulfillment center data platform migration from local to AWS cloud.
- Engineered the release, integration, and customization of Airflow for a 25+ person analytics team .
- Automated DAG generation based on source system using concept of dynamic DAG’s. Data extraction is done using Sqoop, DMS. Migrated over 500+ tables from in-house to S3 storage.
- Working on AWS Data Pipeline to setup pipeline to ingest data from Spark and migrate to Snowflake Database.
- Created airflow DAG’s to sync files from box, analyze data quality, and alert for missing files.
- Setup an automated process to consume Confidential client transactions stored in Dynamo DB. Transformed and stored the data in relational format in Spark tables.
- Working on building an Enterprise data dictionary and metadata tool using React, Redux, Neo4j, Flask and Elasticsearch.
- Maintained pipeline metadata, naming standards and warehouse standards for future application development Parsing high-level design spec to simple ETL coding and mapping standards.
- Managed legacy SQL System of Record (SOR) for 1+ million financial instruments. Added 15 new financial products including interest rate futures using Python .
- Optimized the performance of queries with modifications in T-SQL queries, removed unnecessary columns, and eliminated redundant and inconsistent data.
- Provided ad-hoc report metric and dashboard to drive key business decisions using Tableau and Confidential tools.
- Designed library for emailing executive reports from Tableau REST API using python, Kubernetes, Git, AWS CodeBuild, and Airflow .
- Presented the Dashboard to Business users and cross functional teams, define KPIs (Key Performance Indicators), and identify data sources.
- Documented the source to target mappings for both data integration as well as web services.
Environment: AWS (S3, RDS, DMS, Snowflake, EMR, EC2, Glue), Kubernetes, Airflow, Nebula, OracleDB, DynamoDB, HDFS, PySpark, SQL Server Management Studio(SSIS, SSRS), Flask, Zeppelin, SQL, Python, Git, Tableau, MS Office, JIRA, Windows.
Confidential, Bloomfield, CT
Data Engineer
Responsibilities:
- Member of fraud analytics team who is actively responsible for developing intelligent and fault-tolerant big data applications with the purpose of flagging potential fraudulent claims by analyzing 837P EDI Files, Providers, and NPPES healthcare data in an automated manner.
- Automated ad-hoc ETL processes, making it easier to provision data while reducing queries’ response time by as much as 50% .
- Built scalable and robust data pipelines for Business Partners Analytical Platform to automate their reporting dashboard using SparkSQL and Pyspark, and also scheduled the pipelines.
- Co-ordinated weekly on-call support with on-shore and offshore Software teams of different Hospitals and resolved incidents reported by them.
- Involved in data integration process, identification of issues and resolving them efficiently.
- Implemented Spark Streaming with Kafka Consumer to subscribe to Claims’ inlet network of the organization and consume the feed on real time basis.
- Optimized Spark Jobs for the efficient utilization of the cluster.
- Developed application manager services to keep a check on health of dedicated edge node and applications running on YARN.
- Tested data models and data maps (extract, transform and load analysis - ETL analysis) of the DataMart and feeder systems in the aggregation effort.
- Created Stored Procedures to maintain metadata and Triggers for audit purpose in MSSQL Server Database.
- Implemented Data partitioning, Error Handling through TRY-CATCH-THROW statement, Common Table Expression (CTE).
- Achieved Continuous Integration and Delivery to Databricks by defining build and release pipeline tasks on AWS DevOps, that builds, drops and deploys the application libraries into DBFS.
- Used SqlAlchemy ORM on top of SQL Server for caching and auditing parser runs, while using Alembic for database schema migrations.
- Utilized GitHub as a version control platform for developed applications.
- Developed Python applications to push data from big data tables to archive tables in order to make sure, data models don't exhibit any performance specific challenges.
- Developed sensitive dashboards and report stories to assist KPI monitoring for senior management.
- Troubleshoot any data or reporting issues as reported by the team or client.
- Provided on-going and ad-hoc reporting for Renewal Rate, Match back Reporting, Employee KPIs, and Customer Demographics to identify Fraud waste and abuse of healthcare claims .
Environment: PySpark, Databricks, Kafka, SQL Server 2012, Tableau, JavaScript, Python, SQL, MS Excel, MS Access, UNIX, JIRA, Git.
Confidential, Richardson, TX
Data Analyst Intern
Responsibilities:
- Designed a data-driven email campaign using Salesforce which led to a 15% increase in lead to enrollment conversion.
- Performed A/B test analysis and measure campaign performance by different customer segments.
- Worked closely with the team to review code for compatibility issues, resolve issues as they arise, and implement deployment processes and improvements on a continuous basis.
- Customized workflow implementation using SFDC standard objects such as Leads, Reports, Dashboards.
- Collaborated with data engineering team and A&M marketing operations team to implement ETL process, wrote and optimized SQL queries to perform required data extraction to fit the target analytical requirements.
- Monitored data pipeline to extract data from disparate data sources such as CSV, Text, Parquet, Log files of student data into landing tables for data mining.
- Performed EDA on A&M Student data to identify key attributes and demographics. And use such attributes to build a targeted student enrollment campaign. Built certain data pipelines from Relational Databases and share insights to end CRM.
- Involved in design, configure and customize the Salesforce/force.com Service Cloud, Sales Cloud,and Marketing Cloud implementation.
- Developed SOQL and SOSL queries to get data from different related objects and Used Force.com Explorer for SOQL testing.
- Used Data Loader for insert, update, and bulk import or export of data from Salesforce.com subjects. Used it to read, extract and load data from comma-separated values (CSV) files.
- Created Excel and MS word Add-Ins using Visual Studio Tool for Office (VSTO) as per Business requirements.
- Created word cloud using Tableau to monitor the popular queries that are received overtime and the ones that were needed immediate attention.
Environment: Salesforce.com(Sandbox), MS Office, MS Visual Studio, SQL, Apex, Trello.