We provide IT Staff Augmentation Services!

Data Engineer Resume

3.00/5 (Submit Your Rating)

San Jose, CA

SUMMARY:

  • Over 6 years of profound experience as a Data Scientist/Data Engineer and SAS Developer with excellent Statistical Analysis, Data Mining and Machine Learning Skills.
  • Worked in teh domains of Telecommunications, Financial Service, Healthcare and Retail.
  • Sound financial domain noledge of Fixed Income, Bonds, Equities, Trade Cycle, Derivatives (Options and Futures), Portfolio Management, Sales and Marketing, CCAR and risk management.
  • Expertise in managing full life cycle of Data Science project includes transforming business requirements into DataCollection,Data Cleaning, Data Preparation, DataValidation, Data Mining,and DataVisualization from structured and unstructured Data Sources.
  • Hands on experience in writing queries in SQL and R to extract, transform and load (ETL) data from large datasets usingDataStaging.
  • Proven ability in using Text Analytics such as Topic Modelling, sentiment analysis.
  • Hands on experience of statistical modeling techniques such as: linear regression, Lasso regression, logistic regression, elastic net, ANOVA, Monte Carlo methods, factor analysis, clustering analysis, principle component analysis and Bayesian inference.
  • Hands experience in writing User Defined Functions (UDFs) to extend functionality with Scala for data preprocessing.
  • Professional working experience in Machine Learning algorithms such as LDA, linear regression, logistic regression, GLM, SVM, Naive Bayes, Random Forests, Decision Trees, Clustering, neural networks and Principle Component Analysis.
  • Experienced in SAS/BASE, SAS/MACRO, SAS/SQL, SAS/ODS in Windows and Unix environments
  • Skilled in using SAS Statistical procedures like PROC REPORT, PROC TABULATE, PROC CORR, PROC ANOVA, PROC LOGISTI, PROC FREQ, PROC MEANS, PROC UNIVARIATE.
  • Working noledge on Anomaly detection, Recommender Systems and Feature Creation, Validation using ROC plot and K - fold cross validation.
  • Professional working experience of using programming languages and tools such as Python, Hive, Spark, Java, PHP and PL/SQL.
  • Hands on experience inELK(Elasticsearch,Logstash, andKibana) stack.
  • Working experienced of statistical analysis using R, SAS (STAT, macros, EM), SPSS, Matlab and Excel.
  • Hands on experience of Data Science libraries in Python such as Pandas, Numpy, SciPy, scikit-learn, Matplotlib, Seaborn, Beautiful Soup, Orange, Rpy2, LibSVM, neurolab, NLTK.
  • Familiar with packages in R such as ggplot2, Caret, Dplyr, Tidyr, Wordcloud, Stringr, e1071, MASS, Rjson, Plyr, FactoMineR, MDP.
  • Working noledge of NLP based deep learning models in Python 3.
  • Working experience in RDBMS such as SQL Server 2012/2008 and Oracle 11g.
  • Extensive experience of Hadoop, Hive and NoSQL databases such as MongoDB, Cassandra and HBase.
  • Experience in data visualizations using Python, R, D3.js and Tableau 9.4/9.2.
  • Highly experienced in MS SQL Server with Business Intelligence in SQL Server Integration Services (SSIS), SQL Server Analysis Services (SSAS), and SQL Server Reporting Services (SSRS).
  • Familiar with conducting GAP analysis, User Acceptance Testing (UAT), SWOT analysis, cost benefit analysis and ROI analysis.
  • Deep understanding of Software Development Life Cycle (SDLC) as well as Agile/Scrum methodology to accelerate Software Development iteration.
  • Experience with version control tool- Git.
  • Extensive experience in handling multiple tasks to meet deadlines and creating deliverables in fast-paced environments and interacting with business and end users.

TECHNICAL SKILLS:

Data Analysis/Visualization: Tableau 9.4/9.2, Matplotlib, D3.js, Rshiny

Hadoop Ecosystem: Hadoop2.X, Spark1.6+, Hive2.1, Hbase1.0+, Scala 2.10.X

Languages: Python2.7/3, R-3, PL/SQL, SAS 9.4, Hive, Pig, Java, PHP

Databases: MySQL 5.X, Oracle 11g, SQL Server2012/2008, MongoDB3.2, HBase 1.0+, Cassandra 3.0

Packages: Pandas, Numpy, Scikit-learn, Beautiful Soup, GGPLOT2, caret, dplyr, tidyr, wordcloud, Stringr, e1071, MASS, rjson, plyr, FactoMineR, Seaborn, matplotlib, MDP, Orange, Rpy2, LibSVM, neurolab, NLTK.

Machine Learning: LDA, Naive Bayes, Decision trees, Regression models, Neural Networks, SVM, XG Boost, SVM, random forests, bagging, gradient boosting machines, k-means

Business Analysis: Requirements Engineering, Business Process Modeling & Improvement, Gap analysis, Cause and TEMPEffect Analysis, UI Design, UML Modeling, User Acceptance Testing (UAT), RACI Chart, Financial Modeling

Documentation/Modeling Tools: MS Office 2010, MS Project, MS Visio, Rational Rose, Excel (Pivot, Tables, Lookups) Share Point, Rational Requisite Pro, MS Word, PowerPoint, Outlook

Scripting Language: UNIX Shell, HTML, XML, CSS, JSP, SQL, Markdown

Operating Systems: Linux, Ubuntu, Mac OS, CentOS, Windows

Version Control: Git, TFVC

PROFESSIONAL EXPERIENCE:

Confidential - San Jose, CA

Data Engineer

Responsibility:

  • Extracted data from Memsql database using SparkSql.
  • Analyzed data trends using packages like Numpy, Pandas and Matplotlib in Python.
  • Figured out and optimized parameters for decay function using Matplotlib in Python.
  • Used Scala to repartition data based on profile.
  • Applied decay functions to produce teh score to predicted users’ preferences using Scala.
  • Loaded teh results to multiply database including S3, Spark and Memsql.
  • Implemented teh project into clusters.
  • Involved in creating reports to display teh result and monitor teh health of project using Tableau.
  • Used Git for version control.

Environment:Python 3, Scala 2.10, Spark 2.0, AWS, Kafka Streaming, Janitor, Mesos,Tableau 9.4, Memsql, Spark SQL, Mesos, PL/SQL, git

Confidential, New York, NY

Data Scientist/Data Engineer:

Responsibility:

  • Implemented and supported strategic data sourcing to automate teh population of annual, monthly, and quarterlyCCARreportsin Big Data platform.
  • Developed teh statisticalmodelsdesigned to forecast market variables under stress scenarios withinFinancial Models using R.
  • Created queries using Scala, Hive, SAS (Proc SQL) and PL/SQL to load large amount of data from MongoDB and SQL Server into HDFS to spot data trends.
  • Wrote Hive-QL to retrieve, query and process raw data.
  • Used Scala to perform data cleansing, transformation and filtering such as identifying outliers, missing value and invalid values.
  • Utilized K-means clustering technique to classify unlabeled data.
  • Worked on data pattern recognition, data cleaning as well as data visualizations such as Scatter Plot, Box Plot and Histogram Plot to explore teh data using packages Matplotlib, Seaborn in Python, ggplot in R and SAS.
  • Used LDA, PCA and Factor Analysis to perform dimensional reduction.
  • Modified and applied Machine Learning algorithm such as Neural Networks, SVM, Bagging, Gradient Boosting, K-Means using SAS, PySpark and MLlib to detect target customers.
  • Worked on customer segmentation based on teh similarities of teh customers using an unsupervised learning technique - cluster analysis.
  • Used Pandas, Numpy, Scipy, Scikit-learn, NLTK in Python for scientific computing and data analysis.
  • Applied cross validation to evaluate and compare teh performance among different models. Validated teh machine learning classifiers using ROC Curves and Lift Charts.
  • Configured Spark Streaming with Kafka to clean and aggregate real time data.
  • Involved in Text Analytics such as analyzing text, language syntax, structure and semantics.
  • Generated weekly and monthly reports and maintained, manipulated data using SAS macro, Tableau and D3.js.
  • Involved in using Sqoop to load historical data from SQL Server into HDFS.
  • Used Git for version control.

Environment:Python 3/2.7, R 3, SAS 9.4, HDFS, MongoDB 3.2, Elastic Search, Hadoop, Hive, Linux, Spark, Scala, Kafka, Tableau 9.4, D3.js, SQL Server 2012, Spark SQL, PL/SQL, UML, Git

Confidential, Hartford, CT

Data Scientist/Data Engineer

Responsibility:

  • Conducted comprehensive analysis and evaluations of business needs; Provided analytical support for policy; Engineered financial, operational and reputational impacts and influence decisions for different models.
  • Retrieved, manipulated, analyzed, aggregated and performed ETL through billions of records of claim data from databases like RDBMS and Hadoop cluster using SAS (Proc SQL), PL/SQL, Scala, Sqoop and Flume.
  • Used Matplotlib, Seaborn in Python to visualize teh data and performed featuring engineering such as detecting outliers, missing value and interpreting variables.
  • Worked on transformation and dimension reduction of teh dataset using PCA and Factor Analysis.
  • Developed, validated and executed machine learning algorithms including Naive Bayes, Decision trees, Regression models, SVM, XG Boost to identify different kinds of fraud and reporting tools that answer applied research and business questions for internal and external clients.
  • Implemented models like Linear Regression, Lasso Regression, Ridge Regression, Elastic Net, Random Forest and Neural Network to provide predictions to help reducing teh rate of frauds.
  • Experienced in using Pandas, Numpy, SciPy, Scikit-learn to develop various machine learning algorithms.
  • Used SAS, PySpark, MLlib to evaluate different models like F-Score, Precision, Recall, and A/B testing.
  • Fine-tuned teh developed algorithms using regularization term to avoid overfitting.
  • Configured Kafka with Spark Streaming API to fetch near real time data from multiply source such as web log.
  • Analyzed real time data using Spark Streaming and Spark core with MLlib.
  • Used teh final machine learning model to detect fraud of real time data.
  • Extensively involved in data visualization using D3.js and Tableau.

Environment:Python 3/2.7, R 3, SAS 9.4, HBase 1.0+, Kafka, HDFS, Hadoop, Hive, Linux, Spark, Scala, Tableau 9.2, D3.js, SQL Server 2012, Excel, Spark SQL

Confidential

Data Scientist/Data Engineer

Responsibility:

  • Created new features based on information from million transaction records and training models using Machine-Learning techniques such as Gradient Boosting Tree and Deep Learning.
  • Analyzed and determined a cutoff point for accepting/ decliningtransactions to minimize fraud losses and increase customer experienceby using various machine learning algorithms such as Logistic Regression, Classification, Random Forests and Clustering in SAS, R and Python.
  • Used Pandas, Numpy, Seaborn, Scipy, Matplotlib, Scikit-learn, NLTK in Python for implementing various machine learning algorithms.
  • Used SAS, SQL, Oracle, Teradata and MS Office analysis tools to complete analysis requirements. CreatedSASdata sets by extracting data from Oracle database and flat files
  • Used Proc SQL, Proc Import,SASData Step to clean, validate and manipulate data.
  • Performed updating data by weekly and monthly; maintained, manipulated teh data for database management. Used theSASMacro and Excel Macro for teh monthly production.
  • Experienced in SQL queries to retrieve and validate data, prepared for data mapping document.
  • Experienced in creating dashboards using SSAS and building matrix and tabular reports using reporting services
  • Worked on RDBMS like MySQL and NoSQL databases like MongoDB.
  • Used teh Agile Scrum methodology to build teh different phases of software development life cycle.

Environment:SAS9.4,Base SAS, SASMacros SASGraph,SASAccess,SASSTAT,SASODS,SASSQL, SAS/ETL, SAS/Stat, SAS ENTERPRISE Miner, Python, PL/SQL, Oracle 9i, Hadoop, MongoDB

Confidential

Data Analyst

Responsibility:

  • Analyzed online user behavior, conversion data and customer journeys, funnel analysis andmulti-channel attribution.
  • Worked on business forecasting, segmentation analysis and data mining.
  • Involved in teh development of Data Warehouse for personal lines property and casualty insurance.
  • Generated graphs and reports using ggplot in RStudio for analyzing models.
  • Developed and implemented R and Shiny for business forecasting.
  • Developed predictive models using Decision Tree, Random Forest and Naïve Bayes.
  • Used available data sources to deep dive and troubleshoot campaign performance issues.

Environment:MySQL5.5, R 3, caret, Shiny

Confidential

Data Analyst

Responsibilities:

  • Provided analytical support for teh Claims, Ancillary, and Medical Management.
  • Performed Data Mapping and Logical Data Modeling; Created class diagrams, ER diagrams
  • Cleaned data by analyzing and eliminating duplicate and inaccurate data using PROC FREQ, PROC MEAN, PROC UNIVARIATE, PROC RANK, and macros in SAS; Used SQL queries to filter data.
  • Converted various SQL statements into stored procedures theirby reducing teh number of database accesses.
  • Worked with Quality Control Teams to develop Test Plan and Test Cases.
  • Designed and implemented basic SQL queries for Data Report and Data Validation.
  • Developed user manuals and provided orientation and training to end users for all modified and new systems.

Environment:Base SAS, SAS/Access, SAS/Stat, SAS/Graph, SAS/SQL, SAS/ODS, SAS DI Studio, SAS/Macros, MS Excel, MS Word, PowerPoint, Oracle 9g, DB2, MS Excel, UNIX, SAS ENTERPRISE Miner, SAS EBI (Enterprise Business Intelligence) 9.4

We'd love your feedback!