We provide IT Staff Augmentation Services!

Data Scientist(python/machine Learning/nlp) Resume

4.50/5 (Submit Your Rating)

Boston, MA

SUMMARY

  • Data Scientist with 6 years’ experience in Statistical Modeling, Data Mining, Time Series Forecasting, Data Visualization, Machine Learning, and Applied Bayesian Statistics.
  • Familiar with entire data science project life cycle including Data Acquisition, Data Cleansing, Data Manipulation, Feature Engineering, Modelling, Evaluation, Optimization, Testing and Deployment.
  • Expertise in Machine Learning algorithm and Predictive Modeling including Regression Models, Decision Tree, Random Forests, Naïve Bayes Classifier, Bootstrap, K - Means Clustering, Ada Boosting, ROC/AUC, Support Vector Machine (SVM) and Principal Components Analysis (PCA).
  • Experienced in Natural Language Processing (NLP) including Latent Dirichlet allocation(LDA), Guided LDA, Labeled LDA, Lemmatization, Sentiment Analysis, tf-idf, and Word2Vec. Proficient in various Statistical Methodologies such as Hypothesis Testing, ANOVA, EM algorithm, Maximum likelihood function, Point Estimation, Interval Estimation and Bayesian Methods.
  • Knowledge of developing Time Series Forecasting models such as Generalized Autoregressive Conditional Heteroskedasticity (GARCH), Autoregressive Conditional Heteroskedasticity (ARCH), Autoregressive Moving Average (ARMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and Holt-Winter Procedure.
  • Solid experience in Deep Learning techniques with Convolutional Neural Networks (CNN), Recursive Neural Networks (RNN) and Generative adversarial network (GAN).
  • Proficient in model validation and optimization with Model selection, Parameter tuning and K-fold cross validation; and Python 3.x including Numpy, Scikit-learn, Pandas, NLTK, Gensim, Sklearn, Matplotlib and Seaborn.
  • Manipulated Big Data in the Hadoop ecosystem and Apache Spark framework such as HDFS, MapReduce, HiveQL, SparkSQL, and PySpark. Extensive experience in RDBMS such as SQL server and Oracle; and designing schema of Data Warehouse and developing complex and long SQL queries.
  • Experienced in building ETL packages and business reports in SQL Server Integration Service (SSIS) and setting up the schedule of the ETL packages in SQL Server Job Agent.
  • Proficient in data visualization tools such as Tableau, Python Matplotlib and Seaborn.
  • Experienced in ticketing systems such as JIRA/Confluence and version control tools such as Github. Great passion in data manipulation and learning cutting-edge theories and algorithms for Machine Learning and Artificial Intelligence. Strong business sense and abilities to communicate data insights to both technical and non-technical clients.

TECHNICAL SKILLS

Statistical Methods: Hypothesis Testing, Confidence Intervals, Principal Component Analysis (PCA), Cross-Validation, Auto-correlation, EM algorithm, Maximum likelihood function, Hessian Matrix, Gibbs Sampler, Bayesian Methods

Machine Learning: Regression analysis, Decision Tree, Random Forests, ROC/AUC, Support Vector Machine(SVM), Naïve Bayes, Bootstrap, Deep Learning, K-Means Clustering, KNN, Ada Boosting, NLP

Programing Languages: Python (2.x/3.x), R, SAS, Matlab, SQL, PySpark

Time Series Analysis: ARMA, SARIMA, ARCH, GARCH, Holt-Winter Procedure

Databases: Microsoft SQL Server (2012/2017), Oracle SQL Developer, PostgreSQL

Data Visualization: Tableau 9.x/10.x/2018.x/2019. x, MatPlotLib, Seaborn, ggplot2, Plotly, PowerBI

Cloud Services: Amazon Web Services(AWS) EC2/S3/Redshift, Microsoft Azure

Project Management Tools: GitHub, Google Shared Services, Visual Studio 2017

Operating Systems: Microsoft Windows, Linux (Ubuntu)

PROFESSIONAL EXPERIENCE

Confidential, Boston, MA

Data Scientist(Python/Machine Learning/NLP)

Responsibilities:

  • Used Machine Learning, Time Series and Natural Language Processing (NLP) to build intelligent supply chain management system in Python.
  • Optimized the relationship between distribution center, manufacturers, supplier and customers’ demand to reduce the cost and fulfill orders DIFOT(delivery in full on time).
  • Combined time series models, machine learning methods and business logics to predict the inventory of DC/MFG, shipment time and recommended supplier.
  • Allocated inventory for high demand product and normal product to customers by prioritizing customers to different groups.
  • Recommended production strategy by combining time series models and business rules to reduce the storage cost and maximize profit.
  • Implemented early warning system to inform users take early actions on PO/STO order delay issues according to prediction results of machine learning models.
  • Used NLP to improve the flexibility of the supply chain management system by leveraging user search historical data and adding voice search to search interface.
  • Collaborated with engineering team to productionize the product and supported the intelligent modules of the management system.

Environment: Machine Learning (Random Forest, Logistic Regression, XGBoost), Natural Language Processing (LDA, tf-idf), Python 3.x (Numpy/Pandas/Pickle/Sklearn/Plotly/Json), MongoDB, Spark

Confidential, Farmington, CT

Data Scientist (NLP/Python/Tableau)

Responsibilities:

  • Used Natural Language Processing (NLP) to build case management system to help HR department categorize cases and track popular topics of employees to achieve service excellence.
  • Worked with engineers to extract millions of case data from different sources of internal service records like email, dialog and online support records.
  • Conducted exploratory data analysis using Python to do spatio-temporal-analysis to understand underlying distribution of cases.
  • Completed data preprocessing by slang handling, removing chat jargon and automated text, performing lemmatization, setting up customized stop words and generating bi-grams features to build corpus.
  • Generated tf-idf values before feeding it into Naïve Bayes model to get important features from labeled data.
  • Introduced Guided LDA method to perform topic modelling using important features of each label and seed words picked by HR specialist.
  • Used bubble plot and metrics to validate results with HR team and helped them name topics.
  • Visualized topics popularity weekly/monthly trend and topics detail using line chart and scatter chart in Tableau.
  • Collaborated with data engineers to automate project process and provided final result to stakeholders using Tableau.

Environment: Natural Language Processing (Guided LDA, tf-idf), Machine Learning (Random Forest, Naïve Bayes, Ada Boosting), Python 3.x (NLTK/Gensim/Numpy/Pandas/re/Pickle/Sklearn/Guidedlda), SQL Server 2017, SQL Server Management Studio (SSMS), T-SQL, Spark, Tableau (Desktop 2018.x/Server 2018.x)

Confidential, Boston, MA

Data Scientist (NLP/Python/Tableau)

Responsibilities:

  • Built an internal job matching system using Natural Language Processing (NLP) to help HR specialists find top N candidates suitable for certain open positions and at the same time, recommend potential candidates to top N matching open positions in the company.
  • Communicated with HR department to understand the problem, comment their requirements and expectations of the job matching system.
  • Collaborated with engineers to understand data structure to design an ETL package on Azure Databricks which automated extraction of different types of data like pdf, image and excel file from different sources.
  • Completed resume preprocessing by converting different types of files into text files, removed empty files and files that are too small.
  • Extracted skillsets from job descriptions and resumes by finding nouns with top 20% tf-idf value.
  • Generated expanded skillset which categorizes similar skills into groups by calculating Word2Vec to get the closest K skills in the skillset using KNN.
  • Utilized Elasticsearch to find the top N records with the closest skillset of a resume or a job description to build the job matching system.
  • Worked with engineering team to show the application demo of the job matching system and collected suggestions from HR department to improve the system.

Environment: Language Processing(tf-idf, Word2Vec), Elasticsearch API, Tableau, Spark, Azure Databrick, Python 3.x (PyPDF2/pytesseract/Numpy/Pandas/Gensim), SQL Server 2017, SQL Server Management Studio (SSMS), T-SQL

Confidential, Worcester, MA

Data Scientist (Python/Machine Learning/SQL)

Responsibilities:

  • Worked with data science team to build a machine learning model to predict 3-year relapses of recovered breast cancer patients using Python.
  • Collaborated with oncologist to understand background of this problem and select both medical and non-medical features of more than 11,000 patients from 2008 to 2018.
  • Communicated with data engineer team to understand data schema before consuming data from different data sources.
  • Preprocessed data to deal with unbalanced dataset, missing values and created dummy variables for categorical features.
  • Implemented dimensionality reduction methods such as PCA to reduce number of features and avoid overfitting.
  • Applied different machine learning classification models like Random Forest, Logistic Regression and Support Vector Machine to predict relapses of patients.
  • Evaluated different models and picked the best model using performance metrics like accuracy, precision, recall and AUC-ROC.
  • Developed KPI dashboard for high-risk relapses patients using donut chart, line chart, crosstab, bar chart and density chart with Tableau.
  • Design and create ETL pipeline with PySpark on Azure Databricks to automate extract data from local database, transform and load data into Data Warehouse.
  • Stored customer level relapses prediction results into our information system database for medical specialist’s usages.

Environment: Machine Learning (Logistic Regression, Random Forest, Naïve Bayes, Ada Boosting), Python 3.x (NLTK/Gensim/Numpy/Pandas/Sklearn), SQL Server 2017, SQL Server Management Studio (SSMS), T-SQL, Spark

Confidential, Boston, MA

Data Scientist (R/MLE/Tableau)

Responsibilities:

  • Helped company to make better decisions by estimating the distribution of a specific disease in a certain area by applying Bayesian methods on sensitive questions and response models.
  • Designed reliable response survey by introducing unrelated questions and randomness when showing possibly sensitive questions to survey takers.
  • Conducted survey to different groups of people where each group has different probability of answering sensitive questions.
  • Identified the data source and validated source data as per requirements then built ETL pipeline with PySpark on Azure Databricks to automate the process of data collection from different data source for different respondents from different county and then loading into SQL Database in Azure.
  • Designed Bayesian algorithm and workflow to estimate true answer of sensitive questions based on survey results.
  • Implemented different models like EM algorithm to solve the maximum likelihood function with latent variable.
  • Applied bootstrap method and Hessian Matrix to compute standard deviation and the confidence intervals and calculate the Hessian Matrix of estimated parameters.
  • Improved the result with Gibbs Sampler and validated the outcome with current result.
  • Visualized respondents’ information, estimation results and disease distribution using big number, bar chart, donut chart, tables and geographic maps with Tableau.

Environment: EM algorithm, Maximum Likelihood function, latent variable, Hessian Matrix, point estimation, Interval estimation, Standard Deviation, Gibbs Sampler, Bootstrap, R, Tableau (Desktop 10.x/Server 10.x)

Confidential

Data Analyst (SQL/SSIS/Tableau)

Responsibilities:

  • Collaborated with marketing team to understand business processes and gathered information and requirement from various departments and converted them into documents or reports for data warehouse designing.
  • Worked with different departments and finished the data structure design of the Data Warehouse.
  • Created ETL packages, deployed package to SQL Server and set up the schedule of daily job in SQL Server Job Agent.
  • Developed and updated views, stored procedures and User-Defined Functions that met business requirements using T-SQL.
  • Developed Tableau data visualization using cross tabs, heat maps, scatter plots, geographic maps, pie chart, donut charts and density charts.
  • Used Big number in Tableau to show the data of current week, month and year and percentage difference comparing with previous time period.
  • Clearly and thoroughly communicated with business analysts to support consuming data services and deliverables.

Environment: SQL Server 2014, SQL Server Management Studio (SSMS), T-SQL, Visual Studio 2015, Excel, PowerPoint, Tableau 9.x

We'd love your feedback!