We provide IT Staff Augmentation Services!

Senior Data Scientist/big Data Engineer Resume

4.00/5 (Submit Your Rating)

Philadelphia, PA

SUMMARY

  • Extensive profound experience as a Data Scientist/Data Engineer and SAS Developer with excellent Statistical Analysis, Data Mining and Machine Learning Skills.
  • Worked in the domains of Telecommunications, Financial Service, Healthcare and Retail.
  • Sound financial domain knowledge of Fixed Income, Bonds, Equities, Trade Cycle, Derivatives (Options and Futures), Portfolio Management, Sales and Marketing, CCAR and risk management.
  • Expertise in managing full life cycle of Data Science project includes transforming business requirements into Data Collection, Data Cleaning, Data Preparation, Data Validation, Data Mining, and Data Visualization from structured and unstructured Data Sources .
  • Hands on experience in data conversions from various data sources (flat files, MS Excel files, text files, XML files, etc.) using SAS into Linux platform and testing for accuracy.
  • Hands on experience in writing queries in SQL and Talend Open Studio to extract, transform and load (ETL) data from large datasets.
  • Proven ability in using Text Analytics such as Topic Modelling, sentiment analysis.
  • Hands on experience of statistical modeling techniques such as: linear regression, Lasso regression, logistic regression, elastic net, ANOVA, factor analysis, clustering analysis, principle component analysis and Bayesian inference.
  • Hands experience in writing User Defined Functions ( UDFs ) to extend functionality with Scala for data preprocessing.
  • Professional working experience in Machine Learning algorithms such as LDA, linear regression, logistic regression, GLM, SVM, Naive Bayes, Random Forests, Decision Trees, Clustering, neural networks and Principle Component Analysis.
  • Experienced in SAS/BASE, SAS/MACRO, SAS/SQL, SAS/ODS in Windows and Unix environments
  • Skilled in using SAS Statistical procedures like PROC REPORT, PROC TABULATE, PROC CORR, PROC ANOVA, PROC LOGISTI, PROC FREQ, PROC MEANS, PROC UNIVARIATE.
  • Working knowledge on Anomaly detection, Recommender Systems and Feature Creation, Validation using ROC plot and K - fold cross validation.
  • Professional working experience of using programming languages and tools such as Python, Hive, Spark, Java, PHP and PL/SQL.
  • Hands on experience in Google Analytics and ELK (Elasticsearch, Logstash, and Kibana) stack.
  • Working experienced of statistical analysis using python, SAP Predictive Analytics, R, SAS (STAT, macros, EM), , SPSS, Matlab and Excel.
  • Hands on experience of Data Science libraries in Python such as Pandas, Numpy, SciPy, scikit-learn, Matplotlib, Seaborn, Beautiful Soup, Orange, Rpy2, LibSVM, neurolab, NLTK, Tensorflow .
  • Familiar with packages in R STUDIO such as ggplot2, Caret, Dplyr, Tidyr, Wordcloud, Stringr, e1071, MASS, Rjson, Plyr, FactoMineR, MDP.
  • Working knowledge of NLP based deep learning models in Python 3.
  • Working experience in RDBMS such as SQL Server 2012/2008 and Oracle 11g.
  • Extensive experience of Hadoop, Multi-tennancy, Hive and NoSQL databases such as MongoDB, Cassandra and HBase.
  • Experience in data visualizations using Python, R, D3.js and Tableau 9.4/9.2 .
  • Highly experienced in MS SQL Server with Business Intelligence in SQL Server Integration Services (SSIS), SQL Server Analysis Services (SSAS), and SQL Server Reporting Services (SSRS).
  • Familiar with conducting GAP analysis, User Acceptance Testing ( UAT ) , SWOT analysis, cost benefit analysis and ROI analysis.
  • Deep understanding of Software Development Life Cycle (SDLC) as well as Agile/Scrum methodology to accelerate Software Development iteration.
  • Experience with version control tool- Git.
  • Extensive experience in handling multiple tasks to meet deadlines and creating deliverables in fast-paced environments and interacting with business and end users.

TECHNICAL SKILLS

Hadoop Ecosystem Data Analysis/Visualization: Hadoop2.X, Spark1.6+, Hive2.1, Hbase1.0+, Tableau 9.4/9.2, Matplotlib, D3.js, Rshiny Scala 2.10.X

Languages Databases: Python2.7/3, R-3, PL/SQL, SAS 9.4, Hive, MySQL 5.X, Oracle 11g, SQL Server2012 Pig, Java, MongoDB3.2, HBase 1.0+, Cassandra 3.0

Packages Machine Learning: Pandas, Numpy, Scikit-learn, Beautiful Soup, LDA, Naive Bayes, Decision trees, Regression GGPLOT2, caret, dplyr, tidyr, wordcloud, models, Neural Networks, SVM, XG Boost, Stringr, e1071, MASS, rjson, plyr, FactoMineR, SVM, random forests, bagging, gradient Seaborn, matplotlib, MDP, Orange, Rpy2, boosting machines, k-means LibSVM, neurolab, NLTK, Tensorflow.

Business Analysis Documentation/Modeling Tools: Requirements Engineering, Business Process MS Office 2010, MS Project, MS Visio, Modeling & Improvement, Gap analysis, Rational Rose, Excel (Pivot, Tables, Cause and Effect Analysis, UI Design, UML Lookups) Share Point, Rational Requisite Pro, Modeling, User Acceptance Testing (UAT), MS Word, PowerPoint, Outlook RACI Chart, Financial Modeling

Scripting Language Operating Systems: UNIX Shell, HTML, XML, CSS, JSP, SQL, Linux, Ubuntu, Mac OS, CentOS, Windows Markdown

Version Control: Git, TFVC

PROFESSIONAL EXPERIENCE

Confidential, Philadelphia, PA

Senior Data Scientist/Big data Engineer

Responsibilities:

  • Work on building spark jobs using Scala, python as language, processing TBs of data, and building new ETL pipelines on a Cloudera Hadoop cluster
  • Implement best practices associated with Big Data and value-driven development ensuring quality from concept to production to generate data.
  • Implement data science & processing logic with Spark SQL, Data Frame/Data Set/RDD API and coding languages Scala, Python.
  • Import data from Teradata, Oracle, MySql and ingest into HDFS using Sqoop.
  • Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization to deliver data science solutions.
  • Implemented a Python-based distributed random forest via PySpark.
  • Used Pandas, NumPy, Seaborn, Diplomata, Sci kit-learn in Python for developing various machine learning models and utilized algorithms such as Linear regression, Logistic regression, Gradient Boosting, SVM and KNN.
  • Analyzed and visualized different segments of users to understand their customer behaviors better with Kibana.
  • Performed data analysis using NumPy and Panda package of python and used Seaborn Package to get the insights of the data.
  • Implemented various Machine learning algorithms in python such as Linear Regression, Logistic Regression, Decision Trees, Naive Bayes, K-Nearest Neighbors, Random Forests, Support Vector Machines (SVM) and K-means clustering.
  • Used Python (NumPy, SciPy, pandas, Scikit-learn, seaborn, NLTK) and Spark (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
  • Used stepwise selection, cross-validation and confusion matrix results to identify the best performing model.
  • Used PCA to reduce the dimensionality of the data to accelerate the training process.
  • Deployed feature engineering techniques to generate new features and select features with Scikit-learn library in Python.
  • Used cross-validation to train the models with different batches of data to optimize the models and prevent over fitting.
  • Implement data science & processing logic with Spark SQL, Data Frame/Data Set/RDD API and coding languages Scala, Python.
  • Migrating Map Reduce jobs into RDD (Resilient data distributions) or Data Frame and create Spark jobs for better performance.
  • Evaluated performance of the model through different methods, such as Accuracy, Precision, Recall and F1 score.
  • Operationalization and performance tuning of Spark jobs on large volume of data using Hadoop Cluster and Containers/Executors.
  • Developed scripts to run Klondike project Automation and generate the output response using UC4 for scheduling.
  • Write Spark project and Scala scripts doing Klondike Data Quality Check for both monthly and Daily data through the entire Klondike pipelines.
  • Monitor performance of Klondike Spark-submit jobs in Resource Manager and write unit test cases and performed exploratory analysis debug data in Spark-Shell.
  • Defining both data development tactics and defining requirements for ad hoc data analysis.

Environment: Pyspark, Scala 2. 1, Spark 2. 4, AWS, Kibana Spark SQL, Mesos, PL/SQL, HBase, Elastic search, Devops

Confidential, San Jose, CA

Data Engineer

Responsibilities:

  • Extracted data from Memsql database using SparkSql.
  • Worked on real time data using Spark Streaming and Spark core.
  • Created and maintained Data pipelines and message queuing in the big data platform.
  • Analyzed data trends using packages like Numpy, Pandas and Matplotlib in Python.
  • Figured out and optimized parameters for decay function using GGplog in R STUDIO .
  • Used Scala to repartition data based on profile.
  • Applied decay functions to produce the score to predicted users’ preferences using Scala.
  • Loaded the results to multiply database including S3, Spark and Memsql.
  • Implemented the project into clusters .
  • Real-time simulation of machine learning model on GPU.
  • Created action filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
  • Restricted data for particular users using Row level security and User filters.
  • Designed, created, and deployed interactive Tableau visualizations reports to display the result and monitor the health of project using Tableau server/ Desktop .
  • Developed Tableau workbooks from multiple data sources using Data Blending.
  • Used Git for version control.

Environment: Python 3, R STUDIO, SAS, Scala 2. 1, Spark 2.0, AWS, Kafka Streaming, Janitor, Mesos, Tableau 9.4, Memsql, Spark SQL, Mesos, PL/SQL, HBase, MongoDB, g it, Elastic search, Devops

Confidential, Piscataway, NJ

Data Scientist/ Data Engineer

Responsibilities:

  • Optimized Data pipelines in SAP and Big Data platform.
  • Developed the statistical models designed to forecast market variables under stress scenarios within Financial Models using R STUDIO.
  • Analyzed Google Analytics data and presented findings to Product Manager and discussed changes to digital marketing strategy
  • Used SAS, SQL, Oracle, Teradata and MS Office analysis tools to complete analysis requirements Created SAS data sets by extracting data from Oracle database and flat files
  • Used Proc SQL, Proc Import, SAS Data Step, cleaned, validated and manipulated data by SAS and SQL.
  • Performed updating data by weekly and monthly and maintain, manipulating the data for database management. Used the SAS MACROS and Excel Macro for the monthly production.
  • Created queries using Scala, Hive, SAS (Proc SQL) and PL/SQL to load large amount of data from MongoDB and SQL Server into HDFS to spot data trends.
  • Use Talend Open Studio to retrieve, query and process raw data.
  • Used Scala to perform data cleansing, transformation and filtering such as identifying outliers, missing value and invalid values.
  • Utilized K-means clustering technique to classify unlabeled data.
  • Worked on data pattern recognition, data cleaning as well as data visualizations such as Scatter Plot, Box Plot and Histogram Plot to explore the data using packages Matplotlib, Seaborn in Python, ggplot in R and SAS.
  • Used LDA, PCA and Factor Analysis to perform dimensional reduction.
  • Modified and applied Machine Learning algorithm such as Neural Networks, SVM, Bagging, Gradient Boosting, K-Means using SAP Predictive Analytics, SAS, PySpark and MLlib to detect target customers.
  • Worked on customer segmentation based on the similarities of the customers using an unsupervised learning technique - cluster analysis.
  • Used Pandas, Numpy, Scipy, Scikit-learn, NLTK in Python for scientific computing and data analysis.
  • Applied cross validation to evaluate and compare the performance among different models. Validated the machine learning classifiers using ROC Curves and Lift Charts.
  • Configured GPU/ Spark Streaming with Kafka to clean and aggregate real time data.
  • Involved in Text Analytics such as analyzing text, language syntax, structure and semantics.
  • Generated weekly and monthly reports and maintained, manipulated data using SAS macro, Tableau and D3.js.
  • Involved in using Sqoop to load historical data from SQL Server into HDFS.
  • Used Git for version control.

Environment: Python 3/2.7, R STUDIO, Tensorflow, SAP, Predictive Analytics, R 3, SAS 9.4, Google Analytics, HDFS, MongoDB 3.2, Elastic Search, Hadoop, Hive, Linux, Spark, Scala, Kafka, Tableau 9.4, D3.js, SQL Server 2012, Spark SQL, PL/SQL, UML, Git

Confidential

Data researcher

Responsibilities:

  • Conducted comprehensive analysis and evaluations of business needs; Provided analytical support for policy; Engineered financial, operational and reputational impacts and influence decisions for different models.
  • Retrieved, manipulated, analyzed, aggregated and performed ETL through billions of records of claim data from databases like RDBMS and Hadoop cluster using SAS (Proc SQL), PL/SQL, Scala, Sqoop and Flume.
  • Used Matplotlib, Seaborn in Python to visualize the data and performed featuring engineering such as detecting outliers, missing value and interpreting variables.
  • Worked on transformation and dimension reduction of the dataset using PCA and Factor Analysis.
  • Developed, validated and executed machine learning algorithms including Naive Bayes, Decision trees, Regression models, SVM, XG Boost to identify different kinds of fraud and reporting tools that answer applied research and business questions for internal and external clients.
  • Implemented models like Linear Regression, Lasso Regression, Ridge Regression, Elastic Net, Random Forest and Neural Network to provide predictions to help reducing the rate of frauds.
  • Experienced in using Pandas, Numpy, SciPy, Scikit-learn to develop various machine learning algorithms.
  • Used SAS, PySpark, MLlib to evaluate different models like F-Score, Precision, Recall, and A/B testing.
  • Fine-tuned the developed algorithms using regularization term to avoid overfitting.
  • Configured Kafka with Spark Streaming API to fetch near real time data from multiply source such as web log.
  • Analyzed real time data using Spark Streaming and Spark core with MLlib.
  • Used the final machine learning model to detect fraud of real time data.
  • Extensively involved in data visualization using D3.js and Tableau.

Environment: Python 3/2.7, R3, SAS 9.4, HBase 1.0+, Kafka, HDFS, Hadoop, Hive, Linux, Spark, Scala, Tableau 9.2, D3.js, SQL Server 2012, Excel, Spark SQL

We'd love your feedback!