Ml Engineer/big Data Engineers Resume
San Francisco, CA
SUMMARY
- Highly efficient Data Scientist/Data Analyst/ML Engineer wif 5+ years of experience in Data Analysis, Machine Learning, Data mining wif large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization, Web Scraping. Adept in statistical programming languages like R and PyUnithon including Big Data technologies like Hadoop, Hive.
- Proficient in managing entire data science project life cycle and actively involved in all teh phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models, clustering), dimensionality reduction using TEMPPrincipal Component Analysis and Factor Analysis, testing and validation using ROC plot, K - fold cross-validation and data visualization.
- Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison, and validation.
- Has ability to build advanced statistical and predictive models, such as generalized linear, decision tree, neural network models, ensembles models, Support Vector Machines (SVM), and Random Forest.
- Experience wif a variety of NLP methods for information extraction, topic modeling, parsing, and relationship extraction wif developing, deploying, and maintaining production NLP models wif scalability. Creative thinking and propose innovative ways to look at problems by using data mining approaches on teh set of information available.
- Experience in working wif relational databases (Teradata, Oracle) wif advanced SQL programming skills.
- Experience in Big Data platforms like Hadoop platforms (Cloudera, Hortonworks & others) and Graph Databases
- Experience in designing visualizations using Tableau software and publishing and presenting dashboards.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, and design
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
- Skilled in performing data parsing, data manipulation and data preparation wif methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
- Experience in using various packages in R and python-like ggplot2, caret, dplyr, gmodels, twitter, NLP, Reshape2, plyr, pandas, NumPy, Seaborn, SciPy, Matplotlib, sci-kit-learn, Beautiful Soup.
- Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
- Hands on experience wif big data tools like Hadoop, Spark, Hive, Pig, PySpark, Spark SQL
- Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis, Boosting
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
- Experience and Technical proficiency in Designing, Data Modeling Online Applications, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
- Experience wif Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
- Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in teh infrastructure to provide data summarization.
- Worked and extracted data from various database sources like Oracle, SQL Server, DB2, regularly accessing JIRA tool and other internal issue trackers for teh Project development.
- Highly creative, innovative, committed, intellectually curious, business savvy wif good communication and interpersonal skills.
- Expertise in implementing Time Series Models usingRNN's, LSTM's, ARIMA, VARIMA, SARIMA.
- Experience of implementing deep learning algorithms such as Artificial Neural network (ANN) and Recurrent Neural Network (RNN), tuned hyper-parameter and improved models wif Python packages TensorFlow.
- Extracted data from HDFS and prepared data for exploratory analysis using data munging.
- Implemented, tuned and tested teh model on AWS EC2 wif teh best algorithm and parameters.
- Extensively worked on Spark using Pyspark, Pyhthon on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark wif Hive and SQL/Oracle, Azure Databricks, Azure Data lake, Azure Data Factory.
TECHNICAL SKILLS
Languages: C, C++, XML, R/R Studio, SAS Enterprise Guide, SAS, R, Python 2.x/3.x, Java, C, SQL, Shell Scripting, Maven, Scala, spark 2, 2.3, Spark Sql, Spark Streaming, Hadoop, MapReduce, R - (Packages: Stats, Zoo, Matrix, data table, OpenSSL), HDFS, Eclipse, Anaconda, Jupyter notebook
NO SQL Databases: Cassandra, HBase, MongoDB
Statistics: Hypothetical Testing, ANOVA, Confidence Intervals, Bayes Law, MLE, Fish Information, TEMPPrincipal Component Analysis (PCA), Cross-Validation, correlation.
BI Tools: Tableau, Tableau server, Tableau Reader, Splunk, SAP Business Objects, OBIEE, SAP Business Intelligence, QlikView, Amazon Redshift, or Azure Data Warehouse, Azure Data Factory, SSIS
Algorithms: Logistic Regression, Random Forest, XG Boost, KNN, SVM, Neural Network rk, Linear Regression, Lasso Regression, Generalized Linear Models, Boxplots, K-Means Clustering, SVN, PuTTY, WinSCP, Redmine (Bug Tracking, Documentation, Scrum), Neural networks, AI, Teradata, Tableau, GitHub, Linear, regression.
Data Analysis and Data Science: Deep neural network, Logistic regression, Decision Tress, Random Forests, KNN, XGBoost, Ensembles (Bagging, Boosting), Support Vector Machines, Neural Networks, graph/network analysis, and time series analysis (ARIMA model), NLP.
Big Data: Hadoop, HDFS, HIVE, PuTTy, Spark, Scala, Sqoop,Spark, MongoDB, Hbase
Reporting Tools: Tableau, PowerBI,SSRS
Database Design Tools and Data Modeling: MS Visio, ERWIN 4.5/4.0, Star Schema/Snowflake Schema modeling, Fact & Dimensions tables, physical & logical data modeling, Normalization and De-normalization techniques, Kimball &Inmon Methodologies
PROFESSIONAL EXPERIENCE
Confidential, San Francisco, CA
ML Engineer/Big Data Engineers
Responsibilities:
- Various data science process such as data mining, data collection, data cleansing, dataset preparation, for machine learning models in an RDBMS database was done using T-SQL
- Various analysis processes such as trend analysis, predictive modeling, machine learning and statistics, and other data analysis techniques was used to collect, explore, and identify teh data to explain customer behavior and segmentation, text analytics and big data analytics, product level analysis and customer experience analysis.
- Perform Data cleaning process applied Backward - Forward filling methods on dataset for handling missing value.
- Plan, develop, and apply leading-edge analytic and quantitative tools and modeling techniques to halp clients gain insights and improve decision-making.
- Utilize Spark, Scala, Hadoop, HQL, VQL, oozie, pySpark, Data Lake, TensorFlow, HBase, Cassandra, Redshift, MongoDB, Kafka, Kinesis, Spark Streaming, MLLib, R, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Apply various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python and Pyspark.
- Built and tested different Ensemble Models such as Bootstrap aggregating, Bagged Decision Trees and Random Forest, Gradient boosting, XGBoost, and AdaBoost to improve accuracy, reduce variance and bias, and improve stability of a model
- Involved in agile environment to implement agile management ideals such as sprint planning, daily standups, managing project timelines, and communicate wif clients to ensure project progress satisfactorily. Also worked in canban environment.
- Performance was optimized using Hyperparameter tuning, debugging, parameter fitting and troubleshooting of models and automated teh processes.
- Developed reports, charts, tables, and other visual aids in support of findings to recommend business direction or outcomes.
- Work on different data formats such as JSON, XML and applied machine learning algorithms in Python.
- Optimize algorithm wif stochastic gradient descent algorithm Fine-tuned teh algorithm parameter wif manual tuning and automated tuning such as Bayesian Optimization.
- Work on Text Analytics, Naive Bayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social networking platforms.
- Successfully read data, cleaned data, filtered data, preprocessed data, removed outlier, subset data, read preprocessed data, used model ( Linear Regression, Random Forest Regression ) and selected teh reasonable model on teh basis of R-square and accuracy using R and Python.
- Improved estimated customer delivery date (ECDD) accuracy up to 80% using AI/ML models which increased teh customer satisfactions by 20% and reduction in teh customer calls by 10%
- Successfully Ingested batch file and tables from Supply Chain Warehouse databases OMS, TMS,WMS, PKMS, YARDVIEW, STELLA, BOLD360 mostly in Oracle, DB2 databases, converted into parquet and Delta file, transformed and loaded in Azure Data Lake (ADL) by reading from Azure Data bricks using Scala and Pyspark wif teh halp of all databases server, host information, user name, password, driver JCBC jar files corresponding to different databases.
- Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Designed and built ETL pipelines to automate ingestion of structured and unstructured data
- Successfully connected to different warehouse supply chain databases like OMS, TMS, WMS, PKMS, YARDVIEW, WCC from Hadoop, Azure Databricks and Azure Data Factory
- Experienced wif teh Hadoop ecosystem and Spark framework (YARN, HDFS, SPARK, Scala, Pyspark)
- Extensively involved in writing SQL queries (Sub queries, nested queries, views, Join conditions, removal of duplicates) in Impala/Hive, Oracle and Spark SQL
- Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Read variety of databases from Azure Data Bricks using JDBC connections using Scala and Pyspark and saved in ADL.
- Improved estimated customer delivery date (ECDD) accuracy up to 80% using AI/ML models which increased teh customer satisfactions by 20% and reduction in teh customer calls by 10%
- Successfully connected to different data sources using SSH, SFTP from Hadoop cluster and Azure data factory.
Environment: R/R Studio, SAS, SSRS, SSIS, Oracle Database 11g, Oracle BI tools, Tableau, MS-Excel, Python, Naive Bayes, SVM, K- means, ANN, Regression, MS Access, SQL Server Management Studio, SAS E-Miner
Confidential - Bellevue, WA
ML Engineer/Data Scientist
Responsibilities:
- Feature selection process using Random forest, Select K best, RFE was used. Enhanced precision, lessen fluctuation and predisposition, and enhance solidness of a model, using Ensemble Models in machine adapting, for example, Boosting (Gradient boosting, XGBoost and AdaBoost), and Bootstrap accumulating (packing - Bagged Decision Trees and Random Forest)
- Created information models for information examination and extraction writing database complex SQL quiries in Oracle, PostgreSQL, MySQL
- Design, development and implementation of performant ETL pipelines using python API (pySpark) of Apache Spark on Azure Databricks.
- Connected to files, relational and Big Data Source using Tableau to Visually analyze, and process data. Tableau was also used to create and distribute an interactive and shareable dashboard to see teh trends, density and variations, of teh data in teh form of graphs and charts.
- Created information models for information examination and extraction writing database complex SQL inquiries in Oracle, PostgreSQL, MySQL
- Experienced in using teh spark application master to monitor thespark jobsand capture teh logs for teh spark jobs.
- ImplementedSparkusing Scala and Spark SQL for faster testing and processing of data.
- Implemented Spark using Scala and utilizingData framesand Spark SQL API for faster processing of data. supported in designing efficient and robust ETL workflows (extract, transform, load) on large datasets and creating big data warehouses that can be used for reporting or analysis by Data Scientists.
- Worked wif cloud computing to store, retrieve, and share large quantities of data in Azure is teh Azure Data lake. Read and wrote to Data lake from Apache Hadoop, Apache Spark, and Apache Hive. PCA was used for dimensional Reduction and created teh K-means clustering.
- To load both structured and unstructured streaming data to HDFS, hive and HBase apache Flume, and Apache Sqoop data loading was used
- Used Spark stream processing to get data into in-memory, implemented RDD transformations, actions to process as units
- Created Hive tables and implemented partitioning, dynamic partitions, buckets and created external tables to optimize performance.
- Used Spark andSpark-SQLto read teh parquet data and create teh tables in hive using teh Scala API.
- Collaborated wif product management and other departments to gather teh requirements. Performance of teh model was improved using K-fold cross Validation technique and teh data was tested to enhance teh model on teh sample data before finalizing teh model. Confusion Matrix and ROC Chart were used to evaluate teh classification model.
- Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Built and tested different Ensemble Models such as Bootstrap aggregating, Bagged Decision Trees and Random Forest, Gradient boosting, XGBoost, and AdaBoost to improve accuracy, reduce variance and bias, and improve stability of a model
- Experienced of Telecom Mobile and Wireless Data Network architecture
- Documented logical data models, semantic data models and physical data models.
- Implemented model on batch data using Spark SQL.
- Performed MapReduce jobs and Spark analysis using Python and R for machine learning and predictive analytics models on big data in Hadoop ecosystem on AWS cloud platform as well as some data from on-premise SQL.
- Developed Spark/Scala, Python, R for regular expression (regex) project in teh Hadoop/Hive environment wif Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Worked in teh agile environment to implement agile management ideals such as sprint planning, daily standups, managing project timelines, and communicate wif clients to ensure project progress satisfactorily.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of teh data. Create various types of data visualizations using Python and Tableau.
- Delivered fraud dashboards, Trends, plots on fraud data
- Study teh fraud cases and identify process gaps to prevent losses
- Identified fraud pattern and rebuilt teh machine learning models for fraud alarms.
- Used Clustering and Statistical plots to analyze data using R
- Design, development and implementation of performant ETL pipelines using python API (pySpark) of Apache Spark on Azure Databricks.
- Connected to files, relational and Big Data Source using Tableau to Visually analyze, and process data. Tableau was also used to create and distribute an interactive and shareable dashboard to see teh trends, density and variations, of teh data in teh form of graphs and charts.
Environment: HDFS, Hive, Scoop, Pig, Oozie, Amazon Web Services (AWS), Python 3.x (SciPy, Scikit-Learn), Tableau 9.x, D3.js, SVM, Random Forests, Naïve Bayes Classifier,SVM, LigthGBM Classifier,XGBoost Classifier, A/B experiment, Git 2.x, Agile/SCRUM.
Confidential - Deerfield, IL
Data Scientist/ML Engineer
Responsibilities:
- Developed pipelines using SparkML that drive data for teh automation of training and testing teh models.
- Supervised model types including Generalized Linear Models, Random Forests, Gradient Boosting Machines, Support Vector Machines, Deep Learning Neural Nets, and Ensemble Learning/Stacking. Unsupervised model types like TEMPPrincipal Component Analysis, K-means clustering, Hierarchical Clustering, AutoEnconders.
- Built models for highly imbalanced data sets. Bias/Variance tradeoff. Model quality metrics like RSquared, AUC. Outlier detection and removal.
- Advanced Statistics/Math: ANOVA/ANCOVA. Bootstrapping, confidence intervals.
- Worked in Big Data Hadoop Hortonworks, HDFS architecture, R, Python, Jupyter, Pandas, NumPy, SciKit, Matplotlib, PyHive, Keras, Hive, NoSQL- HBASE, Sqoop, Pig, MapReduce, Oozie, Spark MLlib.
- Used Cloudera Hadoop YARN to perform analytics on data in Hive, build models wif big data frameworks like Cloudera Manager and Hadoop
- Work wif different data science models Machine Learning Algorithms such as Linear, Logistic, Decision Tree, Random Forests, Support Vector Machines, Neural Networks, KNN, Deep learning
- Causal modeling in both experimental and observational data sets. Bayesian networks. Bayesian regression.
- Advanced programming in Python using SciKit-Learn and NumPy libraries.
- Predicted teh Remaining Useful Life (RUL), or Time to Failure (TTF) using Regression.
- Predicted if an asset will fail wifin certain time frame (e.g. days) wif Binary classification.
- Used LSTM to predict probability of failure at different time intervals compensating for independent variables reflecting states of wear.
- Worked in teh agile environment to implement agile management ideals such as sprint planning, daily standups, managing project timelines, and communicate wif clients to ensure project progress satisfactorily.
- Detected Anomaly using data science concept of Support Vector Machine (SVM), k-means Clustering, K-nearest neighbor
- Pandas, numphy, keras,sklearn, Scikit-learn, scipy, tensor Flows was utilized to predict categories dependent on location, time and some different highlights of Linear regression, Logistic regression, Decision Tress, Random Forests, relapse, Deep neural network,, KNN, XGBoost, k-means Clustering, Support Vector Machines, time arrangement analysis(ARIMA Model), Ensembles (Bagging, Boosting)Neural Networks, diagram/arrange examination
- Trained huge rundown of models, assessed and looked at and chose teh best models for predicting and forecasting. Set up and keep up viable procedures for confusion matrix, K - fold validating, and approved distinctive models. utilized teh TensorFlow profound learning library to address teh image and recognition issue developing convolution neural network in python.
- Feature selection process using Random forest, Select K best, RFE was used. Enhanced precision, lessen fluctuation and predisposition, and enhance solidness of a model, using Ensemble Models in machine adapting, for example, Boosting (Gradient boosting, XGBoost and AdaBoost), and Bootstrap accumulating (packing - Bagged Decision Trees and Random Forest) versee and convey machine-learning work processes and models into generation being familiar wif Azure Machine Learning Model Management
- Took a shot at ETL on Hadoop and worked for information wrangling,highlight choice and extraction. Used Spark SQL for ETL of crude information
- Connected to files, relational and Big Data Source using Tableau to Visually analyze, and process data.Tableau was also used to create and distribute an interactive and shareable dashboard to see teh trends, density and variations, of teh data in teh form of graphs and charts.
- Created information models for information examination and extraction writing database complex SQL inquiries in Oracle, PostgreSQL, MySQL
- Knowledge of AR (Autoregressive), MA (Moving Average), and ARIMA (Autoregressive Integrated Moving Average) time arrangement examination models. Statistical Model wif Grid search and and ARIMA time series was used to predict CVS market demand.
- Worked in teh Agile procedure wif self-sorting out, cross-useful groups run towards results in quick, iterative, gradual, and adaptive steps
- Worked wif cloud computing to store, retrieve, and share large quantities of data in AWS is teh Amazon S3 object store. Read and wrote to S3 from Apache Hadoop, Apache Spark, and Apache Hive. PCA was used for dimensional Reduction and created teh K-means clustering.
- Experience wif graphical analyses, NoSQL databases. Assessed and performed POC on new vital product items and applications.
- To load both structured and unstructured streaming data to HDFS, hive and HBase pache Flume, and Apache Sqoop data loading was used
Confidential
Data Analyst/Data Scientist
Responsibilities:
- Integrated data from multiple data sources or functional areas, ensures data accuracy and integrity, and updates data as need using SQL and Python.
- Expertise leveraging SQL, Excel and Tableau to manipulate, analyze and present data.
- Performs analyses of structured and unstructured data to solve multiple and/or complex business problems utilizing advanced statistical techniques and mathematical analyses.
- Developed advanced models using multivariate regression, Logistic regression, Random forests, decision trees and clustering.
- Used Pandas, Numpy, Seaborn, Scikit-learn in Python for developing various machine learning algorithms.
- Build and improve models using natural language processing (NLP) and machine learning to extract insights from unstructured data.
- Experienced working wif distributed computing technologies (Apache Spark, Hive).
- Applied predictive analysis and statistical modeling techniques to analyze customer behavior and offer customized products, reduce delinquency rate and default rate. Lead to fall in default rates from 5% to 2%.
- Applied machine learning techniques to tap into new markets, new customers and put forth my recommendations to teh top management which resulted in increase in customer base by 5% and customer portfolio by 9%.
- Analyzed customer master data for teh identification of prospective business, to understand their business needs, built client relationships and explored opportunities for cross-selling of financial products. 60% (Increased from 40%) of customers availed more than 6 products.
- Experienced in implementing Time Series Models usingRNN's, LSTM's, ARIMA, VARIMA, SARIMA.
- Collaborated wif business partners to understand their problems and goals, develop predictive modeling, statistical analysis, data reports and performance metrics.
- Participate in teh on-going design and development of a consolidated data warehouse supporting key business metrics across teh organization.
- Designed, developed, and implemented data quality validation rules to inspect and monitor teh health of teh data.
- Dashboard and report development experience using Tableau.
Environment: MS SQL Server, R/R studio, SQL Enterprise Manager, Python, Red shift, MS Excel, Power BI, Tableau, T-SQL, ETL, MS Access, XML, MS office, Outlook, AS E-Miner