Data Scientist Resume
CA
SUMMARY
- Data Scientist 8 Plus of experience executing data - driven solutions with adept knowledge on Data Analytics, Text Mining, Machine Learning (ML), Predictive Modelling, and Natural Language Processing (NLP).
- Experience in productionizing Machine Learning pipelines on Google cloud platform which performs data extraction, data cleaning, model training and updating the model on performance basis.
- Utilized GCP resources namely Big Query, cloud composer, compute engine, Kubernetes cluster and GCP storage buckets for building the production ML pipeline
- Expertise in building batch and streaming data pipelines on pulling data from multiple sources into Google’s Big Query. These pipelines are built using python, Kafka, Dataflow and Data Proc.
- Expertise in building ML models to predict failure events over store self-checkout machines and provides root cause for those failure events.
- Proficient in Statistical Modeling and Machine Learning techniques (Linear, Logistics, Decision Trees, Random Forest, SVM, K-Nearest Neighbors, Bayesian, XG Boost) in Forecasting/ Predictive Analytics
- Hands on solving problems which brings significant business value by building predictive models utilizing structured & unstructured data.
- Built a Machine Learning model to predict hourly sales (Orders, Invoices and Shipments) for an ecommerce platform.
- Hands - on experience in Machine Learning algorithms such as Linear Regression, GLM, CART, SVM, KNN, LDA/QDA, Naive Bayes, Random Forest, SVM, Boosting.
- Experience in working on both windows, Linux and UNIX platforms including programming and debugging skills in UNIX Shell Scripting.
- Hands on experience in creating data visualizations, dashboards in a Tableau desktop.
- Expertise in performing time series analysis and built forecasting models to predict the temperature and humidity spikes inside cold storage warehouses.
- Expertise in building monitoring dash boards that visualizes the present and predicted health of the cold storage warehouses.
- Proficient in Python and its libraries such as NumPy, Pandas, Scikit-learn, Matplotlib and Seaborn.
- Experience in building data warehouses, data marts and data cubes for creating power BI reports to visualize various key performance indicators of business.
- Utilized python libraries namely pandas, matplotlib and plotly for performing data analysis, data visualizations and predicted unexpected reboot events on store self-checkout machines (POS systems).
- Built a facial recognition model which is being used to perform user authentication for employee work hours tracking system.
- Expertise in building computer vision and deep learning applications.
- Utilized python’s flask framework for building REST API’s on top of Data Lake (Big Query, Cloud SQL).
- Using python, Telegraph and Kafka built a metrics data pipelines to push virtual infrastructure performance metrics, failed events to Wave Front tenant and built monitoring dash boards.
- Using Docker and ansible, containerized virtual infrastructure’s configuration management tasks which are used to detect config drifts and change back to original configurations.
- Expertise in building user interface (UI) applications using Bootstrap
- Expertise in containerizing applications using Docker composes.
- Achieved Continuous Integration &Continuous Deployment (CI/CD) for applications using Concourse.
- Experience with Test driven development (TDD), Agile methodologies and SCRUM processes.
- Experience in version control and collaboration tools like Git and source tree.
TECHNICAL SKILLS
Languages: Python, Java, Java Script, C, C++, SQL
ML/AI: TensorFlow, Keras, Scikit-learn, Prophet, PySpark, NLTK, Airflow, Pandas, OpenCV
Data Base: MySQL, SQL server, PostgreSQL, MongoDB
Reporting Tools: Power BI, WaveFront
Predictive and Machine Learning: Regression (Linear, Logistic, Bayesian, Polynomial, Ridge, Lasso), Classification (Logistic Reg., two/multiclass classification, Boosted Decision Tree, Random Forest, Decision Tree, Naïve Bayes, Support Vector Machines, k-Nearest Neighbors, Neural Network, and various other models), Clustering (K-means, Hierarchical ), Anomaly Detection, LSTM, RNN
Cloud: Google Cloud Platform, Pivotal Cloud Foundry, Azure, AWS
GCP ML Resources: BigQuery, Cloud Composer, AI Platform, Kubeflow
GCP Other Resources: Dataflow, Data Proc, Compute Engine, google Kubernetes engine, App Engine
Other Cloud Resources: Azure Data bricks, AWS Glue
Frame works: Flask, Django, Falcon, Bottle
Tools: Apache Spark, Kafka, Docker, Git, Concourse, Swagger
Operating System: Linux, Windows, Unix, MacOS
Automation Tool: Ansible, Telegraf
PROFESSIONAL EXPERIENCE
Confidential, CA
Data Scientist
Responsibilities:
- Built Power BI reports to show various KPI’s about the business and complete order life cycle from order purchases to order shipments.
- Applied Supervised Machine Learning Algorithms Logistic Regression, Decision Tree, and Random Forest for the predictive modeling various types of problems: Successful Transition from Skilled Nursing Facility, identify predictors for Medicare Advantage members, lower the cost of mitigating homelessness, issues management.
- Explored and analyzed the customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau.
- Built an ML model to perform customer segmentation which in turn helped to Increase the sales by strategic email campaigning of offers to targeted customers
- Developed a machine learning model to predict the hourly Sales (Orders, Invoices and Shipments).
- Built a data warehouse by gathering all the business data related to doctors, patients, prescriptions, orders and calls from different sources.
- Built doctor report cards to visualize the doctor performance over the years.
- Build Financial and executive reports to visualize information regarding revenue, profit margin and other key performance indicators of the business.
- Developed predictive models using Decision Tree, Random Forest, Naïve Bayes, Logistic Regression, Cluster Analysis, and Neural Networks.
- Automated the temperature monitoring system by building a forecasting model to predict the temperature and humidity spikes inside cold storage warehouses.
- Utilized IOT sensors for collecting health information of warehouses and build streaming data pipeline into GCP’s Big Query.
- Used Pandas, NumPy, Scikit-learn in Python for developing various machine learning models and utilized algorithms such as Linear regression, Logistic regression, Gradient Boosting, SVM and KNN.
- Developed pipeline using Hive (HQL) to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database and used ETL for data transformation
- Derived data from relational databases to perform complex data manipulations and conducted extensive data checks to ensure data quality. Performed Data wrangling to clean, transform and reshape the data utilizing Numpy and Pandas library.
- A highly immersive Data Science program involving Data Manipulation &Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT.
- We have worked with data-sets of varying degrees of size and complexity including both structured and unstructured data and Participated in all phases of Data mining, Data cleaning, Data collection, variable selection, feature engineering, developing models, Validation, Visualization and Performed Gap analysis.
- Used tableau desktop for creating data visualizations.
- Understanding in UNIX Shell scripts and writing SQL Scripts for development, automation of ETL process, error handling, and auditing purposes.
- Utilized Spark, Scala, Hadoop, HQL, VQL, oozie, pySpark, Data Lake, TensorFlow, HBase, Cassandra, Redshift, Mongo DB, Kafka, Kinesis, Spark Streaming, Edward, CUDA, MLLib, AWS, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Implemented Installation and configuration of multi-node cluster on Cloud using AWS
- Productionized machine learning pipelines that gathers data from BigQuery and build forecasting models to predict temperature and humidity spikes inside the warehouses.
- Built monitoring dash boards that visualizes the present and predicted health of the cold storage warehouses.
Environment: Python, HDFS, ODS, OLTP, Power BI, Oracle 10g, Hive, AWS, OLAP, DB2, Metadata, Tera Data, MS Excel, Mainframes MS Vision, Spark, Map-Reduce, Rational Rose, SQL and MongoDB, Unix/Linux.
Confidential, Alpharetta, GA
Data Scientist
Responsibilities:
- Constructively been part of a talented research team of data scientists in the field of Computer Vision to innovate, analyze application requirements into data models, to support standardization & effective adoption of bleeding-edge scientific norms and practices with a vision to enable integration and collaboration of AI/ML into everyday workflow; sharing learnings, best practices and research across many domains.
- Managed data pipeline using RDF Graphs, primitives from Apache Beam & Apche Nifi to build transparent and manageable data flow on GCP DataFlow, Google Big Query platform for a practically fully automated solution alleviating daily routine.
- Applied Probabilistic Graphical Methods (Bayesian and Gaussian network) to create Machine learning models.
- Built Machine Learning classification models using supervised algorithm like Boosted Decision, Logistic Regression, SVM, and Random Forest, Naïve Bayes, KNN.
- Implemented classification Models including Random Forest and Logistic Regression to quantify the likelihood of each member enrollment for the upcoming enrollment period.
- Facilitated and built Deep Learning architectures to achieve higher performance on classification tasks, object detection & localization tasks, image segmentation on variety of image data through rapid experimenting & customizing models from latest research studies via Transfer Learning and observed models’ performances over new image data to quantify data ambiguity.
- Gathering all the data that is required from multiple data sources and creating datasets that will be used in analysis.
- Used Linear Regression for the member cost for the upcoming enrollment period.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
- Developed Spark code using Scala and Spark-SQL for faster processing and testing.
- Hands on expertise in working with different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Explored and analyzed the gaming specific features by using Matplotlib and ggplot2. Extracted structured data from MySQL databases or CRM systems, developing basic visualizations or analyzing A/B test results.
- Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon.
- Developed Spark/Scala, R Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Updated Python scripts to match training data with database stored in AWS Cloud Search to assign each document a response label for further classification.
- Responsible for reporting of findings that will use gathered metrics to infer and draw logical conclusions from past and future behavior
- Collaborated with Data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XGBoost, SVM, and Random Forest.
- Executed the SAS jobs in batch mode through UNIX shell scripts.
- Created an algorithm that can predict the type of the object in a typical house using Deep Learning. Used OpenCV for the image analysis and keras and Tensor Flow for implementing artificial neural networks (ANN).
- Created Dense Models, Auto-Encoders & Convolutional Neural Networks (CNNs) using pre-trained models such as VGG, MobileNet, NasNet, ResNet, DenseNet, etc. using Vowpal-Wabbit (VW), TensorFlow, Keras, PyTorch in Python to train them on unseen real world & synthetic image data for medical image multi-class labeling and product categorization.
Environment: Data Pipeline Management, Acyclic Graphs, AWS, Data Governance, Hybrid Cloud, GCP Dataflow, Google Big Query, Apache Nifi, Scala, Apache Beam, Spark, Azure Cloud, Azure ML Studio, Vowpal-Wabbit, Image Processing, Computer Vision (CV), Unix/Linux.
Confidential, Austin TX
Machine Learning Consultant
Responsibilities:
- Built an ML model to automate the process of finding the root cause over failed events on store self-checkout machines (POS systems).
- Decreased the enterprise service now tickets by 15% in building a service named Back up as a service which gives the ability for a customer to initiate backups, restores on servers.
- As part of Capacity Planning, built an ML model to predict the CPU and Disk usage of on-premise infrastructure.
- Using apache airflow, data pipelines are built to gather data from store-checkout devices into BigQuery.
- Used classification techniques including Random Forest and Logistic Regression to quantify the likelihood of each user referring.
- On Google cloud platform utilizing cloud composer, BigQuery and GCP storage buckets productionized machine learning pipelines which performs data extraction, data cleaning, model training and updating the model on performance basis.
- Using python’s Django framework and Cloud Bolt, built a web user interface where users can perform backups and restores on servers.
- Designed and developed an automation process, that helps enterprise to maintain common configurations and detect configuration drifts across the enterprise virtual infrastructure.
- Using docker, containerized all the configuration management tasks and uploaded those docker images into local docker’s artifactory.
- Application of various machine learning algorithms and statistical Modeling like decision trees, text analytics,Natural Language Processing (NLP),supervised and unsupervised, Regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using Scikit-learn package in python.
- Visualization of data using MS PowerBI, ggplot, Seaborn, matplotlib and plotly.
- Utilizing python and Kafka built data pipelines for pulling data from multiple sources (vCenters, data bases, store devices) into Google’s Big Query.
- Utilize machine learning algorithms such as logistic regression, multivariate regression, K-means, & Recommendation algorithms to extract the hidden information from the data.
- Used Pandas, NumPy, Scikit-learn in Python for developing various machine learning models and utilized algorithms such as Linear regression, Logistic regression, Gradient Boosting, SVM and KNN.
- For serving data, Rest Api’s are built on the data lake (BigQuery, cloud sql).
- Wrote Data flow jobs for moving data across the google cloud platform.
- Using python, built an alerting application that sends alerts (email, slack messages) and create service now tickets on critical Rubrik (Backup & recovery management system) failure events.
- Used python built a data pipeline that pulls the performance metrics of all the virtual infrastructure into wave front tenant there by achieving higher data retention periods.
- Using wave front built dashboards for showing the performance of all the virtual infrastructure includes vCenters, clusters, ESXi and Nutanix hosts.
- Using angular and bootstrap, built a UI application which serves as a single stop for the information about all enterprise virtual infrastructure.
- Built a navigational robot which helps a person in navigating to any specified location inside the office area.
- All the applications, Rest Api’s had been deployed to google app engine, pivotal cloud foundry and on- premise Linux servers.
Environment: Customer Lifetime Value (CLTV), Risk Management, Fraud Detection, Customer Segmentation, Hadoop Ecosystem, HDFS, Hive, Hive QL, Pig, Sqoop, Map Reduce, Regression, Time-Series Forecasting, Predictive Analytics, Clustering, Text Mining, NLP, Unix/Linux.