We provide IT Staff Augmentation Services!

Sr Data Engineer/python With Aws Cloud Resume

2.00/5 (Submit Your Rating)

Atlanta, GA

SUMMARY

  • Overall, 8+ years of technical IT experience in all phases of Software Development Life Cycle (SDLC) with skills in data analysis, design, development, testing and deployment of software systems.
  • Strong Experience in Data Engineering and Big Data analytics,Data manipulation, using Hadoop Eco system toolsMap - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HBase, Spark, Presto, Kafka, Flume, Oozie, Sqoop, Spark integration with Cassandra and Zookeeper
  • Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, cloud Watch, SNS, Dynamo, SQS.
  • Experience in creating real time datastreamingsolutions using Apache Spark/ SparkStreaming/ apache Kafka & ApacheFlink.
  • Managing Database, Azure Data Platform services (Azure Data Lake (ADLS), talendtory (ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Data bricks, NoSQL DB), SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes
  • Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau, PowerBI.
  • CreatedSnowflake Schemasby normalizing the dimension tables as appropriate, and creating a Sub Dimension named Demographic as a subset to the Customer Dimension
  • Hands on experience in test driven development(TDD),Behavior driven development(BDD)and acceptance test driven development (ATDD)approaches.
  • Provided full life cycle support to logical/physical database design, schema management and deployment. Adapt to database deployment phase with strict configuration management and controlled coordination with different teams.
  • Experience in writing code in R and Python to manipulate data for data loads, extracts, statistical analysis, modeling, and data munging.
  • Familiar with latest software development practices such as Agile Software Development, Scrum, Test Driven Development (TDD) and Continuous Integration (CI).
  • Utilized analytical applications like R, SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
  • Extensive hands-on experience in using distributed computing architectures such as AWS products (EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and effective use of Azure SQL Database, Map Reduce, Hive, SQL and PySpark to solve big data type problems.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle).
  • Skilled in performing data parsing, data ingestion, data manipulation, data architecture, data modelling and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, re index, melt and reshape.
  • Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS and other services of the AWS family.

TECHNICAL SKILLS

Hadoop Distributions: Cloudera, AWS EMR and Azure Data Factory.

Languages: Scala, Python, SQL, Hive QL, KSQL.

IDE Tools: Eclipse, IntelliJ, pycharm.

Cloud platform: AWS, Azure

AWS Services: VPC, IAM, S3, Elastic Beanstalk, CloudFront, Redshift, Lambda, Kinesis, DynamoDB, Direct Connect, Storage Gateway, EKS, DMS, SMS, SNS, and SWF

Reporting and ETL Tools: Tableau, Power BI, Talend, AWS GLUE.

Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (Hbase, Cassandra, Mongo DB)

Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks, Kafka, Cloudera

Machine Learning And Statistics: Regression, Random Forest, Clustering, Time-Series Forecasting, HypothesisExplanatory Data Analysis

Containerization: Docker, Kubernetes

CI/CD Tools: Jenkins, Bamboo, GitLab CI, uDeploy, Travis CI, Octopus

Operating Systems: UNIX, LINUX, Ubuntu, CentOS.

Other Software: Control M, Eclipse, PyCharm, Jupyter, Apache, Jira, Putty, Advanced Excel

Frameworks: Django, Flask, WebApp2

PROFESSIONAL EXPERIENCE

Confidential, Atlanta, GA

Sr Data Engineer/Python with AWS Cloud

Responsibilities:

  • Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing and Reporting of voluminous, rapidly changing data.
  • Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.
  • Involved in the development of real time streaming applications using PySpark, ApacheFlink, Kafka, Hiveon distributed Hadoop Cluster.
  • Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
  • Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.
  • Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake.
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
  • Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.
  • Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
  • Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
  • Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).
  • Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates
  • Developed a fully automated continuous integration system using Git, Jenkins, MySQL and custom tools developed in Python and Bash which saved $85K YOY.
  • Automated nightly build to run quality control using Python with BOTO3 library to make sure pipeline does not fail which reduces the effort by 70%.
  • Involved with development of Ansible playbooks with Python and SSH as wrapper for management of AWS node configurations and testing playbooks on AWS instances.
  • Developed Python AWS serverless lambda with concurrent and multi-threading to make the process faster and asynchronously executing the callable.
  • Chunking the data to convert them from larger data sets to smaller chunks using python scripts which will be useful for faster data processing.
  • Processing this data using SparkStreamingAPI with Scala.
  • Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift
  • Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model.
  • Implemented Workload Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.

Environment: Hadoop, Map Reduce, HDFS, Hive, Presto, Apache Flink, Python, Streaming, SQL, Amazon RDS, Amazon EC2, S3, CloudWatch, Spark, Scala, AWS, Git, Kafka, RedShift, DynamoDB, PostgreSQL.

Confidential, Boston, MA

Sr Data Engineer

Responsibilities:

  • Performed data analysis and developed analytic solutions. Data investigation to discover correlations / trends and the ability to explain them.
  • Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables - visualization)
  • Developed frameworks and processes to analyze unstructured information. Assisted in Azure Power BI architecture design
  • Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics
  • Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solution.
  • Performing data analysis, statistical analysis, generated reports, listings and graphs using SAS tools, SAS/Graph, SAS/SQL, SAS/Connect and SAS/Access.
  • Developing Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Using Kafka and integrating with the Spark Streaming. Developed data analysis tools using SQL andPythoncode.
  • DesignedanddevelopedFlinkpipelines to consume streaming data fromKafkaand applied business logic to massage and transform and serialize raw data.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
  • Experience in developing Spark applications using Spark-SQL inDatabricksfor data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
  • Applied sparkstreamingfor real time data transforming.
  • Developed and deployed data pipeline in cloud such as AWS and GCP.
  • Agile methodology including test-driven and pair-programming concept.
  • Developed a SparkStreamingmodule for consumption of Avro messages from Kafka.
  • Created functions and assigned roles in AWS Lambda to run python scripts, and AWSLambda using java to perform event driven processing.
  • Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
  • Created yaml files for each data source and including glue table stack creation. Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
  • Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, Event Bridge, SNS)
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab. Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Experience in Converting existing AWS Infrastructure to Server less architecture(AWS Lambda, Kinesis),deploying viaTerraformand AWS Cloud Formation templates.
  • Worked onDocker containerssnapshots, attaching to a running container, removing images, managing Directory structures, and managing containers.
  • Experienced in day-to-day DBA activities includingschema management, user management(creating users, synonyms, privileges, roles, quotas, tables, indexes, sequence),space management(table space, rollback segment),monitoring(alert log, memory, disk I/O, CPU, database connectivity),scheduling jobs, UNIX Shell Scripting.
  • Developed complexTalend ETL jobsto migrate the data fromflat filesto database. Pulled files frommainframe into Talendexecution server using multipleftpcomponents.
  • DevelopedTalend ESBservices and deployed them onESBservers on different instances.
  • Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.
  • Developed merge scripts toUPSERTdata intoSnowflakefrom an ETL source.

Environment: Hadoop, Map Reduce, HDFS, Hive, Apache Flink, Spring Boot, Cassandra, Data Streaming, Data Lake, Sqoop, Oozie, SQL, Kafka, Spark, Scala, GCP, GitHub, RedShift, Talend Big Data Integration, Solr, Impala.

Confidential, Atlanta, GA

Big Data Engineer

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
  • Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
  • Experience in designing both time driven, and data drivenautomated workflowsusingOozie.
  • Implemented Real-timestreamingof AWS CloudWatch Logs to Splunk using Kinesis Firehose.
  • DevelopedOozie Workflowsfor daily incremental loads, which gets data fromTeradataand then imported intohive tables.
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • Built a new CI pipeline. Testing and deployment automation with Docker, Swamp, Jenkins and Puppet. Utilized continuous integration and automated deployments with Jenkins and Docker.
  • Data visualization:Pentaho, Tableau, D3. Have knowledge of Numerical optimization, Anomaly Detection and estimation, A/B testing, Statistics, and Maple. Have big data analysis technique using Big data related techniques i.e.,Hadoop, Map Reduce, NoSQL, Pig/Hive, Spark/Shark, MLlibandScala, NumPy, SciPy, Pandas, scikit-learn.
  • Worked with teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, Elastic Load Balancer, and Auto scaling groups, VPC subnets and CloudWatch
  • UtilizedSpark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLlib, Pythonand utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
  • Used ApacheSpark Data frames, Spark-SQL, Spark MLlibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity. Build an ETL which utilizes spark jar inside which executes the business analytical model
  • Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
  • Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and cost sensitivealgorithms.
  • Improve fraud prediction performance by using random forest and gradient boosting for feature selection withPython Scikit-learn.
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop, Kafka, Spark, Sqoop, Docker, Swamp, Azure, Spark SQL, TDD, Spark-Streaming, GCP, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper.

Confidential

AWS Data Engineer/Data Analyst

Responsibilities:

  • Gathered business requirements, definition and design of the data sourcing, worked with the data warehouse architect on the development of logical data models.
  • Created sophisticated visualizations, calculated columns and custom expressions anddeveloped Map Chart, Cross table, Bar chart, Tree map and complex reports which involves Property Controls, Custom Expressions.
  • Investigated market sizing, competitive analysis and positioning for product feasibility. Worked on Business forecasting, segmentation analysis and Data mining.
  • Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports.
  • Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database.
  • Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models. Developed a data pipeline using Kafka to store data into HDFS.
  • Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process usingpythonscripts.
  • DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
  • DevelopedSparkjobs using Scala for faster real-time analytics and usedSparkSQL for querying
  • Implemented various statistical techniques to manipulatethe datalike missingdataimputation, principal component analysis and sampling.
  • Worked on R packages to interface with Caffe Deep Learning Framework. Perform validation on machine learning output from R.
  • Applied different dimensionality reduction techniques like principal component analysis (PCA) and t-stochastic neighborhood embedding(t-SNE) on feature matrix.
  • Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
  • Responsible for design and development of Python programs/scripts to prepare transform and harmonize data sets in preparation for modeling.
  • Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
  • Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python and build models using deep learning frameworks
  • Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model.

Environment: Spark, YARN, HIVE, Pig, Scala, NiFi, TDD, Python, AWS, Hadoop, Azure, Dynamo DB, NOSQL, Sqoop, MYSQL.

Confidential

Software Engineer

Responsibilities:

  • Experience in Data Analytics and design in Hadoop ecosystem using Map Reduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
  • Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic Map Reduce(EMR)on(EC2).
  • Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS
  • Performed pig script which picks the data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in Oozie script
  • Hands on experiences on Git bash commands like Git pull to pull the code from source and developing it as per the requirements, Git add to add files, Git commit after the code build and Git push to the pre prod environment for the code review and later used screwdriver. yaml which actually build the code, generates artifacts which releases in to production
  • Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
  • Connected to AWS Redshift through Tableau to extract live data for real time analysis.
  • Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
  • Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
  • DevelopedPythonscript to run SQL query as parallel to initial load data into target table. Involved in loading data from edge node to HDFS using shell scripting and assisted in designing the overall ETL strategy
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
  • Worked with applications like R, SPSS and Python to develop neural network algorithms, cluster analysis, ggplot2 and shiny in R to understand data and developing applications.
  • Partner with technical and non-technical resources across the business to leverage their support and integrate our efforts.
  • Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.

Environment: Hadoop, Map Reduce, Hive, Pig, Spark, HBase, Oozie, Impala, Kafka, Azure data factory, Data bricks, Aws, Azure, Python, NumPy, Pandas, Pl/Sql, Sql Server, Unix, Shell Scripting, Git.

We'd love your feedback!