Big Data Engineer Resume AZ - Hire IT People

SUMMARY

Around 8 years of professional experience in IT industry in BIGDATA platform using HADOOP framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies.
Experience in implementing various Big Data Analytical, Cloud Data engineering, Data Warehouse, Data Visualization, Reporting, Data Quality, and Data virtualization solutions.
Have proven track record of working as Data Engineer on Amazon cloud services, Bigdata/Hadoop applications and product development.
Experience in designing the Conceptual, Logical and Physical data modeling using Erwin and E/R Studio Data modeling tools, AWS.
Good with big data on AWS cloud services EC2, S3, Glue, Athena, DynamoDB n RedShift.
Experience in job/workflow scheduling, Worked on Oozie, AWS Data pipeline & Autosys.
Provisioned the highly available EC2 Instances using Terraform and cloud formation and wrote new plugins to support new functionality in Terraform.
Knowledge ofpush down optimization conceptsand tuning Informatica objects for optimum execution timelines.
Understanding of structured data sets, data pipelines, ETL tools, data reduction, transformation and aggregation technique, Knowledge of tools such as DBT, DataStage
Defined and deployed monitoring, metrics, and logging systems on AWS.
Good experience in deploying, managing and developing with MongoDB clusters.
Docker container orchestration using ECS, ALB and lambda.
Experience with Unix/Linux systems with scripting experience and building data pipelines.
I was Responsible for migration of application running on premise onto Azure cloud.
Experience in detailed system design using use case analysis, functional analysis, modeling program with class & sequence, activity and state diagrams using UML and rational rose.
Experience on Cloud Databases and Data warehouses (SQL Azure and Confidential Redshift/RDS)
Worked with InformaticaData Quality toolkit, Analysis, data cleansing, data matching, data conversion, exception handling, and reporting and monitoring capabilitiesof IDQ 8.6.1.
Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, MS SQL.
Cluster monitoring and troubleshooting with Cloudera, Ganglia, NagiOS, Ambari metrics.
Expert in setting up Horton works cluster with and without using Ambari.
Experience in deploying and managing the multi - node development and production Hadoop cluster with different Hadoop components (HIVE, PIG, SQOOP, OOZIE, FLUME, HCATALOG, HBASE, ZOOKEEPER) using Horton works Ambari.
Played a key role in migrating Cassandra, Hadoop cluster on AWS and defined different read/write strategies.
Strong SQL development skills including writing Stored Procedures, Triggers, Views, and User Defined functions.
Experience in developing core java/j2ee based applications and database applications using spring, hibernate, oracle/sql and batch programming, unix and php.
Designed Tableau Data Extracts for improving visualizations perform better.
Hands-on improvement helping clients in making and adjusting worksheets and information representation dashboards in Tableau.
Implemented solutions using Hadoop, HBase, Hive, Sqoop, Java API, etc.
Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources.
Good understanding of software development methodologies, including Agile (Scrum).
Expertise in development of various reports, dashboards using various Tableau visualizations.
Hands on experience with different programming languages such as Java, Python, R, SAS.
Experience in using different Hadoop eco system components such as HDFS, YARN, MapReduce, Spark, Pig, Sqoop, Hive, Impala, Hbase, Kafka, and Crontab tools.
Expert in creating HIVE UDFs using java inorder to analyze data sets for complex requirements.
Extensive Tableau Experience of Enterprise Environment, in Tableau 10.1/ 9.x/8.x/7.x as a Developer and Admin.
Expert in Data Visualization development using Tableau to create complex, instinctive, and imaginative dashboards.
Experience in developing ETL applications on large volumes of data using different tools: MapReduce, Spark-Scala, PySpark, Spark-Sql, and Pig.
Experience in using SQOOP for importing and exporting data from RDBMS to HDFS and Hive.
Experience on MS SQL Server, including SSRS, SSIS, and T-SQL.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, HBase, Pig, Hive, Sqoop, KafkaFlume, Cassandra, Impala, Oozie, Zookeeper, MapR, Amazon Web Services (AWS), EMR

Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Gradient Boosting Classifier, Extreme Gradient Boosting Classifier, Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayes Classifier, Extra Trees Classifier, Stochastic Gradient Descent, etc.

Cloud Technologies: AWS, Azure, Google cloud platform (GCP)

IDE’s: IntelliJ, Eclipse, Spyder, Jupyter

Ensemble and Stacking: Averaged Ensembles, Weighted Averaging, Base Learning, Meta Learning, Majority Voting, Stacked Ensemble, AutoML - Scikit-Learn, MLjar, etc.

Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE

Programming / Query Languages: Java, SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, SAS, R Programming (Caret, Glmnet, XGBoost, rpart, ggplot2, sqldf), RStudio, PL/SQL, Linux shell scripts, Scala.

Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, NI-FI, GCP, Google Shell, Linux, Big Query, Bash Shell, Unix, Tableau, Power BI, SAS, We Intelligence, Crystal Reports, Dashboard Design.

PROFESSIONAL EXPERIENCE

Confidential, AZ

Big Data Engineer

Responsibilities:

Analysing large amounts of data sets to determine optimal way to aggregate and report on data sets.
Designed and Implemented Big Data Analytics architecture, transferring data from Oracle.
Created DDL's for tables and executed them to create tables in the warehouse for ETL data.
Implemented logical and physical relational database and maintained Database Objects in the data model using Erwin.
Design, Implement and maintain Database Schema, Entity relationship diagrams, Data modeling, Tables, Stored procedures, Functions and Triggers, Constraints, clustered and non-clustered indexes, partitioning tables, Schemas, Functions, Views, Rules, Defaults and complex SQL statement for business requirements and enhancing performance.
Developed data pipeline using Flume, Sqoop, Pig and Java map reduce and Spark to ingest customer behavioral data and purchase histories into HDFS for analysis.
Designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin.
Connected to AWS Redshift through Tableau to extract live data for real time analysis.
Exporting the analyzed and processed data to the RDBMS using Sqoop for visualization and for generation of reports for the BI team.
Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
Worked on designing, building, deploying and maintaining Mongo DB.
Design SSIS packages to bring data from existing OLTP databases to new data warehouse using various transformations and tasks like Sequence Containers, Script, for loop and Foreach Loop Container, Execute SQL/Package, Send Mail, File System, Conditional Split, Data Conversion, Derived Column, Lookup, Merge Join, Union All, OLE DB source and destination, excel source and destination with multiple data flow tasks.
Developed ETL framework using Spark and Hive (including daily runs, error handling, and logging)
Coordinated with team and Developed framework to generate Daily adhoc, Report's and Extracts from enterprise data and automated using Oozie.
Improve the performance of SSIS packages by implementing parallel execution, removing unnecessary sorting and using optimized queries and stored procedures.
Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames.
Developed pipeline for POC to compare performance/efficiency while running pipeline using the AWS EMR Spark cluster and Cloud Dataflow on GCP.
Configure and manage data sources, data source views, cubes, dimensions, mining structures, roles, defined hierarchy and usage-based aggregations with SSAS.
Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR and MapR (MapR data platform).
Worked on small dashboard project which use java-spring Boot, REST API.
Responsible for maintaining and tuning existing cubes using SSAS and Power BI.
Worked on cloud deployments using maven, docker and Jenkins.
Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.
Used AWS Glue for the data transformation, validate and data cleansing.
Used python Boto 3 to configure the services AWS glue, EC2, S3.
Involved in design serverless application CI/CD by using AWS Serverless (Lambda) application model, Participates in the development, maintenance of snowflake database applications.
Build the Logical and Physical data model for snowflake as per the changes required.
Evaluate Snowflake Design considerations for any change in the application.
Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions.
Developed merge scripts toUPSERTdata intoSnowflakefrom an ETL source.

Confidential

Big Data Developer

Environment: Hadoop, Map Reduce, HDFS, Hive, Ni-fi, Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, AWS, GitHub, Talend Big Data Integration, Impala.

Responsibilities:

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Informatica BDM, T-SQL, Spark SQL, and Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
Implemented ETL process through Informatica BDM and python scripting to load data from Denodo (visualization layer) to ThoughtSpot that help business to run advance algorithms.
Applying Hadoop map reduce performance tuning techniques and build hive queries efficiently.
Designing the distribution strategy for tables in Azure SQL data warehouse
Designed and created Informatica mappings used to build data marts according to requirements.
Technical Validation of functional specification, data mapping from source model into target model.
Collaborate with business stakeholders to identify and meet data requirements.
Involved in the Workshop and High-Level Design of Claims Monthly Management Metrics (C3M) project under DEEP that enables claim historical reporting.
Supported business development in Claims Area by doing impact analysis, effort estimation and future state solution development.
Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
Designing the business requirement collection approach based on the project scope and SDLC methodology.
Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
Analyzed clickstream data from Google analytics with Big Query. Designed APIs to load data from Omniture, Google Analytics, Google Big Query.
Maintained JIRA team and program management review dashboards and maintained COP account and JIRA team sprint metrics reportable to customer and SAIC division management.

Environment: Hadoop, Kafka, Spark, Sqoop, Docker, Swamp, Big Query, Spark SQL, TDD, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper.

Confidential

Data Engineer/Scala Developer

Responsibilities:

Responsible for architecting Hadoopclusters Translation of functional and technical requirements into detailed architecture and design.
Worked on analyzingHadoopcluster and different big data analytical and processing tools includingPig, Hive, Spark, and Spark Streaming.
Migrating various Hive UDF's and queries intoSparkSQLfor faster requests.
Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data toHDFSusingScala.
Hands on experience inSparkand Spark Streaming creatingRDD's, applying operations -Transformation and Actions.
Developed multiple POCs using scala and deployed on the YARN cluster, compared the performance of spark with Hive and SQL/Teradata.
Analyzed the SQL scripts and designed the solution to implement using scala.
Developed analytical component using Scala, Spark and Spark stream.
Developed and implementedhivecustomUDFsinvolving date functions.
UsedSqoopto import data from Oracle toHadoop.
UsedOozieworkflow engine to manage interdependent Hadoop jobs and to automate several types ofHadoopjobs such asJava map-reduce Hive, Pig,andSqoop.
Experienced in developing scripts for doing transformations usingScala.
Involved in developingShell scriptsto orchestrate execution of all other scripts and move the data files within and outside ofHDFS.
Installed and configuredHive,Pig,SqoopandOozieon theHadoopcluster.
UsingKafkaon publish-subscribe messaging as a distributed commit log, have experienced in its fast, scalable and durability.
UsedTableaufor generating reports on weekly basis to the customer.
AnalyzingHadoopcluster and different Big Data analytic tools includingPig,Hive,HBaseandSqoop.
ImplementedKerberosSecurity Authentication protocol for existing cluster.
Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database.
Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models. Developed a data pipeline using Kafka to store data into HDFS.
Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process usingpythonscripts.
DevelopedSparkjobs using Scala for faster real-time analytics and usedSparkSQL for querying
Generated graphs and reports using ggplot package in RStudio for analytical models. Developed and implemented R and Shiny application which showcases machine learning for business forecasting.

Confidential

Hadoop Developer

Environment: Spark, YARN, HIVE, Pig, Scala, Mahout, NiFi, Java,TDD, Python, Spring Boot, Hadoop, Azure, Dynamo DB, Kibana, NOSQL, Sqoop, MYSQL.

Responsibilities:

Involved in installing Hadoop Ecosystem components.
Develop and run Map-reduce jobs on a multi peta byte YARN and Hadoop clusters which processes billions of events every day, to generate daily and monthly reports as per need.
Used to manage and review the Hadoop log files and responsible to manage data coming from different sources.
Participated in development/implementation of Cloudera Hadoop environment.
Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
Led discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models. Expert in Business Intelligence and Data Visualization tools: Tableau, MicroStrategy.
Worked on machine learning on large size data using Spark and MapReduce.
Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes.
Development of UDF in java for Hive and Pig, Developed Spark/Scala, Python for regular expression project in the Hadoop/Hive environment with Linux/Windows for big data resources.
Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.

Confidential

Data Analyst

Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Java, Tableau, SQL, Excel, VBA, SAS, MATLAB, ASPSS, Cassandra, Oracle, MongoDB, SQL Server 2012, DB2, T-SQL, PL/SQL, XML, Tableau.

Responsibilities:

Gathered high level requirements and developed scope of the project for the implementation of Microsoft Office Share Point 2007.
Experience with Data Extraction, Transforming and Loading (ETL) using various tools such as Data Transformation Service (DTS), SSIS and Bulk Insert (BCP)
Responsible for creating test scenarios, scripting test cases using testing tool and defect management for Policy Management Systems, Payables/Receivables and Claims processing.
Worked on billing system a cash management module and enhanced the encrypting standards that are required for the application.
Worked on bug tracking reports on daily basis using Quality Center.
Designed, developed and tested data mart prototype (SQL 2005), ETL process (SSIS) and OLAP cube (SSAS)
Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services).
Use of data transformation tools such as DTS, SSIS, Informatica or Data Stage.
Implemented Change Data Capture usingInformatica Power Exchange 9.1.
Designed, developedInformatica Power Center 9.5mappings to extract, transform and load the data intoOracle 11gtarget tables.
Extensively worked on Informatica tools likesource analyzer, mapping designer, workflow manager, workflow monitor, Mapplets, Worklets and repository manager.
Create the data model, design the ETL process approach, identify the metrics, KPI's from the data and design the mockups to present the data on Tableau dashboard.
Focused continuous improvement efforts on driving gap closures identified through the implementation of rigorous metrics and Key Performance Indicators (KPI's).
Performed Unit Testing and User Acceptance testing and documented detailed results.

Environment: Oracle 10g/9i/8i/7.x, MS SQL Server, UDB DB2 9.x, Teradata, Quality Center, SQL Queries., KPI's, Siebel Analytics, SSIS Oracle BI, UML, OLAP, Data mining, Teradata SQL Assistant.

We provide IT Staff Augmentation Services!

Big Data Engineer Resume

AZ

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship