Sr. Big Data Engineer Resume
2.00/5 (Submit Your Rating)
NJ
OBJECTIVE:
- 8+ years of professional experience in information technology with an expert hand in the areas of BIG DATA, HADOOP, SPARK, HIVE, IMPALA, SQOOP, FLUME, KAFKA, SQL tuning, ETL development, report development, database development, data modeling and strong knowledge of oracle database architecture.
- Experience in Big Data analytics, Data manipulation, using Hadoop Eco system tools Map - Reduce, Confidential, Yarn/MRv2, Pig, Hive, Confidential, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop, AWS, Spring Boot, Spark integration with Cassandra, Avro, Solr and Zookeeper.
- Hands on experience in test driven development (TDD), Behavior driven development (BDD) and acceptance test driven development (ATDD) approaches.
- Managing Database, Azure Data Platform services (Azure Data Lake(ADLS), Data Factory(ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL Confidential ), SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes.
- Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau, Power BI.
- Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, Cloud Watch, SNS, Dynamo Confidential, SQS.
- Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server. Worked on different file formats like delimited files, avro, Json and parquet.
- Docker container orchestration using ECS, ALB and lambda.
- Created Snowflake Schemas by normalizing the dimension tables as appropriate, and creating a Sub Dimension named Demographic as a subset to the Customer Dimension.
- Expertise in Java programming and have a good understanding on OOPs, I/O, Collections, Exceptions Handling, Lambda Expressions, Annotations
- Able to use Sqoop to migrate data between RDBMS, NoSQL databases and Confidential .
- Experience in Extraction, Transformation and Loading (ETL) data from various sources.
TECHNICAL SKILLS
- Apache Hadoop Confidential
- Confidential
- Apache Hadoop Impala
- Impala
- Apache Hadoop Mapreduce
- Hadoop Mapreduce
- Mapreduce
- Apache Hadoop Oozie
- Oozie
- Apache Hadoop Sqoop
- Sqoop
- Apache Solr
- SOLR
- BIG Data Analytics
- Cassandra
- CDH
- CDH4
- Data Analytics
- Data Cleansing
- Data Management
- ETL
- Flume
- Hadoop
- Hadoop Cluster
- Hadoop Distributed File System
- Hbase
- Informatica
- Kafka
- Machine Learning
- MAP Reduce
- Apache Spark
- Avro
- Continuous Integration/Delivery
- CI/CD
- Git
- Hive
- Jenkins
- JSON
- Pig
- Python
- Numpy
- Pyspark
- R Language
- R Programming
- Real Time
- Scripting
- Shiny
- Structured Software
- Software Development
- System Development
- VBA
- XML
- Zookeeper
- Data Analysis
- MS SQL Server
- SQL Server
- MySQL
- OLTP
- Oracle
- PL/SQL
- PostgreSQL
- SQL
- SQL Queries
- WEB Database
- Database Development
- Amazon Elastic Compute Cloud
- EC2
- Amazon Elastic Mapreduce
- Elastic Mapreduce
- Amazon Simple Storage Service
- Amazon S3
- AWS S3
- Amazon Web Services
- AWS
- SSRS
- SAS
- Tableau Software
- Tableau
- T-SQL
- Apache
- Linux
- Shell Scripting
- Solaris
- Unix Shell
- Database Architecture
- JIRA
- Test Plan
- UAT
- SCRUM
- Version Control
- RDBMS
- Scala
- EMR
- Parsing
- AJAX
- Streaming
- Web Services
- CODA
- Dynamo
- QA
- Test Scripts
- Functional Testing
- Performance Testing
- Performance Tuning
- Risk Management
- Root Cause Analysis
- Test Cases
- Business Intelligence
- BI
- Marketing Analysis
- Statistical Analysis
- Statistical Modeling
- Ecosystem
- Product Support
- Marketing
- ROI
- SPSS
- Continuous Improvement
- Feasibility
- Optimization
- Algorithms
- TOPO
- Mapping
- Serial Attached Scsi
- ECO
- ECS
- Business Requirements
- Process Improvements
- Forecasting
- Integrator
- Integration
- Visualization
- Architecture
- Documentation
- Documenting
- Technical Specifications
- Writing Functional
- Data Extraction
- Scheduling
- Pipeline
- GCP
- Metrics
- Performance Analysis
- Trading
- Collection
- Gantt
- Debug
- Data Services
- Customer Service
- Database Management
- Regression Testing
- Customer Service Oriented
- Regulatory Compliance
- Decommissioning
- Cases
- Excellent communication skills
- Credit Issues
- Building Automation
- Translate
- Retail Sales
PROFESSIONAL EXPERIENCE
Confidential
Sr. Big Data Engineer
Responsibilities:
- Installing, configuring and maintaining Data Pipelines
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Conduct performance analysis and optimize data processes.
- Make recommendations for continuous improvement of the data processing environment Conduct performance analysis and optimize data processes.
- Make recommendations for continuous improvement of the data processing environment.
- Develop a data platform from scratch and took part in requirement gathering and analysis phase of the project in documenting the business requirements.
- Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools.
- Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
- Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
- Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Files extracted from Hadoop and dropped on daily hourly basis into S3
- Authoring Python (PySpark) Scripts for custom Confidential 's for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.
- Writing Pig Scripts to generate MapReduce jobs and performed ETL procedures on the data in Confidential .
- Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
- Conduct root cause analysis and resolve production problems and data issues
- Performance tuning, code promotion and testing of application changes Used Sqoop to channel data from different sources of Confidential and RDBMS.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Used SQL Server Management Tool to check the data in the database as compared to the requirement give Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Identified and documented Functional/Non-Functional and other related business decisions for implementing Actimize-SAM to comply with AML Regulations.
- Work with region and country AML Compliance leads to support start-up of compliance-led projects at regional and country levels. Including defining the subsequent phases training, UAT, staff to perform test scripts, data migration and the uplift strategy (updating of customer information to bring them to the new KYC standards) review of customer documentation.
- Description of End-to-end development of Confidential
Confidential
Sr. Big Data Engineer
Responsibilities:
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Designed & build infrastructure for the Google Cloud environment from scratch Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension) Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP Worked on confluence and Jira Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python Strong understanding of AWS components such as EC2 and S3
- Responsible for data services and data movement infrastructures Experienced in ETL concepts, building ETL solutions and Data modeling Worked on continuous Integration tools Jenkins and automated jar files at end of day.
- Worked with Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling Compiled data from various sources to perform complex analysis for actionable results
- Experience in working with different join patterns and implemented both Map and Reduce Side Joins.
- Wrote Flume configuration files for importing streaming log data into HBase with Flume.
- Imported several transactional logs from web servers with Flume to ingest the data into Confidential .
- Using Flume and Spool directory for loading the data from local system to Confidential .
- Installed and configured pig, written Pig scripts to convert the data from Text file to Avro format.
- Created Partitioned Hive tables and worked on them using Hive QL.
- Tested Apache Tez for building high performance batch and interactive data processing applications on Pig and Hive jobs.
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met Implemented a Continuous Delivery pipeline
Confidential, NJ
Data Engineer/ Analyst
Responsibilities:
- Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
- Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage and it was sometimes extended up to 24 nodes during production.
- The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Confidential for business users.
- Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
- Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Develop database management systems for easy access, storage, and retrieval of data.
- Perform Confidential activities such as indexing, performance tuning, and backup and restore.
- Expertise in writing Hadoop Jobs for analysing data using Hive QL (Queries), Pig (Data flow language), and custom MapReduce programs in Java.
- Built APIs that will allow customer service representatives to access the data and answer queries.
- Designed changes to transform current Hadoop jobs to HBase.
- Handled fixing of defects efficiently and worked with the QA and BA team for clarifications.
- Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
- Extending the functionality of Hive with custom Confidential 's various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in the hive and Map Side joins.
- Expert in creating Hive using Java to analyse the data efficiently. R
- esponsible for loading the data from BDW Oracle database, Teradata into Confidential using Sqoop.
- Implemented AJAX, JSON, and Java script to create interactive web screens.
- Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analysed them by running Hive queries.
- Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into Confidential .
- Created Session Beans and controller Servlets for handling HTTP requests from Talend Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
- Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
Confidential
Data Engineer
Responsibilities:
- Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
- Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
- Running of Apache Hadoop, CDH and Map-R distros, dubbed Elastic MapReduce(EMR) on (EC2).
- Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
- Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
- Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
- Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources Developed and implemented R and Shiny application which showcases machine learning for business forecasting.
- Developed predictive models using Python & R to predict customers churn and classification of customers.
- Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams. Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Performed pig script which picks the data from one Confidential path and performs aggregation and loads into another path which later pulls populates into another domain table.
- Converted this script into a jar and passed as parameter in Oozie script
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Build an ETL which utilizes spark jar inside which executes the business analytical model.
- Hands on experiences on git bash commands like git pull to pull the code from source and developing it as per the requirements, git add to add files, git commit after the code build and git push to the pre prod environment for the code review and later used screwdriver.
- Created logical data model from the conceptual model and its conversion into the physical database design using Confidential .
Confidential
Data & Reporting Analyst
Responsibilities:
- Imported Legacy data from SQL Server and Teradata into Amazon S3.
- As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
- Created consumption views on top of metrics to reduce the running time for complex queries.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- Compare the data in a leaf level process from various databases when data transformation or data loading takes place.
- I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
- Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project Worked on to retrieve the data from FS to S3 using spark commands
- Analysed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
- Worked with business to identify the gaps in mobile tracking and come up with the solution to solve.
- Analysed click events of Hybrid landing page which includes bounce rate, conversion rate, Jump back rate, List/Gallery view, etc. and provide valuable information for landing page optimization.
- Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI.
- Suggested improvements and modify existing BI components (Reports, Stored Procedures)
- Understood Business requirements to the core and Came up with Test Strategy based on Business rules Prepared Test Plan to ensure QA and Development phases are in parallel Written and executed Test Cases and reviewed with Business & Development Teams.
- Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team Automated Regression tool (Qute) and reduced manual effort and increased team productivity Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders Incorporated predictive modelling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
- Worked with stakeholders to communicate campaign results, strategy, issues or needs. Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing.
- Tested Hadoop MapReduce developed in python, pig, Hive