Senior Big Data Engineer Resume
Fort Lauderdale, FL
SUMMARY
- Having 8 years of professional experience in IT, including 7 years of work experience in Big Data Hadoop Eco system.
- Ample experience on Big Data Analytics with hands on experience in writing Map Reduce jobs on Hadoop Ecosystem including Pig and Hive.
- Hands - on experience on major components in Hadoop Ecosystem including Hive, PIG, Sqoop, Flume and knowledge of Mapper/Reduce/HDFS Framework.
- Excellent knowledge on Hadoop architecture such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
- Extensive Experience in Setting up single node Hadoop Cluster.
- Hands-on experience in writing Pig Latin scripts, working with grunt shells and job scheduling with Oozie.
- Experience in handling, configuration and administration of databases like MySQL and NoSQL databases like MongoDB and Cassandra.
- Experience in working with databases like MongoDB, MySQL and Cassandra.
- Strong MySQL and MongoDB administration skills inUnix, Linux and Windows.
- Good experience in writing PIG scripts and Hive Queries for processing and analyzing large volumes of data.
- Good Understanding of Data Warehouse concepts such as Dimensional Data Models, Star Schema, Snowflake Schema, OLAP, SCD (Slowly Changing Dimensions) Type 1 and Type 2.
- Experience in analyzing data using Hive QL, Flume Pig Latin, and custom Map Reduce programs in Java.
- Experienced in developing MapReduce programs using Apache Hadoop for working with Big Data.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems/mainframe and vice-versa.
- Hands-on experience with related/complementary open-source software platforms and languages (e.g., Java, Linux, UNIX, Python).
- Hands on experience in AWS Cloud in various AWS services such as Redshift cluster, Route 53 domain configuration.
- Experience on practical implementation of cloud-specific AWS technologies including IAM, Amazon Cloud
- Hands-on Experience in creating complex SQL Queries and SQL tuning, writing PL/SQL blocks like stored procedures, Functions, Cursors, Index, triggers, and packages.
- Very good understanding on NOSQL databases like MongoDB and HBase.
- Involved in developing solutions to analyze large data sets efficiently.
- Proficient in preparing System Design Document and Installation Guide.
- Experience in working with the Different file formats like TEXTFILE, AVRO and JSON.
- Experience in Hadoop administration activities such as installation and configuration of clusters using Cloudera.
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
- Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform(GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage Composer.
- Professional experience with emphasis on Azure Cloud Services like Azure Data Lake, Azure SQL Database, Azure Data Factory, Azure Storage, Azure Synapse, Azure Event Hub, Azure Logic Apps and Azure Databricks.
- Hands on experience with Azure Data Factory, building and orchestrating the pipelines which makes the data integration easier from beginning till end seamless and robust.
- Experienced in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and grant database access.
- Experience in extracting data from both Relational systems and Flat Files. Ability to blend technical expertise with strong Conceptual, Business and Analytical skills to provide quality solutions and result-oriented problem-solving technique and leadership skills.
- Ability to perform at a high level, meet deadlines, adaptable to ever changing priorities.
- Excellent work ethics, self-motivated, quick learner, and team oriented. Continually provided value added services to the clients through thoughtful experience and excellent communication skills.
- Demonstrated expertise utilizing ETL tools: Talend Data Integration, SQL Server Integration Services (SSIS), Developed slowly changing dimension (SCD) mappings using Type-I, Type-II, and Type-III methods.
- Hands-on experience with integration processes for the Enterprise Data Warehouse (EDW) and extensive knowledge of various Performance Tuning Techniques on Sources, Targets using Talend Mappings, Jobs.
- Expertise in Extraction, Transformation, loading data from Oracle, DB2, SQL Server, MS Access, Excel, Flat Files and XML using Informatica and Talend
TECHNICAL SKILLS
Hadoop/Big Data/ETL Technologies: HDFS, Map Reduce, Sqoop, Pig, Hive, Hbase, Oozie, Impala, Zookeeper, Ambari, Storm, Spark, and Kafka, hue
No SQL Database: HBase, Cassandra
Monitoring and Reporting: Tableau,Shell Scripts
Hadoop Distribution: Horton Works, Cloudera
Build and Deployment Tools: Maven, GitHub, SVN, Jenkins
Programming and Scripting: SQL, Linux Shell Scripting, Python, Pig Latin, HiveQL, PySpark
Databases: Snowflake, Oracle, MY SQL, MS SQL Server, Vertica, Teradata, Sybase and DB2
Analytics Tools: Tableau, SAP BO
IDE Dev. Tools: Eclipse, IntelliJ, PyCharm, Oracle, Ant, Maven
Operating Systems: Linux, Unix, Windows 8/10, Windows Server, MacOS
Cloud: AWS, Azure,GCP
AWS Services: EC2, EMR, S3, Redshift, EMR, Lambda, Glue, Data Pipeline, Athena, AWS MAWAA
Network protocols: TCP/IP, UDP, HTTP, DNS, DHCP
Project &Trouble reporting tools: Jira
PROFESSIONAL EXPERIENCE
Confidential, Fort Lauderdale, FL
Senior Big Data Engineer
Responsibilities:
- Involved in high level architecture design for migration of transformation of transactional data to HDFS.
- Created external, partitioned hive tables and corresponding HDFS locations to house data.
- Create shell scripts to automate data ingestion jobs and processing data.
- Create Python wrapper scripts which act as framework and are utilized by all jobs in the Transaction Data Hub.
- Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication, schema design.
- Created multiple databases with sharded collections and choosing shard key based on the requirements.
- Experience in managing MongoDB environment from availability, performance and scalability perspectives.
- Worked on creating various types of indexes on different collections to get good performance in Mongo database.
- Worked on creating documents in Mongo database.
- Create Oozie Workflow Engine in running workflow jobs with actions that run Hadoop Map Reduce and Spark Jobs, execute python and shell scripts.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Provided 24x7 on-call support for production and development systems for both SQL and NoSQL databases.
- Knowledge experience with other NoSQL Databases.
- Was involved and responsible for managing more than 75 NoSql clusters .
- Design Batch ingestion components using Sqoop scripts, data integration and processing components using shell scripts, pig scripts, hive scripts.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Create PySpark scripts to transform and write data to HDFS.
- Use Spark SQL and data frames for transforming data on a large scale.
- Streamed data from Kafka topics
- Create CA7 packages in mainframes for scheduling the Jobs.
- Implement Kafka streaming to ingest data in near real time.
- Review and merge pull requests in Bitbucket.
- Work with Jenkins pipeline, deploy to automate the Build and deploy process.
- Work closely with the testing team to create unit test cases.
- Use Impala to create complex queries to analyze the transaction data.
- Worked in Agile Scrum methodology
- Work closely with Product manager for release planning
- Migrating Services from On-premise to Azure Cloud Environments. Collaborate with development and QA teams to maintain high-quality deployment
- Designed Client/Server telemetry adopting latest monitoring techniques.
- Worked on Continuous Integration CI/Continuous Delivery (CD) pipeline for Azure Cloud Services using CHEF.
- Configured Azure Traffic Manager to build routing for user traffic Infrastructure Migrations: Drive Operational efforts to migrate all legacy services to a fully Virtualized Infrastructure.
- Implemented HA deployment models with Azure Classic and Azure Resource Manager.
- Configured Azure Active Directory and managed users and groups
- Created ETL/Talend jobs both design and code to process data to target databases
- Created Talend jobs to load data into various Oracle tables. Utilized Oracle stored procedures and wrote few Java codes to capture globalmap variables and use them in the job
- Created Talend jobs to copy the files from one server to another and utilized Talend FTP components
- Involved in the development of Talend Jobs and preparation of design documents, technical specification documents
- Processe business requests in ad-hoc of loading data into production database using Talend Jobs
Environment:, Lambda, MSK, Azure, KMS, Spark, NoSQL, Git, Hadoop Yarn, Hive, Scala, Pig, MapReduce, Impala, SQL Server 2016, DB2, DynamoDB, CloudWatch, Spring boot, Micro Services, Tableau, Python, Pyspark, SNS, Nifi, Oozie, step functions, MongoDB, Hbase, Cassandra, Zookeeper, UNIX
Confidential, Philadelphia, PA
Big Data Engineer
Responsibilities:
- Developed Simple to complex Map/reduce streaming jobs using Python, Hive and Pig.
- Used various compression mechanisms to optimize Map/Reduce Jobs to use HDFS efficiently.
- Used ETL component Sqoop to extract the data from MySQL and load data into HDFS.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Performed ETL processes from the business data and created a spark pipeline that can efficiently perform ETL process.
- Created map reduce jobs that can perform entire ETL process.
- Wrote Hive queries and Pig scripts to study customer behavior by analyzing the data.
- Loaded data into Hive tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.
- Great expose to Unix scripting and good hands-on shell scripting.
- Wrote Python scripts to process semi-structured data in formats like JSON.
- Involved in loading and transforming of large sets of structured, semi structured, and unstructured data.
- Troubleshooting and finding the bugs in the Hadoop applications and to clear off all the bugs took help from the testing team.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
- Used Python API by developing Kafka producer, consumer for writing Avro Schemes.
- Worked with google data catalog and other google cloud API’s for monitoring, query and billing related analysis for BigQuery usage.
- Installed Ganglia Monitoring Tool to generate reports related to Hadoop cluster like CPUs running, Hosts Up and Down etc., operations were performed to maintain Hadoop cluster.
- Responsible for analyzing and data cleaning using Spark SQL Queries.
- Handled importing of data from various data sources performed transformations using spark and loaded data into hive.
- Worked with spark core, Spark Streaming and Spark SQL modules of Spark.
- Used Scala to write the code for all the use cases in Spark and extensive experience with Scala for data analytics on Spark cluster and Performed map-side joins on RDD.
- Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analysing data.
- Determining the viability of a business problem for a Big Data solution with Pyspark.
- Proactively monitored systems and services, architecture design and implementation of Hadoop deployment, configuration management, backup, and disaster recovery systems and procedures.
- Monitored multiple Hadoop clusters environments using Ganglia and Monitored workload, job performance and capacity planning using MapR.
- Involved in time series data representation using HBase.
- Great working experience with Splunk for real time log data monitoring.
- Worked with data bricks for connecting the different sources and transforming data to store in cloud platform.
- Experience in managing large-scale, geographically-distributed database systems, including relational (Oracle, SQL server) and NoSQL (MongoDB, Cassandra) systems.
- Experienced in building extensible data integration and data acquisition solutions to meet the requirement of the business.
- Experienced in building optimized data integration platform to provide efficient performance under developing data volumes.
- Enabled Journaling across all mongo instances for auto recovery of data after unexpected shutdown.
- Used MongoDB internal tools like Mongo Compass, Mongo Atlas Manager & Ops Manager, Cloud Manager etc.
- Worked on MongoDB database concepts such as locking, transactions, indexes, sharding, replication and schema design.
- Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports by our BI team.
- Great hands on experience with Pyspark for using Spark liberties by using python scripting for data analysis.
- Worked with (BI)Tableau teams as requirement of datasets and good working experience with Data visualization.
- Developed complex Talend ETL jobs to migrate the data from flat files to database
- Implemented custom Error handling in Talend jobs and worked on different methods of logging
- Followed the organization defined Naming conventions for naming the Flat file structure, Talend Jobs and daily batches for executing the Talend Jobs
- Wrote complex SQL queries to inject data from various sources and integrated it with Talend
Environment: Hadoop (HDFS, MapReduce), Scala, Spark, Impala, NoSQL, Hive, MongoDB,HBase, Oozie, Hue, Sqoop, Flume, Oracle, AWS Services, Mysql,Sql Server, Python, Scala, Spark, Hive, Spark-Sql.
Confidential, Fort Washington, PA
Big Data Engineer
Responsibilities:
- Worked with application and data science teams to support development of custom data solutions Supported the database design, development, implementation, information storage and retrieval, data flow and analysis activities
- Translated a set of requirements and data into a usable database schema by creating or recreating ad hoc queries, scripts and macros, updates existing queries, creates new ones to manipulate data into a master file Supported development of databases, database parser software, database loading software, and database structures that fit into the overall architecture of the system under development Data Analyst | Miracle Software Systems, AR 09.1 .2021
- Built data pipelines to ingest daily batch loads from data sources of application teams using hiveQL and python.
- Improved the performance of existing ETL processes and hive queries by parameter optimization.
- Migrated data from HDFS to GCS buckets using GCP DataProc Clusters.
- Transformed the existing hive QL to spark SQL and implemented performance tuning using historical data.
- Worked with business users to resolve the error records and duplicate records across tables.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data.
- Automated ETL jobs with both event and schedule-based techniques using AUTOMIC tool.
- Developed the YAML based ETL jobs that would ensure fault tolerant ways of Data Ingestion, Data Integration, Transformation using Sqoop, Hive, Spark and Shell Scripting.
- Troubleshooted the issues of data pipeline failures and slowness to ensure SLA adherence.
- Created scripts to monitor data in tables and send notifications to the team in case of stale data in tables.
- Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed
Environment: SPARK, Kafka, Map Reduce, Python, Hadoop, Hive, Pig, Spark, PySpark, SparkSQL, Azure SQL DW, Data brick, Azure Synapse, AzureDatalake, ARM, Azure HDInsight, Blob storage, Apache Spark, Oracle 12c, Cassandra, Git, Zookeeper, Oozie.
Confidential
Data Engineer
Responsibilities:
- Analyzing large amounts of datasets to determine optimal way to aggregate and report on these datasets.
- Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSQL databases such as HBase and Cassandra.
- Usage of tools like RoboMongo, MongoVue and NoSQL Manager for migration of data between databases without any kind of data loss.
- Used Kafka for live streaming data and performed analytics on it. Worked on Sqoop to transfer the data from relational database and Hadoop.
- Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
- Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
- Created AWS Data pipelines using various resources in AWS including AWS API Gateway to receive response from AWSLambda and retrieve data from Snowflake using Lambda function and convert the response into Json format using database as Snowflake, DynamoDB, AWS Lambda function and AWS S3.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV& other compressed file formats.
- Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, &KNN for data analysis.
- Developed Python code for different tasks, dependencies, and time sensor for each job for workflow management and automation using Airflow tool.
- Worked on cloud deployments using Maven, Docker and Jenkins.
- Create Glue jobs to process the data from S3 stating area to S3 persistence area.
- Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations.
- Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements.
Environment: AWS EMR, S3, EC2, Lambda, MapR, ApacheSpark, Spark-Streaming, Spark SQL, HDFS, Hive, PIG, Apache Kafka, Sqoop, Flume, Python, Scala, Shell scripting, Linux, MySQL, HBase, NoSQL, DynamoDB, Cassandra, Machine Learning, Snowflake, Maven, Docker, AWS Glue, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, Power BI.