We provide IT Staff Augmentation Services!

Big Data Engineer/data Modeler Resume

3.00/5 (Submit Your Rating)

Hartford, CT

SUMMARY

  • Over 8+ years of professional IT experience and over 6+ years of Big Data Ecosystem experience in ingestion, storage, querying, processing and analysis of big data.
  • Technical expertise in all phases of SDLC (Software Development Life Cycle) which includes a major concentration on Big Data Technologies in highly scalable end - to-end Hadoop Infrastructure analyzing frame works, various Relational Databases, NoSQL Databases, Spark and Java/J2EE technologies with topmost software practices.
  • In depth understanding/Knowledge of Hadoop Architecture and various components such as HDFS, YARN, Name Node, Data Node, Resource Manager, Node Manager.
  • Experience in building, maintaining multiple Hadoop clusters of different sizes and configuration.
  • In-depth knowledge ofHadoop and Spark Architecture with data mining and stream processing technologies including Spark Core, Spark SQL, Data Frames and Spark Streaming for developing Spark Programs for Batch and Real-Time Processing.
  • Experience in importing and exporting data between HDFS and Relational Database Management systems using Sqoop. Good knowledge in using job scheduling and monitoring tools like Oozie and Zookeeper.
  • Experienced in using Docker for automating the deployment of applications inside software containers and virtualized deployments using Docker, with Docker images, Docker Hub and Docker registries.
  • Experience in working with Cloudera (CDH4 & 5), and Hortonworks and AWS Amazon EMR, Lambda, Kinesis data stream to fully leverage and implement new Hadoop features.
  • Extensive experience in developing PIG Latin Scripts and using Hive Query Language for data analytics and working with Cloudera, Hortonworks, and Microsoft Azure HDINSIGHT Distribution sand also optimizing Hive Queries by tuning configuration parameters.
  • Experience in building pipelines using Azure Data Factory and moving the data into Azure Data Lake Store.
  • Worked on AWS SAAS, PAAS, Hybrid Cloud and Google cloud platforms (GCP)
  • Exposure and development experience in Microservices Architectures best practices, Java Spring Boot Framework, Docker, Kubernetes Jenkins and Python.
  • Experienced configuring Amazon EC2 instances that launch behind a load balancer, monitor the health of Amazon EC2 instances, deploy Amazon EC2 instances using command line calls and troubleshoot the most common problems with instances.
  • Implement AWS Data Lake leveragingS3, terraform, vagrant/vault, EC2, Lambda, VPC, and IAM in performing data processing and storage while writing complex SQL queries, analytical and aggregate functions on views in Snowflake data warehouse to develop near real time visualization using Tableau Desktop/Server 10.4 and Alteryx.
  • Experience with Python, SQL on AWS cloud platform, better understanding of Data Warehouses like Snowflake and Data-bricks platform, etc.
  • Experience in implementing OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse, Azure Data Factory, ADLS, Databricks, SQL DW.
  • Excellent understanding and Knowledge of NOSQL databases like HBase, Cassandra.
  • Hands on experience in Sequence files, RC files, Combiners, Counters, Dynamic partitions, Bucketing for best practice and performance improvement.
  • Good understanding of Oracle data dictionary and normalization techniques. Experienced in Amazon redshift database usage, Oracle 12c/11g/10g/9i/8i systems MySQL and writing stored procedures, functions, joins, and triggers for different Data Models.
  • Experience in managing Hadoop clusters and services using Cloudera Manager. Proficient in using Cloudera Manager, an end-to-end tool to manage Hadoop operations.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
  • Experienced in writing Map Reduce programs & UDF's for both Pig & Hive in java.
  • Experience in collecting log files and error messages across clusters to extract data and to copy into HDFS using Apache Flume.
  • Experience in database design using PL SQL to write stored Procedures, Functions, Triggers and writing queries for Oracle 10gusing Data Modelling techniques to find the results based on SQL and PL/SQL queries.
  • Experienced with code versioning and dependency management systems such as Git, SVT, and Maven.
  • Experience with visualization tools such as Power BI, Tableau, Jupyter Notebook, TIBCO SpotFire, QlikView, MicroStrategy, Information Builders, and other reporting and analytical tools
  • Hands on experience with Shell Scripting and UNIX.

TECHNICAL SKILLS

Hadoop Distributions: Apache, Cloudera CDH4, Hortonworks and CDH 5

Big Data Ecosystem: Apache Hadoop(HDFS/Map Reduce), Hive, Pig, Sqoop, Zookeeper, Oozie, Hue, Spark, Spark SQL, PySpark, Apache Kafka

NoSQL Databases: HBase, Cassandra, Mongo DB

Languages: SQL, Python, Scala, Core Java, PL SQL, AZURE PowerShell

Java / J2EE Technologies: Core Java, Servlets, Hibernate, Spring, Struts, JMS, EJB

Cloud: AWS (SaaS, PaaS, IaaS), Lambda, Hybrid, Kinesis, S3, AZURE, Google Cloud Big Query Platform (GCP), Databricks, Azure Data Factory

Visualization Tools: Power BI, Tableau Desktop and Server, TIBCO Spot-Fire, QlikView, MicroStrategy, Jupyter Notebook, Information Builders

Databases: Oracle 10g/9i/8i, DB2, MySQL, MS-SQL Server

Application Servers: WebLogic, Web Sphere, JBoss, Tomcat

Development Tools: Eclipse, Rational Rose

Build Tools: Jenkins, Maven, ANT

Software Engineering: Agile/Scrum & Waterfall Methodology

PROFESSIONAL EXPERIENCE

Confidential, Hartford, CT

Big Data Engineer/Data Modeler

Responsibilities:

  • Participate in design and development ofBig Data analytical applications.
  • Design, support and continuously enhance the project code base, continuous integration pipeline, etc.
  • At some point involved in requirement gathering, Business Analysis and translated business requirements into Technical design in Google cloud/Big query, Hadoop and Big Data.
  • Collaborated with Data Warehouse implementation teams, BI Administrators, Developers and Analysts for successful development of BI reporting and analytic solutions.
  • Design, develop, implemented and supported reporting and analytics applications leveraging tools such as Cognos, Tableau, SPSS, R, Python NodeJS to meet client objectives and requirements.
  • Worked with cutting-edge Google Cloud (GCP) Big query technologies to deliver next-Gen Cloud solutions
  • Gather and translate requirements from clients for building insightful, compelling reports and dash boards in Tableau that helps end users in processes and procedures for cross-application and cross-function control reporting using Tableau BI tool.
  • Worked on Jupyter Notebook for data cleaning, building visualizations, creating machine learning models and other data manipulating related tasks.
  • Using Jupyter wrote Python or R code (depending on the kernel), saved results of code execution in the cells and shared them with other teammates.
  • Involved in defining BI standards, guidelines and best practices related to the Google cloud (GCP) Big Query platform for clients, client services, and technical teams.
  • Maintain accurate and complete technical documents on confluence and Jira for the data Migration from legacy system to Google cloud Big Query platform (GCP).
  • Developed Spark and PySpark programs using Scala API’S, Python on Azure HDInsight for Data Aggregation, Validation to compare the performance of Spark with Hive and SQL and implemented Spark using Scala and Sparks SQL for Faster Testing and Processing of data. Also developed analytical component using Scala, PySpark and Spark Streaming.
  • Worked on AWS Lambda function to process records in an Amazon Kinesis data stream and develop hybrid cloud solutions to enhance, harden and support client’s service delivery processes.
  • Worked on Amazon’s Redshift to manage large storage uses, high-performance SSDs in each RA3 node for fast local storage and Amazon S3 for longer-term durable storage.
  • Developed various scripts to load the CSV files into the S3 buckets and created AWS S3buckets.
  • Built the data pipeline using Azure Service like Data Factory to load the data from Legacy SQL server to Azure Data Base using Data Factories, API Gateway Services, SSIS Packages, Talend Jobs, custom .Net and Python codes and to move data from on-premises servers to Azure Data Lake Storage Gen2, and Databricks
  • Worked on AWS container systems like Docker and container orchestration like EC2 Container Service by providing additional layer of abstraction and automation of operating system level virtualization on Linux using Kubernetes and Terraform.
  • Worked on NoSQL / SQL, Microservice, RESTful API development for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Exported the analyzed data to the relational databases (MySQL, Oracle) using Sqoop from HDFS and accessing Kafka cluster to consume data to Hadoop and analyzing the data by performing Hive queries and running Pig Scripts.
  • Designed and created Hive external tables using shared eta store instead of derby partitioning, dynamic portioning and buckets and involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python.
  • Involved in analyzing business, system and data mapping requirements and developing ETL data pipelines for real-time streaming of data using Kafka and Spark.
  • Built pipelines to move hashed and un-hashed data from Azure Blob to Data lake and wrote AZURE POWERSHELL scripts to copy or move data from local file system to HDFS Blob storage
  • Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems.
  • Responsible for high performance of data architecture and design including Star Schemas, Snowflake Schemas, and Dimensional Modeling.
  • Involved in developing of Confidential Data Lake and in building Confidential Data Cube on Microsoft Azure HDINSIGHT cluster.
  • Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references.
  • Involved in developing Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
  • Involved in Data Migration process using Azure by integrating with Git-hub repository and Jenkins and worked on Angular and React for SPA development.
  • Involved in Configuring Hadoop cluster and Hadoop installation, Commissioning, Decommissioning, Load Balancing, Troubleshooting, Monitoring and debugging Configuration of multiple nodes using Hortonworks platform.

Confidential, IL

Cloud Data Engineer/Data Modeler

Responsibilities:

  • Involved in integration of Hadoop cluster with Spark engine to perform BATCH operations. Also, used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
  • Migrated the data coming from different sources into Spark Cluster through Spark-Submit job.
  • Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
  • Developed pipelines in Azure Data Factory and built multiple Data Lakes. Built how the data will be received, validated, transformed and then published.
  • Involved in building proof of concept using modern Big Data technologies and convert into production-grade implementation.
  • Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
  • Designed database systems and develop tools for real-time and offline analytic processing and for better understanding of Data Warehouses like Snowflake
  • Accessed Kafka cluster to consume data to Hadoop and analyzing the data by performing Hive queries.
  • Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
  • Involved in Architecture& implementation medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB)
  • Imported data from AWS S3 into Spark RDD and performed transformations and actions on RDD's. Used Redshift for multiple storage types.
  • Develop programs/scripts/data model to extract and store required data from the source systems to produce web-based control reporting using Tibco.
  • Deploy Kubernetes in both AWS using Kinesis and Google cloud (GCP). Setup cluster, replicator. Deploy multiple containers in a pod.
  • Involved in migrating team with the current Linux environment to AWS/RHEL Linux environment and used auto scaling feature.
  • Involved in configuring and monitoring Amazon Web Services resources and deploying the content cloud platform to Amazon Web Services Lambda cloud services using EC2, S3 and EBS for auto scaling and VPC to build secure, highly scalable and flexible systems that handled load on the servers.
  • Developed multiple POCs using PySpark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
  • Utilized Spark Core, Spark Streaming and Spark SQL API for faster processing of data instead of using MapReduce in Java.
  • Load the data into Spark RDD and do in memory data Computation to generate the Output response and converting MapReduce programs into Spark transformations using Spark RDD in Scala.
  • Developed Python Script to import data SQL Server into HDFS & created Hive views on data in HDFS using Spark.
  • Involved in Implementing various AWS environment for provisioning of Linux servers and services implemented by the providers.
  • Experience in AWS (Lambda, Kinesis, S3) environment including AWS Storage and content Delivery, Databases, Networking, Management Tools, Security & Identity etc.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark.
  • Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
  • Involved in giving architecture guidance for selected Business Units, assessing migration feasibility and deployment strategy based on AWS Architecture best practices.
  • Used hive to analyze the partitioned data and compute various metrics for reporting.
  • Reduced the latency of spark jobs by tweaking the spark configurations and following other performance and Optimization techniques.
  • Involved in using Apache Splunk add-ons to enhance data ingestion and analysis of log data.
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce, Hive, Pig.
  • Worked on various production issues during the month end support and provide resolutions without missing any SLA.

Confidential, Charlotte, NC

Cloud Data Engineer/Data Modeler

Responsibilities:

  • Worked with the source team to understand the format and delimiters of the data file.
  • Running Periodic Map-Reduce jobs to load data from Cassandra into Hadoop.
  • Involved in creating Hive tables, loading with data and writing Hive queries which will invoke and run Map-Reduce jobs in the backend.
  • In depth understanding/ knowledge of Hadoop architecture and various components such as HDFS, application manager, node master, resource manager name node, data node and map reduce concepts.
  • Involved in developing a Map Reduce framework that filters bad and unnecessary records.
  • Involved heavily in setting up the CI/CD pipeline using Jenkins, Maven, Nexus, GitHub, and AWS.
  • Involved in moving all log files generated from various sources to HDFS for further processing through flume.
  • Developed data pipeline using flume, Sqoop, pig and map reduce to ingest customer behavioral data and purchase histories into HDFS for analysis.
  • Used Spark-SQL to load JSON data and create schema RDD and loaded it into Hive tables handled structured data using Spark SQL.
  • Created H-base tables to store variable data formats of data coming from different legacy systems.
  • Used HIVE to do transformations, Event Bridge joins and some pre-aggregations before storing the data onto HDFS.
  • The Hive tables created as per requirement were internal or external tables defined with appropriate static and dynamic partitions, intended for efficiency.
  • Experience in analyzing Cassandra database and comparing it with other open-source NoSQL databases to find which one of them better suits the current requirements.
  • Loading the output data into Cassandra using Bulk Load.
  • Implemented the workflows using Apache Oozie framework to automate tasks.
  • Written Map Reduce code that will take input as log files and parse the and structures them in tabular format to facilitate effective querying on the log data.
  • Transformed the data using Hive, Pig for BI team to perform visual analytics according to the client requirement.
  • Developed scripts and automated data management from end to end and sync up b/w all the Clusters.
  • Implemented Fair schedulers on the Job Tracker to share the resources of the cluster for the Map Reduce jobs given by the users.

Confidential

Data Engineer/Data Modeler

Responsibilities:

  • Explored and used Hadoop ecosystem features and architectures.
  • Worked with business team to gather their requirements and new support features.
  • Configured Sqoop and developed scripts to extract data from MySQL into HDFS.
  • Wrote programs using scripting languages like Pig to manipulate data.
  • Involved in creating Hive tables, loading structured data and writing hive queries which will run internally in map reduce way.
  • Monitoring the running Map-Reduce programs on the cluster.
  • Implemented the workflows using Apache Oozie framework to automate tasks.
  • Reviewed the HDFS usage and system design for future scalability and fault-tolerance.
  • Prepared Shell scripts to get the required info from the logs.
  • Responsible to manage data coming from different sources.

Confidential

ETL Developer/Data Modeler

Responsibilities:

  • Design, develop, and test processes for extracting data from legacy systems or production data bases.
  • Participate in performance, integration, and system testing.
  • Created Sample Data Sets using Informatica Test Data Management.
  • Worked on the Informatica Data Quality (IDQ) to standardize the Address, Address Profiling, Merging and Parsing, used the components like RBA, Token Parser, Character Parser.
  • Counsels team members on the evaluation of data using the Informatica Data Quality (IDQ) toolkit.
  • Applies data analysis, data cleansing, data matching, exception handling, and reporting and monitoring capabilities of IDQ.
  • Has an intensive experience working on Address doctor, Matching, De-duping and Standardizing.
  • Provide in-depth knowledge of data governance or data management.
  • Having in-depth knowledge on MDM.
  • Perform hands-on development on data quality tools IDQ and Data Archive using ILM and TDM.
  • Created and reuse the existing Unix Shell Scripts to schedule and automate the Informatica Mappings and also to automate the batch jobs. FTP the Files from different UNIX boxes.
  • Match/merge utilities.
  • Worked extensively with complex mappings using expressions, aggregators, filters, lookup, Joiner, update strategy, filter transformation etc.
  • Procedures to develop and populate Data Warehouse Relational databases, Transform and cleanse data, and Load it into data marts.
  • Coordinated in Testing, writing and executing test cases, procedures and scripts, creating test scenario.

We'd love your feedback!