Sr. Data Engineer Resume Pittsburgh, PA - Hire IT People

SUMMARY

Data Engineer with around 7 years of experience in interpreting and analyzing sophisticated datasets and expertise in providing business insights
Experience in AGILE Software Development Lifecycle (SDLC) - Requirement Gathering, analysis, design, development, maintenance, build, code management and testing of enterprise data warehouse applications
Created the AWS VPC network for the Installed Instances and configured the Security Groups and Elastic IP's accordingly
Experienced on working with Amazon EMR framework for processing data on EMR and EC2 instances
Well versed with HADOOP framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies
Experience in using different HADOOP eco system components such as HDFS, YARN, MapReduce, Spark, Pig, Sqoop, Hive, Impala, and Kafka
Experience with data warehousing and data mining, using one or more NoSQL Databases like HBase, Cassandra, and MongoDB
Experience in using Sqoop to ingest data from RDBMS to HDFS
Experience in Cluster Coordination using Zookeeper and Worked on File Formats like Text, ORC, Avro and Parquet and compression techniques like Snappy, Gzip and Zlib
Experienced in using various Python libraries like NumPy, Scipy, Python-twitter, Pandas, Scikit-learn
Worked on visualization tools like Power BI, Tableau for report creation and further analysis
Experienced with Spark processing framework such as Spark SQL, and Data Warehousing and ETL processes
Developed end to end ETL pipeline using Spark-SQL, Scala on Spark engine and imported data from AWS S3 into Spark RDD, performed transformations and actions on RDDs
Experience with Spark Streaming and to write Spark jobs
Experience developing high throughput Streaming applications from Kafka queues and writing enriched data back to outbound Kafka queues
Experience in ingesting data using Sqoop from HDFS to Relational Database Systems (RDBMS)- Oracle, DB2 and SQL Server and from RDBMS to HDFS
Good understanding of AWS S3, EC2, Kinesis and Dynamo DB
Used Jupyter Notebooks for data pre-processing and building machine learning algorithms on datasets
Good Knowledge on Machine Learning solutions to various business problems and generating using Python
Experienced in real-time analytics with Spark RDD, Data Frames and Streaming API
Used Spark Data Frame API over Cloudera platform to perform analytics on Hive data
Acquires good understanding of JIRA and maintaining JIRA dashboards
Ability to work independently as well as in a team and able to effectively communicate with customers, peers and management at all levels in and outside the organization
Provided production support and involved with root cause analysis, bug fixing and promptly updating the business users on day-to-day production issues.
Developed DAGs and automated the process for the data science teams
Developed Ad-hoc Queries for moving data from HDFS to HIVE and analysing the data using HIVE QL
Integration Slack Notifications with Jenkins deployments to notify the required users about the deployments
Involved in daily SCRUM meetings to discuss the development/progress of Sprints and was active in making SCRUM meetings more productive.

TECHNICAL SKILLS

Big Data Technologies: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Impala

HADOOP Distribution: Cloudera, Horton Works, AWS

Languages: Java, Shell scripting, Pig Latin, Scala, Python, R, C, C++, HiveQL

Web Technologies: HTML, CSS, JavaScript, XML, JSP, Restful, SOAP

Operating Systems: Windows(xp/7/8/10), UNIX, LINUX, UBUNTU, CENTOS

Machine Learning: Linear regression, Logistic Regression, Random forest, k-NN

Build Automation tools: SBT, Ant, Maven

Version Control: GIT

IDE & Builld Tools, Design: Eclipse, Visual Studio, Net Beans, Rational Application Developer, Junit

Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (HBase, Cassandra, MongoDB), Teradata

BI Tools: Power BI, Tableau

PROFESSIONAL EXPERIENCE

Sr. Data Engineer

Confidential, Pittsburgh, PA

Responsibilities:

Developed highly efficient Spark batch and streaming applications which run on AWS utilizing Spark API such as Datasets, Case Classes, Lambda functions, RDD transformations adhering to market standards and best practices for development.
Migrated long running Hadoop applications from legacy clusters to Spark applications running on Amazon EMR.
Used Spark-SQL to Load Parquet data and created Datasets defined by Case classes and handled structured data using Spark SQL which were finally stored into Hive tables for downstream consumption.
Written ETL scripts to move data from HDFS to S3 and vice versa and created Hive external tables on top of this data to be utilized in Big data applications.
Created scripts to sync data between local and Postgres databases with those on AWS Cloud.
Implemented POC to migrate Hadoop Java applications to Spark on Scala.
Developed Scala scripts on Spark to perform operations as data inspection, cleaning, loading and transforms the large sets of JSON data to Parquet format.
Prepared Linux shell scripts to configure, deploy and manage Oozie workflows of Big Data applications.
Worked on Spark streaming using Amazon Kinesis for real time data processing.
Created, configured, managed and destroyed EMR transient non-prod clusters as well as long running Prod cluster on AWS.
Worked on Triggering and scheduling ETL jobs using AWS Glue and Automated Glue with CloudWatch Events.
Involved in developing Hive DDL templates which were hooked into Oozie workflows to create, alter and drop tables.
Created Hive snapshot tables and Hive Avro tables from data partitions stored on S3 and HDFS.
Involved in creating frameworks which utilized a large number of Spark and Hadoop applications running in series to create one cohesive E2E Big Data pipeline. Worked on Large sets of structured, semi structured and unstructured data
Worked with Sqoop for importing data from relational data bases
Wrote multiple Map Reduce jobs for data cleaning and pre-processing
Running Hive queries and Pig scripts on large datasets to generate insights
Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS
Experience in managing and reviewing log files of HADOOP Cluster
Used SerDes in Hive for converting JSON format data in CSV format for Loading into tables
Assisted with data capacity planning and node forecasting
Design and develop Spark jobs for Streaming the real-time data, which is received by Rabbit MQ, IBM MQ through Kafka and Spark Streaming
Experience with Apache Spark Streaming and Batch framework. Create Spark jobs for data transformation and aggregation
Designed workflows by scheduling Hive processes for data, which is ingested into HDFS using Sqoop
Developed Hive queries to process the data and generate the data for visualizing
Created Pig Latin scripts to sort, group, join and filter the enterprise wise data
Implemented Partitioning, Dynamic Partitions, Buckets in HIVE
Used Zookeeper to manage coordination among the clusters
Developing scripts and batch jobs to schedule various HADOOP Programs
Streaming of data was continuously scheduled and monitored by Oozie
Fault tolerance in the presence of machine failure using Streaming tool
Reporting the data to analysts for further tracking of trends according to various consumers
Used Spark for interactive queries, processing of Streaming data and integration with NoSQL database for huge volume of data
Worked with DevOps team to Clusterize NIFI Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres running on other instances using SSL handshakes
Release process implementation like Devops and Continuous Delivery methodologies to existing Build and Deployments. Experience with scripting languages Python, Perl or shell script also
Work with Continuous Integration (CI)/CD using Jenkins for timely builds and running Tests
Develop a script using Jenkins with the integration of the GIT repository for the build, testing, code review and the deployment of the build Jar file, shell-scripts and OOZIE workflows to the destination HDFS paths

Environment: AWS, Sqoop, MapReduce, Pig, Hive, Oozie, Zookeeper, Java, Shell scripting, SPARK, SPARK SQL.

Bigdata Developer

Confidential, Hartford, CT

Responsibilities:

Worked on analysing Hadoop cluster and different big data analytical and processing tools including Pig, Hive, Sqoop, python and Spark with Scala & java, Spark Streaming
Wrote Spark-Streaming applications to consume the data from Kafka topics and wrote processed streams to HBase and steamed data using Spark with Kafka
Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive, and MongoDB
Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers
Developed Apache Spark applications by using Scala and python for data processing from various streaming sources
Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark
Implemented Spark solutions to generate reports, fetch and load data in Cassandra
Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system
Written HiveQL to analyse the number of unique visitors and their visit information such as views, most visited pages, etc
Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala and Python
Having experienced in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling and Excel data extracts
Created the AWS VPC network for the Installed Instances and configured the Security Groups and Elastic IP's accordingly
Experienced on working with Amazon EMR framework for processing data on EMR and EC2 instances
Designing and implementing complete end-to-end Hadoop Infrastructure including Pig, Hive, Sqoop, Oozie, Flume, and Zookeeper
Further used pig to do transformations, event joins, elephant bird API and pre -aggregations performed before loading JSON files format onto HDFS
Testing the processed data through various test cases to meet the business requirements
Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS

Environment: AWS, Ambari, Hive, Python, HBase, Spark, Scala, Map Reduce, HDFS, Sqoop, Impala, Linux, Shell scripting, Tableau.

Hadoop/Spark Developer

Confidential, Dallas TX

Responsibilities:

Evaluated business requirements and prepared Detailed Design documents that follow Project guidelines and SLAs required procuring data from all the upstream data sources and developing written programs.
Data files are retrieved by various data transmission protocols like Sqoop, NDM, SFTP, DMS etc., these data files are then validated by various Spark Control jobs written in Scala.
Spark RDDs are created for all the data files and then transformed to cash only transaction RDDs.
The filtered cash only RDDs are aggregated and curated based on the business rules and CTR requirements, converted into data frames, and saved as temporary Hive tables for intermediate processing.
The RDDs and data frames undergo various transformations and actions and are stored in HDFS as parquet Files and in HBase for auto generating CTRs.
Developed Spark scripts by using Scala and Python shell commands as per the requirement.
Maintained and administrated HDFS through HADOOP - Java API, Shell scripting, Python.
Used Python for writing script to move the data across clusters.
Expertise in designing Python scripts to interact with middleware/back end services.
Worked on Python scripts to analyze the data of the customer.
Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark RDD's, and Scala Python.
Developed monitoring and notification tools using Python.
Wrote Python routines to log into the websites and fetch data for selected options.
Used Collections in Python for manipuLating and looping through different user defined objects.
Wrote and tested Python scripts to create new data files for Linux sever configuration using a Python templet tool.
Wrote shell scripts to automate the jobs in UNIX.
Used log4j API to write log files.
Understood the existing Oozie workflows and modified them as per new requirements.

Environment: Cloudera Distribution 5.5, HADOOP Map Reduce, Spark 1.6, HDFS, Python, Hive, HBase, HiveQL, SQOOP, Java, Scala 2.10.4, Unix, IntelliJ, Maven.

Hadoop Application Developer

Confidential

Responsibilities:

Key member of the Wholesale Credit Risk Team responsible for generating the wholesale exposure data for building the Accounting View
Member of Wholesale CRP Sourcing Team responsible for data sourcing from Netezza, Teradata, Exadata, SQL Server and flat files
Contributed in defining the directory structure and wholesale credit risk data model
Involved in the initial CRP architectural and design meetings to define the directory structure and Wholesale Credit Risk Data Model
Designed and Developed a generic Sourcing Framework to source Reference data, Control tables, Commercial and Non-Commercial data from upstream systems
Defined the coding standards in HADOOP and followed the data modelling standards, guidelines, platform architecture and naming standards in CRP
Developed a framework to load the changed control tables in Exadata to HADOOP to run in parallel
Recommended the best practices in Tech stack as Autosys, Oozie, Sqoop, Hive, Impala, Shell scripting, Exadata, Netezza and Spark SQL.
Converted all the existing Pig and Hive ETL scripts in HADOOP to run in Spark
Developed Surrogate ID generator, Sequence Key Generator and CDC components in Scala running through Spark
Conducted Data quality and Data Integrity checks in CRP layers as Staging, PDM, RDM and Distribution
Writing the Denorm SQL’s to perform the denormalization from Stage to RDM layer
Developed and executed validation scripts in Impala to perform count, duplicate checks for all Wholesale Risk reference data, facts and dimension tables
Documented HQL scripts to deploy DDL’s in Stage, PDM, RDM and Distribution layers
Delivered HADOOP Training to business users and warehousing teams

Environment: Cloudera(CDH 5.8.3), HADOOP (2.6), Spark (1.6), HDFS, Sqoop, MapReduce, Hive, Impala, YARN, Oozie, Autosys, Hue, Netezza, Exadata, SQL Server, Toad Data Point, Shell scripting and MicroStrategy.

Hadoop Application Developer

Confidential

Responsibilities:

Built APIs that will allow customer service representatives to access the data and answer queries.
Designed changes to transform current HADOOP jobs to HBase.
Handled fixing of defects efficiently and worked with the QA and BA team for clarifications
Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files
Extending the functionality of Hive and Pig with custom UDF s and UDAF's
The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users
Implemented Bucketing and Partitioning using Hive to assist the users with data analysis
Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
Implemented Partitioning, Dynamic Partitions, Buckets in HIVE
Develop database management systems for easy access, storage, and retrieval of data
Perform DB activities such as indexing, performance tuning, and backup and restore
Expertise in writing HADOOP Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java
Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in the Hive and Map Side joins
Expert in creating PIG and Hive UDFs using Java to analyze the data efficiently
Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop
Implemented AJAX, JSON, and Java script to create interactive web screens
Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB
Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts

Environment: AWS, Hadoop, Pig, Hive, MapReduce, HDFS, Sqoop, Impala, Tableau, Oozie, Linux.

We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

Pittsburgh, PA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship