We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

2.00/5 (Submit Your Rating)

Pittsburgh, PA

SUMMARY

  • Data Engineer with around 7 years of experience in interpreting and analyzing sophisticated datasets and expertise in providing business insights
  • Experience in AGILE Software Development Lifecycle (SDLC) - Requirement Gathering, analysis, design, development, maintenance, build, code management and testing of enterprise data warehouse applications
  • Created the AWS VPC network for the Installed Instances and configured the Security Groups and Elastic IP's accordingly
  • Experienced on working with Amazon EMR framework for processing data on EMR and EC2 instances
  • Well versed with HADOOP framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies
  • Experience in using different HADOOP eco system components such as HDFS, YARN, MapReduce, Spark, Pig, Sqoop, Hive, Impala, and Kafka
  • Experience with data warehousing and data mining, using one or more NoSQL Databases like HBase, Cassandra, and MongoDB
  • Experience in using Sqoop to ingest data from RDBMS to HDFS
  • Experience in Cluster Coordination using Zookeeper and Worked on File Formats like Text, ORC, Avro and Parquet and compression techniques like Snappy, Gzip and Zlib
  • Experienced in using various Python libraries like NumPy, Scipy, Python-twitter, Pandas, Scikit-learn
  • Worked on visualization tools like Power BI, Tableau for report creation and further analysis
  • Experienced with Spark processing framework such as Spark SQL, and Data Warehousing and ETL processes
  • Developed end to end ETL pipeline using Spark-SQL, Scala on Spark engine and imported data from AWS S3 into Spark RDD, performed transformations and actions on RDDs
  • Experience with Spark Streaming and to write Spark jobs
  • Experience developing high throughput Streaming applications from Kafka queues and writing enriched data back to outbound Kafka queues
  • Experience in ingesting data using Sqoop from HDFS to Relational Database Systems (RDBMS)- Oracle, DB2 and SQL Server and from RDBMS to HDFS
  • Good understanding of AWS S3, EC2, Kinesis and Dynamo DB
  • Used Jupyter Notebooks for data pre-processing and building machine learning algorithms on datasets
  • Good Knowledge on Machine Learning solutions to various business problems and generating using Python
  • Experienced in real-time analytics with Spark RDD, Data Frames and Streaming API
  • Used Spark Data Frame API over Cloudera platform to perform analytics on Hive data
  • Acquires good understanding of JIRA and maintaining JIRA dashboards
  • Ability to work independently as well as in a team and able to effectively communicate with customers, peers and management at all levels in and outside the organization
  • Provided production support and involved with root cause analysis, bug fixing and promptly updating the business users on day-to-day production issues.
  • Developed DAGs and automated the process for the data science teams
  • Developed Ad-hoc Queries for moving data from HDFS to HIVE and analysing the data using HIVE QL
  • Integration Slack Notifications with Jenkins deployments to notify the required users about the deployments
  • Involved in daily SCRUM meetings to discuss the development/progress of Sprints and was active in making SCRUM meetings more productive.

TECHNICAL SKILLS

Big Data Technologies: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Impala

HADOOP Distribution: Cloudera, Horton Works, AWS

Languages: Java, Shell scripting, Pig Latin, Scala, Python, R, C, C++, HiveQL

Web Technologies: HTML, CSS, JavaScript, XML, JSP, Restful, SOAP

Operating Systems: Windows(xp/7/8/10), UNIX, LINUX, UBUNTU, CENTOS

Machine Learning: Linear regression, Logistic Regression, Random forest, k-NN

Build Automation tools: SBT, Ant, Maven

Version Control: GIT

IDE & Builld Tools, Design: Eclipse, Visual Studio, Net Beans, Rational Application Developer, Junit

Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (HBase, Cassandra, MongoDB), Teradata

BI Tools: Power BI, Tableau

PROFESSIONAL EXPERIENCE

Sr. Data Engineer

Confidential, Pittsburgh, PA

Responsibilities:

  • Developed highly efficient Spark batch and streaming applications which run on AWS utilizing Spark API such as Datasets, Case Classes, Lambda functions, RDD transformations adhering to market standards and best practices for development.
  • Migrated long running Hadoop applications from legacy clusters to Spark applications running on Amazon EMR.
  • Used Spark-SQL to Load Parquet data and created Datasets defined by Case classes and handled structured data using Spark SQL which were finally stored into Hive tables for downstream consumption.
  • Written ETL scripts to move data from HDFS to S3 and vice versa and created Hive external tables on top of this data to be utilized in Big data applications.
  • Created scripts to sync data between local and Postgres databases with those on AWS Cloud.
  • Implemented POC to migrate Hadoop Java applications to Spark on Scala.
  • Developed Scala scripts on Spark to perform operations as data inspection, cleaning, loading and transforms the large sets of JSON data to Parquet format.
  • Prepared Linux shell scripts to configure, deploy and manage Oozie workflows of Big Data applications.
  • Worked on Spark streaming using Amazon Kinesis for real time data processing.
  • Created, configured, managed and destroyed EMR transient non-prod clusters as well as long running Prod cluster on AWS.
  • Worked on Triggering and scheduling ETL jobs using AWS Glue and Automated Glue with CloudWatch Events.
  • Involved in developing Hive DDL templates which were hooked into Oozie workflows to create, alter and drop tables.
  • Created Hive snapshot tables and Hive Avro tables from data partitions stored on S3 and HDFS.
  • Involved in creating frameworks which utilized a large number of Spark and Hadoop applications running in series to create one cohesive E2E Big Data pipeline. Worked on Large sets of structured, semi structured and unstructured data
  • Worked with Sqoop for importing data from relational data bases
  • Wrote multiple Map Reduce jobs for data cleaning and pre-processing
  • Running Hive queries and Pig scripts on large datasets to generate insights
  • Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS
  • Experience in managing and reviewing log files of HADOOP Cluster
  • Used SerDes in Hive for converting JSON format data in CSV format for Loading into tables
  • Assisted with data capacity planning and node forecasting
  • Design and develop Spark jobs for Streaming the real-time data, which is received by Rabbit MQ, IBM MQ through Kafka and Spark Streaming
  • Experience with Apache Spark Streaming and Batch framework. Create Spark jobs for data transformation and aggregation
  • Designed workflows by scheduling Hive processes for data, which is ingested into HDFS using Sqoop
  • Developed Hive queries to process the data and generate the data for visualizing
  • Created Pig Latin scripts to sort, group, join and filter the enterprise wise data
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE
  • Used Zookeeper to manage coordination among the clusters
  • Developing scripts and batch jobs to schedule various HADOOP Programs
  • Streaming of data was continuously scheduled and monitored by Oozie
  • Fault tolerance in the presence of machine failure using Streaming tool
  • Reporting the data to analysts for further tracking of trends according to various consumers
  • Used Spark for interactive queries, processing of Streaming data and integration with NoSQL database for huge volume of data
  • Worked with DevOps team to Clusterize NIFI Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres running on other instances using SSL handshakes
  • Release process implementation like Devops and Continuous Delivery methodologies to existing Build and Deployments. Experience with scripting languages Python, Perl or shell script also
  • Work with Continuous Integration (CI)/CD using Jenkins for timely builds and running Tests
  • Develop a script using Jenkins with the integration of the GIT repository for the build, testing, code review and the deployment of the build Jar file, shell-scripts and OOZIE workflows to the destination HDFS paths

Environment: AWS, Sqoop, MapReduce, Pig, Hive, Oozie, Zookeeper, Java, Shell scripting, SPARK, SPARK SQL.

Bigdata Developer

Confidential, Hartford, CT

Responsibilities:

  • Worked on analysing Hadoop cluster and different big data analytical and processing tools including Pig, Hive, Sqoop, python and Spark with Scala & java, Spark Streaming
  • Wrote Spark-Streaming applications to consume the data from Kafka topics and wrote processed streams to HBase and steamed data using Spark with Kafka
  • Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive, and MongoDB
  • Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers
  • Developed Apache Spark applications by using Scala and python for data processing from various streaming sources
  • Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark
  • Implemented Spark solutions to generate reports, fetch and load data in Cassandra
  • Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system
  • Written HiveQL to analyse the number of unique visitors and their visit information such as views, most visited pages, etc
  • Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala and Python
  • Having experienced in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling and Excel data extracts
  • Created the AWS VPC network for the Installed Instances and configured the Security Groups and Elastic IP's accordingly
  • Experienced on working with Amazon EMR framework for processing data on EMR and EC2 instances
  • Designing and implementing complete end-to-end Hadoop Infrastructure including Pig, Hive, Sqoop, Oozie, Flume, and Zookeeper
  • Further used pig to do transformations, event joins, elephant bird API and pre -aggregations performed before loading JSON files format onto HDFS
  • Testing the processed data through various test cases to meet the business requirements
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS

Environment: AWS, Ambari, Hive, Python, HBase, Spark, Scala, Map Reduce, HDFS, Sqoop, Impala, Linux, Shell scripting, Tableau.

Hadoop/Spark Developer

Confidential, Dallas TX

Responsibilities:

  • Evaluated business requirements and prepared Detailed Design documents that follow Project guidelines and SLAs required procuring data from all the upstream data sources and developing written programs.
  • Data files are retrieved by various data transmission protocols like Sqoop, NDM, SFTP, DMS etc., these data files are then validated by various Spark Control jobs written in Scala.
  • Spark RDDs are created for all the data files and then transformed to cash only transaction RDDs.
  • The filtered cash only RDDs are aggregated and curated based on the business rules and CTR requirements, converted into data frames, and saved as temporary Hive tables for intermediate processing.
  • The RDDs and data frames undergo various transformations and actions and are stored in HDFS as parquet Files and in HBase for auto generating CTRs.
  • Developed Spark scripts by using Scala and Python shell commands as per the requirement.
  • Maintained and administrated HDFS through HADOOP - Java API, Shell scripting, Python.
  • Used Python for writing script to move the data across clusters.
  • Expertise in designing Python scripts to interact with middleware/back end services.
  • Worked on Python scripts to analyze the data of the customer.
  • Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark RDD's, and Scala Python.
  • Developed monitoring and notification tools using Python.
  • Wrote Python routines to log into the websites and fetch data for selected options.
  • Used Collections in Python for manipuLating and looping through different user defined objects.
  • Wrote and tested Python scripts to create new data files for Linux sever configuration using a Python templet tool.
  • Wrote shell scripts to automate the jobs in UNIX.
  • Used log4j API to write log files.
  • Understood the existing Oozie workflows and modified them as per new requirements.

Environment: Cloudera Distribution 5.5, HADOOP Map Reduce, Spark 1.6, HDFS, Python, Hive, HBase, HiveQL, SQOOP, Java, Scala 2.10.4, Unix, IntelliJ, Maven.

Hadoop Application Developer

Confidential

Responsibilities:

  • Key member of the Wholesale Credit Risk Team responsible for generating the wholesale exposure data for building the Accounting View
  • Member of Wholesale CRP Sourcing Team responsible for data sourcing from Netezza, Teradata, Exadata, SQL Server and flat files
  • Contributed in defining the directory structure and wholesale credit risk data model
  • Involved in the initial CRP architectural and design meetings to define the directory structure and Wholesale Credit Risk Data Model
  • Designed and Developed a generic Sourcing Framework to source Reference data, Control tables, Commercial and Non-Commercial data from upstream systems
  • Defined the coding standards in HADOOP and followed the data modelling standards, guidelines, platform architecture and naming standards in CRP
  • Developed a framework to load the changed control tables in Exadata to HADOOP to run in parallel
  • Recommended the best practices in Tech stack as Autosys, Oozie, Sqoop, Hive, Impala, Shell scripting, Exadata, Netezza and Spark SQL.
  • Converted all the existing Pig and Hive ETL scripts in HADOOP to run in Spark
  • Developed Surrogate ID generator, Sequence Key Generator and CDC components in Scala running through Spark
  • Conducted Data quality and Data Integrity checks in CRP layers as Staging, PDM, RDM and Distribution
  • Writing the Denorm SQL’s to perform the denormalization from Stage to RDM layer
  • Developed and executed validation scripts in Impala to perform count, duplicate checks for all Wholesale Risk reference data, facts and dimension tables
  • Documented HQL scripts to deploy DDL’s in Stage, PDM, RDM and Distribution layers
  • Delivered HADOOP Training to business users and warehousing teams

Environment: Cloudera(CDH 5.8.3), HADOOP (2.6), Spark (1.6), HDFS, Sqoop, MapReduce, Hive, Impala, YARN, Oozie, Autosys, Hue, Netezza, Exadata, SQL Server, Toad Data Point, Shell scripting and MicroStrategy.

Hadoop Application Developer

Confidential

Responsibilities:

  • Built APIs that will allow customer service representatives to access the data and answer queries.
  • Designed changes to transform current HADOOP jobs to HBase.
  • Handled fixing of defects efficiently and worked with the QA and BA team for clarifications
  • Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files
  • Extending the functionality of Hive and Pig with custom UDF s and UDAF's
  • The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users
  • Implemented Bucketing and Partitioning using Hive to assist the users with data analysis
  • Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE
  • Develop database management systems for easy access, storage, and retrieval of data
  • Perform DB activities such as indexing, performance tuning, and backup and restore
  • Expertise in writing HADOOP Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java
  • Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in the Hive and Map Side joins
  • Expert in creating PIG and Hive UDFs using Java to analyze the data efficiently
  • Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop
  • Implemented AJAX, JSON, and Java script to create interactive web screens
  • Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB
  • Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts

Environment: AWS, Hadoop, Pig, Hive, MapReduce, HDFS, Sqoop, Impala, Tableau, Oozie, Linux.

We'd love your feedback!