We provide IT Staff Augmentation Services!

Senior Gcp Data Engineer Resume

2.00/5 (Submit Your Rating)

Columbia, SC

PROFESSIONAL SUMMARY:

  • Having 8+ years of experience in all phases of Software Application requirement analysis, design, development and maintenance of Hadoop/Big Data applications.
  • Hands on experience with Spark Core, Spark SQL, Spark Streaming.
  • Used Spark - SQL to perform transformations and actions on data residing in Hive.
  • Used Kafka & Spark Streaming for real-time processing.
  • Deployed instances in AWS EC2 and used EBS stores for persistent storage and also performed access management using IAM service.
  • Experience in NoSQL Column-Oriented Databases like Cassandra, HBase, Mongo DB and it’s Integration with Hadoop cluster.
  • Experience in writing Hive Queries for processing and analyzing large volumes of data.
  • Experience in importing and exporting data using Sqoop from Relational Database Systems to HDFS and vice-versa.
  • Developed Oozie workflows by integrating all tasks relating to a project and schedule the jobs as per requirements.
  • Automated all the jobs, for pulling data from upstream server to load data into Hive tables, using Oozie workflows.
  • Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s
  • Experience with building data pipelines in python/Pyspark/HiveSQL/Presto/BigQuery and building python DAG in Apache Airflow.
  • Experience with various scripting languages like Linux/Unix shell scripts.
  • Implemented several optimization mechanisms like Combiners, Distributed Cache, Data Compression, and Custom Practitioner to speed up the jobs.
  • Having 3+ years of hands-on experience with Big Data Ecosystems including Hadoop and YARN, Spark, Kafka, DynamoDB, Redshift, SQS, SNS, Hive, Sqoop, Flume, Pig, Oozie, MapReduce, Zookeeper in a range of industries such as Financing sector and Health care.
  • Good Knowledge in Teradata.
  • Experience with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.
  • Templated AWS infrastructure in code with Terraformsto build out staging and production environments.
  • Good Knowledge in Apache Spark data processing to handle data from RDMS Stands streaming sources with Spark streaming.
  • Experience in Data Warehousing and ETL processes and Strong databases, ETL and data analysis skills.
  • Experience in developing Hive QLscripts for Data Analysis and ETL purposes and also extended the default functionality by writing User Defined Functions (UDF's) for data specific processing.
  • Extensive experience in using Flume to transfer log data files to Hadoop Distributed File System.
  • Good understanding/knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
  • Tested various flume agents, data ingestion into HDFS, retrieving and validating snappy files.
  • Have good skills in writing SPARK jobs in Scala for processing large sets of structured, semi-structured and store them in HDFS.
  • Ability to spin up different AWSN VPC like EC2, EBS, S3, SQS, SNS, Lambda, Redis, EMR using cloud formation templates.
  • Hands on experience with Amazon DynamoDB integrating with Spark.
  • Ensure data integrity and data security on AWS technology by implementing AWS best practices.
  • Good Knowledge in Spark SQL queries to load tables into HDFS to run select queries on top.

TECHNICAL SKILLS:

Spark Components: Spark Core, SparkSQL, Spark Streaming.

Programming Languages: SQL, Scala, Java, Python and Unix Shell Scripting.

Databases & NoSQL: MySQL, Pig Latin, Hive-QL, Terradata, RDBMS.

Cloud: AWS, GCP, Azure

Operating Systems: Windows, Unix, Red Hat Linux.

Big Data/Hadoop: HDFS, Hive, Pig, Sqoop, Oozie, Flume, Impala, Zookeeper, Kafka, Map Reduce, Cloudera, Amazon EMR.

PROFESSIONAL EXPERIENCE:

Confidential

Senior GCP Data Engineer

Responsibilities:

  • Working with Offshore and onsite teams for Sync up.
  • Using hive extensively to create a view for the feature data.
  • Creating and maintaining automation jobs for different Data sets.
  • Interacting with multiple teams understanding their business requirements for designing flexible and common component.
  • Installed and configured Apache airflow for workflow management and created workflows in python.
  • Developed pipelines for auditing the metrics of all applications using GCP Cloud functions, and Dataflow for a pilot project.
  • Developed end-to-end pipeline, which exports the data from parquet files in Cloud Storage to GCP Cloud SQL.
  • Implemented Spark SQL to access hive tables into spark for faster processing of data.
  • Used Hive to do transformations, joins, filter and some pre-aggregations before storing the data.
  • Data visualization for some data set by pyspark in jupyter notebook.
  • Validating and visualizing the data in Tableau.
  • Created sentry policy files to provide access to the required databases and tables to view from impala to the business users in the dev, uat and prod environment.
  • Created and Validate the Hive views using HUE.
  • Configured Airflow DAG for various feeds.
  • Created deployment document and user manual to do validations for the dataset.
  • Created Data Dictionary for Universal data sets.
  • Working with AWS/GCP cloud using in GCP Cloud storage, Data-Proc, Data Flow, Big- Query, EMR, S3, Glacier and EC2 Instance with EMR cluster.
  • Working with platform and Hadoop teams closely for the needs of the team.
  • Using Kafka for Data ingestion for different data sets.
  • Experienced in importing and exporting data into HDFS and assisted in exporting analyzed data to RDBMS using SQOOP.
  • Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Involved in building Data Models and Dimensional Modeling with 3NF, Star and Snowflake schemas for OLAP and Operational data store applications.
  • Wrote complex spark applications for performing various denormalization of the datasets and creating a unified data analytics layer for downstream teams.
  • Involved in creating Hive scripts for performing ad hoc data analysis required by the business teams.
  • Used apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.
  • Used AWS Glue for the data transformation, validate and data cleansing
  • Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor.
  • Worked on Generating Dynamic Case Statements based on Excel provided by Business using spark.
  • Migrated an Oracle SQL ETL to run on google cloud platform using cloud dataproc & bigquery, cloudpub/sub for triggering the airflow jobs.
  • Created two instances in GCP where one is for development and the other for production.
  • Worked on Autosys for scheduling the Oozie Workflows.
  • Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
  • Used SVN for branching, tagging, and merging.
  • Experience writing scripts using PySpark, Scala on Spark framework to utilize RDDs/DFs and facilitate advanced data analytics
  • Involved in developing Pig Scripts forchangedatacaptureand delta record processing between newly arriveddataand already existingdatain HDFS.
  • Validating the source file for Data Integrity and Data Quality by reading header and trailer information and column validations.
  • Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solution.
  • Created shell scripts and TES job scheduler for workflow execution to automate the loading into different data sets.
  • Wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
  • Developed Python scripts to automate data sampling process.
  • Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark.
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP
  • Developed sqoop jobs to import the data from RDBMS and file servers into Hadoop.
  • Created Heat maps, Bar Chats, plot graphs using pyplot, numpy, pandas, numpy, scipy in Jupyter Notebook.

Environment: GCP, Airflow, Snowflake, Hive, Pyspark, Python, SQL, Oozie.

Confidential, Columbia, SC

Senior Data Engineer

Responsibilities:

  • Worked on loading the data from MYSQL & Teradata to Hbase where necessary using Sqoop.
  • Used Sqoop for importing and exporting data from Netezza, Teradata into HDFS and Hive.
  • Created Teradata schemas with constraints, Created Macros in Teradata. Loaded the data using Fast load utility. Created functions and procedures in Teradata.
  • Creating external hive tables to store and queries the data which is loaded.
  • Optimizations techniques include partitioning, bucketing.
  • Extensively used Pyspark API, processed structured Data sources.
  • Using Avro file format compressed with Snappy in intermediate tables for faster processing of data.
  • Developed data ingestion modules (both real time and batch data load) to data into various layers inS3, Redshift and Snowflake using AWS Kinesis, AWS Glue, AWS Lambda and AWS Step Functions
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Configured failure alerts and status alerts for long running jobs on Airflow.
  • Used parquet file format for published tables and created views on the tables.
  • Created sentry policy files to provide access to the required databases and tables to view from impala to the business users in the dev, uat and prod environment.
  • Automated the jobs with Oozie and scheduled them with Autosys.
  • Experience in AWS to spin up the EMR cluster to process the huge data which is stored in S3 and push it to HDFS.
  • Participated in a collaborative team designing software and developing a Snowflake data warehouse within AWS.
  • Developed DAGs in Airflow to schedule and orchestrate the multiple Spark jobs.
  • Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Data Integrity checks have been handled using Hive queries, Hadoop, and Spark.
  • Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.
  • Used Hive and created Hive Tables and was involved in data loading and writing Hive UDFs.
  • Used Sqoop to import data into HDFS and Hive from other data systems.
  • Hands on Experience on Amazon DynamoDB and Fivetran API connectors
  • Involved in NoSQL database design, integration, and implementation.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
  • Developed Kafka producer and consumers, HBase clients, Spark, and Hadoop MapReduce jobs along with components on HDFS, Hive.
  • Developed Python... SQL, Spark Streaming using PySpark and Scala scripts.
  • Develop the Pyspark programs to process the data required for Model... framework using Pyspark
  • Implemented batch processing data pipelines in AWS using Pyspark, Airflow, Glue and S3.
  • Participated in evaluation and selection of new technologies to support system efficiency.
  • Participated in development and execution of system and disaster recovery processes.
  • Interacting with multiple teams understanding their business requirements for designing flexible and common component.
  • Validating the source file for Data Integrity and Data Quality by reading header and trailer information and column validations.
  • Used Spark SQL for creating data frames and performed transformations on data frames like adding schema manually, casting, joining data frames before storing them.
  • Implemented Spark SQL to access hive tables into spark for faster processing of data.
  • Worked on Spark streaming using Apache Kafka for real time data processing.
  • Experience in creating Kafka producer and Kafka consumer for Spark streaming.
  • Used Hive to do transformations, joins, filter and some pre-aggregations before storing the data onto HDFS.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Worked on three layers for storing data such as raw layer, intermediate layer and publish layer.

Environment: Hive, Impala, Spark, Autosys, Kafka, DynamoDB, Lambda, s3, SQS, SNS, Sqoop, Pig, Java, Scala, Eclipse, Tableau, Teradata, UNIX, and Maven, Hadoop, Cloudera, Amazon AWS, HDFS, SBT.

Confidential

Big data Developer

Responsibilities:

  • Supported Hive Programs those are running on the cluster.
  • Involved in loading data from UNIX file system to HDFS.
  • Installed and configured Hive and also written Hive UDFs.
  • Involved in creating Hive tables, loading data and writing Hive queries.
  • Working closely with AWS to migrate the entire Data Centers to the cloud using VPC, EC2, S3, EMR, RDS, Splice Machine and DynamoDB services
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of thedata. Created various types of data visualizations using Python and Tableau.
  • Performed end-to-end delivery of pyspark ETL pipelines on Azure-databricks to perform the transformation of data orchestrated via Azure Data Factory (ADF) scheduled through Azure automation accounts and trigger them using Tidal Schedular.
  • Hands on Experience in Oozie Job Scheduling.
  • Worked with big data developers, designers and scientists in troubleshooting map reduce, hive jobs and tuned them to give high performance.
  • Automated end to end workflow from Data preparation to presentation layer for Artist Dashboard project using Shell Scripting
  • Used Azure PowerShell to deploy the Azure Databricks Languages: PySpark, Scala.
  • Provide input into Product Management to influence feature requirements for compute, and networking in VMware cloud offering.
  • Developed Map reduce program which were used to extract and transform the data sets and result dataset were loaded to Cassandra.
  • Orchestrated Sqoop scripts, pig scripts, hive queries using Oozie workflows and sub-workflows
  • Conducting RCA to find out data issues and resolve production problems.
  • Involved in loading the created files into MongoDB for faster access of large customer base without taking performance hit.
  • Proactively involved in ongoing maintenance, support and improvements in Hadoop cluster.
  • Performed data analytics in Hive and then exported this metrics back to Oracle Database using Sqoop.
  • Involved in Minor and Major Release work activities.
  • Proactively involved in ongoing maintenance, support and improvements in Hadoop cluster.
  • Collaborating with business users/product owners/developers to contribute to the analysis of functional requirements.
  • Importing and exporting data into HDFS using Sqoop which included incremental loading.
  • Developed Hive queries for processing and data manipulation.
  • Worked on optimizing and tuning Hive to achieve optimal performance.
  • Experienced in defining job flows managing and using oozie.
  • Responsible to manage data coming from different sources.

Environment: Hadoop, MapReduce, HDFS, Pig, Hive, Java, Scala, Hortonworks, Amazon EMR, EC2, S3.

Confidential

Hadoop Developer

Responsibilities:

  • Created and maintained Technical documentation for launching Cloudera Hadoop Clusters and for executing Hive queries and Pig Scripts
  • Implemented JMS for asynchronous auditing purposes.
  • Experience in Automate deployment, management and self-serve troubleshooting applications.
  • Define and evolve existing architecture to scale with growth data volume, users and usage.
  • Design and develop JAVA API (Commerce API) which provides functionality to connect to the Cassandra through Java services.
  • Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
  • Responsible for managing data from multiple sources BIDW & Analytics knowledge.
  • Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
  • Responsible to manage data coming from different sources.
  • Installed and configured Hive and also written Hive UDFs.
  • Experience in managing the CVS and migrating into Subversion.
  • Experience in defining, designing and developing Java applications, specially using Hadoop Map/Reduce by leveraging frameworks such as Cascading and Hive.
  • Experience in Document designs and procedures for building and managing Hadoop clusters.
  • Strong Experience in troubleshooting the operating system, maintaining the cluster issues and also java related bugs.
  • Assisted in exporting analyzed data to relational databases using Sqoop.
  • Experienced in importing and exporting data into HDFS and assisted in exporting analyzed data to RDBMS using SQOOP.
  • Developed MapReduce jobs using Java API.
  • Wrote MapReduce jobs using Pig Latin.
  • Developed workflow using Oozie for running MapReduce jobs and Hive Queries.
  • Worked on Cluster coordination services through Zookeeper.
  • Worked on loading log data directly into HDFS using Flume.
  • Involved in loading data from LINUX file system to HDFS.

Environment: Hadoop, HDFS, Hive, Flume, Sqoop, PIG, Eclipse, MySQL and RedHat, Java (JDK 1.6).

Confidential

Python Developer

Responsibilities:

  • Involved in Design, Development and Support phases of Software Development Life Cycle (SDLC).
  • Developed processes, DevOps tools, automation for Jenkins based software for build system and delivering SW Builds.
  • Developed and tested many features in an AGILE environment using HTML5, CSS, JavaScript, jQuery, and Bootstrap.
  • UsedPythonscripts to generate various reports like transaction history, OATS, user privileges, limit rules and commission schedule reports.
  • Designed front end and backend of the application using Python on Django Web Framework.
  • Used HTML, CSS, AJAX, JSON designed and developed the user interface of the website.
  • Developed views and templates with Python and Django's view controller and templating language to create a user-friendly website interface.
  • Worked with Scrappy for web scraping to extract structured data from websites to analyze the specific data of a website and work on it.
  • Used SVN, CVS as version control for existing systems.
  • Used JIRA to maintain system protocols by writing and updating procedures and business case requirements, functional requirement specifications documents.
  • Implemented unit testing using PyUnit and tested several RESTful services using SOAP UI.
  • Develop consumer-based features and applications usingPython, Django, HTML, Behavior Driven Development (BDD) and pair-programming.

Environment: HTML, Behavior Driven Development (BDD),HTML5, CSS, AGILE, SVN, CVS, Python, Django, JIRA, JavaScript, jQuery, and Bootstrap, SOAP UI, REST.

We'd love your feedback!