Gcp Data Engineer Resume
New York, NY
SUMMARY
- Over 7+ years of experience as Data Engineer with demonstrated expertise in building and deploying data pipelines using open - source Hadoop based technologies such as Apache Spark, Hive, Hadoop, Python, PySpark.
- Hands on Experience in developing Spark applications using PySpark Data Frame, RDD, Spark SQL.
- Working with GCP cloud using in GCP Cloud storage, DataProc, Data Flow, Big Query, Cloud Composer, Cloud Pub/Sub.
- Expert in working with cloud PUB/SUB to replicate data real-time from source system to GCP Big Query.
- Good knowledge on GCP service accounts, billing projects, authorized views, datasets, GCS buckets and gsutil commands
- Experienced in building and deploying Spark applications on Hortonworks Data Platform and AWS EMR
- Experienced in working with AWS services such as - EMR, S3, EC2, IAM, Lambda, Cloud Formation, Cloud Watch.
- Worked on structured and semi structured data storage formats such as Parquet, ORC, CSV, JSON.
- Developed automation scripts using AWS Python Boto3 SDK.
- Proficient in developing UNIX Shell Scripts.
- Experienced in working with Snowflake cloud data warehouse and Snowflake Data Modeling.
- Built ELT workflow using Python and Snowflake COPY utilities to load data into Snowflake.
- Experienced in working with RDBMS databases Oracle, SQL Server.
- Developed complex SQL queries and performance tuning of SQL queries.
- Experienced in working with CICD pipelines like Jenkins, Bamboo.
- Experienced in working with Source Code management tools in GIT and Bit Bucket.
- Worked on Application Monitoring tools like Splunk, Elastic Search, Log Stash, Kibana for application logging and monitoring.
PROFESSIONAL EXPERIENCE
GCP Data Engineer
Confidential | New York, NY
Responsibilities:
- Experience in building multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Design and implement various layer of Data lake, Design star schema in Big Query.
- Using g-cloud function with Python to load data in to Big query for on arrival csv files in GCS bucket.
- Process and load bound and unbound data from Google pub/sub topic to Big query using cloud Dataflow with Python.
- Designed Pipelines with Apache Beam, KubeFlow, Dataflow and orchestrated jobs into GCP.
- Developed and Demonstrated the POC, to migrate on-prem workload to Google Cloud Platform using GCS, Big Query, Cloud SQL and Cloud DataProc.
- Documented the inventory of modules, infrastructure, storage, components of existing On-Prem data warehouse for analysis and identifying the suitable technologies/strategies required for Google Cloud Migration.
- Design, development and implementation of performing ETL pipelines using python API (pySpark) of Apache Spark.
- Worked on GCP POC to migrate data and applications from On-Prem to Google Cloud.
- Exposure on IAM roles in GCP.
- Create firewall rules to access Google data procs from other machines.
- Process and load bound and unbound Data from Google pub/subtopic to Big query using cloud Dataflow with Python.
- Setup GCP Firewall rules to ingress or egress traffic to and from the VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
Environment: GCP, Cloud SQL, Big Query, Cloud DataProc, GCS, Cloud SQL, Cloud Composer, Informatica Power Center 10.1, Talend 6.4 for Big Data, Hadoop, Hive, Teradata, SAS, Teradata, Spark, Python, Java, SQL Server.
Data Engineer
Confidential | San Diego, CA
Responsibilities:
- Worked on implementing scalable infrastructure and platform for large amounts of data ingestion, aggregation, integration, analytics in Hadoop using Spark and Hive.
- Worked on developing streamlined workflows using high-performance API services dealing with large amounts of structured and unstructured data.
- Developed Spark jobs in Python to perform data transformation, creating Data Frames and Spark SQL.
- Worked on processing un-structured data in JSON format to structured data in parquet format by performing several transformations using Pyspark.
- Developed Spark applications using spark libraries to perform ETL transformations and thereby eliminating the need for utilizing ETL tools.
- Developed the end-to-end data pipeline in spark using python to ingest, transform and analyses data.
- Created Hive tables using HiveQL, then loaded the data into Hive tables and analyzed the data by developing Hive queries.
- Created and executed Unit test cases to validate transformations and process functions are working as expected.
- Worked on scheduling Control-M workflow engine to run multiple jobs.
- Written shell scripts to automate application deployments.
- Implemented solutions to switch schemas based on the dates so that the transformation would be automated.
- Developed custom functions and UDFs in python to in corporate methods and functionality of Spark.
- Developed data validation scripts in Hive and Spark and perform validation using Jupiter Notebook by spinning up the query cluster in AWS EMR.
- Executed Hadoop and Spark jobs on AWS EMR using data stored in Amazon S3.
- Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations.
- Worked on Data serialization formats for converting complex objects into sequence bits by using Parquet, Avro, JSON, CSV formats.
Environment: Hadoop, Hive, Zookeeper, Sqoop, Spark, Control-M, Python, Bamboo, SQL, Bit bucket, AWS, Linux.
Data Engineer
Confidential | Cincinnati, OH
Responsibilities:
- Creating and managing nodes that utilize Java jars and python, shell scripts for scheduling jobs to customize Data ingestion.
- Developed Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS.
- Developed and performed Sqoop import from Oracle to load the data into HDFS.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Created Hive tables to store the processed results in a tabular format.
- Scheduled MapReduce jobs in production environment using Oozie scheduler and Autosys.
- Developed Kafka producer and brokers for message handling.
- Imported the data to Hadoop using Kafka and implemented the Oozie job for daily imports.
- Configured Kafka ingestion pipeline to transmit the logs from web server to Hadoop.
- Worked with POC’s for stream processing using Apache NIFI.
- Worked on Hortonworks Hadoop Solutions with Real-time Streaming using Apache NIFI.
- Analyzed Hadoop logs using Pig scripts to oversee the errors caused by the team.
- Performed MySQL queries for efficient retrieval of ingested data using MySQL workbench.
- Implemented data ingestion and transformation using automated workflows using Oozie.
- Created and generated audit reports to notify security threat and track all user activity using various Hadoop components.
- Designed various plots showing HDFS analytics and other operations performed on the environment.
- Worked with Infra team for testing the environment after patches, upgrades and migration.
- Developed multiple Java scripts for delivering end-to-end support while maintaining product integrity.
Environment: HDFS, Hive, MapReduce, Pig, Spark, Kafka, Sqoop, Scala, Oozie, Maven, GitHub, Java, Python, MySQL, Linux.
Junior Data Engineer
Confidential
Responsibilities:
- Designed and implemented code changes in existing modules - Java, python, shell-scripts for enhancement.
- Designed User Interface and the business logic for customer registration and maintenance.
- Integrating Web services and working with data in different servers.
- Involved in designing and Development of SOA services using Web Services.
- Created, developed, modified, and maintained database objects, PL/SQL packages, functions, stored procedures, triggers, views, and materialized views to extract data from different sources.
- Extracted data from various location and load them into the oracle table using SQL*LOADER.
- Developed PL/SQL Procedures and UNIX Scripts for Automation of UNIX jobs and running files in batch mode.
- Using Informatica Power Center Designer, analyzed the source data to Extract, transform from various source systems (Oracle, SQL server and flat files) by incorporating business rules using different objects and functions that the tool supports.
- Used Oracle OLTP databases as one of the main sources for ETL processing.
- Managed ETL process by pulling large volume of data from various data sources using BCP in staging database from MS access and excel.
- Responsible for detecting errors in ETL operation and rectify them.
- Incorporated Error Redirection during ETL load in SSIS packages.
- Implemented various types of SSIS transformations in packages including Aggregate, Fuzzy Lookup, Conditional Split, Row Count, Derived Column etc.
- Implemented the Master Child Package Technique to manage big ETL Projects efficiently.
- Involved in Unit testing and System Testing of ETL Process.
Environment: MS SQL Server, SQL, SSIS, MySQL, Unix, Oracle, Java, Python, Shell.