We provide IT Staff Augmentation Services!

Gcp Data Engineer Resume

2.00/5 (Submit Your Rating)

New, YorK

SUMMARY

  • Over 8+ years of experience in Analyzing, Designing, Developing and Implementation of data, architecture, frameworks as a Data Engineer.
  • Specialized in Data Warehousing, Decision support Systems and extensive experience in implementing Full Life cycle Data Warehousing Projects and in Hadoop/Big Data related technology experience in Storage, Querying, Processing, analysis of data.
  • Software development involving cloud computing platforms like Amazon Web Services (AWS), AzureandGoogle Cloud (GCP).
  • Excellent knowledge on Hadoop Architecture and ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
  • Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between raw source file and big query tables.
  • Knowledge in installing, configuring, and using Hadoop ecosystem components like Hadoop Map Reduce, HDFS, HBase, Oozie, Hive, Sqoop, Zookeeper and Flume.
  • Experience in analyzing data using HiveQL, HBase and custom Map Reduce programs.
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems like Teradata, Oracle, SQL Server and vice - versa.
  • Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools) worked on Azure suite: Azure SQL Database, Azure Data Lake (ADLS), Azure Data Factory (ADF) V2, Azure SQL Data Warehouse, Azure Service Bus, Azure key Vault, Azure Analysis Service (AAS), Azure Blob Storage, Azure Search, Azure App Service, AZURE data Platform Service.
  • Hands on experience inGCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Data Proc, Stack driver.
  • Developed complex mappings and load the data from various sources into the Data Warehouse, using different transformations/ Stages like Joiner, Transformer, Aggregator, Update Strategy, Rank, Lookup, Filter, Sorter, Source Qualifier, Stored Procedure transformation etc.
  • Implemented POC to migrate map reduce jobs into Spark transformations using Python.
  • Developed Apache Spark jobs using Python in a test environment for faster data processing and used Spark SQL for querying.
  • Experienced in Spark Core, Spark RDD, Pair RDD, Spark Deployment Architectures.
  • An accomplished Data Engineer experienced in ingestion, storage, querying, processing, and analysis of big data, an expert in coming up with data warehousing solutions working with a variety of database technologies.
  • Extensive experience focused on Data warehousing, Data modeling, Data integration, Data Migration, ETL process, and Business Intelligence. Package Software: Expertise in SSIS, Informatica ETL, and reporting tools.
  • Extensively worked on Spark using Scala, Python on the cluster for computational analytics, installed it on top of Hadoop performed advanced analytical applications by making use of Spark with Hive and SQL/Oracle.
  • Good understanding of Statistics and developing Machine learning models, Experience in implementing data science solutions using Databricks.
  • Strong Programming experience in Python, Scala, and core CS concepts such as Data Structures and algorithms.
  • Extensive experience in developing applications that perform Data Processing tasks using Teradata, Snowflake, Oracle, SQL Server, and Postgres databases.
  • Hands-on experience building ETL pipelines, Visualizations, Analytics based quality solutions in-house using AWS, Azure Databricks, and other Open-source frameworks.
  • Extensive experience in working with various distributions of Hadoop like enterprise versions of Cloudera, Hortonworks, and good knowledge of MAPR distribution and AWS EMR (Elastic MapReduce).
  • Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, Amazon Elastic Load Balancing, and other services in AWS. Deep understanding of Cloud Architectures including AWS, Azure, GCP.Experienced in implementing schedulers using Oozie, Airflow, Crontab, and Shell scripts.Good working experience in importing data using Sqoop from various sources like RDMS, Snowflake, Teradata, Oracle to HDFS and performing transformations on it using Hive, Pig, and Spark.Extensive experience in importing and exporting streaming data into HDFS using stream processing platforms like Flume and Kafka messaging systems.
  • Experienced in migrating data from different sources using the PUB-SUB model in Redis, and Kafka producers, consumers, and preprocess data using Spark.
  • Expertise in writingSparkRDD transformations, Actions, Data Frames, Case classes for the required input data and performed the data transformations usingSpark-Core.
  • Engaged in performance tuning, scalability engineering, reliability, and feasibility in solutions design.
  • Proficient in NoSQL databases including HBase, Cassandra, MongoDB and its integration with Hadoop cluster.
  • Working knowledge in installing and maintaining Cassandra by configuring the Cassandra.YAML file as per the business requirement and performing reads/writes using Java JDBC connectivity.
  • Written multiple MapReduce Jobs using Hive for data extraction, transformation, and aggregation from multiple file-formats including Parquet, Avro, XML, JSON, CSV, ORCFILE, and other compressed file formats Codecs like gZip, Snappy, Lzo.
  • Strong understanding of Data Modeling (Relational, dimensional, Star and Snowflake Schema), Data analysis, implementations of Data warehousing using Windows and UNIX.
  • Experience in complete Software Development Life Cycle (SDLC) in both Waterfall and Agile methodologies.
  • Generated various kinds of knowledge reports using Tableau, Power BI, and Qlik based on Business specifications.

TECHNICAL SKILLS

Big Data Technologies: HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Storm, Flume, Spark, Apache Kafka, Zookeeper, Ambari, Oozie, MongoDB, Cassandra, Mahout, Puppet, Avro, Parquet, Snappy, Falcon.

NO SQL Databases: Postgres, HBase, Cassandra, MongoDB, Amazon DynamoDB, Redis

Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR, and Apache.

Languages: Scala, Python, R, XML, XHTML, HTML, AJAX, CSS, SQL, PL/SQL, HiveQL, Unix, Shell Scripting

Source Code Control: GitHub, CVS, SVN, ClearCase

Cloud Computing Tools: Amazon AWS, (S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch, CloudFront), Microsoft Azure, GCP

Databases: Teradata Snowflake, Microsoft SQL Server, MySQL, DB2

DB languages: MySQL, PL/SQL, PostgreSQL & Oracle

Build Tools: Jenkins, Maven, Ant, Log4j

Business Intelligence Tools: Tableau, Power BI

Development Tools: Eclipse, IntelliJ, Microsoft SQL Studio, Toad, NetBeans

ETL Tools: Talend, Pentaho, Informatica, Ab Initio, SSIS

Development Methodologies: Agile, Scrum, Waterfall, V model, Spiral, UML

PROFESSIONAL EXPERIENCE

Confidential, New York

GCP Data engineer

Responsibilities:

  • Developed Spark programs to parse the raw data, populate staging tables, and store the refined data in partitioned tables in the Enterprise Data warehouse.
  • Experience in building powerBI reports on Azure Analysis services for better performance.
  • Developed Streaming applications using PySpark to read from the Kafka and persist the data NoSQL databases such as HBase and Cassandra.
  • Implemented PySpark Scripts using SparkSQL to access hive tables into a spark for faster processing of data.
  • Worked on Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
  • Migrating an entire oracle database to BigQuery and using of powerBI for reporting.
  • Build data pipelines in airflow in GCP for ETl related jobs using different airflow operators.
  • Developed streaming and batch processing applications using PySpark to ingest data from the various sources into HDFS Data Lake.
  • Developed DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS and Hive.
  • Developed and implemented HQL scripts to create Partitioned and Bucketed tables inHivefor optimized data access.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
  • Written Hive UDFs to implement custom functions in the Hive for aggregations.
  • Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa loading data into HDFS.
  • Monitoring YARN applications Troubleshoot and resolve cluster related system problems.
  • Created shell scripts to parameterize the Hive actions in Oozie workflow and for scheduling the jobs.
  • Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.
  • Worked as a key role in a team of developing an initial prototype of a NiFi big data pipeline. This pipeline demonstrated an end-to-end scenario of data ingestion, processing.
  • Using the NiFi tool to check whether a message reached the end system or not.
  • Developed the custom processor for NiFi.
  • Worked on NoSQL Databases such as HBase and integratedwith PySpark for processing and persisting real-time streaming.
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
  • Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the map-reduce jobs that extract and Zookeeper for providing coordinating services to the cluster.
  • Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
  • Assessed existing and EDW (enterprise data warehouse) technologies and methods to ensure our EDW/BI architecture meet the needs of the business and enterprise and allows for business growth.
  • Worked on Big Data Integration and Analytics based on Hadoop, SOLR, Spark, Kafka, Storm, and web Methods technologies.
  • Design and develop data pipelines for integrated data Analytics using Hive, Spark, Sqoop, MySQL.

Environment: CDH5, Hortonworks, Apache Hadoop 2.6.0, HDFS, Java 8, Hive 1.2.1000, Sqoop 1.4.6, HBase 1.1.2, Oozie 4.1.0, Storm 0.9.3, YARN, NiFi, Cassandra, Zookeeper, Spark, Kafka, Oracle 11g, MySQL, Shell Script, AWS, EC2, Tomcat 8, Spring 3.2.3, STS 3.6, Build Tool Gradle 2.2, Source Control GIT, Tera Data SQL Assistant.

Confidential, New York

Sr. Data Engineer

Responsibilities:

  • Import data using Sqoop to load data from Teradata to HDFS on a regular basis.
  • Write Hive queries for ad-hoc reporting to the business.
  • Participated in weekly release meetings with Technology stakeholders to identify and mitigate potential risks associated with the releases.
  • Implemented Responsible AWS solutions using EC2, S3, RDS, EBS, Elastic Load Balancer, and Auto scaling groups, Optimized volumes and EC2 instances.
  • Wrote Terraform templates for AWS Infrastructure as a code to build staging, production environments & set up build & automations for Jenkins.
  • Configured Elastic Load Balancers (ELB) with EC2 Auto scaling groups.
  • Created Amazon VPC to create public-facing subnet for web servers with internet access, and backend databases & application servers in a private-facing subnet with no Internet access.
  • Created AWS Launch configurations based on customized AMI and use this launch configuration to configure auto scaling groups.
  • Utilized Puppet for configuration management of hosted Instances within AWS Configuring and Networking of Virtual Private Cloud (VPC).
  • Utilized S3 bucket and Glacier for storage and backup on AWS.
  • Using Amazon Identity Access Management (IAM) tool created groups & permissions for users to work collaboratively.
  • Implemented /setup continuous project build and deployment delivery process using Subversion, Git, Jenkins, IIS, Tomcat.
  • Connected continuous integration system with GIT version control repository and continually build as the check-in's come from the developer.
  • Knowledge in build tools Ant and Maven and writing build.xml and pom.xml respectively.
  • Knowledge in authoring pom.xml files, performing releases with the Maven release plug-in and managing Maven repositories. Implemented Maven builds to automate JAR and WAR files.
  • Designed and built deployment using ANT/ Shell scripting and automate the overall process using Git and MAVEN.
  • Implemented a Continuous Delivery frameworks using Jenkins, Ansible/puppet, and Maven & Nexus in Linux environment.
  • Wrote Terraform, Cloud formation templates for AWS Infrastructure as a code to build staging, production environments & set up build & automations for Jenkins.
  • Design, development and implementation of performant ETL pipelines using python API (pyspark) of Apache Spark on AWS EMR.
  • Performed Code Reviews and responsible for Design, Code, and Test signoff.
  • Assigning work to the team members and assisting them in development, clarifying on design issues, and fixing the issues.

Environment: Scala, Hadoop, MapReduce, Spark, Yarn, Hive, Pig, Nifi, Kafka, Hortonworks, Cloudera, Sqoop, Flume, Elastic Search, Cloudera Manager, Java, J2EE, Web services, Hibernate, Struts, JSP, JDBC, XML, WebLogic Workshop, Jenkins, Maven.

Confidential, Corvallis, OR

Responsibilities:

  • As a Data Engineer, I was responsible for overall design and implementing enterprise data migration process from Legacy Oracle/Db2 Sources to RDS-Postgres and Amazon Redshift using AWS Data Migration Service, Schema Conversion tools and migration agents.
  • Designed and implemented highly scalable ETL Using Matillion tool. Developed numerous Orchestration and transformation jobs and nested as master jobs Matillion.
  • Dockized ETL components and deployed to Data Specific ECS clusters using Jenkins/GIT. Configured ETL Services, RDS and Redshift logs to Splunk and Steel Central for enterprise monotiling.
  • Implemented FIPS 140-2 Compliant Encryption standards for Data-At-Rest and Data-In-Transit.
  • Worked as POC for Optimizing and performance tuning issues and Maintenance for Cloud Databases.
  • Defined and deployed monitoring, metrics, and logging systems on AWS, primary configuring CloudWatch metrics for RDS and Redshift.
  • Implemented Workload Management (WML) inRedshiftto prioritize basic dashboard queries over more complex longer running ad-hoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Worked on developing various spark jobs for processing parquet data files; Responsible for Designing Logical and Physical data modelling for various data sources on Amazon Redshift.
  • Implemented Data extracts process between CME BIC (Beneficiary cloud) and CMS RASS Products.
  • Optimizing /tuning and automating Redshift DW environment using AWS Utility.
  • For implementation in RASS Project- extensively used QLIK Replicate (Formerly Attunity) and QLIK Compose for Datawarehouse to automation Data Ingestion and Data Curation Processes.

Environment: Amazon Redshift, PostgreSQL, PySpark, Oracle, Matillion, AWS Data Pipeline, S3, SQL Server Integration Services, AWS Data Migration Services, AWS SCT, Python, QLIK Replicate (Formerly Attunity), QLIK Compose, Docker, Splunk, GIT, SQS,SNS and Jenskins.

Confidential

Data engineer

Responsibilities:

  • Involved in implementation of the project went through several phases namely: data set analysis, preprocessing data set, user-generated data extraction, and modeling.
  • Participated in Data Acquisition with the Data Engineer team to extract historical and real-time data by using Sqoop, Pig, Flume, Hive, MapReduce, and HDFS.
  • Wrote user-defined functions (UDFs) in Hive to manipulate strings, dates, and other data.
  • Performed Data Cleaning, features scaling, features engineering using pandas, and NumPy packages in python.
  • Process Improvement: Analyzed error data of recurrent programs using Python and devised a new process to reduce the turnaround time of the problem's solutions by 60%
  • Worked on production data fixes by creating and testing SQL scripts.
  • Deep dived into complex data sets to analyze trends using Linear Regression, Logistic Regression, Decision Trees
  • Prepared reports using SQL and Excel to track the performance of websites and apps
  • Visualized data using Tableau to highlight abstract information
  • Applied clustering algorithms i.e. Hierarchical, K-means using Scikit, and Scipy.
  • Performed Data Collection, Data Cleaning, Data Visualization, and Feature Engineering using Python libraries such as Pandas, Numpy, matplotlib, and seaborn.
  • Optimized SQL queries for transforming raw data into MySQL with Informatica to prepare structured data for machine learning.
  • Used Tableau for data visualization and interactive statistical analysis.
  • Worked with Business Analysts to understand the user requirements, layout, and look of the interactive dashboard.
  • Used SSIS to create ETL packages to Validate, Extract, Transform, and Load data into Data Warehouse and Data Mart.
  • The lifetime values were classified based on the RFM model by using an XGBoost classifier.
  • Maintained and developed complex SQL queries, stored procedures, views, functions, and reports that meet customer requirements using Microsoft SQL Server
  • Participated in Building Machine Learning using python

Environment: Python, PL/SQL scripts, Oracle Apps, Excel, IBM SPSS, Tableau, Big Data, HDFS, Sqoop, Pig, Flume, Hive, MapReduce, HDFS, SQL, Pandas, Numpy, MatPlotLib, Seaborn, ETL, SSIS, SQL Server, Windows.

Confidential

Data Analyst

Responsibilities:

  • Primarily worked on a project to develop internal ETL product to handle complex and large volume healthcare claims data. Designed ETL framework and developed number of packages to Extract, Transform and Load data using SQL Server Integration Services (SSIS) into local MS SQL 2012 databases to facilitate reporting operations.
  • Involved in various Transformation and data cleansing activities using various Control flow and data flow tasks in SSIS packages during data migration
  • Applied various data transformations like Lookup, Aggregate, Sort, Multicasting, Conditional Split, Derived column etc.
  • Developed Mappings, Sessions, and Workflows to extract, validate, and transform data per the business rules using Informatica.
  • Supported Data migration projects, migrated data from SQL Server toNetezzausing NZ Migrate utility.
  • Designed target tables as per the requirement from the reporting team and designed Extraction, Transformation and Loading (ETL) usingTalend.
  • Worked on NetezzaSQL scripts to load the data betweenNetezzatables.
  • ScheduleTalendJobs using Job Conductor (Scheduling Tool inTalend) - available in TAC.
  • Querying, creating stored procedures and writing complex queries and T-SQL join to address various reporting operations and ad-hoc data requests.
  • Performance monitoringandOptimizing Indexestasks by usingPerformance Monitor, SQL Profiler, Database Tuning Advisor and Index tuning wizard.
  • Acted as point of contact to resolve locking/blocking and performance issues.
  • Wrote scripts and indexing strategy for a migration to Amazon Redshift from SQL Server and MySQL databases
  • Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift
  • Used JSON schema to define table and column mapping from S3 data to Redshift and worked on indexing and data distribution strategies optimized for sub-second query response
  • Worked on Dell Boomi Connectors like FTP, Mail, Database, Salesforce, Web Services Listener, HTTP Client, Web Services SOAP Client, SuccessFactors, Trading Partner.
  • Developed the Database/Flat-file/JSON/XML profiles, Boomi Mappings, Processes using with different connectors/shapes and logic shapes between the application profiles and different Trading partners using Dell Boomi.

Environment: Amazon Redshift, AWS Data Pipeline,TalendPlatform for Big Data MS SQL Server 2008R2/2012, Oracle 10g/9i, Dell BoomiNetezzaMako 7.2, S3, SQL Server Reporting Services (SSRS), SQL Server Integration Services (SSIS), Share Point, TFS, MS Project, MS Access and Informatica.

We'd love your feedback!