Sr. Gcp Data Engineer Resume
4.00/5 (Submit Your Rating)
Houston, TX
SUMMARY
- 9+ years of proven track record in the fields of Business Intelligence Reporting, Google cloud services, Bigdata/Hadoop ETL and supply chain product development.
- Hands on experience on Google Cloud Platform (GCP) in all the bigdata products BigQuery, Cloud Data Proc, Google Cloud Storage, Composer (Air Flow as a service).
- Highly knowledgeable in developing data marts in big data world in BigQuery or on - premises Hadoop clusters.
- Skilled at identifying the right cloud native technology for developing and maintaining big data flows in organizations.
- Experience with building data pipelines in Python/Pyspark/HiveSQL/Presto/BigQuery and building python DAG in Apache Airflow.
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Composer Airflow.
- Strong knowledge in data preparation, data modelling and data visualization using Power BI and had experience in developing various analysis services using DAX queries.
- Hands on experience with different programming languages such as Python, R, SAS.
- Very keen in knowing newer techno stack that Google Cloud platform adds.
- Experience in using different Hadoop eco system components such as HDFS, YARN, MapReduce, Spark, Pig, Sqoop, Hive, Impala, and HBase, Kafka, and Crontab tools.
- Experience in developing ETL applications on large volumes of data using different tools: MapReduce, Spark-Scala, Spark, Spark-Sql, and Pig.
- Experience in using SQOOP for importing and exporting data from RDBMS to HDFS and Hive.
- Experience with Unix/Linux systems with bash scripting experience and building data pipelines.
- Strong SQL development skills including writing Stored Procedures, Triggers, Views, and User Defined functions.
- Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, Sqoop, Apache Spark with Cloudera Distribution.
- Knowledge in various file formats in hdfs like avro, orc, parquet.
- Extensive experience in writing python functions and advanced.
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Hands on experience in using Google cloud platform for bigquery, cloud dataproc and apache airflow services.
- Have good Programming experience with Python and Scala.
- Hands in experience on No SQL database like Hbase, Cassandra.
- Experience in using stackdriver service/ dataproc clusters in GCP for accessing logs for debugging.
- Experience with scripting languages like PowerShell, Perl, Shell, etc.
- Expert knowledge and experience in fact dimensional modeling (Star schema, Snow flake schema), transactional modeling and SCD (Slowly changing dimension)
- Expertise knowledge in Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
- Experienced in using Hortonworks on-premise Hadoop distribution and have written Hive SQL and Sqoop scripts.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery, Azure Data Factory DataBricks.
- Experience in building efficient pipelines for moving data between GCP and Azure using Azure Data Factory.
- Experience in building power bi reports on Azure Analysis services for better performance when comparing that to direct query using GCP BigQuery.
- Extensive use of cloud shell SDK in GCP to configure/deploy the services Data Proc, Storage, and BigQuery.
TECHNICAL SKILLS
- Python
- SAS
- ETL
- Unix
- Linux
- Hive SQL
- Presto SQL
- Spark SQL
PROFESSIONAL EXPERIENCE
Confidential - Houston, TX
Sr. GCP Data Engineer
Responsibilities:
- Developed multi cloud strategies in better using GCP (for its PAAS) and Azure (for its SAAS).
- Involved in loading and transforming large sets of the structured, semi-structured dataset and analyzed them by running Hive queries.
- Developed custom python program including CI/CD rules for google cloud data catalog for metadata management.
- Design and develop spark job with Scala to implement end to end data pipeline for batch processing
- Do fact dimensional modeling and proposed solution to load it
- Processing data with Scala, spark, spark SQL and load in hive partition tables in parquet file format
- Develop spark job with partitioned RDD (like hash, range, custom) for faster processing
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Develop near real time data pipeline using flume, Kafka and spark stream to ingest client data from their web log server and apply transformation
- Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
- me have written shell script to trigger data Stage jobs.
- Assist service developers in finding relevant content in the existing reference models.
- Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
- Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
- Compiling and validating data from all departments and Presenting to Director Operation.
- Develop SQOOP script and SQOOP job to ingest data from client provided database in batch fashion on incremental basis
- Use DISTCP to load files from S3 to HDFS and Processing, cleansing and filtering data using Scala, Spark, Spark SQL, HIVE, Impala Query and Load in Hive tables for data scientists to apply their ML algorithms and generate recommendations as part of data lake processing layer.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators both old and newer operators.
- Created BigQuery authorized views for row level security or exposing the data to other teams.
- Good knowledge in using cloud shell for various tasks and deploying services.
Confidential, Charlotte, NC
GCP Data Engineer
Responsibilities:
- Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).
- Wrote scripts in Hive SQL/Presto SQL, using python plugin for both spark and presto for creating complex tables with high performance metrics like partitioning, clustering and skewing.
- Migrated previously written cron jobs to airflow/composer in GCP.
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP
- Worked on confluence and Jira
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
- Compiled data from various sources to perform complex analysis for actionable results
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
- Optimized the Tensorflow Model for efficiency
- Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
- Built performant, scalable ETL processes to load, cleanse and validate data
- Have written python DAGs in airflow which orchestrate end to end data pipelines for multiple applications.
- Was involved in setting up of apache airflow service in GCP.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
Confidential - Fremont, CA
Data Engineer
Responsibilities:
- Used panda’s library for data manipulation and numpy for numerical analysis and managed large datasets using panda’s data frames and MySQL.
- Helping the client partially migrate from MySQL to Hadoop ecosystem.
- Define the data pipeline for various clients
- Building part of oracle database in
- Loading data in No SQL database (Hbase, Cassandra)
- Combine all the above steps in oozie workflow to run the end to end ETL process
- Using YARN in CLOUDERA manager to monitor job processing
- Developing under scrum methodology and in a CI/CD environment using Jenkin.
- Do participate in architecture council for database architecture recommendation.
- Deep analysis on SQL execution plan and recommend hints or restructure or introduce index or materialized view for better performance
- Deploy EC2 instances for oracle database
- Worked briefly in writing pyspark core methods for speeding up Hive SQL queries like non-equi joins.
- Used apache Sqoop import and export and handled datatypes after moving.
- Did POC's by doing bucketing and partitioning on a hive table to understand performance.
- Wrote and executed various MYSQL database queries from Python using Python-MySQL connector and MySQL dB package.
- Used Collections in Python for manipulating and looping through different user defined objects.
Confidential - Mooresville, NC
SQL Analyst
Responsibilities:
- Extracted, compiled, and tracked data and analyzed data to generate reports.
- Participated with troubleshooting, and problem resolution efforts.
- Performed daily data manipulation using SQL and prepared reports on weekly, monthly, and quarterly basis.
- Used Excel functions to generate spreadsheets and pivot tables.
- Identified, analyze, and interpret trends or patterns in complex data sets.
- Cleansing and blending multiple data sources to allow for different views on application data in a single dashboard.
- Perform data manipulation operations like import/export data from various external file formats using SQL Generated variety of business reports from SQL server using excel with pivot table and pivot chart.
- Develop and implement databases, data collection systems, data analytics and other strategies that optimize statistical efficiency and quality.
- Checked for invalid/out of range data.
- Develop, update, and maintain current SQL programs.
- Develop and implement databases, data collection systems, data analytics and other strategies that optimize statistical efficiency and quality.
- Acquire data from primary or secondary data sources and maintain databases/ data systems