Data Engineer Resume
SUMMARY
- Around 8 years of total IT experience in Data Warehouse life cycle, data lake, big data project implementation
- Strong knowledge of Entity - Relationship concept, Facts and dimensions tables, slowly changing dimensions (SCD) and Dimensional Modeling (Kimball/Inman), Star Schema and Snow Flake schema
- Experience in working on the Hadoop Eco system, also have extensive experience in AWS, GCP platform
- Experience in the integration of various data sources such as Oracle, SQL Server, Salesforce cloud, Teradata, JSON, XML Files, Flat files and API integration.
- Extensive experience in creating complex mappings in Talend using transformation and big data components
- Expertise in defining and documenting ETL Process Flow, Job Execution Sequence, Job Scheduling and Alerting Mechanisms using command line utilities.
- Extensive experience in implementing Error Handling, Auditing and Reconciliation and Balancing Mechanisms in ETL process.
- Good understanding of Hadoop architecture and hands on experience with Hadoop components such as Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming
- Experienced in optimizing Hive queries by tuning configuration parameters.
- Experienced in Worked on NoSQL databases - HBase, Cassandra & Impala, database performance tuning & data modeling.
- Experience in Google cloud ecosystem like bigquery, bigtable, cloudproc, dialogflow,cloud storage and IAM policies.
- Experienced with Terraform to automate
- Knowledge on Amazon EC2, Amazon S3, Amazon RDS, NOSQL (DynamoDB), Redshift, Lambda, VPC, IAM, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS and other services of the AWS family.
- Experience in using PL/SQL to write Stored Procedures, Functions and Triggers. Experience includes Requirements Gathering, Design, Development, Integration, Documentation, Testing and Build
- Hands on experience in tuning mappings, identifying and resolving performance bottlenecks in various levels like sources, targets, mappings, and sessions
- Strong knowledge on implementation of SPARK core - SPARK SQL, MLlib, GraphX and Spark streaming.
- Strong understanding of project life cycle and SDLC methodologies including Waterfall and Agile
- Expertise in understanding and supporting the client with project planning, project definition, requirements definition, analysis, design, testing, system documentation and user training
- Experience in UNIX shell scripting, CRON, FTP and file management in various UNIX environments.
- Knowledge in designing Dimensional models for Data Mart and Staging database.
- Excellent Analytical, Written and Communication skills
TECHNICAL SKILLS
Big Data Tools: Google Cloud, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Flume, Oozie, Kafka, spark
Databases: Oracle, MS SQL, Teradata, Big query, Hive, LUDP
Data Modeling: ERWIN 4.5, Star Schema Modeling, Snow Flake Modeling
Programming: Python, Core Java, SQL
Scheduling Tools: TAC, Airflow,Nifi
Operating system: UNIX, Linux, Windows Variants
Other Tools: Eclipse,IntelliJ, GitHub,Jira, Confluence, Putty, WINSCP, TSA, Postman, swagger, bigbucket bamboo, Talend, informatica, tableau, Docker,Terraform
PROFESSIONAL EXPERIENCE
Confidential
Data Engineer
Responsibilities:
- Analyze the business requirements and converting into High level Data Model document for easy understanding for the development team
- Job duties involved the design, development and testing of various use cases implemented in hadoop Big Data Platform.
- Transform the ingested data using technologies like Spark, Sqoop and Hive as per the data model
- Created Hive tables to store variable data formats of data coming from different applications
- Ingested huge amount of data into Hadoop in Parquet storage format
- Used Sqoop extensively to import & export data to and from SQL Server in to HDFS and HIVE.
- Experience in writing Hive/HQL/Impala scripts to extract and load data in to Hive Data warehouse.
- Performed Data Loading Techniques through Hive and HBase, ETL through Talend.
- Analyzed business requirements and cross-verified them with functionality and features of NOSQL databases like impala .
- Implemented Spark using pyspark and Spark SQL for faster testing and processing of data responsible to manage data from different sources.
- Created POC for Google cloud, AWS environment to check feasibility check to move data from on premise to cloud eco system.
- Created AWS environment to test S3, EMR,Cloudwatch,EC2 and Lambda for one business group and demonstrated various issues to product management team
- Managing and scheduling Jobs on a Hadoop cluster using Oozie.
- Implemented ingestion of data using SQOOP for large dataset transfer between Hadoop and RDBMS.
- Involved in ingestion of data from RDBMS to Hive.
- Responsible in creating Hive tables, loading with data and writing Hive queries.
- Experienced in managing and reviewing log files using Web UI and Cloudera Manager.
- Daily Scrum Meetings with the team for the Updated status and the action plan of the day
Environment: Talend free version, Oracle, Postgres, HDFS, Hive Impala, java,GIT,python JIRA, Agile Methodology
Confidential, Chicago IL
Data Engineer/BI Developer
Responsibilities:
- Worked with ecommerce, Marketing salesforce, Sales & Ops, Manufacturing, customer support team to implement projects like Data warehouse, Data Egineering, Data integration automation, process design, API enablement, Analytics, Data quality etc
- Developed data pipelines to implement enterprise data warehouse in Google cloud & LUDP environments
- Developed ingestion layer in google data storage for manufacturing team to process daily 200GB data.
- Handled importing of data from various data sources, performed transformations using Hive, Pig and Spark and loaded data into LUDP.
- Worked on migrating customer engagement team data from on oracle system databases to AWS Redshift and S3
- Selecting appropriate AWS services to design, develop and deploy an application based on given requirements
- Involved in working with EMR cluster and S3 in AWS cloud.
- Created Hive External tables and loaded the data into tables and query data using HQL
- Developed Restful API’s using Python for customer care system which can be used to easily access customer and product data
- Developed Pyspark scripts using Data frames/Spark SQL and RDD in Spark for Data Aggregation, queries.
- Developed Spark code in Python and SparkSQL environment for faster testing and processing of data and Loading the data into Spark RDD and doing In-memory computation to generate the output response with less memory usage.
- Developed Oozie workflow jobs to execute hive, sqoop and spark actions.
- Developed workflow in Apache airflow to automate the tasks of loading the data into HDFS and pre-processing with Python script.
- Used Terraform for infrastructure to spinup EC2, EMR, Lambda as per requirement and enable
- Designed and Maintained Oozie workflows to manage the flow of jobs in the cluster.
- Responsible for Data Modeling and Development of Internal Business Intelligence Chat Bot that provide real time access to business KPI’s using Python Flask and Google cloud
- Improved daily jobs performance using data cleaning, query optimization and table partitioning
- Created an automation process for Distribution group which receive Inventory and sales data send activation report using Talend and Big query
- Worked with ecommerce Dev ops engineer to create an automation process for log Filtering using AWS S3,Gsutil, python and Splunk
- Developed data migration processes to migrate historical data from LUDP hive tables to google cloud environment
- Involved in building the ETL architecture and Source to Target mapping to load data into Data warehouse.
- Developed a process using tableau to analyze customers data to run Push notification promotion campaign which increase adaption rates by 28%
- Identified deeply defective manufacturing stations using tableau reports to and test process based on smart factory method and suggested Process changes which reduce cost by 31%
- Designed and customized data models for Data warehouse supporting data from multiple sources on real time.
- Solid experience in implementing complex business rules by creating re-usable transformations and robust mappings using Talend transformations like tConvertType, tSortRow, tReplace, tAggregateRow, tUnite etc.
- Worked in migration project to convert Informatica ETL jobs to talend
- Involved in analyzing system failures, identifying root causes and recommended course of actions.
- Worked on Hive, Big query (BQ) for exporting data for further analysis and for generating transforming files from different analytical formats to text files.
- Experience in Extraction, Transformation and Loading of Data from different Heterogeneous Origin systems likes Complex JSON, XML, Flat Files, Excel, Oracle, MySQL and SQL Server, Sales force Cloud, API endpoint.
- Created API service for Customer care center using talend ESB and Big Query (BQ)
- Created a mechanism to import third party vendor orders and distributor information data using API endpoint extraction
- Create a process to extract email attachments and send required information from Big Query
- Mapping source to target data and converted data JSON to XML (Accord Format) using Talend data mapper and transform with TXMLMap component.
- Created execution plans in TAC
- Created talend quality checks job lets based on business requirements
- Created Talend Mappings to populate the data into dimensions and fact tables
- Developed jobs to move inbound files to vendor server location based on monthly, weekly and daily frequency in Talend.
Environment: Spark, Oracle, Hive 0.13, HDFS, google, XML files, Flat files, JSON, Hadoop, JIRA, Postman, oozie, pyspark, Talend
Confidential
Data Analyst
Responsibilities:
- Responsible for the design, development and administration of complex T-SQL queries (DDL / DML), Stored Procedures, Views& functions for transactional and analytical data structures
- Identify and interpret trends and patterns in large and complex datasets. Analyze trends in key metrics
- Collaborate with team to identify data quality, metadata, and data profiling issues
Confidential
ETL Developer
Responsibilities:
- Design and Implement ETL for data load from Source to target databases and for Fact and Slowly Changing Dimensions (SCD) Type1, Type 2, Type 3 to capture the changes.
- Involved in writing SQL Queries and used Joins to access data from Oracle, and MySQL.
- Participated in all phases of development life-cycle with extensive involvement in the definition and design meetings, functional and technical walkthroughs.
- Designing, developing and deploying end-to-end Data Integration solution.
- Implemented custom error handling in Talend jobs and worked on different methods of logging.
- Develop the ETL mappings for XML, .CSV, .TXT sources and loading the data from these sources into relational tables with Talend ETL Developed Joblets for reusability and to improve performance.
- Created UNIX script to automate the process for long running jobs and failure jobs status reporting.
- Developed high level data dictionary of ETL data mapping and transformations from a series of complex Talend data integration jobs.
- Developed mappings to load Fact and Dimension tables, SCD Type 1 and SCD Type 2 dimensions and Incremental loading and unit tested the mappings
- Expertise in interaction with end-users and functional analysts to identify and develop Business Requirement Documents (BRD) and Functional Specification documents (FSD).
- Prepared ETL mapping Documents for every mapping and Data Migration document for smooth transfer of project from development to testing environment and then to production environment.
- Created context variables and groups to run Talend jobs against different environments.
- Used Talend components tMap, tDie, tConvertType, tFlowMeter, tLogCatcher, tRowGenerator.
- Created Data model, Data entities and view for the Master Data Management
- Involved in creating roles and access control
- Created event management in which it listens continuously for the events in the MDM hub
- Used the triggers to launch the process with given set of conditions
- Worked on creating Data model, Campaigns for the Talend Data stewardship
- Created Data entities in the Data model and defined roles at the entity level
- Deployed the Data model, Data entities and View for the Talend MDM
- Performed requirement gathering, Data Analysis using Data Profiling scripts, Data Quality (DQ) scripts and unit testing of ETL jobs.
- Created triggers for a Talend job to run automatically on server.
- Installation of Talend Enterprise Studio (Windows, UNIX) and configuring along with Java.
- Set up and manage transactional log shipping, SQL server Mirroring, Fail over clustering and replication.
- Worked on AMC tables (Error Logging tables)
Environment: Talend Platform for Data management 5.6, UNIX Scripting, Toad, Oracle 10g, SQL Server