Senior Data Engineer Resume
Charlotte, NC
SUMMARY
- Over 13 years of extensive experience on varied data ware housing technologies, Big data and Hadoop eco systems, Data modeling, Data Integration, Data quality, Data migration and OLAP reporting.
- Extensive experience in Requirements gathering, Profiling, Design, Modeling, Development and Testing of Enterprise Data warehouse, Data Marts, ODS and Data Quality Frameworks.
- 7 years of extensive Data engineering experience on Hadoop eco systems and Spark with hands - on in multiple Greenfield and migration projects in agile and waterfall methodologies.
- Life time learner and quick to adapt new technologies.
- Has Extensively performed Senior Data Engineer activities such as Data Ingestion, Data Profiling and Data Analysis in Enterprise Retail, Wholesale Credit Authorized Data Source
- Has excellent noledge in Data Sourcing, Data Processing and Distribution.
- Has extensively worked on activities like Requirements Gathering, Profiling and Analysis, Design, Development and Testing in Wholesale Credit data marts.
- Extensive noledge in building new data pipelines, identify existing data gaps and provide automated solutions to deliver analytical capabilities and enriched data to applications.
- Building scalable and reusable components using Scala and python for the mostly commonly used ETL operations like SourceOp, TargetOp, JoinerOp,SCD1,AggregatorOp,FilterOp.etc
- Extensive usage of shell Scripting to integrate and schedule the jobs with schedulers like Autosys, crontab Has excellent noledge in AWS Cloud Services like S3, Glue, Redshift, Athena etc.
- Proficient in writing Packages, Stored Procedures, Functions, Views and Database Triggers using T-SQL, Netezza and Oracle Sound noledge in Data warehousing concepts.
- Dimensional data modeling, Relational data modeling, Data Aggregation, Star and snow flake schemas.
- Has sound noledge on Hadoop Components like SQOOP, HIVE, BEELINE, SPARK SQL, OOZIE, HUE.
- Developed reusable components for logging framework, workflow execution, purging and archival processes using Python, SQL and Scala.
- Has worked on different file storage formats like S3,AVRO, PARQUET and TEXT format. Built CI/CD and auto deployment processes using shell scripting.
- Has worked on versioning tools like SVN & BITBUCKET. Sound Knowledge on Teradata & Netezza Architecture Expertise in using Teradata utilities - TPT, Fast load, Multi load (Mload), BTEQ, etc
- Expertise in using Netezza utilities - NZSQL, NZLOAD, NZMIGRATE Expertise in Informatica Power center Administration and Development, brought innovative thoughts into scalable products, which are being used till date.
- Streamlined the Code merge and deployment issues for Informatica and OBIEE objects.
- Profound experience in Loading from Flat files to Oracle using Oracle SQL* Loader Expert Knowledge on UNIX shell scripting.
- Has excellent noledge in Python programming. Excellent noledge on Performance tuning for optimal performance in SQL Server, Oracle and spark3.0.Sound noledge on Collect statistics, join strategies, join types and Explain/Optimizer plans.
- Possess excellent Presentation skills. Prepared presentations, Numerous data flow and process flow diagrams
TECHNICAL SKILLS
Bigdata and Hadoop: Ecosystem Apache Spark 3.0, CDH 5.8.3, HDFS 2.7.3, SPARK 2.0.0, Hive 2.0.0, Impala 2.7.0, Sqoop 1.4.6, Map Reduce,Oozie 3.1.0
RDBMS: SQL Server 2017,Oracle 12g,Netezza,Teradata14.11.0.1
Programming: Pyspark,Python, Scala,Shell Scripting, Bash, Python
Cloud Services: AWS Glue,EMR,Redshift,Athena
ETL Tools: Informatica Power Center 10.2,9.6.1, Administration & Development, Informatica Cloud Services, SSIS
IDE MS Visual: studio,Juypter, IntelliJ,PyCharm
Scheduling: Autosys,Crontab,Airflow
Reporting Tools: OBIEE 11.1.1.7.x,11.1.1.6.x,MSBI
EIM,DQ & DP Tools: Informatica Data Explorer, Informatica Data Quality, MDM
Other tools: GIT,JIRA,Bitbucket,Putty,WinSCP
PROFESSIONAL EXPERIENCE
Confidential, Charlotte, NC
Senior Data Engineer
Responsibilities:
- Worked on end to end process of requirements gathering through delivery for building the Reusable logging framework which can be embedded to all spark jobs to extract custom logging messages.
- Used the datamover framework to integrate with log4j logging and customized it to generate messages related from the job flow, at every stage of the flow.
- Wrote scala APIs to customize the logger classes and integrated in the various operations like source,Target,
- filter,joiner(transformations).
- Wrote Scala API for generated the complete log information with different levels of depth, based on the loglevel, create a csv and finally save to a hive table for auditing.
- Worked on the ETL data pipelines for the batch processing, involving various transformations.
- Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, TEMPEffective & efficient Joins, Transformations and other during ingestion process itself.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Developed PySpark and SparkSQL code to process the data frames in Apache Spark .
- Created data quality job in python to compare two dataframes .
Environment: Spark 2.3.2, Spark SQL, Python 3.0, Pyspark, SQL Server, Scala, Hive, Git, Linux,Shell Scripting.
Confidential - Charlotte, NC
Sr. Big data Developer
Responsibilities:
- Requirements gathering with business and defining low level design documents.
- Re-Architecting current Sourcing and Distribution System from Talend, and Netezza to Hadoop SPARK.
- Working with large data sets for performing sourcing from desperate source systems like Teradata, Flat files, Sql server, SFTP pulls, trigger based.
- Developed the end-to end process for building the active customers for the wholesale credit risk application, by performing various transformations for building the Real,Local,Secured Customers, Inactive customers.
- Performed the CDC process for the customers for both the daily and monthly flows.
- Develop wholesale credit risk application using Scala, hive.
- Contribute to the proprietary big data framework for data processing that Confidential uses across multiple teams using big data technologies like Hadoop, Scala, Python, Hive, Impala etc.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Analyze issues in Production environment of the Credit risk application during daily/monthly batch run. Provide technical solutions to the defects raised during the batch.
- Building the DAG flow for the execution of the jobs in parallel for the spark jobs.
- Creating conf(configuration) file for sequential execution of the jobs, creating persistent tables in the process, where required.
- Creating various data frames to hop the data in the process of extraction, standardization, transformation, and loading it to the target.
- Writing shell scripts for the ETL, deployment jobs. Creating autosys jils for scheduling the command jobs, box jobs, establishing the dependency between the jobs.
- Updating the code repository (bit bucket) and maintain gloden copy of code.
- Build minimum viable products, based on the tasks in the sprint, using big data technologies.
- Maintained the standards during coding for easy understanding and maintenance Debugging the failures and finding out technical solutions for the bugs.
- Performance tuned the processes by identifying the bottle necks and ensured quick job executions
- Developed ETL framework using Python and Hive (including daily runs, error handling, and logging) to clean useful data and improve vendor negotiations.
- Involved in the complete SDLC of multiple assignments, starting from requirements gathering, FSD, design, develop, testing, deployment and production support.
- Developed various mappings using transformations like HTTP, Web service Consumer, Application Source Qualified, Aggregator, Filter, Expression, Lookup, SQL, Router and Update Strategy transformations.
- Developed Integration solution to implement an Access Management solution which will allow synchronizing Salesforce Access Control with IES access control and pave way for the cross-platform unified experience for user management.
- Developed SCD mappings to identify the new, modified and disabled records between the LDAP and Salesforce systems.
Environment: Informatica Power center 9.6.1 and 10.2, Salesforce dotcom, Salesforce marketing Cloud, LDAP, Linux, SQL Server. SPARK 2.0.0, Hive 2.0.0, Impala 2.7.0, Sqoop 1.4.6
Confidential, Charlotte, NC
Sr. Application Developer
Responsibilities:
- Requirements gathering with business and defining low level design documents.
- Re-Architecting current Sourcing and Distribution System from Talend, and Netezza to Hadoop Haas.
- Worked on structured data and semi structured data with daily incremental loads of 1 TB in size and monthly, quarterly loads of several TBs.
- Developed the Sourcing logic, which includes Data staging, cleansing, Standardizing, Archiving and purging logic through Pig, Sqoop, hive, Oozie workflows for multiple SORs in financial domains.
- Optimized Hive 2.0.0 scripts to use HDFS efficiently by using various compression mechanisms.
- Extracted the data from Netezza, Teradata and flat files into the HDFS using SQOOP.
- Worked on Fast load and Fast Export to perform data movement from One Environment to other.
- Wrote BTEQ scripts to transform data from Netezza to the Teradata staging environment.
- Implemented authentication using Kerberos and authentication using Apache Sentry.
- Creating Frame Work for Data Quality to Comply with Banks Data Governance Team (EDM) in establishing Peaks Application as an Authorized Data Source (ADS) for Confidential .
- Created many Environments like Dev., SIT, UAT and Pre-Prod to Support Quality Assurance.
- Created Automation scripts for the refresh of UAT, SIT and PREPROD Hadoop environments, for source code, ddls, environment variables and configuration.
- Worked with Business and QA teams in resolve Issues during testing and providing explanation for gaps/differences between Prod and Other Lanes.
- Used Netezza and Teradata Windowing functions and load Utilities. Co-ordinated with offshore team for timely execution of Projects.
Environment: Informatica Power center 9.5.1, Talend, Netezza 14.11.0.1, SQOOP, HIVE, FLUME, OOZIE, PIG, RHEL Linux, SQL Server.