Pyspark Developer - Bigdata Resume
MD
SUMMARY:
- Having 11.5 years of experience in handling Data Warehousing and Business Intelligence projects in Banking, Finance, Credit card and Insurance industry. Which includes 4.5 + years of experience as a Data Engineer.
- Evaluating technology stack for building Analytics solutions on cloud by doing research and finding right strategies, tools for building end to end analytics solutions and help designing technology roadmap for Data Ingestion, Data lakes, Data processing and Visualization.
- Design and Developed real time streaming pipelines for sourcing data from IOT devices, defining strategy for data lakes, data flow, retention, aggregation, summarization for optimizing the performance of analytics products.
- Extensive experience on Data analytics for satisfying Marketing Campaign.
- Good knowledge on Hadoop Architecture and its ecosystem.
- Having extensive knowledge on Hadoop technology experience in Storage, writing Queries, processing and analysis of data.
- Strong understanding on NoSQL database like Hbase.
- Experience in Microsoft Azure date storage and Azure Data Factory, Azure Data Lake Store (ADLS), AWS S3, EC2 & Vault
- Experience on migrating on Premises ETL process to Cloud.
- Experience on working with Apache Nifi for Ingesting data into Bigdata from different source system.
- Experience in creating and loading data into Hive tables with appropriate static and dynamic partitions, intended for efficiency.
- Worked on various Hadoop file formats like Parquet, ORC & AVRO file.
- Experience in Data Warehousing applications, responsible for the Extraction, Transformation and Loading (ETL) of data from multiple sources into Data Warehouse
- Experience in optimizing Hive SQL quarries and Spark Jobs.
- Implemented various frameworks like Data Quality Analysis, Data Governance, Data Trending, Data Validation and Data Profiling with the help of technologies like Bigadata,Data Stage, Spark, Python, Mainframe with databases like Netezza and DB2,Hive & Snowflakes
- Good knowledge of business process analysis and design, re - engineering, cost control, capacity planning, performance measurement and quality.
- Learned and Implemented Velocidata to integrating AMEX mainframe data into BigData(Hadoop) analytics environment by offloading the mainframe transformation process.
- Experience with creation of Technical document for Functional Requirement, Impact Analysis, Technical Design documents, Data Flow Diagram with MS Visio.
- Having experience in delivering the highly complex project with Agile and Scrum methodology.
- Quick learner and up-to-date with industry trends, Excellent written and oral communications, analytical and problem-solving skills and good team player, Ability to work independently and well-organized.
PROFESSIONAL EXPERIENCE:
Confidential, MD
PySpark Developer - Bigdata
Responsibilities:
- Design and develop ETL integration patterns using Python on Spark.
- Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
- Create Pyspark frame to bring data from DB2 to Amazon S3.
- Translate business requirements into maintainable software components and understand impact (Technical and Business)
- Provide guidance to development team working on PySpark as ETL platform
- Makes sure that quality standards are defined and met.
- Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- Provide workload estimates to client
- Developed framework for Behaviour Driven Development (BDD).
- Migrated On prem informatica ETL process to AWS cloud and Snowflakes
- Implement CICD(Continuous Integration and Continuous Development) pipeline for Code Deployment
- Reviews components developed by the team members
Technologies: AWS Cloud,S3,EC2,Postgre Spark, Python 3.6, Bigdata, Snowflakes, Hadoop, Kubernetes, Dockers, Airflow, Splunk, DB2,PostgreSQL,CICD
Confidential
Data Engineer - Bigdata
Responsibilities:
- Data acquisition from internal/external data sources
- Create and maintain optimal data pipeline architecture
- Identify, design, and implement internal process improvements
- Automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability.
- Build the infrastructure required for optimal extraction, transformation, and loading (ETL) of data from a wide variety of data sources like Salesforce, SQL Server, Oracle & SAP using Azure, Spark, Python, Hive, Kafka and other Bigdata technologies.
- Data QA/QC for data transfer and data lake or data warehouse.
- Build analytics tools that utilize the data pipeline to provide actionable insights into customer acquisition, operational efficiency and other key business performance metrics.
Technologies: Azure HDInsight, Databricks, Spark (2.1), Databricks, Azure Datafactory, Azure Data lakes, HDFS, MapReduce, Hive, Kafka, ETL, Oozie, Python 2.7
Confidential
Data Engineer - Bigdata
Responsibilities:
- Migrating on-prem ETLs from MS SQL server to Azure Cloud using Azure Data Factory and Databricks
- Building Reusable Data ingestion and Data transformation frameworks using Python
- Build, Enhance, optimized Data Pipelines using Reusable frameworks to support data need for the analytics and Business team using Spark and Kafka.
- Migration of data warehouse from SQL Server to Hadoop and Hive.
- Responsible for architecting a complex data layer to source the Raw data from variety of different sources and generating a derived data as per the business requirement and feed the data to BI Reporting to data scientist team
- Designed and built Data Quality frameworks for covering Data Quality aspects like Completeness, Accuracy, coverage using Python and Spark and Kafka.
- Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
- Coordinated with business customers to gather business requirements. And, interact with other technical peers to derive Technical requirements and delivered the BRD and TDD documents
- Created different Hive RAW and Standardize table for data validation and Analysis with Partition and bucket
- Involved in loading and transforming large sets of structured and semi structured from multiple data source to Raw Data Zone (HDFS) using Sqoop imports and Spark jobs.
- Developed Hive queries for data sampling and analysis to the analysts.
- Written Sqoop Queries to import data into Hadoop from SQL Server table.
- Configured and used Query Surge tool to connect with HBase using Apache Phoenix for Data Validation.
- Developed spark applications in python(PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Worked on reading and writing multiple data formats like JSON,ORC,Parquet on HDFS using PySpark.
- Involved in converting Hive Queries into various Spark Actions and Transformations by Creating RDD and Dataframe from the required files in HDFS.
- Responsible for data ingestion into Bigdata using Spark Streaming and Kafka.
- Worked on a POC to use SAP Smart Data Access for exporting SAP Hana data to Bigdata Platform using Apache Spark
- Implemented ETL frame work using Spark with Python and loaded standardize data into Hive and Hbase tables.
- Analyzing the Data from different sourcing using Big Data Solution Hadoop by implementing Azure Data Factory, Azure Data Lake, Azure Data Lake Analytics, HDInsights, Hive, Sqoop.
- Migration of on premise data (SQL Server) to Azure Data Lake Store(ADLS) using Azure Data Factory.
- Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
- Implemented Oozie workflow engine to run multiple Hive and Python jobs.
- Involved in automating the Bigadata jobs in Microsoft HDInsight Platform and managing logs.
- Responsible for creating the design documents, establish specific solutions, creating the Test Cases.
- Responsible for closing the defects identified by QA team and responsible for managing the Release process for the modules.
Technologies : Azure HDInsight,Databricks, Hortonwork Data Platform, Hadoop (2.6), Spark ( 2.1), HDFS, MapReduce, Hive, Kafka , YARN, ZooKeeper, Oozie, Python 2.7 , Scala, Apache Nifi(1.1), Hue, ETL, MS SQL Server, Shell scripting, Maestro, AQT, XML, Query Surge
Confidential, Phoenix, AZ
Senior Developer
Responsibilities:
- Depth Expertise in ETL solutions implementation
- Responsible for leading a project team in delivering solution to customer.
- Deliver new and complex high quality solutions to clients in response to varying business requirements
- Responsible for managing scope, planning, tracking, change control, aspects of the project.
- Responsible for effective communication between the project team and the onsite project counterpart. Provide day to day direction to the project team and regular project status to the onsite counterpart.
- Responsible for creating the design documents, establish specific solutions.
- Establish quality procedure for the team and continuously monitor and audit to ensure team meets quality goals.
- Worked extensively in NZLOAD, NZSQL to migrate the data from DB2 to Netezza database
- Worked for migrating the existing DB2 Data to Netezza.
- Involved in loading data into Netezza from legacy systems and flat files using UNIX scripts.
- Used Netezza groom to reclaim the space for tables, databases
- Responsible for developing and reviewing the Shell script and developing the Datastage jobs.
- Involved in gathering Business Requirements from Business users. Analysed all jobs in the project and prepared ADS document for the impacted jobs.
- Designed, developed and tested the Datastage jobs using Designer and Director based on business requirements and business rules to load data from source to target tables.
- Worked on various stages like Sequential file, Hash file, Aggregator, Funnel, Change Capture, Change Apply, Row Generator, Peek, Remove Duplicates, Copy, Lookup, Join, Merge, Filter, Datasets during the development process of the Datastage jobs.
- Deploying the code into all other test environments and making sure QA to pass all their test cases.
- Resolving the defects which have been raised by QA.
- Established best practices for Datastage jobs to ensure optimal performance, reusability, and restartability.
- Involved in developing Business reports by writing complex SQL queries
- Used SQL explain plan in Query, SQL Developer to fine tune the SQL codes which were used to extract the data in database stages
- Used Control-M to schedule, run and monitor Datastage jobs.
- Worked on standing up the Performance Environment by creating Value Files, Parameter Sets, Unix Shell Scripts, running one time PL-SQL scripts for inserting/updating values in the various tables.
- Extracted the data from the DB2 database and loading into downstream Mainframe files for generating the reports.
- Ability to propose sound design/architecture solutions for Mainframe.
- Able to propose or choose best possible Distribution for table based on the Type and contents of columns.
- Hands on Experience in optimizing DB2 query and cost of Stored procedure using APPTUNE.
- Used various JCL SORT to join,merge, SORT and filter data on various conditions
- Presented and Detailed Product Backlog Items to the Scrum Team in the Sprint Planning Sessions and assisted them in arriving at the Story Points for the User Stories.
- Reviewed user Stories and Acceptance Criteria with the team
- Assisted the Scrum Master in Creating and Managing the Release Planning Documents
- Worked closely with the Scrum Dev architects and Content Engineers for Design and Development of
- Worked with Scrum QA team to go over the various test scenarios for different types System records data.
Technologies: Bigdata, ETL, Datastage 8.1/ 8.7, 11.3, PLSQL, DB2, Netezza, UNIX/LINUX shell scripting, Control M, AIX platform, HP Quality Centre, TOAD, Mainframe, XML, APPTUNE
Confidential
Application Developer
Responsibilities:
- Monitoring job status and solved the error across the environments.
- Running jobs and scripts for checking the status of application.
- Overriding and Restarting Jobs as per the requirement
- Involved in IMS set up between HP-IBM.
- Preformed data refreshed and plex build activities File fixes
- Batch processing of CLAIMS and analysis of batch processing.
- Involved in data refresh and plex build activities.
Technologies: Mainframe,Cobol,JCL,DB2,VSAM,CICS,APPTUNE,SPOOFI,QMF Jobtrack,, HPSD,VSAM,REXX
Confidential
Software Engineer
Responsibilities:
- Playing important role in understanding the core business logic for maintenances and bug fix.
- Interaction with client for specification of user requirement.
- Prepared Test cases and Test Case Documents.
- Modified and generated General Ledger report.
- Modified and generated feeds as per the Business requirement which were sent to DTCC.
- Handled and Implemented DTCC changes to JETS and PC Mapper.
- Generation of C-tax Report for May run and Year end run as per the Business need’s
- Enhancement and Modifications according to the Business need of Pyramid reports
- Handled change management Request.
- Analysis of frequently ABEND jobs in production and fix them.
Technologies: Cobol,JCL,DB2,VSAM,CICS,APPTUNE,SPOOFI,QMF,Jobtrack, JETS,VSAM,REXX