Sr. Data Engineer/cloud Data Engineer Resume
NashvillE
SUMMARY
- Around 8 + years of professional work experience as a Data Engineer, working with Python, Spark, AWS, Python, SQL & MicroStrategy in design, development, testing and Implementation of business application systems for Health Care and Educational sectors.
- Extensively worked on system analysis, design, development, testing and implementation of projects (SDLC) and capable of handling responsibilities independently as well as proactive team members.
- Experience in setting up Hadoopclusters on cloud platforms like AWS and GCP.
- Hands - on experience in designing and implementing data engineering pipelines and analyzing data using AWS stack like AWS EMR, AWS Glue, EC2, AWS Lambda, Athena, Redshift, Sqoop and Hive.
- Hands on experience working in GCP services like Big Query, Cloud Storage (GCS), cloud function, cloud dataflow, Pub/sub, Cloud Shell, GSUTIL, Big Query, Data Proc, Operations Suite (Stack driver).
- Hands on experience in programming using Python, Scala, Java and SQL.
- Sound knowledge of architecture of Distributed Systems and parallel processing frameworks.
- Designed and implemented end-to-end data pipelines to extract, cleanse, process and analyze huge amounts of behavioral data and log data.
- Good experience as HDFS/PySpark developer using Big Data Technologies like Hadoop Ecosystems, Spark Ecosystems.
- Good experience working with various data analytics in AWS Cloud like EMR, Redshift, S3, Athena, Glue.
- Experienced in developing production ready spark applications using Spark RDD APIs, Data frames, Spark-SQL and Spark-Streaming API's.
- Worked extensively on fine tuning spark applications to improve performance and troubleshooting failures in spark applications. understanding of Spark Architecture including Spark Core, Spark SQL Data Frames, RDDS for Pyspark and PANDA libraries.
- Experienced in developing production ready spark applications using Spark RDD APIs, Data frames and Spark-SQL.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
- Strong experience in using Spark Streaming, Spark SQL and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs.
- Proficient in importing/exporting data from RDBMS to HDFS using Sqoop.
- Used hive extensively to perform various data analytics required by business teams.
- Solid experience in working various data formats like Parquet, Orc, Avro, Json etc.,
- Experience automating end-to-end data pipelines with strong resilience and recoverability.
- Experience in creating Impala views on hive tables for fast access to data.
- Experienced in using waterfall, Agile and Scrum models of software development process framework.
- Good knowledge in Oracle PL/SQL and shell scripting.
- Database / ETL Performance Tuning: Broad Experience in Database Development including effective use of Database objects, SQL Trace, Explain Plan, Different types of Optimizers, Hints, Indexes, Table Partitions, Sub Partitions, Materialized Views, Global Temporary tables, Autonomous Transitions, Bulk Binds, Capabilities of using Oracle Built-in Functions.
- Experienced process-oriented Data Analyst having excellent analytical, quantitative, and problem-solving skills using SQL, MicroStrategy, Advanced Excel, Python.
- Proficient in writing unit testing code using Unit Test/PyTest and integrating the test code with the build process.
- Used Pythonscripts to parse XML and JSON reports and load the information in database.
- Experienced with version control systems like Git, GitHub, Bitbucket to keep the versions and configurations of the code organized.
TECHNICAL SKILLS
Big Data Eco-System: HDFS, GPFS, HIVE, SQOOP, SPARK, YARN,PIG,Kafka
Hadoop Distributions: Hortonworks, Cloudera, IBM Big Insights
Operating Systems: Windows, Linux (Centos, Ubuntu)
Programming Languages: PYTHON,Scala, SHELL SCRIPTING
Databases: Hive, MYSQL,NETEZZA, SQL Server
IDE Tools & Utilities: IntelliJ IDEA, ECLIPSE, PYCHARM, AGINITY WORKBENCH, GIT
Markup Languages: HTML
ETL: Data stage 9.1/11.5(Designer/Monitor/Director)
Job Scheduler: Control-M, IBM Symphony Platform, AMBARI,Apache Air flow
Reporting Tools: Tableau, Lumira
Cloud Computing Tools: AWS, GCP
Scrum Methodologies: Agile, Asana, Jira
Others: MS Office, RTC, Service Now, OPTIM, IGC(Info sphere Governance catalog), WinSCP, MS Visio
PROFESSIONAL EXPERIENCE
Confidential, Nashville
Sr. Data Engineer/Cloud Data Engineer
Responsibilities:
- Involved in writing Spark applications using Python to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.
- Used Spark SQL to migrate the data from hive to python using pyspark library.
- Used Hadoop technologies like spark and hive Including using the pyspark library to create spark data frames and converting them to normal pandas data frames for analysis.
- Used AWS Redshift, S3, Spectrum and Athena services to query large amounts of data stored on S3 to create a Virtual Data Lake without having to go through the ETL process.
- Developed multiple POCs using Pyspark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata and developed code in reading multiple data formats on HDFS using Pyspark.
- Loaded the data into Spark dataframes and perform in-memory data computation to generate the output as per the requirements.
- Worked on AWS Cloud to convert all on premise, existing processes and databases to AWS Cloud.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Designed, developed and created ETL(Extract, Transformand Load)packagesusing Python to load data into Data warehouse tools (Teradata) from databases such as Oracle SQL Developer, MS SQL Server.
- Utilized inbuilt Python module JSON to parse the member data which is in JSON format using json. loads or json.dumps and load into a database for reporting.
- Developed pyspark job to load the CSV files into the S3 buckets and createdAWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
- Developed a daily process to do incremental import of data from DB2 and Teradata into Hive tables using Sqoop.
- Analyzed the SQL scripts and designed the solution to implement using Spark.
- Developed Python code to gather the data from HBase and designs the solution to implement using PySpark.
- Worked on importing metadata into Hive using Python and migrated existing tables and the data pipeline from Legacy to AWS cloud (S3) environment and wrote Lambda functions to run the data pipeline in the cloud.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- UsedPythonlibraries and SQL queries/subqueries to create several datasets which produced statistics, tables, figures, charts and graphs and has good experience of software development using IDEs: pycharm,JupyterNotebook.
- Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
- Consumed REST APIs using Python requests such as POST & GET operations to fetch and post member data to different environments.
- Used Pandas API to put the data as time series and tabular format for central timestamp data manipulation and retrieval during various loads in the DataMart.
- Worked on bash scripting to automate the Python jobs for day-to-day administration.
- Performed data extraction and manipulation over large relational datasets using SQL,Python, and other analytical tools.
- Extensively worked with Teradata utilities like BTEQ, Fast Export, Fast Load, Multi Load to export and load Claims & Callers data to/from different source systems including flat files.
Environment: AWS EMR, AWS Glue, Redshift, Hadoop, HDFS, Teradata, SQL, Oracle, Hive, Spark, Python, Hive, Sqoop, MicroStrategy, Excel.
Confidential - Brooklyn, NY
GCP Data Engineer
Responsibilities:
- Will be responsible for developing a highly scalable and flexible authority engine for all customer data.
- Worked on Resetting of customer attributes that provide insight about customer. Purchase frequency, marketing channel, Groupon deal categorization. Advocate different sources of data using SQL, HIVE, SCALA.
- Integrated 3rd party data agencies (Gender, age, other purchase history from other sites) try to integrate that data to existing data store.
- Normalized the data according to the business needs like data cleansing, modifying the data types and various transformations using Spark, Scala and GCP Dataproc.
- Implemented dynamic partitioning in big query tables and used appropriate file format, compression technique to improve the performance of Pyspark jobs in the DATAPROC.
- Developed PySpark code and to mimic the transformations performed in the on-premises environment.
- Analyzed the SQL scripts and designed solutions to implement using PySpark.
- Worked on querying data using Spark SQL on the top of PySpark engine jobs to perform data cleansing, validation and applied transformations and executed the program using python API.
- Used Kafka HDFS Connector to export data from Kafka topic to HDFS files in a variety of formats and integrates with apache hive to make data immediately available for SQL querying.
- Built a system for analyzing the column names from all tables and identifying personal information columns of data across on-premises Databases (data migration) to GCP
- Process and load bound and unbound Data from Google Pub/Sub topic to Big-Query using cloud Dataflow with Python.
- Worked on partitions of Pub/Sub messages and setting up the replication factors.
- Operation focused, including building proper monitoring on data process and data quality.
- Effectively worked and communicated with product, marketing, business owners, and business intelligence and the data infrastructure and warehouse teams.
- Performed analysis on data discrepancies and recommended solutions based upon root cause.
- Designed and developed job flow using Apache Air flow.
- Worked on IntelliJ IDE, Eclipse IDE, Maven, SBT, GIT
- Working on data pipe line which is build on top of Spark using scala
- Designed, developed Created ETL(Extract, Transformand Load)Packagesusing Python, SQL Server Integration Services (SSIS) to load data into Data warehouse (Microsoft SQL Server), from Excel workbooks and Flat Files into database.
- Implemented an application for cleansing and processing terabytes of data using Python and Spark.
- Developed packages usingPython, Shell scripting, XML to automate some of the menial tasks.
- UsedPythonto write data into JSON files for testing Student Item level information.
- Created scripts for data modelling and data import and export.
- Developed SSIS Packages to extract Student data from source systems such as Transactional system for online assessments and legacy system for paper pencil assessments, transform data based on business rules and load the data into reporting DataMart tables such as dimensions, facts and aggregated fact tables.
- Developed T-SQL (transact SQL) queries, stored procedures, user-defined functions, built-in functions.
- Using Advance SQL,Dynamic SQL methods, creating Pivot functions, Un-pivot functions, dynamic table expressions, dynamic executions load through parameters and variables for generating data files.
- Used windowing functions such as ROW NUMBER, RANK, DENSE RANK, NTILE to order data, remove duplicates in source data before loading to DataMart for better performance.
- Worked on performance tuning of existing and new queries using SQL Server Execution plan, SQL Sentry Plan Explorer to identify missing indexes, table scans, index scans.
- Redesign and tune stored procedures, triggers, UDF, views, indexes and increase the performance of the slow running queries.
- Expertise in snowflake to create and Maintain Tables and views.
- Optimized queries by adding necessary non-clustered indexes and covering indexes.
- Developed Power Pivot/SSRS (SQL Server Reporting Services) Reports and added logos, pie charts, bar graphs for display purposes as per business needs.
- Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
- Designed SSRS reports using parameters, drill down options, filters, sub reports.
- Developed internal dashboards for the team using Power BI tools for tracking daily tasks.
Environment: Python, GCP Spark, Hive, Scala, Snowflake, JupyterNotebook, Shell Scripting, SQL Server 2016/2012, T-SQL, SSIS, Visual studio, Power BI, PowerShell.
Confidential - Chicago, IL
Hadoop Engineer
Responsibilities:
- Developed Hive Scripts, Hive UDFs, Python Scripting and used Spark (Spark-SQL, Spark-shell) to process data in Hortonworks.
- Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of Spark.
- Converted traditional ETL Pipelines built in Ab Initio to Pyspark Applications and containerized using docker Images and hosted in OpenShift platform.
- Implemented Partitioning, Dynamic Partitions and Buckets in HIVE & Impala for efficient data access.
- Worked inAgile environment,and used rally tool to maintain the user stories and tasks.
- Extensively worked on HiveQL, join operations, writing custom UDF's and having good experience in optimizing Hive Queries.
- Designed and Developed Scala code for data pull from cloud based systems and applying transformations on it.
- Usage of Sqoop to import data into HDFS from MySQL database and vice-versa.
- Implemented optimized joins to perform analysis on different data sets using MapReduce programs.
- Created continuous integration and continuous delivery (CI/CD) pipeline on AWS that helps to automate steps in software delivery process.
- Experienced in running query using Impala and used BI tools and reporting tool (tableau) to run ad-hoc queries directly on Hadoop.
- Worked on Apache Tez, an extensible framework for building high performance batch and interactive data processing applications Hive jobs.
- Collect the data using Spark streaming and dump into HBase and Cassandra. Used the Spark- Cassandra Connector to load data to and from Cassandra.
- Collecting and aggregating large amounts of log data using Kafka and staging data in HDFS Data lake for further analysis.
- Analyze SQL scripts and design the solution to implement using Pyspark.
- Experience in processing of load and transform the large data sets of structured, unstructured and semi structured data in Hortonworks.
- Experience in using Spark framework with Scala and Python. Good exposure to performance tuning hive queries and MapReduce jobs in Spark (Spark-SQL) framework on Hortonworks.
- Developed Scala & Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark-SQL for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Configured Spark streaming (receivers) to receive Kafka input streams from the Kafka and Specified exact block interval for data Processing into HDFS using Scala.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and HBase Filters to compute various metrics for reporting on the dashboard.
- Developed shell scripts in UNIX environment to automate the dataflow from source to different zones in HDFS.
- Created and defined job work flows as per their dependencies in Oozie and e-mail notification service upon completion of job for the team that request for the data and monitored jobs using Oozie on Hortonworks.
- Experience in designing both time driven and data driven automated workflows using Oozie.
Environment: Hadoop (Cloudera), HDFS, Map Reduce, Hive, Scala, Python, Pig, Sqoop, AWS, Azure, DB2, UNIX Shell Scripting, JDBC.
Confidential
ETL Developer
Responsibilities:
- Installed, configured, and maintained Apache Hadoop clusters for application development and major components of Hadoop Ecosystem: Hive, Pig, HBase, Sqoop, Flume, Oozie and Zookeeper.
- Implemented six nodes CDH4 Hadoop Cluster on CentOS.
- Importing and exporting data into HDFS and Hive from different RDBMS using Sqoop.
- Experienced in defining job flows to run multiple Map Reduce and Pig jobs using Oozie.
- Importing log files using Flume into HDFS and load into Hive tables to query data.
- Monitoring the runningMap Reduceprograms on the cluster.
- Responsible for loading data from UNIX file systems to HDFS.
- Used HBase-Hive integration, written multiple Hive UDFs for complex queries.
- Involved in writing APIs to ReadHBasetables, cleanse data and write to anotherHBasetable.
- Created multiple Hive tables, implemented Partitioning, Dynamic Partitioning and Buckets in Hive for efficient data access.
- Written multiple Map Reduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
- Experienced in running batch processes using Pig Scripts and developed Pig UDFs for data manipulation according to Business Requirements.
- Experienced in writing programs using HBase Client API.
- Involved in loading data into HBase using HBase Shell, HBase Client API, Pig and Sqoop.
- Experienced in design, development, tuning and maintenance of NoSQL database.
- Written Map Reduce program in Python with the Hadoop streaming API.
- Developed unit test cases for Hadoop Map Reduce jobs with MRUnit.
- Excellent experience in ETL analysis, designing, developing, testing and implementing ETL processes including performance tuning and query optimizing of database.
- Continuously monitored and managed the Hadoop cluster using Cloudera manager and Web UI.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Used Maven as the build tool and SVN for code management.
- Worked on writing RESTful web services for the application.
- Implemented testing scripts to support test driven development and continuous integration.
Environment: Hadoop, MapReduce, HDFS, HBase, Hive, Impala, Pig, SQL, Ganglia, Sqoop, Flume, Oozie, Unix, Java, Java Script, Maven, Eclipse.
Confidential
Software Engineer
Responsibilities:
- Involved in the completeSDLClife cycle, design and development of the application.
- AGILEmethodology was followed and was involved in SCRUM meetings.
- Created various java bean classes to capture the data from the UI controls.
- Designed UML diagrams like class diagrams, sequence diagrams and activity diagrams.
- Implemented the java web services,JSP, Servletsfor handling data.
- Designed and developed the user interface using Struts 2.0, JavaScript, XHTML
- Made use of Struts validation framework for validations at the server side.
- Created and Implemented the DAO layer using Hibernate tools.
- Implementedcustom interceptorsandexception handlersfor Struts 2 application.
- Ajaxwas used to provide dynamic search capabilities for the application.
- Developed business components using service locator, session facade design patterns.
- Developed session facade with stateless session beans for coarse functionality.
- Worked with Log4J for logging purpose in the project.
Environment: Java 1.5, Java Script, Struts 2.0, Hibernate 3.0, Ajax, JAXB, XML, XSLT, Eclipses, Tomcat.