Big Data Engineer Resume
Minneapolis, Mn
SUMMARY:
- Big data engineer with a passion for solving complex data analytics, data distribution and data mining problems.
- Over 8 years and 8 months of IT Professional experience with 4 plus years of Hadoop ecosystem and Big data analytics experience working heavily on HDFS, Hive, Pig, Spark, Python and other Big data technologies.
SKILL:
Big Data Technologies: Hive, Apache Pig, MRJob, Sqoop, Flume, Spark, Dato, Impala, Kafka, HBase, Hadoop, MapR, CDH4
Programming Languages: Python
Libraries: NumPy, Pandas, MatplotLib, Seaborn, Scikit - learn, PlotLy, CuffLinks
Database: SQL developer, Oracle, SQL, MY SQL
Visualization Tools: Tableau, Excel
PROFESSIONAL EXPERIENCE:
Confidential, Minneapolis, MN
Big Data Engineer
ROLES AND RESPONSIBILITES:
- Studied company’s existing Claim adjudication alert system application to establish an understanding of how its generating alerts based on nelson rules applied on summary tables.
- Researched and established relationships between different features of the large claim pre-adjudicated tables, Claim reference tables, flattened views and control tables.
- Worked with Data scientists to determine and provide data mining insights using Tableau and/or Python Seaborn. Running A-B tests to validate feasibility of a feature predicting outcomes favorably or unfavorably. Also, running chi squared tests to determine independence of features.
- Cleaning and Transforming data for feature enrichment and optimizing forecast.
- Created predictive model using Logistic regression and eliminated unwanted features using backward elimination and assessed the model using MAE, MSE, RMSE values.
ENVIRONMENT: /TOOLS: Tableau, Python, NumPy, Matplotlib, Seaborn, Pandas, Scikit-learn
ConfidentialBig Data Engineer
ROLES AND RESPONSIBILITIES:
- Meet with source team to determine the source table scopes (~2K MySQL tables) containing various provider, member, accums, claims data, environment and testing needs and requirements.
- Meet with consumer teams to gather and design data availability, frequency and format requirements.
- Work with Attunity team to establish connection between 3 different test source environments and Datalake HDFS. Eventually, enabling continuous change data flow between Source system and HDFS.
- Establish processes to ingest data from ~2K tables from 3 different environments daily and monitor/test them and maintain in the form of Hive tables, Hive views, HBase tables, and ORC tables. Thereby, enabling high data availability to 8 different consumer teams for their daily testing needs.
- Created jobs to keep a check and report on daily schema evolution.
- Performed daily data reconciliation between source and Datalake tables.
- Helped consumer teams in designing and building applications to effectively extract data from Datalake tables.
ENVIRONMENT: /TOOLS: MapR, Cluster size of 12 nodes, Sqoop, Hive, ORC, HBase, IBM CDC, Attunity CDC.
Confidential, Minneapolis, MNBig Data Developer
ROLES AND RESPONSIBILITIES:
- As a big data developer, my role was to ingest data from different platforms in the form of structured or unstructured formats, cleansing them and aggregating them.
- Later my role moved to writing Map reduce Jobs in MRJob and writing and maintaining Hive and Pig scripts for analyzing data.
- Also, worked on Spark MLlib with other teams to create applications to identify claim frauds.
Data Aggregation and Assimilation:
- Moved existing data from older structured datastores to HDFS using Sqoop.
- Aggregated and ingested large amounts of log data from user log files and moved to HDFS using flume.
Cleaning and Transforming data:
- Wrote Hive queries to replace existing SQL queries in the system and wrote new queries to prepare data for various kinds of analysis on a day to day basis.
- Created External and Managed tables in Hive and used them appropriately for different PIG scripts required for reporting.
- Developed and implemented different Pig UDFs to write ad-hoc and scheduled reports as required by the Business team.
- Implemented bucketing, dynamic partitioning in HIVE to maintain claim data to improve query efficiency.
- Cleansed different click stream claims data from log files and stored them in Hive tables.
- Developed and maintained different Map Reduce, Hive and Pig jobs through workflows in Oozie.
- Responsible for writing python/MRJob map reduce code for integrating and transforming data sets.
- Parsed complex columnar data formats into HDFS by implementing custom SerDes.
- Used Impala for performance tuning to handle high concurrency queries run by various teams on HDFS.
- Responsible for troubleshooting MR Jobs issues like data mismatches, node failures through counters in the log files.
Analytics:
- Created a predictive modelling application within Spark using MlLib and python for predicting claim fraud.
- Created feature enrichment applications using Spark for claim fraud analysis and modelling.
- Bench-marked various modelling techniques such as PCA against test claim data.
ENVIRONMENT: /TOOLS: Hadoop CDH4, Cluster size of 30 nodes using AWS, Sqoop, Flume, CMR, MRJob, Python.
Confidential, Chicago, ILPig /Hive Developer
ROLES & RESPONSIBILITIES:
Data Aggregation and Assimilation:
- Involved in assimilating different structured and unstructured data and using Pig/Hive queries to clean, aggregate and transform data required for reporting.
- Assisted in gathering and ingesting large amounts of multiform data using Flume into HDFS through a multi agent source-channel-sink combination.
- Used Sqoop to extract data from multiple structured sources and to export data to other external RDBMS tables for querying and reporting.
Cleaning and Transforming data:
- Running sample runs of Pig queries on subsets of data to fine tune and validate complex queries before running in full dataset.
- Created custom re-usable UDFs, UDAFs, UDTFs and macros in Pig Latin/Hive and used in various reporting queries.
- Implemented Join optimizations in Pig using Skewed and Merge joins for large datasets.
- Used partitioning and bucketing in various map-side join scenarios to improve query performance.
- During query testing and refining used bucketing for effective sampling of data.
- Used storage format like AVRO to access multiple columnar data quickly in complex queries.
- Implemented Counters for diagnosing problem in queries and for quality control and application-level statistics.
ENVIRONMENT: /TOOLS: Hadoop CDH4, Cluster size of 40 nodes using MR2, Sqoop, Flume.
ConfidentialBusiness Technology Analyst/ Developer
ROLES & RESPONSIBILITIES:
- Initially, I was part of the development team involved mostly in creating reports in SQR (SQL) for enhancement to these modules. Later, moved to lead role for technical and functional assistance in HCM.
- Part of the reporting team where we needed to create ad-hoc and scheduled reports using SQR.
- Doing enhancements and development in SQR reporting for company’s ERP application.
- Preparation of design document, development activities and coding in SQL through SQR reports and processes.
- Doing analysis and feasibility study for reporting required in various sub-modules within HCM.
- Worked on software development, requirements analysis, and database design.
- Perform unit testing and system testing and co-ordination between different cross commits.
- Leading the work from technology side and liaison between management and clients.
- Finding areas of improvements and enhancements in the application & suggest same to business.
- To improve architecture and propose improvements to existing design.
ENVIRONMENT: /TOOLS: Oracle, SQL, SQR