We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Alpharetta, GA

SUMMARY

  • Overall 8+ years of technical IT experience in all phases of Software Development Life Cycle (SDLC) with skills in data analysis, design, development, testing and deployment of software systems.
  • 8+ yearsof industrial experience inBig Data analytics,Data manipulation, using Hadoop Eco system toolsMap - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop,AWS,Spring Boot, Spark integration with Cassandra, Avro, Solr and Zookeeper
  • Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, cloud Watch, SNS, Dynamo, SQS
  • Managing Database, Azure Data Platform services (Azure Data Lake(ADLS), Data Factory(ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Data bricks, NoSQL DB), SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes
  • Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau, PowerBI
  • Extensive programming expertise in designing and developing web based applications using Spring Boot, Spring MVC, Java servlets, JSP, JTS, JTA, JDBC and JNDI
  • Experience in MVC and Micro Services Architecture with Spring Boot and Docker, Swamp
  • Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server. Worked on different file formats like delimited files, avro, Json and parquet. Docker container orchestration using ECS, ALB and lambda
  • CreatedSnowflake Schemasby normalizing the dimension tables as appropriate, and creating a Sub Dimension named Demographic as a subset to the Customer Dimension
  • Hands on experience in test driven development(TDD),Behavior driven development(BDD)and acceptance test driven development (ATDD)approaches
  • Expertise in Java programming and have a good understanding on OOPs, I/O, Collections, Exceptions Handling, Lambda Expressions, Annotations
  • Provided full life cycle support to logical/physical database design, schema management and deployment. Adept at database deployment phase with strict configuration management and controlled coordination with different teams.
  • Experience in Spring Frameworks like Spring Boot, Spring LDAP, Spring JDBC, Spring Data JPA, Spring Data REST
  • Experience in writing code in R and Python to manipulate data for data loads, extracts, statistical analysis, modeling, and data munging.
  • Familiar with latest software development practices such as Agile Software Development, Scrum, Test Driven Development (TDD) and Continuous Integration (CI)
  • Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy. Experience in working on creating and running Docker images with multiple micro services
  • Utilized analytical applications like R, SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
  • Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and effective use of Azure SQL Database, Map Reduce, Hive, SQL and PySpark to solve big data type problems.
  • Strong experience in Microsoft Azure Machine Learning Studio for data import, export, data preparation, exploratory data analysis, summary statistics, feature engineering, Machine learning model development and machine learning model deployment into Server system.
  • Proficient inStatistical MethodologiesincludingHypothetical Testing,ANOVA,Time Series,Principal Component Analysis,Factor Analysis,Cluster Analysis,Discriminant Analysis.
  • Expertise in transforming business resources and requirements intomanageable data formatsandanalytical models,designing algorithms,building models,developing data miningandreporting solutionsthat scale across a massive volume of structured and unstructured data.
  • Worked with various text analytics libraries like Word2Vec, GloVe, LDA and experienced with Hyper Parameter Tuning techniques like Grid Search, Random Search, model performance tuning using Ensembles and Deep Learning.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Skilled in performing data parsing, data ingestion, data manipulation, data architecture, data modelling and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, re index, melt and reshape.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, Map Reduce, HBase, Pig, Hive, Sqoop, KafkaFlume, Cassandra, Impala, Oozie, Zookeeper, MapR, Amazon Web Services (AWS), EMR

Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Gradient Boosting Classifier, Extreme Gradient Boosting Classifier, Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayes Classifier, Extra Trees Classifier, Stochastic Gradient Descent, etc.

Cloud Technologies: AWS, Azure, Google cloud platform (GCP)

IDE’s: IntelliJ, Eclipse, Spyder, Jupyter

Ensemble and Stacking: Averaged Ensembles, Weighted Averaging, Base Learning, Meta Learning, Majority Voting, Stacked Ensemble, Auto ML - Scikit-Learn, MLjar, etc.

Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE

Programming / Query Languages: Java, SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, SAS, R Programming (Caret, Glmnet, XGBoost, rpart, ggplot2, sqldf), RStudio, PL/SQL, Linux shell scripts, Scala.

Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Data bricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, Map Reduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, Azure Data bricks, Azure Data Explorer, Azure HDInsight, Salesforce, GCP, Google Shell, Linux, PuTTY, Bash Shell, Unix, etc., Tableau, Power BI, SAS, Intelligence, Crystal Reports, Dashboard Design.

PROFESSIONAL EXPERIENCE

Confidential, Alpharetta, GA

Senior Big Data Engineer

Responsibilities:

  • Performed data analysis and developed analytic solutions. Data investigation to discover correlations / trends and the ability to explain them.
  • Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables - visualization)
  • Developed frameworks and processes to analyze unstructured information. Assisted in Azure Power BI architecture design
  • Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means
  • Implemented Statistical model and Deep Learning Model (Logistic Regression, XGboost, Random Forest, SVM, RNN, CNN).
  • Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics
  • Performing data analysis, statistical analysis, generated reports, listings and graphs using SAS tools, SAS/Graph, SAS/SQL, SAS/Connect and SAS/Access.
  • Developing Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Using Kafka and integrating with the Spark Streaming. Developed data analysis tools using SQL andPythoncode.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
  • Agile methodology including test-driven and pair-programming concept.
  • Created functions and assigned roles in AWS Lambda to run python scripts, and AWSLambda using java to perform event driven processing.
  • Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
  • Created yaml files for each data source and including glue table stack creation. Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
  • Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, Event Bridge, SNS)
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab. Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Built the machine learning model include: SVM, random forest, XGboost to score and identify the potential new business case with Python Scikit-learn.
  • Experience in Converting existing AWS Infrastructure to Server less architecture(AWS Lambda, Kinesis),deploying viaTerraformand AWS Cloud Formation templates.
  • Worked onDocker containerssnapshots, attaching to a running container, removing images, managing Directory structures and managing containers.
  • Experienced in day - to-day DBA activities includingschema management, user management(creating users, synonyms, privileges, roles, quotas, tables, indexes, sequence),space management(table space, rollback segment),monitoring(alert log, memory, disk I/O, CPU, database connectivity),scheduling jobs, UNIX Shell Scripting.
  • Expertise in usingDocker to run and deploy the applications in multiple containers likeDocker SwarmandDocker Wave.
  • Developed complexTalend ETL jobsto migrate the data fromflat filesto database. Pulled files frommainframe into Talendexecution server using multipleftpcomponents.
  • Developed complexTalend ETL jobstomigratethe data from flat files to database. DevelopedTalend ESBservices and deployed them onESBservers on different instances.
  • Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model.
  • Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.
  • Developed merge scripts toUPSERTdata intoSnowflakefrom an ETL source.

Environment: Hadoop, Map Reduce, HDFS, Hive, Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, AWS, GitHub, Talend Big Data Integration, Solr, Impala.

Confidential, Fort Washington, PA

Sr. Data Engineer

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
  • Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
  • Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
  • Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • Built a new CI pipeline. Testing and deployment automation with Docker, Swamp, Jenkins and Puppet. Utilized continuous integration and automated deployments with Jenkins and Docker.
  • Data visualization:Pentaho, Tableau, D3. Have knowledge of Numerical optimization, Anomaly Detection and estimation, A/B testing, Statistics, and Maple. Have big data analysis technique using Big data related techniques i.e.,Hadoop, Map Reduce, NoSQL, Pig/Hive, Spark/Shark, MLlibandScala, NumPy, SciPy, Pandas, scikit-learn.
  • UtilizedSpark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLlib, Pythonand utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
  • Used ApacheSpark Data frames, Spark-SQL, Spark MLlibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
  • Applied variousmachine learning algorithmsand statistical modeling likedecision trees, text analytics, natural language processing (NLP),supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clusteringto identify Volume usingscikit-learn packageinpython, R, and Matlab. Collaborate withData Engineers and Software Developersto develop experiments and deploy solutions to production.
  • Create and publish multiple dashboards and reports usingTableau server and work onText Analytics, Naive Bayes, Sentiment analysis, creating word cloudsand retrieving data fromTwitterand othersocial networking platforms.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
  • Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and cost sensitivealgorithms.
  • Improve fraud prediction performance by using random forest and gradient boosting for feature selection withPython Scikit-learn.
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
  • Optimizealgorithmwithstochastic gradient descent algorithmFine-tuned thealgorithm parameterwith manual tuning and automated tuning such asBayesian Optimization.
  • Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
  • Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop, Kafka, Spark, Sqoop, Docker, Swamp, Spark SQL, TDD, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper.

Confidential, Wilmington, DE

Data Scientist/ R Programmer

Responsibilities:

  • Gathered business requirements, definition and design of the data sourcing, worked with the data warehouse architect on the development of logical data models.
  • Created sophisticated visualizations, calculated columns and custom expressions anddeveloped Map Chart, Cross table, Bar chart, Tree map and complex reports which involves Property Controls, Custom Expressions.
  • Investigated market sizing, competitive analysis and positioning for product feasibility. Worked on Business forecasting, segmentation analysis and Data mining.
  • Automated Diagnosis of Blood Loss during Emergencies and developed Machine Learning algorithm to diagnose blood loss.
  • Extensively used Agile methodology as the Organization Standard to implement the data Models. Used Micro service architecture with Spring Boot based services interacting through a combination of REST and Apache Kafka message brokers.
  • Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports.
  • Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database.
  • Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models. Developed a data pipeline using Kafka to store data into HDFS.
  • Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process usingpythonscripts.
  • DevelopedSparkjobs using Scala for faster real-time analytics and usedSparkSQL for querying
  • Generated graphs and reports using ggplot package in RStudio for analytical models. Developed and implemented R and Shiny application which showcases machine learning for business forecasting.
  • Developed predictive models using Decision Tree, Random Forest, and Naïve Bayes.
  • Used pandas, NumPy, seaborne, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms. Expertise inR, Matlab, pythonand respective libraries.
  • Research on Reinforcement Learning and control (TensorFlow, Torch), andmachinelearning model (Scikit-learn).
  • Hands on experience in implementing Naive Bayes and skilled inRandom Forests, Decision Trees, Linear,and Logistic Regression, SVM, Clustering, Principal Component Analysis.
  • Performed K-means clustering, Regression andDecision Treesin R. Worked on data cleaning and reshaping, generated segmented subsets using NumPy and Pandas in Python.
  • Implemented various statistical techniques to manipulatethe datalike missingdataimputation, principal component analysis and sampling.
  • Worked on R packages to interface with Caffe Deep Learning Framework. Perform validation on machine learning output from R.
  • Applied different dimensionality reduction techniques like principal component analysis (PCA) and t-stochastic neighborhood embedding(t-SNE) on feature matrix.
  • Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
  • Responsible for design and development of Python programs/scripts to prepare transform and harmonize data sets in preparation for modeling.
  • Worked with Market Mix Modeling to strategize the advertisement investments to better balance the ROI on advertisements.
  • Implemented clustering techniques like DBSCAN, K-means, K-means++ and Hierarchical clustering for customer profiling to design insurance plans according to their behavior pattern.
  • Used Grid Search to evaluate the best hyper-parameters for my model and K-fold cross validation technique to train my model for best results.
  • Worked with Customer Churn Models including Random forest regression, lasso regression along with pre-processing of the data.
  • Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
  • Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python and build models using deep learning frameworks
  • Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model
  • Implemented Univariate, Bivariate, and Multivariate Analysis on the cleaned data for getting actionable insights on the 500-product sales data by using visualization techniques in Matplotlib, Seaborn, Bokeh, and created reports in Power BI.

Environment: Spark, YARN, HIVE, Pig, Scala, Mahout, NiFi, TDD, Python, Spring Boot, Hadoop, Azure, Dynamo DB, Kibana, NOSQL, Sqoop, MYSQL.

Confidential

Big Data Developer

Responsibilities:

  • Experience in Big Data Analytics and design in Hadoop ecosystem using Map Reduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
  • Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic Map Reduce(EMR)on(EC2).
  • Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Performed pig script which picks the data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in Oozie script
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity. Build an ETL which utilizes spark jar inside which executes the business analytical model.
  • Hands on experiences on Git bash commands like Git pull to pull the code from source and developing it as per the requirements, Git add to add files, Git commit after the code build and Git push to the pre prod environment for the code review and later used screwdriver. yaml which actually build the code, generates artifacts which releases in to production
  • Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
  • Connected to AWS Redshift through Tableau to extract live data for real time analysis.
  • Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
  • Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
  • DevelopedPythonscript to run SQL query as parallel to initial load data into target table. Involved in loading data from edge node to HDFS using shell scripting and assisted in designing the overall ETL strategy
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
  • Worked with applications like R, SPSS and Python to develop neural network algorithms, cluster analysis, ggplot2 and shiny in R to understand data and developing applications.
  • Partner with technical and non-technical resources across the business to leverage their support and integrate our efforts.
  • Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
  • Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Data bricks, Hive, Hadoop, Python, PySpark, Spark SQL, Map Reduce, and Azure Machine Learning.
  • Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors
  • Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are then implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.

Environment: Hadoop, Map Reduce, Hive, Pig, Spark, HBase, Oozie, Impala, Kafka, Azure data factory, Data bricks, Aws, Azure, Python, NumPy, Pandas, Pl/Sql, Sql Server, Unix, Shell Scripting, Git.

Confidential 

Hadoop Developer

Responsibilities:

  • Performed data transformations like filtering, sorting, and aggregation using Pig
  • Creating Sqoop to import data from SQL, Oracle, and Teradata to HDFS
  • Created Hive tables to push the data to MongoDB.
  • Wrote complex aggregate queries in mongo for report generation.
  • Developed scripts to run scheduled batch cycles using Oozie and present data for reports
  • Worked on a POC for building a movie recommendation engine based on Fandango ticket sales data using Scala and Spark Machine Learning library.
  • Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into AmazonS3 using Spark Scala API and Spark.
  • Implement automation, traceability, and transparency for every step of the process to build trust in data and streamline data science efforts using Python, Java, Hadoop streaming, Apache Spark, Spark SQL, Scala, Hive, and Pig.
  • Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex data types and Parquet file format.
  • Performed data validation and transformation using Python and Hadoop streaming.
  • Developed highly efficient Pig Java UDFs utilizing advanced concept like Algebraic and Accumulator interface to populate ADP Benchmarks cube metrics.
  • Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.
  • Developed bash scripts to bring the TLOG file from ftp server and then processing it to load into hive tables.
  • Automated workflows using shell scripts and Control- Confidential jobs to pull data from various databases into Hadoop Data Lake.
  • Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.
  • Inserted Overwriting the HIVE data with HBase data daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment...
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and have a good experience in using Spark-Shell and Spark Streaming.
  • Designed, developed and maintained Big Data streaming and batch applications using Storm.
  • Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression.
  • Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
  • Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Developed pig scripts to transform the data into structured format and it are automated through Oozie coordinators.
  • Used Splunk to captures, indexes and correlates real-time data in a searchable repository from which it can generate reports and alerts.

Environment: Hadoop, HDFS, Spark, Strom, Kafka, Map Reduce, Hive, Pig, Sqoop, Oozie, DB2, Java, Python, Splunk, UNIX Shell Scripting.

We'd love your feedback!