Hadoop/Data Engineer Resume St. Louis, Mo - Hire IT People

SUMMARY:

Technologically savvy and focused Data Engineer with strong background of total 8 years in advanced data management and troubleshooting skills.
More than 5 years of Hands - on experience in architecting and implementing the solutions in Big Data technologies and more than 4 years of Hands-on experience in architecting and implementing the solutions in Amazon AWS cloud.
Effective in building highly available distributed systems for data extraction, ingestion and loading.
Sound Experience with AWS services like Amazon EC2, S3, EMR, Amazon RDS, VPC, Amazon Elastic Load Balancing, IAM, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS, and Lambda to trigger resources.
Good experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, Storage Explorer.
Dependable ETL Developer with 4 plus years of experience in creating reliable and accurate data transformation tools. Well-versed in technologies such as SQL and Python.
Proven strength in data modelling and warehousing.
Well experience in applying System Development Life Cycle (SDLC), coding standards, code reviews and source management.
Strong Experience with Structured data and analyze/transformation skills.
Excellent understanding of algorithms and experience with source control software, such as GitHub and SVN.
Designing of Data Modelling for Next Generation Data Lake by working with Business SME's and Architecture teams.
Implemented BI solution framework for end-to-end business intelligence projects.
Experience in converting Hive/SQL queries into Spark transformations using Spark DataFrames, SPARQL and Scala for better performance on huge datasets.
Hands on DevOps essential tools like Chef, Puppet, Ansible, Docker, Kubernetes, Subversion (SVN), GIT, Hudson, Jenkins, Ant, Maven and migrated VMWAREVMs to AWS and Managed Services like EC2, S3, Route53, ELB, EBS.
Experience in creating machine learning models and retraining systems and good understanding of Data Mining and Machine Learning techniques.
Proficient in handling and ingesting terabytes of Streaming data (Kafka, Spark streaming, Strom), Batch Data, Automation and Scheduling (Oozie, Airflow).
Architect the Hadoop cluster in Pseudo distributed Mode working with Zookeeper and Apache Kudu and storing and loading the data from HDFS to Amazon AWS S3 and backing up and Created tables in AWS cluster with S3 storage.
Experience in Spark API using Scala and Spark-SQL/Streaming for faster processing of data.
Expertise in writing Hadoop Jobs using MapReduce, Apache Crunch, Apache Kudu,Hive, Pig, and Splunk.
Profound knowledge in developing production-ready Spark applications using Spark Components like Spark SQL, MLlib, GraphX, Data Frames, Datasets, Spark-ML and Spark Streaming.
Expertise in developing multiple confluent Kafka Producers and Consumers to meet business requirements. Store the stream data to HDFS and process it using Spark.
Created instances in AWS as well as migrated data to AWS from data Center using snowball and AWS migration service.
Hands on experience in provisioning and managing multi-node Hadoop Clusters on public cloud environment Amazon Web Services (AWS) - EC2 and on private cloud infrastructure.
Assisted in migrating from On-Premises Hadoop Services to cloud based Data Analytics using AWS.
Used HIVE queries to import data into Microsoft Azure cloud and analyzed the data using HIVE scripts.
Strong working experience with SQL and NoSQL databases (Cosmos DB, MongoDB, HBase, Cassandra), data modeling, tuning, disaster recovery, backup and creating data pipelines.
Experienced in scripting with Python (PySpark), Java, Scala and Spark-SQL for development, aggregation from various file formats such as XML, JSON, CSV, Avro, Parquet, ORC.
Great experience in data analysis using HiveQL, Hive-ACID tables, Pig Latin queries, custom MapReduce programs and achieved improved performance.
Experience in ELK stack to develop search engines on unstructured data within NoSQL databases in HDFS.
Extensive knowledge in all phases of Data Acquisition, Data Warehousing (gathering requirements, design, development, implementation, testing, and documentation), Data Modeling (analysis using Star Schema and Snowflake for FACT and Dimensions Tables), Data Processing and Data Transformations (Mapping, Cleansing, Monitoring, Debugging, Performance Tuning and Troubleshooting Hadoop clusters).
Experienced in managing and reviewing Azure AppInsights log files.
Implemented CRUD operations using Cassandra Query Language (CQL), analyze the data from Cassandra tables for quick searching, sorting, and grouping on top of the Cassandra File System.
Hands-on experience on Ad-hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.
Good knowledge in understanding the security requirements like Azure Active Directory, Sentry, Ranger, and Kerberos authentication and authorization infrastructure.
Expertise in creating Kubernetes cluster with cloud formation templates and PowerShell scripting to automate deployment in a cloud environment.
Sound knowledge in developing highly scalable and resilient Restful APIs, ETL solutions, and third-party integrations as part of Enterprise Site platform using Informatica.
Experience in using bug tracking and ticketing systems such as Jira, and Remedy, used Git and SVN for version control.
Highly involved in all facets of SDLC using Waterfall and Agile Scrum methodologies.
Experience in designing interactive dashboards, reports, performing ad-hoc analysis, and visualizations using Tableau, Power BI, Arcadia, and Matplotlib.
Involved in migration of the legacy applications to cloud platform using DevOps tools like GitHub, Jenkins, JIRA, Docker, and Slack.
Excellent communication skills, creative, research-minded, technically competent and result-oriented with problem solving.
Be proactive and constantly pay attention to the scalability, performance and availability of our systems.

TECHNICAL PROFICIENCIES:

Bigdata Technologies: HDFS, MapReduce, Yarn, HIVE, PIG, Pentaho, Presto, HBase, Oozie, Zookeeper, Snowflake, Sqoop, Cassandra, Spark, Scala, Storm, Flume, Kafka and Avro, Parquet, Snappy.

Operating Systems: Linux (Ubuntu, RHEL7/6.x/5.x/4.x, SOLARIS, CentOS (4.x/5.x/6.x/7), UNIX, Windows XP/Vista/ 2003/2007/2010

NO SQL Databases: HBase, Cassandra, MongoDB, Neo4j, Redis.

Cloud Services: Azure, Aws

Languages: C, C++, Java, Scala, Python, HTML, SQL, PL/SQL, Pig Latin, HiveQL, UNIX, JavaScript, Shell Scripting.

ETL Tools: Informatica, IBM DataStage, Talend.

Application Servers: Web Logic, Web Sphere, JBoss, Tomcat.

Databases: Oracle, MySQL, DB2, Teradata, NEO4J, Microsoft SQL Server.

Operating Systems: UNIX, Windows, iOS, LINUX.

Build Tools: Jenkins, Maven, ANT, Azure

Version Controls: Subversion, Git, Bitbucket, GitHub

Development Tools: Eclipse, IntelliJ, Microsoft SQL Studio, Toad, NetBeans

Methodologies: Agile, Waterfall.

PROFESSIONAL EXPERIENCE:

Hadoop/Data Engineer

Confidential, St. Louis, Mo

Responsibilities:

Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
Responsible for building large-scale data processing systems and serving as an expert in data warehousing solutions while working with a variety of database technologies.
Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and handled Json Data.
Contributed to preparing a big data platform for cloud readiness within AWS/Microsoft Azure platforms by containerizing modules using Docker.
Primarily responsible for designing, implementing, Testing, and maintaining database solution for Azure also worked on Spring boot for creation of microservices.
Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
Applied Machine Learning models (Logistics Regression, Random Forest Classifier) for prediction of user behavior in Python.
Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins.
Creating pipelines from different sources (Hadoop, Teradata, Oracle, Linux) to ADLS2 mediating through Azure data factory by creating datasets and data pipelines.
Worked for statistical procedures that are applied in both Supervised and Unsupervised machine learning problems.
Loading JSON from upstream systems using Spark streaming and load them to elastic search.
Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
Primarily responsible for designing, implementing, Testing, and maintaining database solution for Azure.
Using terraform I migrated many instances from GCP to Azure.
Used Spring AOP to implement Distributed declarative transaction throughout the application.
Used Azure data factory to schedule the flows by connecting different pipelines and data bricks notebooks.
Developed business components and integrated those using Spring features such as Dependency Injection, Auto wiring components such as DAO layers and service proxy layers.
Develop POC on real time streaming data received by Kafka and processed the data using Spark and this data was further stored into HDFS cluster using Scala.
Developed the validation code for Data Quality rules by using Sprak/Scala/PySpark.
Explored MLlib algorithms in Spark to understand the possible Machine Learning functionalities that can be used for use case.
Mixpanel Rest API calls were created to transfer hundreds of GB's of user experience data from Mixpanel into Google Big Query for analysis.
Creating pipelines from different sources (Hadoop, Teradata, Oracle, Linux) to ADLS2 mediating through Azure data factory by creating datasets and data pipelines.
Performed data analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
Implemented Spring boot microservices to process the messages into the Kafka cluster setup.
Derived the Spring MVC Controllers from the Use cases and integrated with the Service Layer to carry out business logic operations and returning resultant Data Model Object if any.
Primarily responsible for designing, implementing, Testing, and maintaining database solution for Azure.
Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins.
Helped deconstruct RDBMS data set to graph database using Vertex and Edges with JSON documents
Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and created hive queries for analysis.
Experience in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and grant database access.
Worked on Spring Web Flow Design using the Sequence Diagrams and configured the flows between the pre-defined Views and Controllers.
Used Azure data factory to schedule the flows by connecting different pipelines and data bricks notebooks.
Build ETL/ELT pipeline in data technologies like Pyspark, hive, presto and data bricks.
Replaced the presto Operators (nearly 400 tables) in the pipelines using Presto SQL, Python scripting based on the new templates generated for the tables beyond 90 days of retention period.
Worked on Developing Data Pipeline to Ingest Hive tables and File Feeds and generate Insights into Cassandra DB.
Worked on logical and physical data modeling using Erwin and designing the data flow from source to target systems.
Upgraded the Hadoop cluster from CDH4.7 to CDH5.2 and worked on installing cluster, commissioning & decommissioning of Data Nodes, NameNode recovery, capacity planning, and slots configuration.
Developed Spark scripts to import large files from Amazon S3 buckets and imported the data from different sources like HDFS/HBase into Spark RDD.
Developed a system to extract valuable insights from raw data to give meaningful leads using machine learning classifiers leveraging algorithms like Support Vector Machines, Naive Bayes, Linear Regression.
Performed architecture design, data modeling, and implementation of Big Data platform and analytics for the consumer products.
Strong technical background on C#.NET, Windows Azure (Cloud), Windows Service, Entity Framework, LINQ, Windows Service, SQL server.
Designed and highly implementing performant data ingestion pipelines using Databricks.
Created Snowpipe for continuous data loads and scheduled different snowflake jobs using Nifi.
Responsible to manage data coming from different sources into a HDFS Data Lake
Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation and worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
Worked on Installing Cloudera Manager, CDH and install the JCE Policy File to Create a Kerberos Principal for the Cloudera Manager Server, enabling Kerberos Using the Wizard.
Created Databricks Clusters to run multiple data loads parallel in PySpark.
Worked on Spark Structured Streaming for developing Live Steaming Data Pipeline with Source as Kafka and Output as Insights into Cassandra DB The Data was fed in JSON/XML format and then Stored in Cassandra DB.
Experience in working with Google Cloud Storage for data migration from HDFS.
Experience in working with Cloudera, Hortonworks, and Microsoft Azure HDINSIGHT Distributions.
Experience in dimensional data modeling, ETL development, and Data Warehousing.
Expert in understanding the data and designing and implementing the enterprise platforms like Hadoop Data Lake and Huge Data warehouses.
Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
Used Impala to read, write and query the Hadoop data in HDFS from HBase or Cassandra.
Created 25+ Linux Bash scripts for users, groups, data distribution, capacity planning, and system monitoring.
Loaded data using Spark-streaming with Scala and Python and worked on HBase Java API to populate operational HBase table with Key value.
Setup Jenkins on Amazon EC2 servers and configured the notification server to Jenkin server for any changes to the repository.
Deployment of Cloud service including Jenkins and Nexus on Docker using Terraform.
Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka.
Experienced in Microsoft Azure date storage and Azure Data Factory, Data Lake.
Experience in implementing OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse.
Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
Worked in Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
Supported MapReduce Programs and distributed applications running on the Hadoop cluster and scripting Hadoop package installation and configuration to support fully automated deployments.
Worked with data modeling, data architecture design and leveraging large-scale data ingest from complex data sources.
Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
Involved creating JSON API calls and handling API responses including errors and exceptions.
Working experience in CI/CD automation environment with GitHub, Bitbucket, Jenkins, Docker, and Kubernetes
Installed and configured OpenShift platform in managing Docker containers and Kubernetes Clusters.
Created Hive External tables and loaded the data into tables and query data using HQL and worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
Monitoring Hadoop cluster using tools like Nagios, Ganglia, and Cloudera Manager and maintaining the Cluster by adding and removing of nodes using tools like Ganglia, Nagios, and Cloudera Manager.
Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.
Performed Data Modeling, Data Migration, consolidate and harmonized data. Hands-on data modeling experience using Dimensional Data Modeling, Star Schema, Snowflake, Fact and Dimension Table, Physical and Logical Data Modeling using Erwin 3.x/4.x
Used SQL, PL/SQL to validate the Data going into the Data warehouse and Implemented a fully operational production grade large scale data solution on Snowflake Data Warehouse.

Environment: Hadoop, MapReduce, Hive, PIG, Sqoop, Python, Spark, Spark-Streaming, Spark SQL, Python, Scala, Pyspark, MapR, ADLS, Java, Oozie, Flume, HBase, Nagios, Ganglia, Hue, Cloudera Manager, Zookeeper, Cloudera, Oracle, Kerberos and RedHat 6.5.

Bigdata Engineer

Confidential, San Jose, CA

Responsibilities:

Responsible for architecting Hadoop clusters with CDH3 and involved in installation of CDH3 and upgradation to CDH4 from CDH3.
Worked on creating Key space in Cassandra for saving the Spark Batch output.
Worked on Spark application to compact the small files present into hive ecosystem to make it equivalent to block size of HDFS.
Created Databricks Delta Lake process for real-time data load from various sources (Databases, Confidential and SAP) to AWS S3 data-lake using Python/PySpark code.
Implementation of highly scalable and robust ETL processes using AWS (EMR, CloudWatch, IAM EC2, S3, Lambda Functions, DynamoDB).
Sound Experience with AWS services like Amazon EC2, S3, EMR, Amazon RDS, VPC, Amazon Elastic Load Balancing, IAM, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS, and Lambda to trigger resources.
Manage multiple AWS accounts with multiple VPC's for both production and non-production where primary objectives are automation, build out, integration and cost control.
Contributed to preparing a big data platform for cloud readiness within AWS/Microsoft Azure platforms by containerizing modules using Docker.
Used MLlib, Spark's Machine learning library to build and evaluate different models and used AWS Recognition for image analysis.
Worked with Different Relational Database systems like Oracle/PL/SQL. Used Unix Shell scripting, Python and Experience working on AWS EMR Instances.
Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE and setup Hadoop environment in AWS EC2 Instances.
Perform advanced procedures like text analytic and processing, using the in-memory computing of Spark using Scala.
Set up Neo4j Graph database, start the Neo4j Server and implement connection to application.
Used GitHub as repository for committing code and retrieving it and Jenkins for continuous integration.
Implemented the real time streaming ingestion using Kafka and Spark Streaming.
Created automated pipelines in AWS Code Pipeline to deploy Docker containers in AWS ECS using S3.
Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE and setup Hadoop environment in AWS EC2 Instances.
Created automated pipelines in AWS Code Pipeline to deploy Docker containers in AWS ECS using S3.
Worked in AWS environment for development and deployment of custom Hadoop applications.
Knowledge and working experience on cloud data integration tools, analytics, and ML tools like Databricks, Azure HDInsight, Azure Data Factory
Developed analytical components using Scala, Spark, and Spark Stream
Used Jenkins for build and continuous integration for software development.
Used ANT, Maven Scripts to build and deploy applications and helped to deployment for Continuous Integration using Jenkin and Maven.
Contribute to preparing a big data platform for cloud readiness within AWS/Microsoft Azure platforms by containerizing modules using Docker.
Design, development, and implementation of performant ETL pipelines using python API (pySpark) of Apache Spark on Azure Databricks
Worked on handling storage accounts in azure Experience with a version control system (e.g., ClearCase, Git)
Created ETL scripts that extracted data from MixPanel (MP) using MP's API scripted in Google Cloud Shell and converted to Google Big Query tables for the Data Analytics teams.
Create Python Flask login and dashboard with Neo4j graph database and execute various cypher queries for data analytics.
Implemented API's to solve technology problems using NoSQL and graph databases like Neo4j and MongoDB.
Create end to end Machine learning pipeline in Pyspark and Python with connect to Neo4j.
Involved in requirement and design phase to implement Streaming Lambda Architecture to use real time streaming using Spark and Kafka and Scala.
Experience in loading the data into Spark RDD and performing in-memory data computation to generate the output responses.
Optimized pipeline to improve CPU and Memory performance by leveraging better Presto functions, better data structures, code optimization, partitioning, filtering etc.
Created an Automated Databricks workflow notebook to run multiple data loads (Databricks notebooks) in parallel using Python.
Extensively worked on Scala API's to build real time data applications using Spark.
Used pig to do transformations, event joins, elephant bird API and pre -aggregations performed before loading JSON files format onto HDFS.
Working with data delivery teams to setup new Hadoop users, this job includes setting up Linux users, setting up Kerberos principals and testing HDFS, Hive.
Extensively used ETL to load data from flat files, XML, Oracle database, MySql from different sources to Data Warehouse database.
Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
Working knowledge on working on kubernetes pods and management of scaling.
Strong technical background on C#.NET, Windows Azure (Cloud), Windows Service, Entity Framework, LINQ, Windows Service, SQL server.
Created Databricks Delta Lake process for real-time data load from various sources (Databases, Confidential, and SAP) to AWS S3 data-lake using Python/PySpark code.
Hands on DevOps essential tools like Chef, Puppet, Ansible, Docker, Kubernetes, Subversion (SVN), GIT, Hudson, Jenkins, Ant, Maven and migrated VMWAREVMs to AWS and Managed Services like EC2, S3, Route53, ELB, EBS.
Developed full text search platform using NoSQL and Logstash Elastic Search engine, allowing for much faster, more scalable, and more intuitive user searches.
Experience in implementing OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse.
Designed GUI prototype using ADF 11G GUI component before finalizing it for development.
Developed end to end ETL batch and streaming data integration into Hadoop (MapR), transforming data.
Developed the Sqoop scripts to make the interaction between Pig and MySQL Database.
Worked on Performance Enhancement in Pig, Hive and HBase on multiple nodes.
Worked with Distributed n-tier architecture and Client/Server architecture.
Supported Map Reduce Programs those are running on the cluster and developed multiple Map Reduce jobs in Java for data cleaning and pre-processing.
CI/CD pipeline setup using Git, Jenkins, Docker, and Azure Kubernetes.
Developed HBase data model on top of HDFS data to perform real time analytics using Java API.
Worked on different kind of custom filters and handled pre-defined filters on HBase using API.
Perform maintenance, monitoring, deployments, and upgrades across infrastructure that supports all our Hadoop clusters and worked on Hive for further analysis and for generating transforming files from different analytical formats to text files.
Evaluated usage of Oozie for Workflow Orchestration and experienced in cluster coordination using Zookeeper.
Developing ETL jobs with organization and project defined standards and processes.
Implemented data access using Hibernate persistence framework.
Design of GUI using Model View Controller Architecture (STRUTS Framework)
Integrated Spring DAO for data access using Hibernate and involved in the Development of Spring Framework Controllers.

Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Zookeeper, Impala, Java (jdk1.6), Cloudera, Oracle, SQL Server, UNIX Shell Scripting, Flume, Oozie, Scala, Spark, ETL, Sqoop, Python, Kafka, PySpark, AWS, S3, MongoDB. AWS EMR, AWS S3, AWS Redshift.

Hadoop Engineer

Confidential, Chicago, IL

Responsibilities:

Responsible for building scalable distributed data solutions on Cloudera distributed Hadoop.
Developed complete end to end Big-Data Processing in Hadoop Ecosystems
Involved in gathering requirements from client and estimating timeline for developing complex queries using HIVE and IMPALA for logistics application.
Involved in using spark streaming and SPARK jobs for ongoing transactions of customers and Spark SQL to handle structured data in Hive.
Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau.
Create and maintain automated ETL processes with special focus on data flow, error recovery, and exception handling and reporting.
Creating pipelines from different sources (Hadoop, Teradata, Oracle, Linux) to ADLS2 mediating through Azure data factory by creating datasets and data pipelines.
Used Azure data factory to schedule the flows by connecting different pipelines and data bricks notebooks.
Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
Conduct business/data flow modeling and generate applicable scenarios for the technology functionality testing team.
Written PySpark code to calculate aggregate data like mean, Co-Variance, Standard Deviation and etc.
Worked on analysing Hadoop cluster and different Big Data Components including Pig, Hive, Storm, Spark, HBase, Kafka, Elastic Search, database and SQOOP.
Integrated Kafka with Flume in sand box Environment using Kafka source and Kafka sink.
Installed Hadoop, Map Reduce, HDFS, and developed multiple Map-Reduce jobs in PIG and Hive for data cleaning and pre-processing.
Complete project documentation including diagram, data flow, status and usage reports, support and escalation processes.
Involved in Developing Insight Store data model for Cassandra which was utilized to store the transformed data.
Responsible for writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
In data exploration stage used hive and impala to get some insights about the customer data.
Successfully secured the Kafka cluster with Kerberos Implemented Kafka Security Features using SSL and without Kerberos. Further with more grain-fines Security set up Kerberos to have users and groups this will enable more advanced security features and Integrated Apache Kafka for data ingestion.
Worked on NiFi workflows development for data ingestion from multiple sources. Involved in architecture and design discussions with the technical team and interface with other teams to create efficient and consistent Solutions.
Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.
Creating files and tuned the SQL queries in Hive utilizing HUE.
Involved in converting Hive/Sql queries into Spark transformations using Spark RDD's.
Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
Using Apache Nifi in a Kerberos system to transfer data from relational databases like MySQL to HDFS.
Expertized in implementing Spark using scala and Spark SQL for faster testing and processing of data responsible to manage data from different sources.
Worked with NoSQL databases like Hbase in making Hbase tables to load expansive arrangements of semi structured data.
CI/CD pipeline setup using Git, Jenkins, Docker, and Azure Kubernetes.
Acted for bringing in data under HBase using HBase shell also HBase client API.
Used Pig as ETL tool to do Transformations with joins and pre-aggregations before storing the data onto HDFS.
Used Kafka to patch up a customer activity taking after pipeline as a course of action of steady appropriate subscribe supports.
Worked extensively on building Nifi data pipelines in docker container environment in development phase.
Provided design recommendations and thought leadership to sponsors/stakeholders that improved review process and resolved technical problems.

Environment: Hadoop, HDFS, Hive, Sqoop, Oozie, NiFi Spark, Scala, Kafka, Python, Cloudera, Linux, Spark streaming, Pig.

Data Engineer

Confidential, San Jose, CA

Responsibilities:

Experienced on loading and transforming of large sets of structured, semi structured and unstructured data from HBase through Sqoop and placed in HDFS for further processing.
Monitored multiple Hadoop clusters environments using Ganglia.
Worked extensively with Hadoop ecosystem tools like HDFS, MapReduce, Pig, Hive, Hbase, sqoop, spark etc.
Generating Scala and java classes from the respective APIs so that they can be incorporated in the overall application
Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
Worked in loading data from UNIX file system and Teradata to HDFS. Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
Built APIs that will allow customer service representatives to access the data and answer queries.
Involved in creating Hive tables, loading data and running hive queries in those data.
Developed PIG scripts for the analysis of semi structured data and also involved in the industry specific UDF (user defined functions).
Extensive Working knowledge of partitioned table, UDFs, performance tuning, compression-related properties, thrift server in Hive.
Installed Apache NiFi and MiNiFi To Make Data Ingestion Fast, Easy and Secure from Internet of Anything with Hortonworks Data Flow.
Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts
Developed Java MapReduce programs on log data to transform into structured way to find user location, age group, spending time.
Financial software into Oracle Projects using Oracle's API. This involved extensive use of SQL and data mapping as well.
Administering large MapR Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
Leveraged Chef to manage and maintain builds in various environments and planned for hardware and software installation on production cluster and communicated with multiple teams to get it done.
Developed optimal strategies for distributing the web log data over the cluster, importing and exporting the stored web log data into HDFS and Hive using Scoop.
Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed to HDFS.
Used Sqoop to import the data on to Cassandra tables from different relational databases like Oracle, MySQL and Designed Column families.
Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most purchased product on website.
Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).

Environment: Amazon EC2, Apache Hadoop 1.0.1, MapReduce, HDFS, CentOS 6.4, Spark, Impala, HBase, Kafka, Elastic Search, Hive, Pig, Oozie, Flume, Java (jdk 1.6), Eclipse, Sqoop, Ganglia, LINUX.

Hadoop Admin /developer

Confidential

Responsibilities:

Collaborate in identifying the current problems, constraints and root causes with data sets to identify the descriptive and predictive solution with support of the Hadoop HDFS, MapReduce, Pig, Hive, and Hbase and further to develop reports in Tableau.
Architect the Hadoop cluster in Pseudo distributed Mode working with Zookeeper and Apache and storing and loading the data from HDFS to Amazon AWS S3 and backing up and Created tables in AWS cluster with S3 storage.
Evaluated existing infrastructure, systems, and technologies and provided gap analysis, and documented requirements, evaluation, and recommendations of system, upgrades, technologies and created proposed architecture and specifications along with recommendations.
Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
Installed and Configured Sqoop to import and export the data into MapR-FS, HBase and Hive from Relational databases.
Administering large MapR Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
Installed and Configured MapR-zookeeper, MapR-cldb, MapR-jobtracker, MapR-tasktracker, MapR resource manager, MapR-node manager, MapR-fileserver, and MapR-webserver.
Installed and configured Knox gateway to secure HIVE through ODBC, WebHcat and Oozie services.
Load data from relational databases into MapR-FS filesystem and HBase using Sqoop and setting up MapR metrics with NoSQL database to log metrics data.
Close monitoring and analysis of the MapReduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance.
Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
Integrated HDP clusters with Active Directory and enabled Kerberos for Authentication.
Worked on commissioning & decommissioning of Data Nodes, NameNode recovery, capacity planning and installed Oozie workflow engine to run multiple Hive and Pig Jobs.
Worked on creating the Data Model for HBase from the current Oracle Data model.
Implemented High Availability and automatic failover infrastructure to overcome single point of failure for Name node utilizing zookeeper services.
Leveraged Chef to manage and maintain builds in various environments and planned for hardware and software installation on production cluster and communicated with multiple teams to get it done.
Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.
Worked closely with data analysts to construct creative solutions for their analysis tasks and managed and reviewed Hadoop and HBase log files.
Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports and worked on importing and exporting data from Oracle into HDFS and HIVE using Sqoop.
Collaborating with application teams to install operating system and Hadoop updates, patches, version upgrades when required.

Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Kafka, Zookeeper, Oozie, Impala, Cloudera, Oracle, Teradata SQL Server, Python, UNIX Shell Scripting, ETL, Flume, Scala, Spark, Sqoop, Python, AWS, S3, EC2, Kafka, OracleMySQL, Hortonworks, Java, Junit, YARN, Python.

Java Developer

Confidential

Responsibilities:

Worked in design, development, implementation and testing of Client-Server, Web Applications using Java/J2EE Technologies.
Experience in SDLC (Software Development Life Cycle) that includes Requirements Analysis, Design, Coding, Testing, Implementation, Maintenance the methodologies like Waterfall Model and Agile Methodology.
Developing well-designed, efficient, and testable code
Conducting software analysis, programming, testing, and debugging
Troubleshooting and resolving the reported issues and replying to queries in a timely manner.
Worked on CI/CD tools like Jenkins, Docker in Devops Team.
Designed and developed Message Flows and Message Sets and other service component to expose Mainframe applications to enterprise J2EE applications.
Designed and developed a fully functional generic n tiered J2EE application platform whose environment was Oracle technology driven. The entire infrastructure application was developed using Oracle JDeveloper in conjunction with Oracle ADF- RichFaces and Oracle ADF-BC.
Developed the Action Classes, Action Form Classes, created JSPs using Struts tag libraries and configured in Struts-config.xml, Web.xml files.
Wrote several Action Classes and Action Forms to capture user input and created different web pages using JSTL, JSP, HTML, Custom Tags and Struts Tags.
Developed Custom Tags to simplify the JSP code. Designed UI screens using JSP and HTML.
Actively involved in designing and implementing Factory method, Singleton, MVC and Data Access Object design patterns.
Web services used for sending and getting data from different applications using SOAP messages. Then used DOM XML parser for data retrieval.
Used JUnit framework for unit testing of application and ANT to build and deploy the application on WebLogic Server.
Experience in design and development of web-based a 'Plications using Java, JDBC, SQL, Servlets, JSTL, JSP, XML, Java-API and Spring.
Experience in client-side Technologies such as HTML/HTML5, CSS/CSS3, JavaScript and jQuery, AJAX, JSON.
Experience in implementing SOA (Service Oriented Architecture) using Web Sences (SOAP, WSDL, Restful, and JAX-WS) and REST Services.

Environment: Java, J2EE, JSP, Servlets, HTML, DHTML, XML, JavaScript, Struts, c/c, Eclipse, WebLogic, PL/SQL, and Oracle.

We provide IT Staff Augmentation Services!

Hadoop/data Engineer Resume

St Louis, MO

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship