Big Data/Hadoop developer Resume

SUMMARY:

5+ Years of experience in Analysis, Design, Development, Testing, Customization, Bug fixes, Enhancement, Support and Implementation using Python, spark programming for Hadoop . Worked on AWS environment such as lambda, server less applications, EMR, Athena, Glue, IAM policies, roles, S3,CFT and Ec2.
Developed Python and pyspark programs for data analysis.
Developed the Pysprk code for AWS Glue jobs and for EMR.
Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR, MapR distribution.
Good working experience with python to develop Custom Framework for generating of rules (just like rules engine). Developed Hadoop streaming Jobs using python for integrating python API supported applications.
Developed Python code to gather the data from HBase and designs the solution to implement using Pyspark
Apache Spark DataFrames/RDD's were used to apply business transformations and utilized Hive Context objects to perform read/write operations.
Re - write some Hive queries to Spark SQL to reduce the overall batch time.
Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Spark, Hive, and Sqoop) as well as systems specific jobs (such as Python programs and shell scripts).
Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
Highly motivated to work on Python, R scripts for statistics analytics for generating reports for Data Quality.
Good experience with understanding R code to Analyze Machine Learning Models.
Working with python for Statically analysis of data to finding quality, confidence intervals.
Good Experience in Linux Bash scripting and following PEP Guidelines in Python.
Worked on kafka for streaming data and also on data ingestion.

TECHNICAL SKILLS:

Operating systems: Centos,Ubuntu,RedHat Linux 5.X, 6.X,Amazon Linux, Windows 95, 98, NT, 2000, Windows Vista, 7

Programming Languages: Python, Java, C# and Scala

Databases: Oracle (SQL) 10g, MYSQL, SQL SERVER 2008

Scripting Languages: Shell scripting, PowerShell.

Hadoop Eco-Systems: Hive, Pig, Flume, Oozie, Sqoop, Spark, Impala, Kafka and HBase

WORK EXPERIENCE:

Confidential

Big Data/Hadoop developer

Responsibilities:

Handled importing of data from various data source performed transformations using spark and loaded data into hive
Responsible for analyzing and data cleaning using Spark SQL Queries.
Worked with spark core, Spark Streaming and spark SQL modules of Spark.
Used Pyspark to write the code for all the use cases in spark and extensive experience with scala for data analytics on Spark cluster and Performed map-side joins on RDD.
Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
On demand, secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda function.
Migrated an existing on-premises application to AWS.
Used AWS services like EC2 and S3 for small data sets.
Used Cloud watch logs to move application logs to S3 and create alarms based on a few exceptions raised by applications.
Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing data.
Good hands on experience with Micro services on cloud foundry for real time data streaming to persist data on Hbase and to communicates with Restful web services using Java API.
Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team and good working on Datameer.
Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop Cluster.
Used Flume to collect, aggregate, and store the web log data from various sources like web servers, mobile and network devices and pushed to HDFS.

Environment: MapReduce, S3, EC2, EMR,Java, HDFS, Hive, Pig, Tez, Oozie, Hbase, Scala, Pyspark,Spark SQL, Kafka, Python, LINUX, Putty, Cassandra, Shell Scripting, ETL, YARN.

Confidential

Software Engineer

Responsibilities:

Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR and MapR (MapR data platform).
Developed Simple to complex Map/reduce streaming jobs using Python, Hive and Pig.
Used various compression mechanisms to optimize Map/Reduce Jobs to use HDFS efficiently.
Used ETL component Sqoop to extract the data from MySQL and load data into HDFS.
Wrote Hive queries and Pig scripts to study customer behavior by analyzing the data.
Loaded data into Hive tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.
Great expose to Unix scripting and good hands on shell scripting.
Wrote python scripts to process semi-structured data in formats like JSON.
Worked closely with the data modelers to model the new incoming data sets.
Experienced in loading and transforming of large sets of structured, semi structured and unstructured data.
Troubleshooting and finding the bugs in the Hadoop applications and to clear off all the bugs took help from the testing team.
Good hands on experience with Python API by developing Kafka producer, consumer for writing Avro Schemes.
Installed Ganglia Monitoring Tool to generate reports related to Hadoop cluster like CPUs running, Hosts Up and Down etc., operations were performed to maintain Hadoop cluster.
Good hands on experience with real -time data injection using Kafka and real-time processing engine through Storm (Spout, Bolt) and persisted data into HBASE database for data analytics.
Responsible for analyzing and data cleaning using Spark SQL Queries.
Handled importing of data from various data sources performed transformations using spark and loaded data into hive.
Worked with spark core, Spark Streaming and spark SQL modules of Spark.
Used Scala to write the code for all the use cases in spark and extensive experience with scala for data analytics on Spark cluster and Performed map-side joins on RDD.
Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
On demand, secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda function.
Used Cloud watch logs to move application logs to S3 and create alarms based on a few exceptions raised by applications.
Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing data.
Good hands on experience with Microservices on cloud foundry for real time data streaming to persist data on Hbase and to communicate with Restful web services using Java API.
Determining the viability of a business problem for a Big Data solution with Pyspark.
Proactively monitored systems and services, architecture design and implementation of Hadoop deployment, configuration management, backup, and disaster recovery systems and procedures.
Monitored multiple Hadoop clusters environments using Ganglia and Monitored workload, job performance and capacity planning using MapR.
Involved in time series data representation using HBase.
Performed Map Reduce programs on log data to transform into structured way to find user location, age group, spending time using Java.
Great working experience with Splunk for real time log data monitoring.
Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most purchased product on website.
Build cluster on AWS environment using EMR using S3,EC2,Redshift.
Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports by our BI team.
Great hands on experience with Pyspark for using Spark liberties by using python scripting for data analysis.
Working with (BI)Tableau teams as requirement of datasets and good working experience with Data visualization.

Environment: MapReduce, S3, EC2, EMR, Java, HDFS, Hive, Pig, Tez, Oozie, Hbase, Spark, Scala, Spark SQL, Kafka, Python, Putty, Pyspark, Cassandra, Shell Scripting, ETL, YARN.AWS s3, EC2, Hadoop, HDFS, Pig, Hive, Splunk, qoop, AWS EC2, S3, LINUX, Cloudera, Big Data, Ganglia, SQL Server, HBase.

Confidential

Software Engineer

Responsibilities:

Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
Worked on analyzing Hadoop cluster using different big data analytic tools including Hive, MapReduce, Pig and flume.
Involved in analyzing system failures, identifying root causes, and recommended course of actions.
Managed Hadoop clusters using Cloudera. Extracted, Transformed, and Loaded (ETL) of data from multiple sources like Flat files, XML files, and Databases.
Great Experience working Talend for data integration and workflow with TAC
Worked on Talend ETL to load data from various sources to Datalake. Used tmap, treplicate, tfilterrow, tsort and various other features in Talend.
Developing dataset which follows CDISC Standards and stored into HDFS(S3).
Good experience with ODM, SEND data xml formats in interact with user data.
Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS using UDF developing by python and java.
Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
Good knowledge on Amazon EMR (Elastic Map Reduce).
Developed the Pig UDF'S to pre-process the data for analysis and Hive queries for the analysts.
Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig. Cluster co-ordination services through ZooKeeper.
Collected the logs data from web servers and integrated in to HDFS using Flume.
Implemented Fair schedulers on the Job tracker to share the resources of the Cluster for the Map Reduce jobs given by the users.
Worked on designing Poc's for implementing various ETL Process.
Responsible for building scalable distributed data solutions using Hadoop.
Analyzed large data sets by running Hive Queries and Pig scripts.
Involved in creating Hive tables, loading and analyzing data using Hive Queries.
Extracted the data from Teradata into HDFS using the Sqoop.
Load and transform large sets of structured, semi structured and unstructured data.
Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
Mentored analyst and test team for writing Hive Queries.
Involved in running Hadoop jobs for processing millions of records of text data.
Worked with application teams to install Hadoop updates, patches and version upgrades as required.
Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
Implemented best income logic using Pig scripts and UDFs.
Implemented test scripts to support test driven development and continuous integration.
Worked on tuning the performance for Hive and Pig queries.
Developed UNIX Shell scripts to automate repetitive database processes

Environment: Hadoop,Talend,Hbase,Python,MapR, ETL, HDFS, Hive, Java (jdk1.7), Pig, Zookeeper, Oozie, Flume, Unix Shell Scripting, Teradata, Sqoop

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship