Big Data Engineer/hadoop Developer Resume
ChicagO
SUMMARY:
- Big Data Developer/Hadoop Developer with Eight plus years of experience in Hadoop eco system (HIVE, PIG, YARN, MapReduce, IMPALA, SQOOP, Spark, Oozie, Zookeeper, HBASE, Hue, Ambari, Kafka and Flume) providing and implementing solutions for Big Data Applications with excellent knowledge of Hadoop architecture (HDFS, Name Node and Data Node).
- Good knowledge on distributed computing, Spark Core API and Spark SQL.
- Used various file formats like Avro, Parquet, Sequence, Json, ORC and text for loading data, parsing, gathering and performing transformations.
- Good experience in Hortonworks and Cloudera for Apache Hadoop distributions.
- Experience in bi - directional data pipelines from HDFS to Relational Database with Sqoop.
- Designed and created Hive external tables using shared meta-store with Static & Dynamic partitioning, bucketing and indexing.
- Expertise in analyzing data using HiveQL, Pig Latin, and custom Map Reduce programs in python and Java.
- Good Knowledge of Pig for load data, transformations, event joins, filter, group and other aggregation functions.
- Exploring with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
- Familiarity with libraries like Pyspark, NumPy, Pandas, Star base, Matplotlib in python.
- Contributed towards building Apache Spark applications using Python, Scala.
- Writing complex SQL queries using joins, group by, nested queries.
- Experience in HBase to load data using connectors and write queries using NOSQL.
- Performance of the Hive, Pig queries were increased by running through Apache Tez.
- Experience with solid capabilities in exploratory data analysis, statistical analysis, and visualization using R, Python, SQL and Tableau.
- Experience in writing machine learning algorithms using Spark MLlib library.
- Running and scheduling workflows using Oozie and Zookeeper, identifying failures and integrating, coordinating and scheduling jobs.
- Hands on experience on Kafka and Flume to load the log data from multiple sources directly in to HDFS.
- Integrated Hadoop with Tableau to generate visualizations like Tableau Dashboards.
TECHNICAL SKILLS:
Operating Systems: Linux, Mac OS & Windows
Hadoop eco system: HDFS, MapReduce, Hive, Yarn, Pig, Impala, Spark SQL, HBase, Kafka, Sqoop, Flume, Spark Streaming, Oozie, Zookeeper, Hue, Ambari.
Hadoop Distribution: Hortonworks-2.6.1, Cloudera-5.10.
Programming Languages: R, python, Linux shell scripts, Java and Scala
Databases: MySQL, Mongo DB, Cassandra, Teradata and HBase
ETL: ta
Cloud: AWS, Microsoft Azure, Google Cloud
Build Tools: Ant, Maven
Processing: Apache Spark, Apache Storm
Visualization: Tableau, R, python
WORK EXPERIENCE:
Confidential, Chicago
Big Data Engineer/Hadoop Developer
Responsibilities:
- Data Ingestion from relational databases into Hdfs using Sqoop import/export and also created Sqoop Job, Evaluate, and incremental jobs.
- Created Partitions, Bucketing and Indexing concepts for optimization as part of hive data modelling.
- Responsible for installation and configuration of Hive, Pig, Sqoop, Flume and Oozie on the Hadoop Cluster.
- Involved in developing Hive DDLs to create, alter and drop Hive tables.
- Built re-usable Hive UDF libraries for business requirements which enabled users to use these UDF’s in Hive querying.
- Responsible for analyzing and cleansing raw data by performing Hive queries and running Pig Scripts on data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and experience in using Spark-Shell and Spark Streaming.
- Used Spark using Python, Scala and utilizing Data frames, Data sets and Spark SQL API for faster processing of data.
- Built recommendation system using Association rule mining algorithm in spark using MLlib, to find frequent buying patterns in customers and recommend products accordingly, also implemented an idea for pruning obvious items.
- Streamed real time data by integrating Kafka with Spark for dynamic price surging using machine learning algorithm.
- Written multiple MapReduce program in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
- Developed Hive and Impala for end user/ analyst requirements to perform hoc analysis.
- Strong knowledge in NOSQL column oriented databases like HBase and their integration with Hadoop cluster using connectors.
- Experienced with both HUE UI for accessing HDFS files and data.
- Developed a data pipeline using Kafka and Spark to store data into HDFS.
- Designed workflow by scheduling Hive processes for Log file data which is streamed into HDFS using Flume.
- Designed workflows and coordinators, managed in Oozie and Zookeeper to automate and parallelize Hive, Sqoop and Pig jobs in Cloudera Hadoop using XML.
- Experienced in performing in memory batch processing using Spark Streaming (Spark and Spark-SQL and Spark-shell).
- Involved in building the runnable jars for the module framework through Maven clean, Maven dependencies.
- Tested Apache Tez, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
- Developed SQL scripts to compare all the records for every field and table at each phase of the data movement process from the original source system to the final target.
- Pre-processed large sets of structured and semi-structured data, with different formats like Text Files, Avro, Parquet, ORC, Sequence Files, and JSON Record.
- Responsible for continuous monitoring and managing the Hadoop Cluster using Cloudera Manager.
- Creating customized Tableau Dashboards, integrating Custom SQL from Hadoop and performing data blending in reports.
Environment: Cloudera CDH 5.8, Linux, HDFS, MapReduce, Shell Scripting, Java, Talend, Hive, Pig, Spark, Storm, Impala, Sqoop, Flume, Oozie, Kafka, Eclipse, Apache Tez, Talend, Yarn, Maven, Tableau.
Confidential, Washington D.CBig Data Developer
Responsibilities:
- Working in an Agile team to deliver and support required business objectives by using Python, Shell Scripting and other related technologies to acquire, ingest, transform and publish data both to and from Hadoop Ecosystem.
- Extracted the data from MySQL into HDFS using Sqoop export/import and also handled importing of data from various data sources, performed transformations using Pig and loaded data into HDFS.
- Assisted application teams in installing Hadoop updates, operating system, patches and version upgrades when required.
- Imported data from RDBMS to Hadoop using Sqoop import.
- Hands on experience in loading data from UNIX file system to HDFS and vice versa
- Performed transformations using Python and Scala to analyze and gather the data in required format of customer
- Used a combination of Flume and Kafka to get log data from and web and mobile app servers.
- Worked on migrating custom workflows built using tools in Hadoop to a third-party tool.
- Used Pig as ETL tool to do transformations, event joins, filtering and some pre-aggregations before storing the data into HDFS.
- Troubleshooting, debugging and altering the Talend issues, working on maintenance and performance of the ETL tools.
- Created Hive External and Internal tables on top of data in HDFS using various SerDe.
- Created hive tables using ORC for faster access and compression in data modelling.
- Ran Analytic queries and gathered stats for tables in hive using Impala.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how it translates to MapReduce jobs.
- Proficient in designing Row keys and Schema Design for NOSQL Database HBase and knowledge of other NOSQL database Cassandra.
- Worked on creating and automating reports in Excel using data imported from Hive via ODBC.
- Writing the RDD and dataframes to process the data in spark before it can be ingested for different uses like reporting.
- Worked on Python scripts to help our team internally with data management.
- Migrated complex Map reduce programs into Spark RDD transformations and actions.
- Developed RDD's using Python and coded Python applications for business requirements.
- Created workflows to automate the batch jobs using third party tools.
- Used R and R Studio for statistical models, machine learning algorithms and creating executive reports.
- Used statistical inference, linear regression and maximization techniques.
- Created cloud formation template to build a repeatable process to stand up various application deployment environments in AWS like EC2 and EMR.
- Provision, monitor and maintain AWS EC2 instances, watching the security and manage theAWSS3 bucket storage on AWS cloud environment.
- Experience on EMR cluster for running spark algorithms through Putty.
- Worked in production support team to ensure data availability, data quality and data integrity for the enterprise.
- Indulged in regular stand-ups meetings, status calls, Business owner meetings with stake holders, Risk management teams in an agile environment.
- Supported code/design analysis, strategy development and project planning.
- Followed Scrum implementation of scaled agile methodology for entire project.
Skills: Cloudera CDH 4 and CDH 5, Elastic search, AWS EC2, Hadoop, Spark, Kafka, Flume, Sqoop, Hive, Impala, HBase, R, R Studio, Scala and python, AWS, Zookeeper, Shell Scripting, Oozie, SQL and Tableau
Confidential, charlotteBig Data Developer
Responsibilities:
- Responsible for loading the customer’s data and event logs from Kafka into HBase using REST API.
- Worked on debugging, performance tuning and Analyzing data using Hadoop components Hive Pig.
- Imported streaming data using Apache Storm and Apache Kafka into HBase and designed hive tables on top.
- Created Hive tables from JSON data using data serialization framework like AVRO.
- Developed multiple POCs using Pyspark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Deployed Hadoop cluster using Hortonworks with Pig, Hive, HBase and Spark.
- Developed restful web service using Spring Boot and deployed to pivotal web services.
- Used build and deployment tools like Maven.
- Involved in Test Driven Development (TDD).
- Developed Kafka producer and consumers, HBase clients, Spark and Hadoop MapReduce jobs along with components on HDFS, Hive.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Responsible for processing unstructured data using Pig and Hive.
- Managed and reviewed Hadoop log files. Used Scala for integration Spark into Hadoop.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data
- Extensively used Pig for data cleansing and HIVE queries for the analysts.
- Created PIG script jobs in maintaining minimal query optimization.
- Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Worked on various Business Object Reporting functionalities such as Slice and Dice, Master/detail, User Response function and different Formulas.
Skills: Hortonworks HDP, Linux, Hadoop, HDFS, Pig, Hive, HBase, MapReduce, Sqoop, Oozie, Spark, Hue, LINUX, Teradata, Java APIs, Java collection, SQL Business Objects XI R2, Apache Storm, Pyspark, SQL, Teradata, Spring Boot, Maven, Kafka, Scala, Spark SQL.
Data Analyst
Confidential
Responsibilities:
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data from RDBMS through Sqoop and placed in HDFS for further processing.
- Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Built and maintained scalable data pipelines using the Hadoop ecosystem and other open source components like Hive.
- Managing and scheduling of Jobs on a Hadoop cluster using Oozie.
- Created tables using Hive and queries are performed using HiveQL which will invoke and run
- Involved in creating Hive tables, loading data and running hive queries in those data.
- Extensive working knowledge of partitioned table, UDFs, performance tuning, compression-related properties, thrift server in Hive.
- Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts.
- Working knowledge in writing Pig's Load and Store functions.
- Developed SQL queries to join tables in MySQL and prepare data for statistical models
- Prepared reports using Excel Pivot Tables and Pivot Charts
- Assimilated and stitched unstructured customer data to MySQL database ensuring consistency
- Created MySQL database to capture online enquiries placed on website
Skills: Hortonworks, Hadoop, Ambari, HDFS, Sqoop, Hive, HBase, Pig, Oozie, MySQL, Flume, SQL and Tableau
Confidential
Software Analyst
Responsibilities:
- Created extract files for improving the performance. Used different Mark types and Mark properties in views to provide better insights into large data sets.
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets
- Responsible for gathering and analyzing the business requirements and then translate them to technical report specifications.
- Designed Tableau Reports, graphs and dashboards as per requirements.
- Created Tableau scorecards, dashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts using show me functionality.
- Delivering reports to Business team on timely manner.
- Worked on functional requirements sessions with business and technology stakeholders on data modeling, integration, and configuration for data warehouse with automated and manual field data collection systems
- Created SQL Queries for testing the tableau Dashboards.
- Migrated Workbooks and Tableau upgrades/migration works. Implemented new features in Tableau to the existing Workbooks and Dashboards.
- Created Dashboards with interactive views, trends and drill downs. Published Workbooks and Dashboards to the Tableau server.
- Combined visualizations into Interactive Dashboards and publish them to the web.
- Involved in installation and configuration of Tableau Server.
- Publishing dashboards and extracts to Tableau server.
- Defined best practices for Tableau report development.
- Developed training plan to cross train new team members and facilitated knowledge base content management.
Skills: Tableau, Dashboards, Tableau Desktop, Tableau Server, SQL, MS-Excel and MS-Office.