Spark/hadoop Engineer Resume
Jacksonville, FloridA
SUMMARY:
- Experience in Big Data Analytics and Development . 2+ years in Spark Development using Scala with Cloudera and Hortonworks platforms.
- In depth knowledge on Big Data Stack in Hadoop ecosystem HDFS, YARN, Hive, Impala, Spark, HBase and Phoenix, and Sqoop for data migration and Flume for data ingestion.
- Hive queries to analyze user request patterns constructing UDF’s inside Hive and implemented various performance optimization measures including implementing partitions and buckets inside Hive tables.
- Spark applications for data processing and analysis using Spark RDD’s, DataFrames, Datasets, SQL, DStreams, Structured Streaming with various spark transformations and actions. Constructing UDF’s for data cleansing.
- Understanding of relational databases. Created normalized databases, wrote stored procedures, views, function. Used JDBC to communicate with database. Experienced with MySQL , and SQL Server .
- Knowledge in designing both time driven and data driven automated workflows using Oozie .
- POC in Spark Streaming for Near - Real Time data processing and Batch-Processing. Implemented windowing functions with transformations and set up batch, window & slide interval processes. Sound knowledge in data ingestion using Kafka and Flume .
- Sound knowledge using CDH (Cloudera Distribution Including Apace Hadoop) and HDP (Hortonworks Data Platform) distributions.
- Tableau for making reports on top of data.
- Knowledge of extracting an Avro schema using Avro-tools .
- Experience in using Text File Format, Sequence files, AVRO file, Parquet file formats, ORC File Format, CSV File Format, JSON File Format inside Hadoop ecosystem.
TECHNICAL SKILLS:
Big Data Ecosystem: Hadoop, MapReduce, YARN, Sqoop, Hive, Impala, Oozie, Spark, Flume, Kafka, Ambari, HBase, Phoenix.
Hadoop Distributions: Cloudera (CDH4, and CDH5), Hortonworks
Languages: Scala, Python, SQL, Java
No SQL Databases: HBase, Apache Cassandra, MongoDB
Development / Build Tools: Eclipse, IntelliJ, SBT, Maven, SQL Server Management Studio, PyCharm
DB Languages: MySQL, PL/SQL, TSQL, Oracle
RDBMS: Oracle 10g,11g, 12c, MS SQL Server 2012,2014,2016, MySQL
Operating systems: UNIX, LINUX, Mac OS and Windows Variants
Visualization: Tableau
MS Tools: : Access, Visio, Word, PowerPoint, MS Project, Excel
Methodology: Agile, Waterfall
PROFESSIONAL EXPERIENCE:
Confidential - Jacksonville, Florida
Spark/Hadoop Engineer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS.
- Importing data from various RDBMS systems using Sqoop import and vice versa including Sqoop jobs.
- Created Hive tables, loading data and Hive queries to analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in Hive.
- Writing Spark scripts using Scala shell commands as per requirements.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Handling large datasets using Spark Transformations and actions. Worked on Datasets and DataFrames to run SQL queries on top of data.
- Spark SQL to read data from JSON file format and Parquet file format to run SQL queries on data.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Access data from RDBMS and NoSQL Data source
- Worked on a POC to compare processing time with Apache Hive for batch applications vs Apache Spark.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into HBase.
- Used Tableau for visualization by connecting with Hive and generate reports.
- Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.
Environment: Hadoop YARN, Spark, Scala, Kafka, Hive, Sqoop, Tableau, Oozie, Cloudera, Oracle 12c, MySQL, Linux.
Confidential - Cleveland, Ohio
Spark/Hadoop Engineer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS.
- Importing data from various RDBMS systems using Sqoop import and vice versa including Sqoop jobs.
- Prepared an ETL pipeline with the help of Sqoop and hive to be able to frequently bring in data from the source and make it available for consumption.
- Created Hive tables, loading data and Hive queries to analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in Hive.
- Writing Spark scripts using Scala shell commands as per requirements.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, and Scala.
- Developed Spark Application constructing UDFs for both Data frames/Spark SQL/Data sets in Spark for Data Aggregation, queries and writing in desired file format into HDFS.
- Involved in Developing POCs to compare the performance of Spark and SQL using HDFS as storage
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Created HBase tables to store data coming from different portfolios.
- Hive to read HBase data and apply CRUD operations HBase data.
- Executed Oozie workflow to run Hive jobs.
- Used Tableau for visualization by connecting with Hive generate reports.
Environment: Hadoop YARN, Spark, Scala, Hive, Sqoop, HBase, Oozie, Tableau, Cloudera, Oracle 12c, Linux, Shell Scripts.
Confidential
Hadoop Developer
Responsibilities:
- Involved in creating tables, partitioning, bucketing of table in Hive
- Creating Hive tables and working on them using Hive QL.
- Worked on NoSQL (HBase) for support enterprise production and loading data into HBASE using Hive and SQOOP.
- Responsible for managing data coming from different sources.
- Performed multiple MapReduce jobs using JAVA as programming language for data cleaning and pre-processing.
- Handled importing of data from various data sources, performed transformations using Hive, PIG, and loaded data into HDFS.
- Experience with Agile development methodologies.
- Experience in Importing and exporting data into HDFS and Hive using Sqoop.
- Used AVRO, Parquet file formats to serialization of data.
- Involved in creating Hive tables and loading and analyzing data using hive queries.
- Developed Hive queries to process the data and generate the data cubes for visualizing.
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
Environment: Hadoop YARN, Hive, Sqoop, HBase, Cloudera, Oracle 11g, Linux.
Confidential
SQL Developer & Hadoop Trainee and Intern
- Creating tables, sub-queries, correlated sub-queries, joins, views, indexes, SQL functions, set operators, RDBMS concepts and object creation such as normalization, cursors, triggers.
- Defined constraints, rules and defaults to ensure data integrity and relational integrity.
- Data normalization
- Maintain referential integrity, domain integrity and column integrity by using the available options such as constraints.
- Learned reporting tool as Tableau to generate reports, creating charts, filter data, joins, custom groups, sets, custom measures, sort, blending.
- Learn Big data technologies Hadoop, Sqoop, Hive, Pig, MapReduce.
- Hadoop ecosystem and HDFS architecture
- Make Sqoop jobs to import data from various sources.
- Learn Pig latin to process data and HiveQL to write queries on top of data.
- Different file formats in data for input and output.