Sr. Data Engineer Resume
Plano, TX
PROFESSIONAL SUMMARY:
- Proactive IT developer with 9 years of working experience in Java/J2EE Technology and development design of various scalable systems using Hadoop Technologies on various environments.
- Experience in installation, configuration, supporting and managing Hadoop Clusters using Horton works , and Cloudera (CDH3, CDH4 ) distributions on Amazon web services (AWS).
- Extraordinary Understanding of Hadoop building and Hands on involvement with Hadoop segments such as Job Tracker, Task Tracker, Name Node, Data Node and HDFS Framework.
- Extensive experience in analyzing data using Hadoop Ecosystems including HDFS, Hive, PIG, Sqoop, Flume, MapReduce, Spark, Kafka, HBase, Oozie, Solr and Zookeeper.
- Extensive knowledge on NoSQL databases like HBase, Cassandra, and Mongo DB.
- Configured Zookeeper, Cassandra and Flume to the existing Hadoop cluster.
- Expertise in writing Hadoop Jobs for analyzing data using Hive QL ( Queries), Pig Latin ( Data flow language ), and custom MapReduce programs in Java .
- Experience in converting Hive queries into Spark transformations using Spark RDDs and Scala .
- Hands on Experience in troubleshooting errors in HBase Shell, Pig, Hive and MapReduce.
- Hands - on experience in provisioning and managing multi-tenant Cassandra cluster on public cloud environment - Amazon Web Services (AWS) - EC2, Open Stack.
- Experience in NoSQL Column-Oriented Databases like HBase , Cassandra and its Integration with Hadoop cluster.
- Experience in maintaining the big data platform using open source technologies such as Spark and Elastic Search.
- Planned and created answer for constant information ingestion utilizing Kafka, Storm, Spark spilling and different NoSQL databases.
- Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Experience in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure.
- Good hands on experience in creating the RDD' s, DF's for the required input data and performed the data transformations using Spark Scala.
- Knowledge in developing a Nifi flow prototype for data ingestion in HDFS .
- Extensive experience working in Oracle, DB2, SQL Server, PL/SQL and My SQL database and Java Core concepts like OOPS, Multithreading, Collections and IO .
- Experience in Service Oriented Architecture using Web Services like SOAP & Restful.
TECHNICAL SKILLS:
Big Data Eco systems: HDFS, Map Reduce, Hive, YARN, Pig, Sqoop, Kafka, Storm, Flume, Oozie, and ZooKeeper, Apache Spark, Apache Tez, Impala, Nifi, Apache Solr, Rabbit MQ, Scala.
No SQL Databases: Hbase, Cassandra, mongoDB
Programming Languages: C, C++, Java, J2EE, PL/SQL, Pig Latin, Scala, Python
Java/J2EE Technologies: Applets, Swing, JDBC, JNDI, JSON, JSTL, RMI, JMS, Java Script, JSP, Servlets, EJB, JSF, JQuery, AngularJS
Frameworks: MVC, Struts, Spring, Hibernate
Version control: SVN, CVS
Business Intelligence Tools: Tableau, QlikView, Pentaho, IBM Cognos intelligence
Databases: Oracle 9i/10g/11g, DB2, SQL Server, MySQL, Teradata
Tools: and IDE: Eclipse, Net Beans, Toad, Maven, ANT, Hudson, Sonar, JDeveloper, Assent PMD, DB Visualizer, IntelliJ.
Cloud Technologies: Amazon Web Services (AWS), CDH3, CDH4, CDH5, HortonWorks, Mahout, Microsoft Azure Insight, Amazon Redshift
PROFESSIONAL EXPERIENCE:
Confidential, Plano, TX
Sr. Data Engineer
Responsibilities:
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark , Effective & efficient Joins, Transformations and other during ingestion process itself.
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark with Hive and SQL / Teradata .
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
- Used Spark-Streaming APIs to perform required transformations and actions on the learner data model which gets the data from Kafka in near real time.
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Used File Broker to schedule workflows to run Spark jobs to transform data on a persistent schedule.
- Experience developing, deploying Shell Scripts for automation/notification/monitoring.
- Extensively used Apache Kafka , Apache Spark , HDFS and Apache Impala to build a near real time data pipelines that get, transform, store and analyze click stream data to provide a better personalized user experience.
- Worked on Performance tuning on Spark Application.
- Worked with Apache Spark SQL and data frame functions to perform data transformations and aggregations on complex semi structured data.
- Hands on experience in creating RDD s, transformations and actions while implementing Spark applications.
Environment : Hadoop, HDFS, Hive, Spark AWS EC2, S3, Kafka, Yarn, Shell Scripting, Scala, Agile methods, Linux, MySQL, Teradata
Confidential, Bellevue, WA
Sr. Bigdata Developer
Responsibilities:
- Developed various spark applications using Scala to perform various enrichment of these click stream data merged with user profile data.
- Utilized Spark - SQL to event enrichment and used Spark-SQL to prepare various levels of user behavior summaries.
- Worked on SQS Queue receiver using Spark Streaming context to consume the data from extended queue and integrated with ETL Functions.
- Real time streaming the data using Spark with SQS . Responsible for handling Streaming data from web server console logs.
- Optimize the Hive tables using optimization techniques such as partitions and bucketing to provide better performance with HiveQL queries.
- Worked on migrating data from traditional RDBMS to HDFS .
- Used Scala to convert Hive / SQL queries into RDD transformations in Apache Spark .
- Written Programs in Spark using Scala for Data quality check.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Implemented the use of Amazon EMR for Big Data processing among a Hadoop Cluster of virtual servers on Amazon related EC2 and S3.
- Optimized Hive QL/ pig scripts by using execution engine like Tez, Spark.
Environment : Hadoop, HDFS, Hive, Spark AWS EC2, S3, Kafka, Yarn, Shell Scripting, Scala, Pig, Oozie, Java, Agile methods, Linux, MySQL, Elastic Search, Kibana, Teradata.
Confidential, Austin, TX
Hadoop Developer
Responsibilities:
- Developed Spark Applications by using Spark , Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Handled importing of data from various data sources, performed data control checks using Spark and loaded data into HDFS .
- Involved in converting Hive / SQ L queries into Spark transformations using Spark RDD , Scala .
- Used Spark SQL to Load JSON data and create SchemaRDD and loaded it into Hive Tables and handled structured data using Spark SQ.
- Imported data from AWS S3 into Spark RDD , Performed transformations and actions on RDD's.
- Used Spark and Spark SQL to read the parquet data and create the tables in hive using the Scala API .
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD/Map Reduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Processing the schema oriented and non-schema-oriented data using Scala and Spark .
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in HDFS .
- Worked on streaming pipeline that uses Spark to read data from Kafka transform it and write it to HDFS.
- Analyzed the weblog data using the HiveQL , integrated Oozie with the rest of the Hadoop stack Utilized cluster co-ordination services through Zookeeper .
Environment : Scala, Spark, Spark SQL, Spark Streaming, Azkaban, Presto, Hive, Apache Crunch, Elastic Search, GIT Repository, Amazon S3, Amazon AWS Ec2/EMR, Spark cluster, Hadoop Framework, Sqoop, DB2.
Confidential, Glendale, CA
Data Engineer
Responsibilities:
- Developed optimal strategies for distributing the web log data over the cluster importing and exporting the stored web log data into HDFS and Hive using Sqoop.
- Design and develop ELT data pipeline using Spark App to fetch data from Legacy system and third-party API, social media sites.
- Developed custom mappers in python script and Hive UDFs and UDAF s based on the given requirement.
- Design and develop DMA (Disney Movies anywhere) dashboard for BI analyst team.
- Perform data analytics and load data to Amazon s3 / Data Lake / Spark cluster .
- Involved in querying data using Spark SQL on top of Spark engine.
- Developed Spark scripts by using Python shell commands as per the requirement.
- Writing Pig and Hive scripts with UDF in MR and Python to perform ETL on AWS Cloud Services.
- Worked with file formats text , avro , parquet and sequence files .
- Involved in migrating HiveQL into Impala to minimize query response time.
- Created Hive tables, dynamic partitions , buckets for sampling, and working on them using HQL .
- Defined job flow using Azkaban , scheduler to automate the Hadoop jobs and installed Zookeepers for automatic node failovers.
- Performed Tableau type conversion functions when connected to relational data sources.
Environment : Languages/Technologies: Java (JDK1.6 and higher), Azkaban, Spark SQL, Presto, Hive, Apache Crunch, Elastic Search, Spring boot, Eclipse, GIT Repository, Amazon S3, Amazon AWS Ec2/EMR, Spark cluster, Hadoop Framework, Sqoop.
Confidential, San Francisco, CA
Hadoop Developer
Responsibilities:
- Involved in managing nodes on Hadoop cluster and monitor Hadoop cluster job performance using Cloudera manager.
- Involved in loading data from edge node to HDFS using shell scripting.
- Created Map Reduce programs to handle semi/unstructured data like xml, json, Avro data files and sequence files for log files.
- Developed Spark scripts by using Python shell commands as per the requirement.
- Integrated Elastic Search and implemented dynamic faceted-search.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Design and Develop Pig Latin scripts and Pig command line transformations for data joins and custom processing of Map reduce outputs.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
- Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations.
- Used maven to build and deploy the Jars for MapReduce, Pig and Hive UDFs.
- Reviewed basic SQL queries and edited inner, left, and right joins in Tableau Desktop by connecting live/dynamic and static datasets.
Environment: Hadoo p , Scala, Map Reduce, HDFS, Spark, Scala, Kafka, AWS, Apache SOLR, Hive, Cassandra, maven, Jenkins, Pig, UNIX, Python, MRUnit, Git.
Confidential, Mountain View, CA
Hadoop Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop .
- Worked in joining raw data with the data using Pig scripting.
- Implemented DataStax Enterprise Search with Apache Solr .
- Created java operators to process data using DAG streams and load data to HDFS.
- Configured, Designed implemented and monitored Kafka cluster and connectors.
- Developed ETL jobs using Spark-Scala to migrate data from Oracle to new hive tables.
- Developed and Deployed applications using Apache Spark, Scala.
- Developed Oozie workflow for scheduling and orchestrating the ETL process
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data. .
- Helped in troubleshooting Scala problems while working with Micro Strategy to produce illustrative reports and dashboards along with ad-hoc analysis.
- Developed Hive queries for the analysts and I have written scripts using Scala.
- Created and worked Sqoop jobs with incremental load to populate Hive External tables.
- Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Continuous Integration environments in SCRUM and Agile methodologies.
- Extracted the data from Teradata into HDFS using the Sqoop.
- Managed real time data processing and real time Data Ingestion in HBase and Hive using Storm.
Environment: Hadoop , HDFS, Pig, Hive, Oozie, HBase, Kafka, Apache SOLR, MapReduce, Apache SOLR, Sqoop, Storm, Spark, Scala, LINUX, Cloudera, Maven, Jenkins, Java, SQL.
Confidential, Tampa, Florida
Java/Hadoop Developer
Responsibilities:
- Exported data from DB2 to HDFS using Sqoop and Developed MapReduce jobs using Java API .
- Used Spring AOP to implement Distributed declarative transaction throughout the application.
- Designed and developed Java batch programs in Spring Batch.
- Installed and configured Pig and wrote Pig Latin scripts .
- Created and maintained Technical documentation for launching Cloudera Hadoop Clusters and for executing Hive queries and Pig Scripts.
- Developed workflow-using Oozie for running MapReduce jobs and Hive Queries.
- Involved in loading data from UNIX file system to HDFS.
- Created java operators to process data using DAG streams and load data to HDFS.
- Assisted in exporting analyzed data to relational databases using Sqoop.
- Involved in Develop monitoring and performance metrics for Hadoop clusters.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
Environment: Hadoop , HDFS, Hive, Flume, Sqoop, HBase, PIG, Eclipse, Spark, My SQL and Ubuntu, Zookeeper, Maven, Jenkins, Java (JDK 1.6), Oracle10g.