Streamsets Developer / Data Engineer Resume
Scottsdale, AZ
SUMMARY
- About 8 years of professional IT experience which includes about 4+ years of experience in Big data ecosystem related technologies like Hadoop HDFS, Map Reduce, cloudera, Apache Pig, Hive, Sqoop, HBase, Flume, Oozie, and 1.5 years in implementation and developing pipelines through scripts in ETL Streamsets.
- Experience in Development and Maintenance of web - based and Client/Server applications utilizing Java, J2EE, Spring, Hibernate, JSP, Servlets, JDBC, JSON, JNDI, HTML and JavaScript, SQL, PL/SQL.
- Good understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, TaskTracker, NameNode, DataNode, Secondary Namenode.
- Strong understanding of NoSQL databases like Cassandra, Snowflake, HBase and MangoDB.
- Extensive experience in developing PIG Latin Scripts and using Hive Query Language for data analytics.
- Good Knowledge in loading the data from Oracle and MySQL databases to HDFS system using SQOOP (Structured Data) and FLUME (Log Files & XML).
- Knowledge on analyzing data interactively using Apache Spark and Apache Zeppelin.
- Experience in Scrum, Agile and Waterfall models.
- Expertise in Core Java Such as OOPs, Collections, Functional Interfaces, Lambda.
- Expressions, Java Stream API’s, Exceptions Handling, Annotations, Multi-Threading and Generics.
- Experience in developing Single Page Application (SPA) using VueJS, AngularJS and Angular.
- Good working knowledge in core java concepts Experience in working with Flume to load the log data from multiple sources directly into HDFS.
- Proficient in Java, Scala and Python.
- Expertise in Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data.
- Involved in design and development of various web and enterprise applications using various technologies like JSP, Servlets, Struts, Hibernate, and spring, JDBC, JSF, XML, Java Script, HTML, AJAX, SOAP and Amazon Web Services.
- Good experience in optimizing Map Reduce algorithms using Mappers, Reducers, combiners and petitioners to deliver the best results for the large datasets.
- Experience in understanding the security requirements for Hadoop and integrate with Key Distribution Centre.
- Having Experience on Development applications like Eclipse, RAD etc.
- Expertise in Unit Testing, Integration Testing, System Testing and experience in preparing the Test Cases, Test Scenarios and Test plans.
- Ability to work independently as well as in a team and able to effectively communicate with customers, peers and management at all levels in and outside the organization.
TECHNICAL SKILLS
Hadoop/Big Data: HDFS, Streamsets, Hive, Map Reduce, Cassandra, Pig, H catalo, Phoenix, Falcon, Sqoop, Flume, Zookeeper, Mahout, Kafka, Oozie, Avro, H Base, Map Reduce, HDFS, Storm, CDH 5.3, \ALM, TOAD, JIRA, Selenium, Test NG, Impala, Storm, YARN, Apache Nifi
No SQL Databases: HBase, MangoDB, Cassandra
Languages: C, Python, Java, Pig Latin, Scala, HiveQL, Perl, Unix shell scripts
Frameworks: Struts, Spring, Spring XD,Hibernate
Operating Systems: Ubuntu Linux, Windows XP/Vista/7/10, MAC OS
Web Technologies: HTML, CSS, JavaScript (ES6, ES5), VueJS, Angular2, AngularJS, TypeScript, JSP, XML, jQuery, Vuetify and Bootstrap.
Web/Application servers: Apache Tomcat, WebLogic, WebSphere
Databases: Oracle, MySQL,PL/SQL,PostgreSQL
Tools: and IDE: Eclipse, Anaconda, Spyder
Network Protocols: TCP/IP, UDP, HTTP, DNS, DHCP
Development Methodologies: Agile, Scrum, Waterfall
PROFESSIONAL EXPERIENCE
Confidential, SCOTTSDALE, AZ
STREAMSETS DEVELOPER / DATA ENGINEER
Responsibilities:
- Manage Critical Data Pipelines that power analytics for various business units.
- Developed different pipelines in the Streamsets according the requirements of the business owner.
- Intensively used Python, JSON & Groovy scripts coding to deploy the Streamsets pipelines into the server.
- Converted hundreds of Teradata syntax macros to Greenplum/Postgres syntax functions using SQL & PL/PGSQL.
- Architect and build pipeline solutions to integrate data from multiple heterogeneous systems using Streamsets data collectors and Azure
- Integrate multiple years data from legacy devices, files and databases like DB2, SQL Server, Oracle, Teradata, JSON and MySQL.
- Worked with Kafka to integrate data from multiple topics to database. Manage RESTful API, integrate with Streamsets to move data.
- Facilitate data audit, integrity and governance for all aspect post data movements, manage and monitor data assets.
- Worked with data Extraction, Transformation and Loading data using BTEQ, Fast load, Multiload from Oracle to Teradata.
- Used the Teradata fast load/Multiload utilities to load data into tables.
- Used Teradata SQL Assistant to build the SQL queries.
- Lead the onsite and offshore global development, configuration, unit testing, integration testing, code review, debugging complex issues involving multiple teams and provide application support.
- Integrated data from Cloudera Big data stack, Hadoop, Hive, Hbase, MongoDB. Build Streamsets pipeline to accommodate change.
- Adhere to compliance, audit process and various regulations while implementing project and services for ETL processes.
- Responsible for sending quality data thru secure channel to downstream systems using role base access control and Streamsets.
- Worked with CI/CD workflow and Git to manage codes.
- Automate testing, execution of data workflow pipeline using command line utilities and other tools including python.
- Responsible for installing, configuring, supporting and managing of Hadoop Clusters.
- Worked on Performance tuning on Hive SQLs.
- Created external tables with proper partitions for efficiency and loaded the structured data in HDFS resulted from MR jobs.
- Monitored all MapReduce Read Jobs running on the cluster using Cloudera Manager and ensured that they were able to read the data to HDFS without any issues.
- Involved in moving all log files generated from various sources to HDFS for further processing.
- Involved in collecting metrics for Hadoop clusters using Ganglia.
- Worked on Kerberos Hadoop cluster with 250 nodes cluster.
- Used Hive and created Hive tables, loaded data from Local file system to HDFS.
- Responsible for deploying patches and remediating vulnerabilities.
- Experience in setting up Test, QA, and Prod environment.
- Involved in loading data from UNIX file system to HDFS.
- Created root cause analysis (RCA) efforts for the high severity incidents.
- Worked hands on with ETL process. Handled importing data from various data sources, performed transformations.
- Provided updates in daily SCRUM and self-planning on start of sprint and provided the planned task using JIRA. In sync up with team in order to pick priority task and update necessary documentation in WIKI.
- Weekly meetings with Business partners and active participation in review sessions with other developers and Manager.
Environment: Python, Hbase, HDFS,, GreenPlum, Hive, Jira, Chef, Confluence, Cucumber, Teradata, StreamSets, SDK for Python, AWS Kinesis, Confluent Kafka, Avro, Parquet, Eclipse, Spark, Bamboo, Stash, Microsoft SQL server, CDH, Sqoop, JSON, Git, Jenkins, Docker, Maven, Oracle12c, SBT.
Confidential, Richmond, VA
DATA ENGINEER / HADOOP DEVELOPER
Responsibilities:
- Have real-time experience of Cloudera, Kafka-Storm on HDP 2.2 platform for real time analysis.
- Created PoC to store Server Log data in MongoDB to identify System Alert Metrics
- Implemented Hadoop framework to capture user navigation across the application to validate the user interface and provide analytic feedback/result to the UI team
- Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
- Implemented Spring boot microservices to process the messages into the Kafka cluster setup.
- Closely worked with Kafka Admin team to set up Kafka cluster setup on the QA and Production environments.
- Had knowledge on Kibana and Elastic search to identify the Kafka message failure scenarios.
- Implemented to reprocess the failure messages in Kafka using offset id.
- Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.
- Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup.
- Used Spark API to generate PairRDD using Java programming.
- Have knowledge on partition of Kafka messages and setting up the replication factors in Kafka Cluster.
- Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods.
- Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
- Wrote MapReduce jobs using Java API and Pig Latin.
- Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
- Wrote Pig scripts to run ETL jobs on the data in HDFS and further do testing.
- Used Hive to do analysis on the data and identify different correlations.
- Imported data using Sqoop to load data from MySQL to HDFS and Hive on regular basis.
- Written Hive queries for data analysis to meet the business requirements.
- Involved in collecting, aggregating and moving data from servers to HDFS using Apache Flume.
- Involved in using Oozie for defining amd scheduling jobs to manage apache Hadoop jobs by Directed Acyclic graph (DAG) of actions with control flows.
- Involved in creating Hive tables and working on them using HiveQL and perform data analysis using Hive and Pig.
- Automatically Importing data regular basis using sqoop to into the Hive partition by using apache Oozie.
- Supported Map Reduce Programs those are running on the cluster.
- Weekly meetings with technical collaborators and active participation in code review sessions with senior and junior developers.
- Used Qlikview and D3 for visualization of query required by BI team
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager
Environment: Hadoop, MapReduce, HDFS, Pig, Hive, HBase, Flume, ZooKeeper, Cloudera Manager,Oozie, Java (jdk1.6), MySQL, SQL, Windows NT, Linux