Big Data Engineer Resume
Chicago, IL
SUMMARY
- Over 8+ years of experience with emphasis on Big Data Technologies, Development and Design of Java based enterprise applications.
- 4+ years of experience in developing applications that perform large scale Distributed Data Processing using Big Data ecosystem tools Hadoop, Pig, Sqoop, HBase, Cassandra, Spark, Kafka, Oozie, Zookeeper, Sqoop, Puppet, Yarn and Avro.
- Experience in querying Snowflake, Redshift, Oracle for OLTP and OLAP.
- Experience on Horton Works and Cloudera Hadoop environments.
- Expertise in database performance tuning data modeling.
- Experienced in providing security to Hadoop cluster with Kerberos and integration with LDAP/AD at Enterprise level.
- Experienced in developing MapReduce programs using Apache Hadoop for working with Big Data.
- Hands on experience in Apache Spark creating RDD’s and Data Frames applying Operations Transformation and Actions and concerting RDD’s to Data Frames.
- Experience data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
- Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS and performed the real - time analytics on the incoming data.
- Good understanding of Scrum methodologies, Test Driven Development and Continuous integration.
- Experience working with JAVA J2EE, JDBC, ODBC, JSP, Java Eclipse, Java Beans, EJB, Servlets
- Expert in developing web page interfaces using JSP, Java Swings, and HTML scripting languages.
- Excellent understanding on Java beans and Hibernate framework to implement model logic to interact with RDBMS database.
- Experience in using IDEs like Eclipse, NetBeans, PyCharm, Spyder, Visual Code Studio and Maven.
- Experience using Alteryx for automating workflow for data pipeline.
- Hands on experience in GCP, Big Query, GCS, cloud functions, Cloud dataflow, Pub/Sub, cloud shell, GSUTIL, bq command- line utilities, Data Proc.
TECHNICAL SKILLS
Big Data and Hadoop: HDFS, Map Reduce, Spark, YARN, Hive, Pig, Scala, Kafka, Flume, Tez, Impala, Solr, Oozie, Zookeeper, Apache Airflow, Databricks, Alteryx
Hadoop Distribution: Horton Works, Cloudera, EMR
NO SQL Databases: HBase, Cassandra, MongoDB
Cloud Computing Tools: Amazon AWS, GCP
Languages: Java/J2EE, Python, SQL, PL/SQL, Pig Latin, HiveQL, UNIX Shell Scripting, Bash
Java & J2EE Technologies: Core Java, Servlets, Hibernate, Spring, Struts, JMS, EJB, Tomcat
Application Servers: Web Logic, Web Sphere, JBoss, Tomcat.
Databases: Oracle, MySQL, SQL, Snowflake, Teradata, Redshift
Operating Systems: UNIX, Windows, LINUX, Cent OS
Build Tools: Jenkins, Maven, ANT, Terraform, Gradle
Development methodologies: Agile/Scrum, Waterfall
PROFESSIONAL EXPERIENCE
Confidential, Chicago IL
Big Data Engineer
Responsibilities:
- Worked on Architecture, design and implementation of high-performance large volume data integration processes, database, storage, and other back-end services in fully virtualized environments on Data and stored the processed data in AWS S3.
- Administering, Optimizing, Troubleshooting, Monitoring, Supporting and Managing Hadoop clusters and its Jobs
- Performed ETL using AWS Glue, PySpark and SQL.
- Used AWS Athena to Query directly from AWS S3.
- Built a system to analyze the column names from all tables and identified the personal information columns of data in Oracle, Teradata, RDBMS On-premises Databases and S3.
- Built Data Dictionary and optimized SQL queries for data pipeline
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Worked on BigQuery data into pandas or Spark data frames for advanced ETL capabilities
- Launched multi-node kubernetes cluster in Google Kubernetes Engine (GKE) and migrated the dockerized application from AWS to GCP
- Created BigQuery authorized views for row level security or exposing the data to other team
- Performed data quality issue analysis using Snow SQL by building analytical warehouses on Snowflake.
- Created Snowpipe for continuous data load
- Redesigned the Views in Snowflake to increase the performance
- Developed stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts
- Worked on partition of Kafka messages and setting up the replication factors in Kafka Cluster.
- Worked on various data normalization jobs for new data ingested into Redshift
- Created AWS Data pipeline jobs to perform transformations on S3 data and load it to Redshift and/or S3
- Created new process flow using Redshift SQL scripts to extract transform and load the data from different sources to Amazon Redshift database
- Developed scripts and indexing strategy for a migration to Confidential Redshift from SQL Server and Teradata
- Integrated Spark streaming service with Kafka to load the data into a HDFS location.
- Good experience in writing Spark applications using Python.
- Automated Alteryx workflow using excel files for the finance team.
- Built pipeline using Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS.
- Extensively worked on Spark using Python creating Data frames using Spark SQL Context for faster processing of data
- Used Spark using Python to create data frames on client requests data for computing analytics and saved it as a text file for sending it to analytical report generation.
- Created an analytical report using Tableau desktop on number client requests received per day.
Environment: Linux, AWS RDS, S3, Lambda, EC2, Python (NumPy, Pandas), Hadoop, Kafka, Spark(Python)HIVE, PIG, HTML, Tableau, Jira, Confluence, Tableau, AWS Glue, Athena, Alteryx, TeradataGCP Cloud Functions, Big Query, Dataflow, Cloud Storage, GKE, GCP Cloud Functions, Snowflake, Redshift.
Confidential, Radnor, Pennsylvania
Big Data Developer
Responsibilities:
- Hands-on experience on AWS platform and its features which includes EC2, VPC, EBS, AMI, RDS, Cloud Watch, Cloud Formation, Auto scaling, CloudFront, IAM, S3.
- Experience creating custom VPCs, subnets, EC2 instances, ELB and security groups.
- Ensured optimum efficiencies for the utilization of cloud services
- Using Different AWS services like S3, Databricks, Snowflake etc to monitor the campaign and solve the issues.
- Improved existing database applications in Snowflake.
- Worked on creating Data Navigator portal to provide overview of data load and data quality using python, Snow pipe improving efficiency of analysis
- Documented the types and structure of the business data which are required for the project in Snowflake
- Migrated on premise database structure to Confidential Redshift data warehouse
- Designed and Developed ETL jobs to extract data from Salesforce replica and loaded it in data mart in Redshift
- Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift
- Used JSON schema to define table and column mapping from S3 data to Redshift
- Used Estimation-Maximization, K-Means Clustering, MFCC, Confusion Matrix algorithm.
- Implemented Kafka/Spark streaming pipelines for Data ingestion using Stream Set Data Collector (SDC)
- Worked on Cloudera Distribution of Hadoop to run it on CDH and MapReduce Cluster
- Implemented Data interface to get information about customers using Restful API.
- Imported data from different sources such as S3, Local file System into Spark RDD
- Responsible for analyzing large datasets and derived customer usage patterns by developing MapReduce program using Python.
- Experienced in implementing static and dynamic partitioning in hive.
- Performed batch operations using on each batch of records using Spark Evaluator.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, Hive and Sqoop.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Extensively used Sqoop to import/export data between RDBMS and hive tables, incremental imports and created Sqoop jobs for last saved value.
Environment: Kafka, Spark, ElasticSearch, Hadoop, MapReduce, Python, Spark Executor, CDH, Sqoop, Hive, Tableau, SDC, AWS, RDBMS, Sqoop, Hive, Scikit-Learn library, Linux, Databricks, Machine Learning Algorithms, Flask, Ajax, Python, Snowflake, Redshift.
Confidential, Salt LakeCity, UT
Hadoop Developer
Responsibilities:
- Responsible for developing efficient MapReduce on AWS cloud programs for more than 20 years’ worth of claim data to detect and separate fraudulent claims.
- Developed Map-Reduce programs from scratch of medium to complex.
- Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and Flume.
- Played a key-role is setting up a 40 node Hadoop cluster utilizing Apache MapReduce by working closely with the Hadoop Administration team.
- Responsible for designing and managing the Sqoop jobs that uploaded the data from Oracle to HDFS and Hive.
- Creating Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team
- Used Flume to collect the logs data with error messages across the cluster.
- Designed and Maintained Oozie workflows to manage the flow of jobs in the cluster.
- Played a key role in installation and configuration of the various Hadoop ecosystem tools such as, Hive, Pig, and HBase.
- Successfully loaded files to HDFS from Teradata, and loaded from HDFS to HIVE
- Experience in using Zookeeper and Oozie for coordinating the cluster and scheduling workflows
- Installed Oozie workflow engine and scheduled it to run data/time dependent Hive and Pig jobs
- Designed and developed Dashboards for Analytical purposes using Tableau.
- Analyzed the Hadoop log files using Pig scripts to oversee the errors.
- Actively updated the upper management with daily updates on the progress of project that include the classification levels in the data.
Environment: Java, Hadoop, Map Reduce Hive, Pig, Sqoop, Flume, HBase, Teradata.
Confidential
Hadoop Developer
Responsibilities:
- Installed and Configuration of Hadoop Cluster
- Worked on Hortonworks cluster, which is responsible for providing open source platform based on Apache Hadoop for analyzing, storing and managing big data
- Worked with analyst to determine and understand business requirements
- Load and transform large data sets of structured, semi structured and unstructured data using
- Hadoop/Big Data concepts
- Developed data pipeline using Flume, Sqoop, Pig and MapReduce to ingest customer data and financial histories into HDFS for analysis
- Used MapReduce and Flume to load, aggregate, store and analyze web log data from different web servers
- Created MapReduce programs to handle semi/unstructured data like XML, JSON, AVRO data files and sequence files for log files
- Involved in submitting and tracking MapReduce jobs using Job Tracker
- Experience writing Pig Latin scripts for Data Cleansing, ETL operations and query optimizations of exists scripts
- Written Hive UDF to sort Structure fields and return complex data types
- Created Hive tables from JSON data using data serialization framework like AVRO
- Experience writing reusable custom Hive and Pig UDF’s in Java and using existing UDF’s from Piggybank and other sources
- Integrated Hive tables to HBase to perform row level analytics
- Developed Oozie workflows for daily incremental loads, which Sqoop’s data from Teradata, Netezza and then imported into Hive tables
- Involved in performance tuning by using different service engines like TEZ etc.
- Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files
- Implemented Daily Cron jobs that automate parallel tasks of loading the data into HDFS using AutoSys and Oozie coordinator jobs
- Developed suit of Unit Test Cases for Mapper, Reducer and Driver classes using MR Testing library
- Supported operations team in Hadoop cluster maintenance activities including commissioning and decommissioning nodes and upgrades
- Providing technical solutions/assistance to all development projects
Environment: Hortonworks, Java, Hadoop, HDFS, MapReduce, Tez, Hive, Pig, Oozie, Sqoop, Flume, Teradata, Netezza, Tableau
Confidential
Java Developer
Responsibilities:
- Involved in designing the Project Structure, System Design and every phase in the project.
- Responsible for developing platform related logic and resource classes, controller classes to access the domain and service classes.
- Developed UI using HTML, JavaScript, and JSP, and developed Business Logic and Interfacing components using Business Objects, XML, and JDBC.
- Designed user-interface and checking validations using JavaScript.
- Managed connectivity using JDBC for querying/inserting & data management including triggers and stored procedures.
- Involved in Technical Discussions, Design, and Workflow.
- Participate in the Requirement Gathering and Analysis.
- Developed Unit Testing cases using JUnit Framework.
- Implemented the data access using Hibernate and wrote the domain classes to generate the Database Tables.
- Involved in design of JSP’s and Servlets for navigation among the modules.
- Designed cascading style sheets and XML part of Order entry Module & Product Search Module and did client-side validations with Java script.
- Involved in implementation of view pages based on XML attributes using normal Java classes.
- Involved in integration of App Builder and UI modules with the platform.
Environment: Hibernate, Java, JAXB, JUnit, XML, UML, Oracle11g, Eclipse, Windows XP.