We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Chesterfield, MO

SUMMARY:

  • Around 7+ years of expertise in BIGDATA using HADOOP framework and Analysis, Design, Development, Testing, Documentation, Deployment and Integration using SQL and Big Data technologies.
  • Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, Map Reduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
  • Develop data set processes for data modelling, and Data mining. Recommend ways to improve data reliability, efficiency and quality.
  • Hands on experience in using other Amazon Web Services like Auto scaling, Redshift, Dynamo DB, Route53.
  • Wrote AZUREPOWERSHELLscripts to copy or move data from local file system to HDFS Blob storage.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
  • Have good knowledge on NoSQL databases likeHBase,CassandraandMongoDB.
  • Good understanding of distributed systems, HDFS architecture, Internal working details of Map Reduce and Spark processing frameworks.
  • Created and maintained various Shell and Python scripts for automating various processes and optimized Map Reduce code, pig scripts and performance tuning and analysis.
  • Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
  • Expert in Angular JS,worked onAngular JSfeatures like Two Way Binding, Custom Directives, Controllers, Filters, Services and Project Architecture, ReactJSfeatures like Components, Lifecycle methods, and unidirectional data flow using the Flux Architecture.
  • Extract Transform and Load data from Sources Systems to cloud Azure Data Storage services using a combination of Azure Cloud Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
  • Hands on experience in installing, configuring Cloudera ApacheHadoopecosystem components like Flume, HBase, Zoo Keeper, Oozie, Hive, Sqoop and Pig.
  • Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
  • Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
  • Installed Hadoop, Map Reduce, HDFS, and AWS and developed multiple Map Reduce jobs in PIG and Hive for data cleaning and pre-processing.
  • Extensive experience on Hadoopecosystem components likeHadoop, Map Reduce, HDFS, HBase, Hive, Sqoop, Pig, Zookeeper and Flume.
  • Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
  • Experience in writing code in R and Python to manipulate data for data loads, extracts, statistical analysis, modeling, and data munging.
  • Extensive usage of Azure Portal, Azure PowerShell, Storage Accounts, Certificates and Azure Data Management.
  • Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive usingHBase-Hive Integration.
  • Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and Used Spark Data Frame Operations to perform required Validations in the data.
  • Having extensive knowledge on RDBMS such as Oracle, Devops, Microsoft SQL Server, MYSQL
  • Extensive experience working on various databases and database script development using SQL and PL/SQL.
  • Utilized SQOOP, Kafka, Flume and Hadoop File system APIs for implementing data ingestion pipelines
  • Excellent understanding and knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
  • Cloudera certified developer for Apache Hadoop. Good knowledge of Cassandra, Hive, Pig, HDFS, Sqoop and Map Reduce.
  • Good understanding and knowledge of NoSQL databases like MongoDB, Azure, Postgre SQL, HBase and Cassandra.
  • Experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications.
  • Excellent programming skills with experience in Java, C, SQL and Python Programming.
  • Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty, GIT.
  • Experienced in working in SDLC, Agile and Waterfall Methodologies.
  • Excellent experience in designing and developing Enterprise Applications for J2EE platform using Servlets, JSP, Struts, Spring, Hibernate and Web services.

TECHNICAL SKILLS:

Languages: SQL, PL/SQL, PYTHON, Java, Scala, C, HTML, Unix, Linux

Data Modeling Tools: ERwin, Power Designer, Embarcadero ER Studio, IBM Rational Software Architect, MS Visio, ER Studio, Star Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables

ETL Tools: AWS Redshift Matillion, Alteryx, Informatica PowerCenter, Ab Initio

Big Data: HDFS, Map Reduce, Spark, Airflow, Yarn, NiFi, HBase, Hive, Pig, Flume, Sqoop, Kafka, Oozie, Hadoop, Zookeeper, Spark SQL.

Concepts and Methods: Business Intelligence, Data Warehousing, Data Modeling, Requirement Analysis

RDBMS: Oracle 9i/10g/11g/12c, Teradata, My SQL, MS SQL

NO SQL: MongoDB, HBase, Cassandra

Cloud Platform: Microsoft Azure, AWS (Amazon Web Services), Snowflake

Application Servers: Apache Tomcat, Web Sphere, Web logic, JBoss

Other Tools: Azure Databricks, Azure Data Explores, Azure HDInsight, Power BI

Operating Systems: UNIX, Windows, Linux

PROFESSIONAL EXPERIENCE:

Confidential, Chesterfield, MO

Senior Big Data Engineer

Responsibilities:

  • Performed data manipulation on extracted data using Python Pandas.
  • Work with subject matter experts and project team to identify, define, collate, document and communicate the data migration requirements.
  • Built customtableau/ SAP Business Objectsdashboards for the Salesforce for accepting the parameters from the Salesforce to show the relevant data for that selected object.
  • Hands on Ab initio ETL, Data Mapping, Transformation and Loading in complex and high-volume environment
  • Setting up Azure infrastructure likestorage accounts, integration runtime, service principalid, app registrations to enablescalable and optimizedutilization of business user analytical requirements in Azure.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Experience in usingZookeeperandOozieoperational services to coordinate clusters and scheduling workflows
  • Using HBase to store majority of data which needs to be divided based on region.
  • Created Hive, Pig, SQL and HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Developed a Python Script to load the CSV files into the S3 buckets and createdAWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
  • Experienced of buildingData WarehouseinAzure platformusingAzure data bricksanddata factory.
  • Implementation and data integration in developing large-scale system software experiencing wif Hadoop ecosystem components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig.
  • Design and Implemented the Sqoop incremental imports, delta imports on tables without primary keys and dates from Teradata and SAP HANA and appends directly into Hive Warehouse.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive usingHBase-Hive Integration.
  • Involved in SQOOP implementation which helps in loading data from various RDBMS sources toHadoopsystems and vice versa.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
  • Develop best practice, processes, and standards for effectively carrying out data migration activities. Work across multiple functional projects to understand data usage and implications for data migration.
  • Prepare data migration plans including migration risk, milestones, quality and business sign-off details.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Writing PySpark and spark sql transformation in Azure Data bricks to perform complex transformations for business rule implementation
  • developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.
  • Implemented python codebase for branch management over Kafka features.
  • Developed PIG scripts to transform the raw data into intelligent data as specified by business users.
  • Worked on to retrieve the data from FS to S3 using spark commands
  • Configured Zookeeper, worked on Hadoop High Availability wif Zookeeper failover controller, add support for scalable, fault-tolerant data solution.
  • Used HBase/Phoenix to support front end applications that retrieve data using row keys
  • Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
  • Developed Scala scripts using both Data frames/SQL and RDD/Map reduce in Spark for Data Aggregation, queries and writing data back into OLTP system through SQOOP.
  • Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
  • Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Implemented ApacheDrillon Hadoop to join data from SQL and No SQL databases and store it i Configured Spark streaming to receive real time data from Kafka and store the stream data to HDFS using Scala.
  • Involved in Debugging and monitoring and troubleshooting issues.
  • Developed highly scalable and reliable data engineering solutions for moving data efficiently across systems using Where cape (ETL tool).
  • Analyzed data, identify anomalies, and provide usable insight to customers.
  • Ensured accuracy and integrity of data through analysis, Testing and profiling using Atacama.

Environment: Python, Hadoop, Azure, Databricks, Data Factory, Data Lake, Data Storage, Teradata, Unix, DB2, PL/SQL, MS SQL, Ab initio ETL, Data Mapping, Spark, tableau, Nebula Metadata, Unix, Sql Server, Scala, Git.

Confidential, Fremont, CA

Big Data Engineer

Responsibilities:

  • Created and executed Hadoop Ecosystem installation and document configuration scripts on Google Cloud Platform.
  • Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames using PySpark.
  • Researched and downloaded jars for Spark-avro programming.
  • Involved in converting Map Reduce programs into Spark transformations using Spark RDD's using Scala and Python.
  • Experience in configuring theZookeeperto coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
  • Developed Spark Streaming applications to consume data from Kafka topics and insert the processed streams to HBase.
  • Strong experience using HDFS, Map reduce, Hive, Spark, Sqoop, Oozie, and HBase.
  • Developed a PySpark program that writes dataframes to HDFS as avro files.
  • Responsible for developing data pipeline using Spark, Scala, Apache Kafka to ingestion the data from CSL source and store in HDFS protected folder.
  • Involved in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Zookeeper, SQOOP, flume, Spark, Impala, and Cassandra with Horton work Distribution.
  • Utilized Spark's parallel processing capabilities to ingest data.
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
  • Imported documents into HDFS, HBase and creating HAR files.
  • Proposed an automated system using Shell script to sqoop the job.
  • Developed workflow inOozieto automate the tasks of loading the data intoHDFSand pre-processing withPig.
  • Built Big Data analytical framework for processing healthcare data for medical research using Python, Java, Hadoop, Hive and Pig. Integrated R scripts with Map reduce jobs.
  • Experienced with the Scala, Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, Pair RDD's, Spark YARN
  • InstalledOozie workflowengine to run multiple Hive and Pig Jobs.
  • Experience in transferring Streaming data, data from different data sources into HDFS and NoSQL databases using Apache Flume. Cluster coordination services through Zookeeper.
  • Involved in creating HiveQL on HBase tables and importing efficient work order data into Hive tables
  • Created and executed HQL scripts that creates external tables in a raw layer database in Hive.
  • Developed a Script that copies avro formatted data from HDFS to External tables in raw layer.
  • Created PySpark code that uses Spark SQL to generate dataframes from avro formatted raw layer and writes them to data service layer internal tables as orc format.
  • Good Knowledge inAmazon AWSconcepts likeEMR and EC2web services which provides fast and efficient processing of Big Data.
  • Experienced in troubleshooting errors in Hbase Shell/API, Pig, Hive and map Reduce.
  • Developed data pipeline using Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
  • Migrated Map reduce jobs to Spark jobs to achieve better performance.
  • Played a key role installing and configuring various Big Data ecosystem tools such as Elastic Search, Logstash, Kibana, Kafka, and Cassandra.
  • In charge of PySpark code, creating dataframes from tables in data service layer and writing them to a Hive data warehouse.
  • Experienced on Hadoop/Hive on AWS, using both EMR and non-EMR-Hadoop in EC2.
  • Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
  • Configured documents which allow Airflow to communicate to its PostgreSQL database.
  • Developed Airflow DAGs in python by importing the Airflow libraries.
  • Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python and build models using deep learning frameworks
  • Performed data profiling and transformation on the raw data using Pig, Python, and Java
  • Developed Spark application that uses Kafka Consumer and Broker libraries to connect to Apache Kafka and consume data from the topics and ingest them into Cassandra.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions
  • Decommissioning nodes and adding nodes in the clusters for maintenance
  • Monitored cluster health by Setting up alerts using Nagios and Ganglia
  • Adding new users and groups of users as per the requests from the client
  • Working on tickets opened by users regarding various incidents, requests
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in map reduce way.

Environment: Spark, AWS, Redshift, EC2, Lambda, S3, Cloud Watch, Cloud Formation, IAM, Auto Scaling, Security Groups, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Cloudera, Oracle11g/10g, PL/SQL, Unix.

Confidential, Fountain valley, CA

Big Data Engineer

Responsibilities:

  • Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
  • Developed a python script to hit REST API’s and extract data to AWS S3
  • Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Schedule nightly batch jobs using Oozie to perform schema validation and IVP transformation at larger scale to take the advantage of the power of Hadoop.
  • Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows.
  • Developed data pipeline using Spark, Hive, Pig and HBase to ingest customer behavioral data and financial histories into Hadoop cluster for analysis.
  • Setup and benchmarked Hadoop/HBase clusters for internal use.
  • Developed Map Reduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Created Sqoop scripts to import/export user profile data from RDBMS to S3 data lake.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
  • Experience in job workflow scheduling and monitoring tools likeOozieand good knowledge onZookeeperto coordinate the servers in clusters and to maintain the data consistency.
  • Experience in open-source Kafka, zookeepers, Kafka connects.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Loading Data into HBase using Bulk Load and Non-bulk load.
  • Strong experience in core Java,Scala, SQL, PL/SQL and Restful web services.
  • Developed reusable transformations to load data from flat files and other data sources to the Data Warehouse.
  • UsedZookeeperto provide coordination services to the cluster. Experienced in managing and reviewingHadooplog files.
  • Developed Sqoop jobs for performing incremental loads from RDBMS into HDFS and further applied Spark transformations
  • Working on designing the Map Reduce and Yarn flow and writing Map Reduce scripts, performance tuning and debugging.
  • Involved in HBASE setup and storing data into HBASE, which will be used for further analysis.
  • Worked in AWS environment for development and deployment of Custom Hadoop Applications.
  • Assisted operation support team for transactional data loads in developing SQL Loader & Unix scripts
  • Implemented Python script to call the Cassandra Rest API, performed transformations and loaded the data into Hive.
  • Extensively worked on Python and build the custom ingest framework.
  • Develop Map Reduce jobs for Data Cleanup in Python and C#.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
  • Created Cassandra tables to store various data formats of data coming from different sources.
  • Designed, developed data integration programs in a Hadoopenvironment with NoSQL data store Cassandra for data access and analysis.

Environment: Hadoop YARN, Spark, Spark Streaming, Spark SQL, Scala, Kafka, Python, Hive, Sqoop, Impala, Tableau, Talend, Oozie, Control-M, Java, AWSS3, Oracle, Linux.

Confidential

Hadoop Developer/ Data Engineer

Responsibilities:

  • Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
  • Experience in developing scalable & secure data pipelines for large datasets.
  • Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
  • Supported Map Reduce Programs those are running on the cluster. Involved in loading data from UNIX file system to HDFS.
  • Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
  • Worked on Integrating Hive-HBase tables and on top of the HBase tables search engine application are built.
  • Supported data quality management by implementing proper data quality checks in data pipelines.
  • Wrote Map Reduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
  • Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
  • Implemented data streaming capability using Kafka and Talend for multiple data sources.
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
  • Developed Scala scripts using both Data frames/SQL and RDD/Map Reduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Scoop.
  • Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
  • Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.
  • Knowledge on implementing the JILs to automate the jobs in production cluster.
  • Troubles hooted user's analyses bugs (JIRA and IRIS Ticket).
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
  • Worked on analyzing and resolving the production job failures in several scenarios.
  • Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.

Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.

Confidential

Data Engineer

Responsibilities:

  • Involved in Requirements Analysis and design an Object-oriented domain model.
  • Involvement in the detailed Documentation, written functional specifications of the module.
  • Built reports and report models using SSRS to enable end user report builder usage.
  • Created Excel charts and pivot tables for the Ad-hoc Data pull.
  • Created Column Store indexes on dimension and fact tables in the OLTP database to enhance read operation.
  • GUI prompts user to enter personal information, charity items to donate, and deliver options
  • Developed a fully functioning C# program that connects to SQL Server Management Studio and integrates information users enter with preexisting information in the database
  • Implemented SQL functions to receive user information from front end C# GUIs and store it into database
  • Deployed Web, presentation, and business components on Apache Tomcat Application Server.
  • Developed PL/SQL procedures for different use case scenarios
  • Apache ANT was used for the entire build process.
  • Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
  • Experience in managing and reviewing Hadoop Log files.
  • Inverse nibble substitution function executes nibble substitution in reverse order
  • Worked on report writing using SQL Server Reporting Services (SSRS) and in creating various types of reports like table, matrix, and chart report, web reporting by customizing URL Access.

Environment: Hadoop, SQL, SSRS, SSIS, OLTP, PL/SQL, Oracle 9i, Log4j, ANT, Clear-case, Windows.

We'd love your feedback!