We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Columbus, OH

SUMMARY

  • Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
  • Experience other Hadoop ecosystem tools in jobs such as Zookeeper, Oozie, Impala
  • Build and maintain the environment on Azure IAAS, PAAS.
  • Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.
  • Hands on experience in installing, configuring Cloudera Apache Hadoop ecosystem components like Flume, Hbase, Zoo Keeper, Oozie, Hive, Sqoop and Pig.
  • Installed Hadoop, Map Reduce, HDFS, and AWS and developed multiple Map Reduce jobs in PIG and Hive for data cleaning and pre - processing.
  • Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
  • Setting up Azure infrastructure likestorage accounts, integration runtime, service principalid, app registrations to enablescalable and optimized utilization of business user analytical requirements in Azure.
  • Hands-on experience in developing web applications and RESTful web services and APIs using Python, Flask and Django.
  • Experienced in Working on Big Data Integration and Analytics based on Hadoop, PySpark and No-SQL databases like HBase and MongoDB.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned. Good knowledge in streaming applications using Apache Kafka.
  • Working in relational SQL and NoSQL databases, including Oracle, Hive, Sqoop and HBase
  • Designed and executed Oozie workflows in a manner that allowed for scheduling Sqoop and Hive job actions to extract, transform and load data
  • Migrate databases to cloud platform SQL Azure and as well the performance tuning.
  • Experienced on Hadoop/Hive on AWS, using both EMR and non-EMR-Hadoop in EC2.
  • Cloudera certified developer for Apache Hadoop. Good knowledge of Cassandra, Hive, Pig, HDFS, Sqoop and Map Reduce.
  • Working on JSON scripts generation and writing UNIX shell scripting to call the SQOOP Import/Export
  • Exploratory Data Analysis and Data wrangling with R and Python.
  • Involved in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, ZooKeeper, SQOOP, flume, Spark, Impala, and Cassandra with Horton work Distribution.
  • Good experience of software development in Python (libraries used: libraries- Beautiful Soup, NumPy, SciPy, matplotlib, python-twitter, Pandas data frame, network, urllib2, MySQL dB for database connectivity).
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Used bash scripting, python to automate ETL and schedule ETL jobs in Oozie.
  • Capable of using AWS utilities such as EMR, S3and cloud watch to run and monitor Hadoop and spark jobson Amazon Web Services(AWS).
  • Hands on experience on ETLto get data ingestions from RDBMS toHiveand HDFS using SQOOP from SQL Server, MYSQL, DB2.
  • Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive usingHBase-Hive Integration.
  • Have good knowledge on NoSQL databases like HBase,Cassandra and MongoDB.
  • Experienced on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
  • Good experience in Oozie Framework and Automating daily import jobs.
  • Implement Continuous integration/continuous development best practice using Azure Devops, ensuring code versioning
  • Worked with real-time data processing and streaming techniques using Spark streaming and Kafka
  • Pipeline development skills with Apache Airflow, Kafka, and NiFi
  • Extensively using open source languages Perl,Python,Scala.
  • Migrated projects from Cloudera Hadoop Hive storage to Azure Data Lake Store to satisfy Confidential Digital transformation strategy
  • Doing data synchronization between EC2 and S3, Hive stand-up, and AWS profiling.
  • Using Spark Data frame API in Scala for analyzing data.
  • Implemented data access jobs through Pig, Hive,Tez,Solr, Accumulo,Hbase, andStorm.

TECHNICAL SKILLS

BigData/Hadoop Technologies: MapReduce, Spark, SparkSQL,Azure,Spark Streaming, Kafka,PySpark,, Pig, Hive,HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server

Languages: Perl, MATLAB,, Json, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Shell Scripting

NO SQL Databases: Cassandra, HBase, MongoDB, MariaDB

Web Design Tools: HTML, CSS, JSP, jQuery, XML

Development Tools: Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans.

Public Cloud: EC2, IAM, S3, Auto scaling, CloudWatch, Route53, EMR, RedShift

Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall

Build Tools: Jenkins, Toad, SQL Loader, PostgreSql, Talend, Maven, ANT, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.

Databases: Microsoft SQL Server, MySQL, Oracle, DB2, Teradata, Netezza

Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, Columbus, OH

Senior Big Data Engineer

Responsibilities:

  • Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
  • Import the data from different sources like HDFS/HBase into Spark RDD and developed a data pipeline using Kafka and Storm to store data into HDFS.
  • Created yaml files for each data source and including glue table stack creation
  • Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
  • Developed a python script to transfer data from on-premises to AWS S3
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions
  • Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.
  • Involved in development of Python APIs to dump the array structures in the Processor at the failure point for debugging. Using Chef, deployed and configured Elasticsearch, Logstash and Kibana (ELK) for log analytics, full text search, application monitoring in integration with AWS Lambda and CloudWatch.
  • Transformed date related data into application compatible format by developing apachePig UDFs.
  • Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
  • Integrated Kafka with Spark Streaming for real time data processing.
  • Strong experience using HDFS, Map Reduce, Hive, Spark, Sqoop, Oozie, and HBase.
  • Proposed an automated system using Shell script to Sqoop the job.
  • Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.
  • Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS, HBase and Elastic Search.
  • Transformed and aggregated data for analysis by implementing work flow management ofSqoop, HiveandPigscripts.
  • Involved in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Zookeeper, SQOOP, flume, Spark, Impala, and Cassandra with Horton work Distribution.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Design and development of user interfaces and client displays using JavaScript, CSS and troubleshoot various issues in Python code and fix them with code enhancements- code used various python libraries such as Pyjamas and Python.
  • Used Apache NiFi to copy data from local file system to HDP.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds
  • Strong Knowledge on Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.
  • Implemented POC to migrate Map Reduce jobs intoSpark RDD transformations using Scala.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Developed API's in python with SQLA lchemy for ORM along with MongoDB, documenting API's in Swagger docs and deploying application over Jenkins. Developed Restful API's using Python Flask and SQLAlchemy data models as well as ensured code quality by writing unit tests using Pytest.
  • Developed Kafka consumer API in Scala for consuming data from Kafka topics.
  • Developed Map Reduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW
  • Developed a strategy for Full load and incremental load using Sqoop.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed a python script to hit REST API’s and extract data to AWS S3
  • Developed Scala scripts using both Data frames/SQL and RDD/Map Reduce in Spark for Data Aggregation, queries and writing data back into OLTP system through SQOOP.
  • Participating in the Distributed Computing Initiative using Hadoop, Hive & PIG implementations. The initiative was to build a Data Fabric platform within the organization to enable parallel computation and analysis of large files emerging from the trading desks.
  • Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSQL databases such as HBase and Cassandra.
  • Integrated Oozie with Hue and scheduled workflows for multipleHive, Pig and Spark Jobs.

Environment: Big Data, Hadoop, AWS, Oracle, Pl/Sql, Scala, Spark-Sql, PySpark, Python, Kafka, Sas, Sql, Mdm, Oozie, Ssis, T-Sql, Etl, Hdfs, Cosmos, Pig, Sqoop, Ms Access.

Confidential, MI

Big Data Engineer

Responsibilities:

  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed and Configured Kafka brokersto pipeline server logs data into spark streaming.
  • Synchronizing both the unstructured and structured data using Pig and Hive on business prospectus.
  • Migrated several on premise solutions to azure cloud, infrastructure, network cloud integration IaaS
  • Implementing a distributing messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
  • Developed Microservices by creating REST APIs and used them to access data from different suppliers and to gather network traffic data from servers. Wrote and executed various MYSQL database queries from python using Python-MySQL connector and MySQL dB package.
  • Involved in development of Web Services using REST for sending and getting data from the external interface in the JSON format. Wrote and executed various MYSQL database queries from python using Python-MySQL connector and MySQL dB package.
  • Developing Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
  • Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
  • Experience in Big Data Analytics and design in Hadoop ecosystem using Map Reduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
  • Developed automation system using PowerShell scripts and JSON templates to remediate the Azure services
  • Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSql databases such as HBase and Cassandra.
  • Use of Azure Data bricks and data frames for transformation. Load parquet and sql tables. Hive external tables for data scientists
  • Experienced in troubleshooting errors in HBase Shell/API, Pig, Hive and map Reduce.
  • Capable of understanding and knowledge of jobworkflow schedulingand locking tools/services like Oozie, Zookeeper, Airflow and Apache NiFi.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Involved in various NOSQL databases likeHBase, Cassandrain implementing and integration.
  • Knowledge and experience in job work-flow scheduling and monitoring tools like Oozie and Zookeeper.
  • Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Data bricks, Hive, Hadoop, Python, PySpark, Spark SQL, Map Reduce, and Azure Machine Learning.
  • Running of Apache Hadoop, CDH and Map-R distros, dubbed Elastic Map Reduce(EMR)on(EC2).
  • Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
  • Implemented data Ingestion and handling clusters in real time processing using Apache Kafka.
  • Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce way.
  • Utilized SQOOP, Kafka, Flume and Hadoop File system APIs for implementing data ingestion pipelines
  • Using Spark Data frame API in Scala for analysing data.
  • Setup and benchmarked Hadoop/HBase clusters for internal use.
  • Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors
  • Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in Map Reduce way.
  • Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are then implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Developed Map Reduce Programs for data analysis and data cleaning.
  • Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
  • Excellent understanding of Hadoop architecture and underlying framework includingstorage management.
  • Extracted the needed data from the server into HDFS and Bulk Loadedthe cleaned data into Hbase.
  • Developed Python script to run SQL query as parallel to initial load data into target table. Involved in loading data from edge node to HDFS using shell scripting and assisted in designing the overall ETL strategy
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources

Environment: HDFS, Hive, Spark, Azure, Linux, Kafka, python, Stone branch, Cloudera, Oracle11g/10g, PL/SQL, Unix, Json and Parquet File systems.

Confidential, New York, NY

Big Data Engineer

Responsibilities:

  • Involved in HBASE setup and storing data into HBASE, which will be used for further analysis.
  • Application development using Hadoop Ecosystems such as Spark, Kafka, HDFS, HIVE, Oozie and Sqoop.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Involved in SQOOP implementation which helps in loading data from various RDBMS sources toHadoopsystems and vice versa.
  • Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references.
  • Created APIs, database Model and Views Utilization Python in order to build responsive web page application. Performed troubleshooting, fixed and deployed many Python bug fixes of the two main applications that were a main source of data for both customers and internal customer service team.
  • DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
  • Developed and designed automate process using shell scripting for data movement.
  • Developed workflow in Oozie to automate the tasks of loading the data into Nifi and pre-processing with Pig.
  • Application development using Hadoop Ecosystems such as Spark, Kafka, HDFS, HIVE, Oozie and Sqoop.
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
  • Performed data manipulation on extracted data using Python Pandas.
  • DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Created SAS ODS reports using SAS EG, SAS SQL, and OLAP Cubes.
  • Implemented the Spark Scala code for Data Validation in Hive
  • Designed and implemented big data ingestion pipelines to ingest multi PB data from various data source using Kafka, Spark streaming including data quality checks, transformation, and stored as efficient storage formats like parquet.
  • Implemented the automated workflows for all the jobs using the Oozie and shell script.
  • Implemented python codebase for branch management over Kafka features.
  • Utilized SQOOP, Kafka, Flume and Hadoop File system APIs for implementing data ingestion pipelines
  • Worked on results from Kafka server output successfully.
  • Work with subject matter experts and project team to identify, define, collate, document and communicate the data migration requirements.
  • Designed and implemented big data ingestion pipelines to ingest multi PB data from various data source using Kafka, Spark streaming including data quality checks, transformation, and stored as efficient storage formats like parquet.
  • Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
  • Designed Oozie workflows for job scheduling and batch processing.
  • In depth knowledge on import/export of data from Databases usingSqoop.

Environment: Hadoop, Python, Kafka, Spark, Sqoop, Hdfs, Map reduce, Scala, Mango DB, HBase, Jenkins, AWS, Lambda, Talend, SAS, Unix, Oracle

Confidential

Data Engineer

Responsibilities:

  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Utilize PyUnit, the Python unit test framework, for all Python applications and rewrite existing application in Python module to deliver certain format of data and developed Python batch processors to consume and produce various feeds.
  • Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
  • Experience in developing scalable & secure data pipelines for large datasets.
  • Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
  • Developed Spark scripts to import large files from Amazon S3 buckets.
  • Developed shell scripts for running Hive scripts in Hive and Impala.
  • Predominant practice of Python Matplotlib package and Tableau to visualize and graphically analyses the data. Data pre-processing, splitting the identified data set into Training set and Test set using other libraries in python.
  • Used Jira for bug tracking and Bit bucket to check-in and checkout code changes
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.

Environment: Scala, HDFS, Python, AWS, Yarn, Hive, Sqoop, Flume, Kafka, Impala, Spark SQL, Spark Streaming, Eclipse, Oracle, Teradata, PL/SQL UNIX Shell Scripting.

Confidential

Data Engineer

Responsibilities:

  • Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI.
  • Analyzed data and provided insights with R Programming and Python Pandas
  • Suggested improvements and modify existing BI components (Reports, Stored Procedures)
  • Understood Business requirements to the core and Came up with Test Strategy based on Business rules
  • Prepared Test Plan to ensure QA and Development phases are in parallel
  • Developed PL/SQL procedures for different use case scenarios
  • Involvement in post-production support, Testing and used JUNIT for unit testing of the module.
  • Experienced in developing the UNIX Shell Scripts and PERL Scripts to execute the scripts and manipulate files and directory.html
  • Worked on predictive analytics use-cases using Python language.
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Responsible for building scalable distributed data solutions using Data tax Cassandra.
  • Implemented DAO layer using IBATIS and wrote queries for persisting demand core banking related information from the backend system using Query tool.
  • Used Ajax to communicate with the server to get the asynchronous response.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts

Environment: Python, SQL, PL/SQL, Oracle, Windows, UNIX, Soap, Jasper reports.

We'd love your feedback!