Data Engineer Resume
Needham, MA
PROFESSIONAL SUMMARY:
- Skilled in databases, data management, analytics, data processing, data cleanings, data modeling and data driven projects including Online Transaction Processing.
- Skilled in Architecture of Big Data Systems, ETL Pipelines, real time analytic systems including Machine Learning algorithms, slicing / dicing OLTP Cubes and drilling tabular models.
- Proficient in various distributions such as Hadoop Apache ecosystems, Microsoft Azure and Spark Databricks; working knowledge of AWS and Hortonworks - Cloudera
- Expert in bucketing, partitioning, multi-threading computing and streaming (Python, PySpark)
- Expert in Performance Optimization and Query Tuning (MS SQL)
- Adept at Project Management methodologies such as Waterfalls Rational Rose or Scrum / Agile / Sprint, Epics/Stories with a good knowledge of SOLID Patterns, working knowledge of technical reports and documentation.
- Proficient web developer using various frameworks including Node.JS, Node Express / Vue.Ja and NET Core.
SKILL:
APACHE: Apache Ant, Apache Flume, Apache Hadoop, Apace Oozie - Sqoop, HDFS, Apache YARN, Presto, Apache Hive, Apache HBase, Apache Kafka, Apache Spark, Apache Airflow, Apache Zookeeper, Cassandra, Cloudera-Hortonworks, MapR, MapReduce Python, NIFI ETL, Apache JMeter, NGInx
MICROSOFT / AWS: MS SQL, MS SSIS / DQS (Data tools), SSAS Tabular, OLAP, Azure Synapse, Azure Cosmos DB Emulator, Azure Databricks, AWS RedShift, AWS EMR, AWS DynamoDB, AWS EC2
SCRIPTING: PySpark, Spark SQL, HiveQL, MapReduce Python, MLib, Python Scikit-learn, R-RevoScaleR, SQLDax Python Anaconda, Flask - REST API - Node.js / Vue.Js, Net C# 4.5 - Core 3.0 Web Authentication API, XML/XLST, UNIX, Shell scripting, LINUX, FTP, SSH
OPERATING SYSTEMS: Unix/Linux, Docker Windows 10, Ubuntu, Apple OS
FILE FORMATS: Parquet, Avro & JSON, ORC, text, csv, XML, SOAP
DISTRIBUTIONS: Cloudera-Hortonworks, Cloudera CDH 4/5 HDP 2.5/2.6, Azure Insights, EMR Amazon Web Services
DATA PROCESSING (COMPUTE) ENGINES: Apache Spark, Spark Streaming, Flume, Kafka, Squoop, Pentaho Data, Azure Databrick, AWS Kinesis
DATA VISUALIZATION TOOLS: MS SSRS, PowerBI, Pentaho CE, QlikView, Matplotlib, Falcon Client Ploty, Streamlit, Zeppelin
DATABASES: Microsoft SQL Server Database, MySQL, PostgreSQL 9.5, Amazon Redshift, DynamoDB, Presto SQL engine, Apache Cassandra, Apache Hive, NoSql MongoDB, RDS Database normal forms and data warehouse models including tabular and start flake models, slow changing dimensions and bus matrix (Kimball) architecture.
SOFTWARE: Oracle Vbox, Eclipse, Apache Airflow, Workbench DBeaver, Falcon SQL Client, PyCharm, RStudio, Visual Code, Azure Studio, Dax Studio, Microsoft Visual Studio, Excel, QlikView, PowerBi, SSRS, Acunetix XSS, Microsoft Project, Microsoft Word, Power Point, Git/Trello, Slack, Rational Rose XDE and technical documentation.
WORK HISTORY:
Data Engineer
Confidential, Needham, MA
Responsibilities:
- Experienced in Automating, Configuring and deploying instances on AWS, Azure environments and Data centers, also familiar with EC2, Cloud watch, Cloud Formation, and managing security groups on AWS.
- Created automated python scripts to convert the data from different sources and to generate the ETL pipelines.
- Extensively used Hive optimization techniques like partitioning, bucketing, Map Join, and parallel execution.
- Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive.
- Maintenance and design of software installation shell scripts.
- Configuring a cluster of 10 Nodes for processing live information
- Developed maintenance and software installation scripts.
- Design extraction of data from different databases and scheduling Oozie workflows to execute daily tasks.
- Developed distributed query agents for performing distributed queries against Hive
- Load the data from different sources such as HDFS or HBase into Spark data frames and implement in-memory data computation to generate the output response.
- Monitoring resources, such as Amazon DB, CPU Memory, using cloud watch.
- Collaborated on a Hadoop cluster (CDH) and reviewed log files of all daemons.
- Used Spark SQL to realize quicker results compared to Hive throughout information Analysis.
- Created Hive external tables and designed information models in hive
- Developed multiple Spark Streaming and batch Spark jobs using Python
Cloud Data Engineer
Confidential, Orlando, FL
Responsibilities:
- Writing Hive Queries for analyzing data in Hive warehouse using HUE
- Evaluate and propose new tools and technologies to meet the needs of the organization.
- Excellent understanding/knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node
- Developed multiple Spark Streaming and batch Spark jobs and Python on AWS
- Worked with Amazon AWS IAM console to create custom users and groups.
- Learned many technologies on the job as per the requirement of the project.
- Developer communication standards,
- Configured Spark and Spark SQL for faster testing and processing of data living in HDFS.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation.
- Formatted JSON into data frames using schema StructTypes
- Implemented a query parser, query planner, query optimizer to override the native query execution of Hive using replicated logs combined with indexes, supporting full relational SQL queries, including joins.
- Transferred data between a Hadoop ecosystem and structured data storage in an RDBMS MySQL using Sqoop.
- Proficient in writing complex queries into Apache Hive on Hortonworks
- Loading data from servers to AWS S3 bucket
- Configured bucket permissions and bucket policies.
- Utilized Spark with AWS EMR for data pipeline automation.
- Expertise in AWS data migration between different database platforms like Local SQL Server to Amazon RDS and EMR Hive.
Hadoop Engineer
Confidential, San Francisco, CA
Responsibilities:
- Installed and configured a full Kafka cluster, Topics and Replicas
- Created and managed Topic creation inside Kafka
- Installed and configured replication factor partitions
- Communicated and managed consumer groups over Kafka
- Wrote python scripts to receive requests from REST Based API’s, through the request libraries and serve to Kafka producer
- Performed ETL to Hadoop file system (HDFS) and wrote Hive UDFs.
- Ingested information from spark Data Frames over HBase
- Performed aggregation and windowing functions with SQL
- Migrated Spark applications from Map Reduce to improve performance
- Created a benchmark between Cassandra and Hbase for fast ingestion
- Processed Terabytes of information on real time using spark streaming
- Created and managed code reviews
- Collaborated on Spring planning and sprint grooming
- Developed unit tests to evaluate test functionality of spark applications
- Developed scripts for collecting high-frequency log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.
- Write producer /consumer scripts to process JSON response in python
- Wrote streaming applications with Spark Streaming/Kafks.
- Developed DBC/ODBC connectors between Hive and Spark for the transfer of the newly populated data frames from MSSQL
- Built Hive views on top of source data tables
- Built a secured provisioning Hive Metastore
- Involved in loading data from the UNIX file system to HDFS.
Data Engineer
Confidential, Los Angeles, CA
Responsibilities:
- Participated in planning meetings and assisted with documentation and communication.
- Worked on moving some on-prem data repositories to cloud using Amazon AWS to make use of reduced cost as well as scalability.
- Implemented all SCD types using server and parallel jobs. Extensively implemented error handling concepts, testing, debugging skills and performance tuning of targets, source, transformation logics and version control to promote the jobs.
- Involved in loading and transforming large sets of structured, semi-structured and unstructured data.
- Involved in loading data from UNIX file system to HDFS.
- Developed ETLs to pull data from various sources and transform it for reporting applications using PL/SQL
- Hands-on experience extracting data from different databases and scheduling Oozie workflows to execute the task daily.
- Used Sqoop to expeditiously transfer information between information data bases and HDFS and used Flume to stream the log data from servers. Successfully loaded files to Hive and HDFS from Oracle, SQL Server using SQOOP.
- Captured data and importing it to HDFS using Flume and Kafka for semi-structured data and Sqoop for existing relational databases.
- Used Zookeeper for providing coordinating services to the cluster.
- Used Oozie hardware system to alter the pipeline advancement and execute jobs in a timely manner.
- Moving information from Oracle to HDFS and vice-versa.
- Collected and aggregative giant amounts of log information exploitation using Apache Flume and staging information in HDFS for additional analysis.