Sr. Big Data Engineer Resume
Dallas, TX
SUMMARY
- Highly dedicated, inspiring and expert Sr Data Engineer with over 7 Plus years of IT industry experience exploring various technologies, tools and databases likeBig Data, AWS, S3, Snowflake, Hadoop, Hive, Spark, python, Sqoop, CDL(Cassandra)), Teradata, Tableau, SQL, PLSQL, Abinitio (ACE).
- Have 4+ years of comprehensive experience in Big Data processing using Hadoop and its ecosystem (MapReduce, Pig, Hive, Sqoop, Flume, Spark).
- Experience in developing Spark Programs for Batch and Real - Time Processing. Develops Spark Streaming applications for Real-Time Processing.
- Proficient in developingdatatransformation and other analytical applications in Spark, Spark-SQL
- Proficient in using Apache Spark (PySpark) fordataprocessing (SparkSQL), machine learning (ML, MLlib) with bigdata, and Kafka for BigDataProcessing & Scala Functional programming
- Experienced in performing ExploratoryDataAnalysis (EDA) using Python libraries such as NumPy, Pandas and Matplotlib.
- Experience collaborating with developers as a team lead in designing Access Extracting, Transformation and Loading (ETL),datamodels and database architecture usingdatawarehousing concepts.
- Good Understanding of Azure Bigdatatechnologies like AzureDataLake Analytics, AzureDataLake Store, AzureDataFactory and created POC in moving thedatafrom flat files and SQL Server using U-SQL jobs.
- Good Knowledge about scalable, secure cloud architecture based on Amazon Web Services (leveraging AWS EMR Clusters, EC2, S3, etc.
- Hands on experience with creating solution driven Dashboards by developing different chart types including Heat Maps, Geo Maps, Symbol Maps, Pie Charts, Bar Charts, Tree Maps, Gantts, Line Charts, Scatter Plots and Histograms in Tableau Desktop, QlikView and Power BI.
- Expertise in Designing, Developing of ETL DataStage server, parallel, and sequence Jobs to populatedataintoDatawarehouse andDatamarts.
- Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources.
- Experience with all stages of the SDLC and Agile Development model right from the requirement gathering to Deployment and production support.
- Excellent in monitoring DBs, Servers and Jobs using Prometheus and Grafana Open Source tools.
- Good in Understanding of Business logics and ability to work well as a part of a team and as an individual.
PROFESSIONAL EXPERIENCE
SR. Big Data Engineer
Confidential, Dallas, TX
Responsibilities:
- Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
- Responsible for analysing the business requirement and estimating the tasks and preparing the mapping design documents for Confidential Point of sale (POS) and Direct sales (Digital sale) across all GOE’s.
- Analysed large and critical datasets using Cloudera, HDFS, MapReduce, Hive, Hive UDF, Pig, Sqoop and Spark.
- Working on the cloud administrating process on Microsoft Azure, involved in configuring Virtual Machines, storage accounts
- Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in Near real time and persist it to Cassandra.
- Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
- Used SparkSQL to load JSONdataand create Schema RDD and loaded it into Hive Tables and handled structureddatausing SparkSQL.
- Worked on designing, building, deploying and maintaining Mongo DB.
- Design SSIS packages to bring data from existing OLTP databases to new data warehouse using various transformations and tasks like Sequence Containers, Script, for loop and Foreach Loop Container, Execute SQL/Package, Send Mail, File System, Conditional Split, Data Conversion, Derived Column, Lookup, Merge Join, Union All, OLE DB source and destination, excel source and destination with multiple data flow tasks.
- Developed ETL framework using Spark and Hive (including daily runs, error handling, and logging) to useful data.
- Coordinated with team and Developed framework to generate Daily adhoc, Report's and Extracts from enterprise data and automated using Oozie.
- Improve the performance of SSIS packages by implementing parallel execution, removing unnecessary sorting and using optimized queries and stored procedures.
- Developed accurate integration of machine learning models required for the application architecture. SQL query fordataextraction and management.
- Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV & other compressed file formats.
- Developed automated processes for flattening the upstream data from Cassandra which in JSON format. Used Hive UDFs to flatten the JSON Data.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms
- Developed PIG UDFs to provide Pig capabilities for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders and Implemented various requirements using Pig scripts.
Environment: Hadoop, HDFS, Pig, Hive, HBase, Oozie, Sqoop, Kafka, Spark, Map Reduce, PL/SQL, AzureDataFactory (ADF), Power BI Desktop/Server, Azure SQL Database, Azure Databricks, SSIS, Python, Java, Oracle 12c, My SQL, No SQL, Mongo DB, Cassandra
SR. Data Engineer
Confidential, York, PA
Responsibilities:
- Responsible in implementing the Models along with the architects so that the information that is represented in the heterogeneous source systems are available for analytical reporting by the BigDataCOE teams.
- Responsible for analysing production failures and fixing them within the prescribed SLAs
- Worked on optimizing multiple production jobs which are processing millions of records each and every day.
- Gather understanding of the various Viewership source systems used by the ATT and perform sourcedataanalysis to provide the feasibility study and estimates for ETL / BigDatafeeds for the Use Case under consideration.
- Used AWS EMR to transform and move large amounts ofdatainto and out of other AWSdatastores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
- Worked closely with down streams using EVP'sdataand enhanced the platform to suit their requirements.
- Design/Develop jobs using UNIX/Scala/Informatica/HIVE/SPARK/PIG/SQOOP/TWS to pull viewershipdatato HDFS ecosystems and provide Business ready extracts to the downstream users.
- Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
- Designed and developed scoop jobs to ingest customer subscriptiondatainto HDFS for aDataIntegration project.
- Designed and Developed ingestion/curation/extraction/publish jobs to send HDFS ORC to parquet converted files to Amazon S3.
- Developed Python scripts to automatedatasampling process. Ensured thedataintegrity by checking for completeness, duplication, accuracy, and consistency.
- Analysed, Strategized & Implemented Azure migration of Application & Databases to cloud.
- Troubleshoot and identify performance, connectivity and other issues for the applications hosted in Azure platform.
- Was involved in real time Kafka streaming solution for app healthdatausingDataRouter pub sub framework hosted on Kafka cluster.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
- Developed framework to publish incrementaldatafrom HDFS to AWS S3 bucket.
Environment: Hive, Pig, Spark, Sqoop, Oozie, AWS, EC2, S3, Lambda, Auto Scaling, Elastic Search, Cloud Watch, Cloud Formation, Dynamo DB, Informatica 10x, Unix Shell scripting, PL/SQL, Oracle Exadata/12c/11g, Toad 8.6, SQL Assistant, TWS, SQL plus, WinSQL 6.1, WinSCP 3.7
Data Engineer
Confidential, Milpitas, CA
Responsibilities:
- Responsible for ingestion ofdatafrom various APIs and writing modules to storedatain S3 buckets.
- Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
- Responsible for writing Unit Tests and deploy production level code through the help of Git version control.
- BuildingDataPipeline which involved ingesting ofdatafrom disparatedatasources to a unified platform.
- Wrote Spark applications fordatavalidation, cleansing, transformation, and custom aggregation.
- Worked on Hadoop ecosystem in PySpark on Amazon EMR and Databricks.
- Processed the web server logs by developing Multi - hop flume agents by using Avro Sink and loaded into MongoDB for further analysis.
- Implemented Custom Serializes to perform encryption using DES algorithm.
- Developed Collections in MongoDB and performed aggregations on the collections.
- Used Spark-SQL to Load JSONdataand create Schema RDD and loaded it into Hive Tables and handled structureddatausing Spark SQL.
- Used Spark-SQL to Loaddatainto Hive tables and written queries to fetchdatafrom these tables.
- Developed Spark Programs using Scala and Java API's and performed transformations and actions on RDD's.
- Created HBase tables and used HBase sinks and loadeddatainto them to perform analytics using Tableau.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverlessdatapipeline which can be written to Glue CatLog and can be queried from Athena.
- Created HBase tables and column families to store the user eventdata
- Importeddatafrom AWSS3 and into Spark RDD and performed transformations and actions on RDD's.
- Configured, monitored, and optimized Flume agent to capture web logs from the VPN server to be put into HadoopDataLake.
Environment: Spark 1.6, H Base 1.2, Python 3.4, PySpark, HDFS, Flume 1.6, Cloudera Manager, SQL, GitHub, Linux, Spark SQL, Kafka, Sqoop 1.46, AWS.
Data Engineer
Confidential, Brooklyn, NY
Responsibilities:
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala
- Developed Real timedataprocessing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka, Flume and JMS
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Analysed the SQL scripts and designed the solution to implement using Scala.
- Developed analytical component using Scala, Spark and Spark Stream.
- Developing UDFs in java for hive and pig and worked on reading multipledataformats on HDFS using Scala.
- Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the map reduce jobs that extract
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Flume, Oozie Zookeeper and Sqoop.
- Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Sqoop, Zookeeper, Teradata, PL/SQL, MySQL, HBase, AWS, DataStage, ETL(Informatica/SSIS).
SoftwareEngineer
Confidential, Charlotte, NC
Responsibilities:
- Extract, transform, and load (ETL) data from multiple federated data sources (JSON,relational database, etc.) with Data Frames inSpark.
- UtilizedSparkSQLto extract and process data by parsing using Datasets or RDDs inHive Context, with transformations and actions (map, flat Map, filter, reduce, reduce By Key).
- Extended the capabilities ofData FramesusingUser Defined Functionsin andScala.
- Resolved missing fields inData Framerows usingfilteringand imputation.
- Integratedvisualizationsinto a Spark application using Databricks and popular visualization libraries (ggplot,matplotlib).
- Trained analytical models withSpark MLestimators including linear regression, decision trees, logistic regression, and k-means.
- Performed pre-processing on a dataset prior to training, includingstandardization,normalization.
- Createdpipelinesto create a processing pipeline including transformations, estimations, evaluation of analytical models.
- Evaluatedmodel accuracy by dividing data into training and test datasets and computing metrics using evaluators.
Environment: Spark 2.0.0, Spark MLlib, Spark ML, Hive 2.1.0, Sqoop 1.99.7, Flume 1.6.0, HBase 1.2.3, MySQL 5.1.73, Scala 2.11.8, Shell Scripting, Tableau 10.0, Agile