We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

2.00/5 (Submit Your Rating)

Minneapolis, MN

SUMMARY

  • Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Strong experience in writing scripts using Python API, PySpsark API and Spark API for analyzing the data.
  • Hands - on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Experience in developing web applications by using Python, Django, C++, XML, CSS, HTML, JavaScript and jQuery.
  • Strong experience in Teradata, Informatica, Python, UNIX shell scripting for processing large volumes of data from varied sources and loading into databases like Teradata, Oracle.
  • Experience in analyzing data using Python, R, SQL, Microsoft Excel,Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Munging and Machine Learning.
  • Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
  • Proficient in handling complex processes using SAS/ Base, SAS/ SQL, SAS/ STAT SAS/Graph, and SAS/ ODS.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Utilized Agile and Scrum methodology for team and project management.
  • Extensive experience in ETL tools like Teradata Utilities, Informatica, Oracle.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked with Cloudera and Hortonworks distributions.
  • Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
  • Good working knowledge of Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy,DynamoDB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Have good Experience on Big Data Integration using Informatica BDM and Talend BDI.
  • Import the data into Hive tables by using Big Data Manager (BDM) and also built the ETL process by using BDM from on prem Oracle DB to S3 bucket and from S3 to Redshift.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Experienced working with JIRA for project management, GIT for source code management, JENKINS for continuous integration and Crucible for code reviews.

TECHNICAL SKILLS

Big Data Technologies: Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala, HDFS, MapReduce, Hive, Pig, BDM, Sqoop, Flume, Oozie, Zookeeper

Hadoop Distribution: Cloudera CDH, Apache, AWS, Horton Works HDP

Programming Languages: SQL, PL/SQL, Python, R, PYSpark, Pig, Hive QL, Scala, Shell, Python Scripting, Regular Expressions

Spark components: RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming

Cloud Infrastructure: AWS, Azure, GCP

Databases: Oracle, Teradata, My SQL, SQL Server, NoSQL Database (HBase, MongoDB)

Scripting &Query Languages: Shell scripting, SQL

Version Control: CVS, SVN and Clear Case, GIT

Build Tools: Maven, SBT

Containerization Tools: Kubernetes, Docker, Docker Swarm

Reporting Tools: Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks,UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS and Tableau

PROFESSIONAL EXPERIENCE

Confidential, Minneapolis, MN

Sr. Data Engineer

Responsibilities:

  • Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
  • Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
  • Worked on troubleshooting spark application to make them more error tolerant.
  • Experience in integrating Jenkins with various tools like Maven (Build tool), Git (Repository), SonarQube (code verification), Nexus (Artifactory) and implementing CI/CD automation for creating Jenkins pipelines programmatically architecting Jenkins Clusters, and scheduled builds day and overnight to support development needs.
  • Programmatically created CICD Pipelines in Jenkins using Groovy scripts, Jenkins file, integrating a variety of Enterprise tools and Testing Frameworks into Jenkins for fully automated pipelines to move code from Dev Workstations to all the way to Prod environment.
  • Working on Docker Hub, Docker Swarm, Docker Container network, creating Image files primarily for middleware installations & domain configurations. Evaluated Kubernetes for Docker Container Orchestration.
  • Experience on AWS cloud services such as EC2, S3, RDS, ELB, EBS, VPC, Route53, auto scaling groups, Cloud watch, Cloud Front, IAM to build configuration and troubleshooting for server migration from physical to cloud on various Amazon photos.
  • Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.
  • Analyzed SQL scripts and designed the solutions to implement using PySpark.
  • Wrote Kafka producers to stream the data from external rest API to Kafka topics.
  • Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
  • Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient joins, transformations, and other capabilities.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics).
  • Worked extensively with Sqoop for importing data from Oracle.
  • Batch scripts have been created to retrieve data from AWS S3 storage and to make appropriate transformations in Scala using the Spark framework.
  • Involved in creating Hive tables, loading, and analysing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
  • Good experience with continuous Integration of application using Bamboo.
  • Designed, documented operational problems by following standards and procedures using JIRA.
  • Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Developed Oozie work processes for planning and arranging the ETL cycle. Associated with composing Python scripts to computerize the way towards extricating weblogs utilizing Airflow DAGs.
  • Using python, the ETL pipeline was developed and programmed to collect data from Redshift data warehouse.
  • Used MongoDB to stored data in JSON format and developed and tested many features of dashboard using Python, Bootstrap, CSS, and JavaScript.
  • Created workflows, mappings using Informatica ETL and worked with different transformations such as lookup, source qualifier, update strategy, router, sequence generator, aggregator, rank, stored procedure, filter, joiner, sorter.
  • Worked on SSIS creating all the interfaces between front end application and SQL Server database, then from legacy database to SQL Server Database and vice versa.
  • Good hands-on participation in the development and modification of SQL stored procedure techniques, functions, views, indexes, and triggers.
  • Migrate data into RV Data Pipeline using DataBricks, Spark SQL and Scala.
  • Used Databricks for encrypting data using server-side encryption.
  • Used Delta Lake as it is an open-source data storage layer which delivers reliability to data lakes.
  • Experience withSnowflake Virtual Warehouses.
  • Responsible for ingesting large volumes of IOT data to Kafka.
  • Wrote Kafka producers to stream the data from external rest APIs to Kafka topics.

Environment: AWS,Azure, Agile, Jenkins, EMR, Spark, Hive, S3, Athena, Sqoop, Kafka, HBase, Redshift, ETL, Pig, Oozie, Spark Streaming, Docker, Kubernetes, Hue, Scala, Python, Apache NIFI, GIT, Micro Services, Snowflakes.

Confidential, Eagan, MN

Sr. Data Engineer/ Big Data Engineer

Responsibilities:

  • Involvement in working with Azure cloud stage (HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB and SQL DWH).
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
  • Using Linked Services/Datasets/Pipeline/ to extract, transform and load data from various sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards, ADF pipelines were created.
  • Performed information purging and applied changes utilizing Databricks and Spark information analysis.
  • Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.
  • Delta lake supports merge, update and delete operations to enable complex use cases.
  • Developed Spark Scala scripts for mining information and performed changes on huge datasets to handle ongoing insights and reports.
  • Implemented versatile microservices to deal with simultaneousness and high traffic. Advanced existing Scala code and improved the cluster execution.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in in Azure Databricks.
  • Reduced access time by refactoring information models, query streamlining and actualized Redis store to help Snowflake.
  • Extensive information in Data changes, Mapping, Cleansing, Monitoring, Debugging, execution tuning and investigating Hadoop clusters.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Developed Python application for Google Analytics aggregation and reporting and used Django configuration to manage URLs and application parameters.
  • Create several types of data visualizations using Python and Tableau.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Provide guidance to development team working on PySpark as ETL platform
  • Included myself in making database components like tables, views, triggers utilizing T-SQL to give structure and keep up information effectively.
  • Conducted statistical analysis on Healthcare data using python and various tools.
  • Broad involvement in working with SQL, with profound knowledge on T-SQL (MS SQL Server).
  • Worked with data science group to do preprocessing and include feature engineering, helped Machine Learning algorithm in production.

Environment: Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, AKS), Scala, Python, Hadoop 2.x, Spark v2.0.2, NLP, Redshift Airflow v1.8.2, Hive v2.0.1, Sqoop v1.4.6, HBase, Oozie, Talend, CosmosDB, MS SQL, MongoDB, Ambari, PowerBI, Azure DevOps, Ranger, Git, Microservices, K-Means,KNN.

Confidential, Kansas, MO

Big Data Engineer

Responsibilities:

  • Configured Spark streaming to get ongoing information from the Kafka and stored the stream information to HDFS and HBase.
  • Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka and Persists into Cassandra.
  • Designed, developed data integration programs in a Hadoop environment with NoSQL data store HBase for data access and analysis.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Used various Spark Transformations and Actions for cleansing the input data.
  • Used Python and Django creating graphics, XML processing, data exchange and business logic implementation.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to HBase.
  • Processed the real time steaming data using Kafka, Flume integrating with Spark streaming API.
  • Consumed JSON messages using Kafka and processed the JSON file using Spark Streaming to capture UI updates.
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed Scala functional programs for streaming data and gathered JSON and XML data and passed to Flume
  • Involved in creating Hive scripts for performing adhoc data analysis required by the business teams.
  • Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Wrote and executed various MYSQL database queries from Python using Python-MySQL connector and MySQL dB package.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Scala,Data Frame,Impala,OpenShift, Talend,pair RDD's.
  • Worked on utilizing AWS cloud services like S3, EMR, Redshift, Athena, and GlueMetastore.
  • Involved in continuous Integration of application using Jenkins
  • Worked on design and development of Informatica mappings, workflows to load data into staging area, data warehouse and data marts inSQLServerand Oracle.
  • Involved in the development of Informatica mappings and preparation of design document (DD), technical design document (TDD) and unit acceptation testing (UAT) documents.
  • Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
  • Used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS.
  • Worked extensively with Sqoop for importing metadata from MySQL and assisted in exporting analysed data to relational databases using Sqoop.
  • Involved in the Migration from On-premises to Azure Cloud
  • Worked on Hive optimization techniques using joins, sub queries and used various functions to improve the performance of long running jobs.
  • Troubleshooted user's analyses bugs ( JIRA and IRIS Ticket).
  • Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
  • Optimized Hive QL by using execution engine like Spark.

Environment: Azure HDInsight,Apache Spark, Apache Kafka, EMR,Scala,Talend, PySpark, HBase, Hive, Sqoop, Flume, Informatica, Glue, Hadoop,Nifi, HDFS, Scala, Oozie, MySQL, Oracle 10g, UNIX, Shella

Confidential

Data Engineer

Responsibilities:

  • Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, Zookeeper and Sqoop.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
  • Administering large Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
  • Close monitoring and analysis of the MapReduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance.
  • Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
  • Used Python&SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
  • Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
  • Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
  • Created different Pig scripts & executed them through shell scripts.
  • Load the data into HDFS from different Data sources like Oracle, DB2 using Sqoop and loaded into Hive tables.
  • Designed and developed Pig Latin scripts and Pig command line transformations for data joins and custom processing of MapReduce outputs.
  • Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
  • Setup Alerting and monitoring using Stack driver in GCP.
  • Design and implement large scale distributed solutions in AWS and GCP clouds.
  • Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
  • Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.
  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
  • Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.

Environment: Hadoop YARN, Zookeeper, Spark 1.6, Spark Streaming, Spark SQL, Scala, Pig, Python, Hive, SqoopMap Reduce, No Sql, HBase, Tableau, Java, AWSS3, Oracle 12c, Linux

Confidential

Data & Reporting Analyst

Responsibilities:

  • Performed data transformations like filtering, sorting, and aggregation using Pig
  • Creating Sqoop to import data from SQl, Oracle, and Teradata to HDFS
  • Created Hive tables to push the data to MongoDB.
  • Wrote complex aggregate queries in mongo for report generation.
  • Developed scripts to run scheduled batch cycles using Oozie and present data for reports
  • Worked on a POC for building a movie recommendation engine based on Fandango ticket sales data using Scala and Spark Machine Learning library.
  • Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into AmazonS3 using SparkScalaAPI and Spark.
  • Implement automation, traceability, and transparency for every step of the process to build trust in data and streamline data science efforts using Python, Java, Hadoopstreaming, ApacheSpark, SparkSQL, Scala, Hive, and Pig.
  • Performed data validation and transformation using Python and Hadoop streaming.
  • Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.
  • Developed bash scripts to bring the TLOG file from ftp server and then processing it to load into hive tables.
  • Automated workflows using shell scripts and Control-M jobs to pull data from various databases into HadoopDataLake.
  • Extensively used DB2 Database to support the SQL
  • Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.
  • Inserted Overwriting the HIVE data with HBasedata daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment...
  • Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs, Scala and have a good experience in using Spark-Shell and Spark Streaming.
  • Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression.
  • Developed OozieWorkflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
  • Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Developed pigscripts to transform the data into structured format and it are automated through Oozie coordinators.

Environment: s: Hadoop, HDFS, Spark, Hive, Pig, Sqoop, Oozie, DB2, Java, Python,Oracle,Sql, Splunk, Unix, Shell Scripting.

We'd love your feedback!