We provide IT Staff Augmentation Services!

Big Data Developer Resume

4.00/5 (Submit Your Rating)

TX

SUMMARY

  • 7+ years of professional experience as a Big Data Engineer/Developer in Analysis, development, design, implementation, maintenance, and support with experience in Big Data, Hadoop Development and Ecosystem Analytics.
  • Hands on experience in Hadoop ecosystem including ETL, HDFS, Map Reduce, Pyspark, Spark, Kafka, HBase, Scala, Pig, Impala, Sqoop, Oozie, Flume, Zookeeper and also worked on Spark SQL, Spark Streaming, AWS services like EMR, S3, Airflow, Glue and Redshift.
  • Excellent understanding of Hadoop architecture and Hadoop - Daemons and various components such as HDFS, YARN, Resource Manager, Node Manager, Name Node, Data Node and Map Reduce programming paradigm.
  • Experience in AWS cloud services like EC2, VPC, S3, Glue, EMR, RedShift, CloudWatch,ETL and Lambda functions.Experience working with both Streaming and Batch data processing using multiple technologies.
  • Hands-on experience in importing and exporting data using Sqoop from HDFS and ETL to Relational Database Systems and vice-versa.
  • Worked extensively in implementing Hive and pig scripts. Loading data from OLAP/OLTP systems to HDFS and creating tables in Hive and writing Pig scripts to do CDC logics to process data in Hadoop environment.
  • Hands-on experience building, scheduling, and monitoring workflows using Apache Airflow with Python.
  • Hands-on experience developing data pipelines using Spark components, Spark SQL, Spark Streaming.
  • Hands On Kafka and Spark Streaming for messaging system and working on streaming data to consume data from Kafka topics and load the data for reporting in real time.
  • Expertise in using Kafka as a messaging system to implement real-time streaming solutions and implemented Sqoop for large data transfers from RDBMS to HDFS/HBase/Hive and vice-versa.
  • Experience in building high throughput ETL pipelines and building high performance data lakes.
  • Worked on ETL Processes, Data mining, and Web reporting features for Data warehouses using Business Object.
  • Worked with data pipeline dat transfers and processes several terabytes of data using Spark, Scala, Python, Apache Kafka, Pig/ Hive, and Impala.
  • Strong understanding of Data Modelling and experience with Data Cleansing, Data Profiling and Data analysis.
  • Knowledge in building data pipelines and perform different operations on Amazon Web Services.
  • Proficient in Shell Scripting and Bash Scripting
  • Experience in spinning clusters in EMR and storing data in S3.
  • Worked in complete Software Development Life Cycle in Agile model
  • Excellent team player and can work on both development and maintenance phases of the project.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Airflow, Impala, Sqoop, HBase, Flume, Oozie, Zookeeper

Hadoop Distributions: Cloudera, Hortonworks, Apache.

Cloud Environments: Amazon RDS, Amazon Redshift, Oracle, SQL Server, MySQL, MS Access, Teradata, S3, EC2, Lambda, AWS EMR

Operating Systems: Linux, Windows

Languages: Python, SQL, Scala, Java

Databases: Oracle, SQL Server, MySQL, HBase, MongoDB, RedShift, DynamoDB

ETL Tools: Informatica

Report & Development Tools: Eclipse, IntelliJ Idea, Visual Studio Code, Jupyter Notebook, Tableau, Power BI.

Development/Build Tools: Maven, Gradle

Repositories: GitHub, SVN.

Scripting Languages: bash/Shell scripting, Linux/Unix

Methodology: Agile, Waterfall

PROFESSIONAL EXPERIENCE

Confidential, TX

Big Data Developer

Responsibilities:

  • Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
  • Worked on migrating existingSQL data and reporting feeds to Hadoop.
  • Developed the different ingestion pipeline components using Spark and Python.
  • Created Hive tables to store the processed results in Avro and Orc tabular format.
  • Writing the script files for processing data and loading to HDFS.
  • Developed the UNIX shell/Python scripts for creating the reports from Hive data.
  • Responsible for building scalable distributed data solutions using Hadoop
  • Development of Python APIs to dump the array structures in the Processor at the failure point for debugging. Handling Web applications - UI security, logging, backend services.
  • Extensive expertise using the core Spark APIs and processing data on an EMR cluster
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs
  • Implemented a CI/ CD pipeline with Docker, Jenkins and GitHub by virtualizing the servers using Docker for the Dev and Test environments by achieving needs through configuring automation using Containerization
  • Integrated services like Bitbucket AWS Code Pipeline and AWS Elastic Beanstalk to create a deployment pipeline.
  • Created S3 buckets in the AWS environment to store files, sometimes which are required to serve static content for a web application.
  • Possess good noledge in creating and launching EC2 instances using AMI’s of Linux, Ubuntu, RHEL, and Windows and wrote shell scripts to bootstrap instance.
  • Used IAM for creating roles, users, groups and implemented MFA to provide additional security to AWS account and its resources. AWS ECS and EKS for docker image storage and deployment.
  • Extensively involved in the Design phase and delivered Design documents. Experience in Hadoop eco system with HDFS, Hive, Zookeeper and Spark with Python.
  • Implemented End to End solution for hosting the web application on AWS cloud with integration to S3 buckets.
  • Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
  • Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).
  • Worked onHBasetable setup and shell script to automate ingestion process.
  • Extensive expertise using the core Spark APIs and processing data on a EMR cluster
  • Analyzed data using various low-level Spark API like RDDS, D streams using Python
  • Performed complex mathematical, statistical analysis using Spark Streaming.
  • Data Quality Code refactor to reduce number of tasks.
  • Extensively worked onTeradataperformance optimizationand brought down the queries to seconds or minutes from spool out and never ending queries by using variousTeradataoptimizationstrategies.
  • Experienced with developing ETL streams using Informatica and Teradata.
  • Migrated ETL from on-prem Informatica to the cloud using Informatica DEI and convert Teradata BTEQ scripts to Python/SQL for execution in DEI.
  • Experience in data workflow scheduler Zookeeper and Control M to manageHadoopjobs by Direct Acyclic Graph (DAG) of actions with the control flows.
  • Worked extensively on Sqoop for importing and exporting data from relational databases like Oracle to Hadoop ecosystem.
  • Developed Spark-SQL scripts in Hortonworks for data extraction.
  • Coordinating and leading ETL (Extract-Transform-Load) Development strategy for implementation of the Design requirement.
  • Experienced with developing ETL streams using Informatica and Teradata.
  • Extract, transform and load data from source system to on-premises data storage and processing the data in Hortonworks.

Environment: Hadoop, Linux, MySQL, HDFS, Yarn, Impala, Hive, Sqoop, Spark, Python, Oozie Map Reduce, Hadoop Data Lake and Hortonworks, Aws Databricks ADLS Gen-2.

Confidential, Tampa, FL

Data Engineer

Responsibilities:

  • Extract Transform and Load data from Sources Systems to AWS Data Storage services using a combination of T-SQL, Spark SQL and U-SQL
  • Worked with ETL tools Including Talend Data Integration, Talend Big Data, Pentaho Data Integration and Informatica.
  • Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
  • Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
  • Worked on migrating existingSQL data and reporting feeds to Hadoop.
  • Implemented End to End solution for hosting the web application on AWS cloud with integration to S3 buckets.
  • Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
  • Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).
  • Worked onHBasetable setup and shell script to automate ingestion process.
  • Create Pyspark frame to bring data from DB2 to Amazon S3.
  • Translate business requirements into maintainable software components and understand impact (Technical and Business)
  • Develop Pig Latin scripts to extract the data from the web server output files to load into HDFS.
  • Optimization and trouble shooting, test case integration into CI/CD pipeline using docker image.
  • Provide guidance to development team working on PySpark as ETL platform
  • Makes sure dat quality standards are defined and met.
  • Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
  • Migrated On prem informatica ETL process to AWS cloud and Snowflakes
  • Implement CICD (Continuous Integration and Continuous Development) pipeline for Code Deployment
  • Reviews components developed by the team members
  • Worked in Spark streaming to get ongoing information from Kafka and store the stream information to HDFS.
  • Worked on the development of tools which automate AWS server provisioning, automated application deployments, and implementation of basic failover among regions through AWS SDK’s.
  • Translated business needs of workflow tool and a CRM system of an agile model to create UAT test cases and reviewed SIT test cases
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Strong analytical and conceptual skills in database design modeling inTeradata
  • Proficient inTeradata performance tuningfrom design architecture to application development.
  • Optimized high volume tables inTeradatausing various joinindextechniques, secondaryindexes,join strategies and hash distributionmethods
  • Developed and ConfiguredKafka brokersto pipeline server logs data into Spark Streaming.
  • Develop script to create external tables and updated partitioning information on a daily basis
  • Streamline Hadoop jobs and workflow operations using Autosys workflow and scheduled through AutoSys on a monthly basis
  • Have good experience in logging defects in Jira and AWS Devops tools.
  • Convert MR algorithms into Spark transformations and actions by creating RDDs, pair RDDs
  • Build reusable Hive UDF libraries for business requirements which enabled users to use these UDFs in Hive querying
  • Involved in converting Hive/SQL queries into Spark functionality and analyze them using Scala API
  • Responsible for developing scalable distributed data solutions using Hadoop.
  • Build Spark Data frames to process huge amounts of structured data
  • Use JSON to represent complex data structure within a map reduce job
  • Store and preprocess the logs and semi structured content on HDFS using MapReduce and import it into Hive warehouse

Environment: Hadoop, Apache Spark, Spark-SQL, Data frames, Scala, HDFS, HIVE, Oozie, Kafka, Oracle, Python/PySpark, Sqoop, HBase, Shell Scripting, glue, S3, Core Java, Cassandra, Toad and LINUX

Confidential, Nashville Tennessee

Data Engineer

Responsibilities:

  • Worked on Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
  • Developed code to import data SQL Server into HDFS and created Hive views on data in HDFS using Spark in Python.
  • Developed data ingestion, preprocess, post ingestion transformation from various data sources like Oracle, SQL Server and Teradata using Sqoop, and Teradata connector loaded data into Hive as ORC tables.
  • Used Oozie Scheduler systems to automate the pipeline workflows
  • Written Spark programs in Python and ran Spark jobs on YARN.
  • Developed complex and multi-step data ingestion pipeline using Spark.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Dataframes using python and Spark SQL.
  • Monitoring YARN applications. Troubleshoot and resolve cluster related system problems.
  • Assessed existing and EDW (enterprise data warehouse) technologies and methods to ensure our EDW/BI architecture meet the needs of the business and enterprise and allows for business growth.
  • Capturing data from existing databases dat provide MySQL interfaces using Sqoop.
  • Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa loading data into HDFS.
  • Created Hive queries dat helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
  • Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System.
  • Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis
  • UsedTeradata Profilerto analyze data.
  • Coding usingTeradataAnalytical functions, BTEQ SQL ofTERADATA, write UNIX scripts to validate, format and execute the SQLs on UNIX environment.
  • ReducedTeradataspace used by optimizing tables adding compression where appropriate and ensuring optimum column definitions.
  • Involved in writing Unix/Linux Shell Scripting for scheduling jobs and for writing Hive scripts.
  • Developed Scripts and automated data management from end to end and sync up between all the clusters.
  • Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Assisted in exporting data into Cassandra and writing column families to provide fast listing outputs.
  • Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the spark jobs dat extract and transform.
  • Used Zookeeper for providing coordinating services to the cluster.

Environment: Apache Hadoop, HDFS, Hive, Python, Sqoop, Spark, Cloudera CDH5, Oracle, MySQL, Cassandra, Tableau.

We'd love your feedback!