Data Engineer Resume
Durham, NC
SUMMARY
- Over 12 years of experience as anApplication, Database and Big Data developerwhich includes 4+ years of experience in Web Application development using Hadoop and related Big Data technologieswithin medical, pharmaceutical and recruitment industries.
- Hands - on experience on big data and application development using Oracle, MongoDB, J2EE, Cloudera and Hortonworks Hadoop ecosystems technologies and distributed system
- CCA175 Cloudera Certified Spark and Hadoop developer
- Experience in Hadoop APIs (Spark Scala, P ySpark) and its ecosystem components (HDFS, HBase, Hive, Impala, Sqoop, Flume, Oozie, Zookeeper, Pig, Spark, Hue)
- Good knowledge on Hadoop Architecture and its components such as HDFS, MapReduce, Job Tracker, Task Tracker, Name Node, Data Node
- Experience in extending Pig and Hive functionalities with custom UDFs for analysis of data, file processing, by running Pig Latin Scripts and using Hive Query Language
- Experience working in a fast paced agile environment including Scrum, XP and TDD. Good exposure to phases of SDLC, which includes analysis, development, testing and implementation, Unit Testing, Git, continuous integration and delivery using Jenkins
- Hands-on development in RDBMS (Oracle, SQL Server) and NoSQL DB(MongoDB, HBase), Amazon AWS (EMR, S3, CLI, Redshift, Glue, Lambda, Kinesis), Data Warehouse, ETL, ELK, Unix shell scripting
- Extensive professional experience in application support, testing, investigating complex data related issues. Proficient in ITIL Best Practices Framework, in addition to performance tracking and evaluation
- Experience in analyzing the log files for Hadoop and ecosystem services and debug the issues
TECHNICAL SKILLS
Programming Technologies: Java, Scala, Python, SAS base, JavaScript, Bash, C-Shell, Perl,R,JQuery, J2EE, JSP, Servlet, EJB, Spring Boot, Struts, JDBC, Web Services (SOAP, WSDL), Rest API
Data platform and Data Science: Oracle,Postgres, Redshift, RDS, MongoDB, HBase, DynamoDB, MySQL, MS SQL Server, ELK, SAS, Machine Learning, SparkML, Scikit-learn, TensorFlow, Mahout
OS/Web/Cloud Platforms: Linux, UNIX, AWS,GCP, Windows Server, IIS, Apache, Node JS, CentOS
Hadoop ecosystem technologies: HortonWorks (2.6.5.0), Cloudera (5.8), HDFS, MapReduce, Spark, Pyspark, Scala, Hive, Sqoop, Pig, HBase, RDD, SparkSQL, DataFrame, Flume,ZooKeeper, Kafka, Impala, Oozie, Hue, Spark Streaming, Storm, Ambari, Yarn, Avro, Flink
Tools: Eclipse, Talend, Pycharm, CA Erwin, PDI, RStudio, Toad, Nifi, Tableau, QlikView
Testing Hadoop: MR UNIT Testing, Quality Center, Hive Testing
Project managmt/DevOps: Jira, AWS DevOps tools, Git, Jenkins, Maven, Ant
PROFESSIONAL EXPERIENCE
Confidential, Durham NC
Data Engineer
Responsibilities:
- Bulk and incremental loads of large scale of data migrated from different sources to adatawarhouse
- Developed modules for procurement, aggregation using SparkSQLon AWS EMR
- Redesigning data pipeline and moving from Pentaho/Postgres to AWS Glue/Redshift for its data transformation pipeline.
- Collected business requirements to set rules for proper data transfer from Data Source to Data Target in Data Mapping, ETL tools (AWS Glue/Pentaho) for loading data into the staging tables in the data marts.
- Built event driven to trigger AWS lambda functions and Glue to call rest API and create a fully automated data cataloging and ETL pipeline to transform the data
- Handled column-oriented files format Parquet, ORC for data storage
- Implementedbulk and delta load processing using Glue, Spark SQL and Dataframe.
- Application integration with Redshift Data Mart using pyspark.
- Fetched data from various upstream applications and made it available for reporting in Redshift.
- Created Python and UNIX shell scripts while interacting with different AWS services.
- Qliksense the data from the Confidential Data Mart API
- Performed, supported the Data loads on a daily basis to push them on AWS Data Lake
- Implemented industry level best practices in defining and designing application architecture using different AWS techonologies and methodologies
- Extensively used Agile methodology as the Organization Standard to implement ETL and cloud data warehouse best practices
- Analyzed business requirements and cross-verified them with functionality and features of SQL/NOSQL databases using HBase,DynamoDB, Cassandra, Redshift to determine the optimal DB to migrate the data
- Hands-on experience with Redshift Spectrum and AWS Athena query services for reading the data from S3
- Worked with Amazon Athena query service to analyze data from S3 with files format
- Performed data analysis of large data sets using python pandas, numpy, multiprocessing and other libraries for data processing (multiprocessing reduced the time to process the data in AWS EC2)
- Hands-on experience with Redshift Spectrum and AWS Athena query services for reading the data from S3 in different format and compression.
- Implemented CI/CD pipelines of serverless AWS Glue ETL applications as a part of DevOps role using AWSdevelopers tools.
Environment: s:AWS (EMR(5.23), Lambda, Glue, Athena, API Gateway, S3, Redshift, Elastic Beanstalk, DynamoDB, Cloudwatch, CloudTrail, SNS, Kinesis, DMS, RDS), Pentaho Data Integration PDI, Hadoop 2.8.5, Scala, Java, Spark 2.4.3, Kafka,Sqoop, Hive, HBase,Impala, Presto, Zookeeper, Pig, CI/CDAWS DevOps, Jira, Postgres, Rest API, Cloudera v5.13, CA Erwin Data Modeler, Cassandra, Java, Python 3.7 (Pandas, Numpy), PySpark, Agile environement, Data warehouse, DataMart, Qlikview, Qliksense, Alation, Tableau, Zeppelin,Jupyter,linux(Redhat7), SAP BDO, MongoDB, PHP, Apache Server, Javascript
Confidential
Application and Big Data developer
Responsibilities:
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop suitable programs.
- Engineered ETL data pipelines of Clickstream data from Adobedump data (Omniture)into HDFS, after cleansingtransformed into structured format and storedindata warehouse,to recommendmarket expansion by visualizing national bundles’ sales on Tableau
- Integrated pipelines to upload, process and analysis data from Web log and job search log to provide weekly performance insights for the jobs posted to the marketing team
- Improved the completion of an ETL load processes from 20 hours to 4 hours using Spark jobs and Sqoopcron jobs that allows to reduce customer business review process by 40% through Tableau self-served solution
- Implemented procedures to move log files generated from various sources to HDFS for further processing through Flume1.5.2
- Hand on writing Hive and Pig(0.16) jobs and extended their functionality using UDFs, UDTFs and UDAFs
- Importing and exporting data from MongoDB using Hadoop connector or different RDBMS like MySQL, Oracle into HDFS(2.7.3), Hive (1.2.1)and HBase(1.1.2) using Sqoop(1.4.6)
- Experienced in transferring data from different data sources into HDFS systems using Kafka(1.0.0)
- Practical exposure on Hortonworks and Cloudera distributions
- Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets and data loading and writing hive queries
- Used OOZIE (4.2.0)Operational Services for batch processing and scheduling workflows dynamically and created UDF's to store specialized data structures in HBase and Cassandra
- Developed multiple MapReduce jobs in Java for data cleaning and pre-processing
- Involved in working with Spark on top of YARN for interactive and batch analysis
- Good understanding on DAG cycle for entire spark application flow on Spark application WebUI
- Worked with various HDFS file formats like Avro, Sequence File, ORC, Json
- Created tables using Impala, and involved in creating Queries which are stored in HBase
- Mentoring analyst and test team for writing Hive queries
- Experience in Hadoop Shell commands, writing MapReduce Programs, verifying managing and reviewing Hadoop Log files
- Experience in analyzing the log files for Hadoop and ecosystem services and finding out the root cause
- Optimizing Hive queries, improve performance by configuring Hive Query parameters
- Implemented test scripts to support test driven development and continuous integration.
- Worked on ORC file format, bucketing, partitioning for hive performance enhancement and storage improvement
- Used MongoDb and Oracle as a datastore for message persistance
Technologies: HDFS, MapReduce, Spark, PySpark, Scala, Hive, Sqoop, Pig, HBase, RDD, SparkSQL, DataFrame, Flume, Kafka, Oozie, Ambari, HortonWorks HDP, Tableau, MongoDB, Shell scripting, Cassandra, Zookeeper
Database Administration and Developer
Confidential
Responsibilities:
- Migrated of Oracle 11g to Amazon web service (AWS), database setup projects on Amazon AWS using EC2 instances and EBS volumes, setup NAT instances and Bastion hosts for connection to the EC2 instances
- Optimized PL/SQL procedures, functions and packages using advanced database features like global temporary tables, table functions, collections, bulk loading techniques to improve performance.
- Developed database Packages, Stored Procedures, and Triggers using PL/SQL, optimized SQL queries (hints),created and managedcron jobs using UNIX shell scripting,create indexes andused Oracle features such as Import/Export, SQL*Loader, collections, bulk loading techniques to improve performance of data loading and retrieving
- DBA daily tasks (shell scripts, monitoring, tuning and troubleshooting queries and DB issues) on Oracle and MongoDB
Technologies: Oracle, MongoDB, Toad, Python, PyCharm, Linux Centos, Shell scripting (cron jobs), MongoDB Compass, Studio 3T, Json, MapReduce, JavaScript, AWS EC, PL/SQL
Java Developer
Confidential
Responsibilities:
- Developed, enhanced and support of the Niche Network platform system that manages more than 40 private label job boards for associations and publications
- Developed and customized on the Niche J2EE application the pay-per post for the job posting flow which create a new sales lead capture system delivering scrubbed and qualified leads, based on on-site activity, into the sales funnel, increasing leads by 200%
- Developed and implemented procedures for encryption of sensitive information (users’ password) on database (Oracle and MongoDB) and on application levels
- Implemented test cases and performed unit testing using Junit
- Developed a batch post process of incoming xml FileFeed requests from the web service API to create new jobs which result in better and faster integration onATS
- Achieved Payment Card Industry (PCI) compliance by redesigning and implementing a new e-commerce customer flow which result in higher customer trust and safety
- Worked on refactoring the existing code for better maintainability, scalability and efficiency
Technologies: Agile, Java 8, Struts, Spring boot, Junit, Python, J2EE, XML, XSLT, JQuery, JavaScript, HTML, CSS3, EJB, JSP, JDBC, Servlet, Rest API, Oracle, MongoDB, PL/SQL, Git, AWS (EC2, S3), NetBeans, Pycharm, GlassFish, Apache, Omniture, Toad, Web Service, Jenkins, Bash (cron jobs),ELK Stack (ElasticSearch,LogStachs,Kibana)
Confidential
Clinical trials Data manager
Responsibilities:
- Solid understanding of Phase I, II, and III of clinical trials from study start to database lock for RDC and Paper studies, including database design and clinical data management process
- Development/Testing of Case Report Forms, Annotated CRFs, Edit Check Specification, Completion Guidelines and Data Handling Plan for paper or electronic studies, AE reconciliation, and database lock
- Integrate, reviewed and reconcile uploaded data with oracle clinical and ensure readiness for analysis with SAS.
- Knowledge of CDISC, GCP, ICH and FDA regulatory requirements applied to clinical data management.
- Develop and maintain general data management standard operating procedures (SOPs) as well as study-specific SOPs and working practices related to the data management needs of the projects.
- Track and process all SAEs according to the sponsor's regulatory requirements
Technologies: SAS 9.3, Oracle Clinical v4.5,SAS v9.3, SAS/MACRO, SAS/STAT, SAS/GRAPH, SAS/ODS, SAS/CONNECT and SAS/ACCESS, Windows, Linux