We provide IT Staff Augmentation Services!

Big Data Developer Resume

4.00/5 (Submit Your Rating)

Houston, TX

SUMMARY

  • 10+ years of experience in IT industry with extensive experience in Hadoop stack, big data technologies, AWS, Java, Python, Scala, RDBMS, ETL and GIS.
  • More than 4 years of hands - on experience using Spark framework with Scala/Python
  • Good Experience on Building Python Rest API.
  • 3+ years of experience working on ETL development (2+ years PENTAHO Data Integration)
  • Good experience on working with AWS S3, Amazon Athena, Amazon Glue, Presto DB etc.
  • Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand controlling and granting database accessandMigrating On premise databases toAzure Data Lake storeusing Azure Data factory.
  • Developed Spring boot application with microservices and deployed it into AWS using EC2 instances.
  • Strong experience working with HDFS, MapReduce, Spark, AWS EMR, Hive, Impala, Pig, Sqoop, Flume, Kafka, NIFI, Oozie, HBase, MSSQL and Oracle.
  • Good Knowledge on Geospatial data, building custom Maps and Administer Post GIS Database.
  • Good understanding of distributed systems, HDFS architecture, internal working details of MapReduce and Spark processing frameworks.
  • Good working knowledge on Snowflake and Teradata databases.
  • Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
  • Create Pyspark frame to bring data from DB2 to Amazon S3.
  • Experience with Azure transformation projects and Azure architecture decision making Architect and implement ETL and data movement solutions using Azure Data Factory(ADF), SSIS
  • Used Hadoop and Palantir Foundry to push the messages for the business statistical analysis of the customers related information.
  • Azure Cloud Engineer |DevOps Engineer having 12+ years of sound experience in supporting, troubleshooting, scaling and maintaining enterprise level applications with strong technical background on Confidential Products and Technologies like Azure Cloud Services, C#, Windows Server, SQL Server, IIS & SharePoint.
  • 3 + years of Hands-on experience with Azure and strong understanding of Azure capabilities and limitations, primarily in the IaaS Space
  • Architecture and Implementation experience with medium to complex On premise to Azure migrations.
  • Provide guidance to development team working on PySpark as ETL platform.
  • Worked on Continuous Integration CI/Continuous Delivery (CD) pipeline for Azure Cloud Services using CHEF.
  • Hand - on Experience on High Availability Methodologies for Azure Cloud and SQL 2014 AOAG
  • Worked on Building Hadoop Clusters Both Cloudera and Horton Works.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
  • Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
  • Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks, Managing the Machine Learning Lifecycle.
  • Tableau visualization experience for Data analytics and data visualization.
  • Good exposure to performance tuning hive queries, MapReduce jobs, spark jobs.
  • Experienced with differentRelational databases like Teradata, Oracle and SQL Server.
  • Experience in design and developing Application leveraging MongoDB.
  • Azure Data Factory(ADF),Integration Run Time(IR),File System Data Ingestion, Relational Data Ingestion
  • Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake Store(ADLS) using Azure Data Factory(ADF V1/V2).
  • Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing.
  • Experience Palantir-technologies such as Slate Contour using Mesa as the programming language.
  • Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files
  • Has good understanding of various compression techniques used in Hadoop processing like Gzip, SNAPPY, LZO etc.,
  • Elasticsearch is one of the tools developed to deal with the problems of the big data.
  • Expertise in Inbound and Outbound (importing/exporting) data form/to traditional RDBMS using Apache SQOOP.
  • Tuned PIG and HIVE scripts by understanding the joins, group and aggregation between them.
  • Extensively worked on HiveQL, join operations, writing custom UDF’s and having good experience in optimizing Hive Queries.
  • Worked on various Hadoop Distributions (Cloudera, Hortonworks, and Amazon AWS) to implement and make use of those.
  • Good Understanding of Azure Internal and External Load Balancers and Networking concepts.
  • Experience in Automation dealing with Azure and other Operational tasks.
  • Experienced inAutomating and Scheduling the Teradata SQL Scripts in UNIX using Korn Shell scripting.
  • Instrumental in design, analyse and fine tuning Service Telemetry. Experience working with Azure Active Directory (AAD)
  • Mastered in using different columnar file formats like RC File, ORC and Parquet formats.
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
  • Experience data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
  • Hands on experience in installing, configuring and deploying Hadoop distributions in cloud environments (Amazon Web Services).
  • Good experience in optimizing Map-Reduce algorithms by using Combiners and Custom Practitioners.
  • Hands on experience in NOSQL databases like HBase, Cassandra and MongoDB.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
  • Worked on Mesa as programming language in order to perform data operations on Slate and Contour (Palantir-Technologies proprietary tools).
  • Expertise in back-end/server-side java technologies such as: Web services, Java persistence API (JPA), Java Messaging Service (JMS), Java Data Base Connectivity (JDBC)
  • Very good understanding in AGILE scrum process.
  • Experience in using version control tools like Bit-Bucket, GIT, and SVN etc.
  • Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
  • Having good knowledge of Oracle and MSSQL and excellent in writing the SQL queries
  • Performed performance tuning and productivity improvement activities.
  • Hands-on experience in development ofMicroservicesand deploying inDocker.
  • Experience with creating script for data modelling and data import and export. Extensive experience in deploying, managing and developing MongoDB clusters. Experience in creating JavaScript for using DML operation with MongoDB.
  • Experienced working on large volume of Data using Teradata SQL and BASE SAS programming.
  • Experiencedin Extracting Data from Mainframes Flat File and converting them into Teradata tables using SAS PROC IMPORT, PROC SQL etc.
  • Providing Data integration, Data ingestion for the use cases required for the palantir analytics.
  • Extensively use of use case diagrams, use case model, sequence diagrams using rational rose.
  • Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake Store(ADLS) using Azure Data Factory(ADF V1/V2).
  • Creation, configuration and monitoring Shards sets. Analysis of the data to be shared, choosing a shard Key to distribute data evenly. Architecture and Capacity planning for MongoDB clusters. Implemented scripts for mongo DB import, export, dump and restore.
  • Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication, schema design. Created multiple databases with sharded collections and choosing shard key based on the requirements. Experience in managing MongoDB environment from availability, performance and scalability perspectives.
  • Worked on creating various types of indexes on different collections to get good performance in Mongo database.
  • Proactive in time management and problem-solving skills, self-motivated and good analytical skills.
  • Have analytical and organizational skills with the ability to multitask and meet the deadlines.
  • Excellent interpersonal skills in areas such as teamwork, communication and presentation to business users or management teams.

TECHNICAL SKILLS

Big Data Ecosystems: Hadoop, Map Reduce, Spark, AWS, EMR, NiFi, Spark, HDFS, HBase, Pig, Impala, Hive, Sqoop, Pyspark, Oozie, Kafka and Flume and Tableau

ETL: Pentaho data Integration, Talend, Informatica.

Spark Streaming Technologies: Spark Streaming, Storm

Scripting Languages: Shell, Python, Java Scripting.

Programming Languages: Java, Python, Scala, SQL, PL/SQL, Teradata V2R5

Databases: Postgre SQL, Oracle, MSSQL, MySQL

Tools: Eclipse, IntelliJ, GIT, JIRA, MS Visual Studio, Net Beans, Tableau, Pentaho PDI, Talend, Informatica

Methodologies: Agile, Waterfall

PROFESSIONAL EXPERIENCE

Confidential, Houston TX

Big Data Developer

Responsibilities:

  • Developed Spark applications using Python utilizing Data frames and Spark SQL API for faster processing of data.
  • Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement
  • Data pipeline consists of Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyse operational data.
  • Used Hadoop and Palantir Foundry to push the messages for the business statistical analysis of the customers related information.
  • Developed a Spark job in Java which indexes data into Elasticsearch from external Hive tables which are in HDFS.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
  • Developed code in Java which creates mapping in Elasticsearch even before data is indexed into.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Wrote severalTeradata SQL Queries using Teradata SQL Assistant for Ad Hoc Data Pull request.
  • Created Teradata objects likeTables and Views.
  • Extensively worked on to convertORACLE scripts into Teradata scripts.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python.
  • Involved in Migrating Objects from Teradata to Snowflake.
  • Hands-on experience in development ofMicroservicesand deploying inDocker.
  • Installation of MongoDB RPM’s, Tar Files and preparing YAML config files.
  • Azure Data Factory(ADF),Integration Run Time(IR),File System Data Ingestion, Relational Data Ingestion
  • Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake Store(ADLS) using Azure Data Factory(ADF V1/V2).
  • Encoded and decoded json objects using PySpark to create and modify the data frames in Apache Spark.
  • Used different tools for data integration with different databases and Hadoop.
  • Providing Data integration, Data ingestion for the use cases required for the Palantir analytics.
  • Responsible for implementing the Informatica CDC logic to process the delta data.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage patterns.
  • Creating Informatica Mapping, writing UNIX shell scripts and also modifying and changing the PLSQL scripts.
  • Analysed the SQL scripts and designed the solution to implement using Scala.
  • CreatedMultiset, temporary, derived and volatile tables in Teradata database.
  • Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake Store(ADLS) using Azure Data Factory(ADF V1/V2).
  • Developed spark applications in python(PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Worked on reading and writing multiple data formats like JSON,ORC,Parquet on HDFS using PySpark.
  • Designed, deployed, maintained and lead the implementation of Cloud solutions using Confidential Azure and underlying technologies.
  • Migrating Services from On-premise to Azure Cloud Environments. Collaborate with development and QA teams to maintain high-quality deployment.
  • Designed Client/Server telemetry adopting latest monitoring techniques.
  • Worked on Continuous Integration CI/Continuous Delivery (CD) pipeline for Azure Cloud Services using CHEF.
  • Configured Azure Traffic Manager to build routing for user traffic Infrastructure Migrations: Drive Operational efforts to migrate all legacy services to a fully Virtualized Infrastructure.
  • Implemented HA deployment models with Azure Classic and Azure Resource Manager.
  • Configured Azure Active Directory and managed users and groups.
  • Used MongoDB internal tools like Mongo Compass, Mongo Atlas Manager & Ops Manager, Cloud Manager etc.
  • Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
  • Worked on MongoDB database concepts such as locking, transactions, indexes, sharding, replication and schema design.
  • Writing Mesa code in helping data transformation and data integration required for Palantir Analytic use cases.
  • Developed Python REST API to feed the web applications.
  • UtilizedODBC for connectivity to Teradata via MS Excel to retrieve automatically from Teradata Database.
  • Designed & developed various Ad hoc reports for different teams inBusiness (Teradata and Oracle SQL, MS ACCESS, MS EXCEL)
  • Extracting DEM and DSM data from the Various Image format files.
  • Extensive experience with the searching engines like Apache Lucene, Apache Solr and Elastic Search.
  • Developed Custom maps with Custom Data and hosted on a webserver.
  • Developed transformations and custom aggregations to feed data to GIS applications.
  • Used variousTeradata Index techniquesto improve the query performance
  • Deployed Spring Boot based microservices in Docker and Amazon EC2 container using Jenkins
  • Worked with Splunk and ELK stack for creating monitoring and analytics solutions.
  • Developed Microservices using Spring MVC, Spring Boot, and Spring Cloud.
  • Worked on data processing and transformations and actions in spark by using Python (Pyspark) language.
  • Used Microservices architecture, with Spring Boot based services interacting through a combination of REST and Spring Boot.
  • Elasticsearch is an open source, scalable full-text search engine from the Apache Lucene infrastructure.
  • Providing L1 and L2 support activities for all Palantir use cases business users across all the state.
  • Enhancing the existing flows that feed data to GIS applications.
  • Creating Docker containers with custom requirements.
  • Experience in creating JavaScript for using DML operation with MongoDB.
  • Extract and Transform data to and from Postgre SQL.
  • WroteSAS Scripts to read TAB Delimited Flat Files and Convert into Teradata Tables using Fast Load Techniques.
  • ETL Transformations for Geo spatial DATA.
  • Data validation and quality check for data consume by Palantir use cases.
  • Building interactive documentation for the API' scripts.
  • Developed Spring boot application with microservices and deployed it into AWS using EC2 instances.
  • Database Administrator for the Postgre SQL.

Environment: Hadoop, HDFS, Hive, Kafka, Sqoop, Pyspark, Shell Scripting, Azure Traffic Manager, Spark, AWS EMR, Teradata SQL Assistant 7.0, Linux-Cent OS, AWS S3, Cassandra, Java, Scala, Eclipse, Azure Cloud Services, Azure Data Factory(ADF) V2, Teradata V2R6, Maven, Agile, Palantir Foundry.

Confidential, Marlborough MA

Sr. Integration Developer (Hadoop and ETL)

Responsibilities:

  • Integrated Kafka with Spark Streaming for real time data processing.
  • Creating end to end Spark applications to perform various data cleansing, validation, transformation and summarization activities on user behavioral data.
  • Used Spark SQL and data frame API extensively to build spark applications.
  • Created custom FTP adaptors to pull the Sensor data from FTP servers to HDFS directly using HDFS File System API
  • Migration of code to QA and PROD which includes UNIX shell scripts, source files, and Informatica mappingand parameter files.
  • Developed a scalable distributed data solution using Hadoop on a 30-node cluster using AWS cloud to run analysis on 25+ Terabytes of customer usage data.
  • Experience in dealing with Windows Azure IaaS - Virtual Networks, Virtual Machines, Cloud Services, Resource Groups, Express Route, Traffic Manager, VPN, Load Balancing, Application Gateways, Auto-Scaling.
  • Expertise in Confidential Azure Cloud Services (PaaS & IaaS).
  • Expertise in Azure infrastructure management
  • Experience in managing Azure Storage Accounts.
  • Good at manage hosting plans for Azure Infrastructure, implementing and deploying workloads on Azure virtual machines (VMs).
  • Experience on DevOps tools like Kubernetes.
  • Transform and analyse the data using Pyspark, HIVE, based on ETL mappings.
  • Working knowledge on Azure Fabric, Micro services, IoT & Docker containers in Azure.
  • Developing Workflow Jobs in Sqoop and flume to extract/export data from IBM MQ and MySQL.
  • Worked extensively on building Hadoop Clusters.
  • Creating Kafka streams to capture and broadcast data and create live stream transformations to normalize data and store in HDFS.
  • Implemented real time log analytics pipeline using Confluent Kafka, storm, elastic search.
  • Built Azure environments by deploying Azure IaaS Virtual machines (VMs) and Cloud services (PaaS).
  • Developing multiple Extract, Transform and load functionalities with Pentaho Data Integration tool.
  • Tableau visualization experience for Data analytics and data visualization.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Stored the processed data by using low level Java API’s to ingest data directly to HBase and HDFS.
  • Experience in writing Spark applications for Data validation, cleansing, transformations and custom aggregations.
  • Imported data from different sources into Spark RDD for processing.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Developed pyspark programs and created the data frames and worked on transformations.
  • Worked on installing cluster, commissioning & decommissioning of Data Node, Name Node high availability, capacity planning, and slots configuration.
  • Developed Spark applications for the entire batch processing by using Scala.
  • Utilized spark data frame and spark SQL API extensively for all the processing
  • Experience in managing and reviewing Hadoop log files.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD and Pyspark concepts.
  • Experience in hive partitioning, bucketing and perform joins on hive tables and utilizing hive SerDes like REGEX, JSON and AVRO.
  • Exported the analysed data to the relational databases using Sqoop and to generate reports for the BI team.
  • Executed tasks for upgrading cluster on the staging platform before doing it on production cluster.
  • Perform maintenance, monitoring, deployments, and upgrades across infrastructure that supports all our Hadoop clusters.
  • Installed and configured various components of Hadoop ecosystem.
  • Optimized HIVE analytics SQL queries, created tables/views, written custom UDFs and Hive based exception processing.
  • Involved in transforming the relational database to legacy labels to HDFS, and HBASE tables using Sqoop and vice versa.
  • Replaced default Derby metadata storage system for Hive with MySQL system.
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
  • Supported in setting up QA environment and updating configurations for implementing scripts with Pig.
  • Configured Fair Scheduler to provide fair resources to all the applications across the cluster.

Environment: Cloudera 5.13, Cloudera Manager, Ambari, Horton Works, AWS S3 cloud, Hue, Spark, Kafka, HBase, HDFS, Hive, Pig, Sqoop, PySpark, Kafka, Mapreduce, DataStax, IBM DataStage 8.1(Designer, Director, Administrator), Flat files, Oracle 11g/10g, Windows NT, UNIX Shell Scripting, PDI(Pentaho Data Integration),Microsoft SQL Server Management Studio, GIT, JIRA, etc.

Confidential, Camp Hill PA

Hadoop Developer

Responsibilities:

  • Developed simple to complex Map Reduce jobs using Java language for processing and validating the data.
  • Developed data pipeline using Sqoop, Spark, Map Reduce, and Hive to ingest, transform and analyse operational data.
  • Developed Map Reduce and Spark jobs to summarize and transform+ raw data.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Real time streaming the data using Spark with Kafka.
  • Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS, HBase and Elasticsearch.
  • Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, Map Reduce and then loading data into HDFS.
  • Exported the analysed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
  • Analysed the data by performing Hive queries (Hive QL) and running Pig scripts (Pig Latin) to study customer behaviour.
  • Used Hive to analyse the partitioned and bucketed data and compute various metrics for reporting.
  • Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
  • Created HBase tables and column families to store the user event data.
  • Scheduled and executed workflows in Oozie to run Hive and Pig jobs.
  • Used Impala to read, write and query the Hadoop data in Hive.

Environment: Horton Works, Hadoop, Ambari, HDFS, HBase, Pig, Hive, MapReduce, Sqoop, Flume, ETL, REST, Java, Scala, PL/SQL, Oracle 11g, Unix/Linux, GIT, JIRA.

Confidential

Java Developer

Responsibilities:

  • Responsible for coordinating on-site and offshore development teams in various phases of the project.
  • Involved in developing dynamic Jsp and doing page validations using Java Script.
  • Involved in database schema design and review meetings.
  • Designed a nightly build process for updating the catalogue and intimating the user of the pending authorization.
  • Used automated test scripts and tools to test the application in various phases.
  • Coordinated with Quality Control teams to fix issues that were identified.
  • Work involved extensive usage of HTML, CSS, JavaScript and Ajax for client-side development and validations.
  • Used parsers for the conversion of XML files to java objects and vice versa.
  • Developed screens using XML documents and XSL.
  • Developed Client programs for consuming the Web services published by the Country Defaults Department which keeps in track of the information regarding life span, inflation rates, retirement age, etc. using Apache Axis.
  • Developed java beans and Jsp’s by using spring and JSTL tag libs for supplements.
  • Development of EJB’s, Servlets and JSP files for implementing Business rules and Security options using IBM Web Sphere.
  • Involved in creating tables, stored procedures in SQL for data manipulation and retrieval using SQL Server, Oracle and DB2.

Environment: Java/J2EE, HTML, Ajax, Servlets, JSP, SQL, JavaScript, CSS, XML, SOAP, Windows, Unix, Tomcat Server, Spring MVC, Hibernate, JDBC, Agile, Git, SVN.

We'd love your feedback!