Data Engineer Resume
Birmingham, AL
SUMMARY
- 9+ years of experience in interpreting and analyzing sophisticated datasets and expertise in providing business insights using HADOOP eco system and data Mining, Data Acquisition and Data Validation in Banking, Insurance.
- Good understanding of Software Development Life Cycle (SDLC) and sound knowledge of project implementation methodologies including Waterfall and Agile.
- Experience of data Mining, Data Acquisition and Data Validation in Banking, Insurance, Suply Chain professional experience in Big Data and Hadoop Ecosystem along with Data Analysis, Dan, Investment and Healthcare industries.
- Expertise in using Python and SQL for Data Engineering and Data Modeling.
- Experience in building reliable ETL processes and data pipelines for batch and real - time streaming using SQL, Python, Spark, Databricks, Spark, Streaming, Sqoop, Hive, AWS, Azure, NiFi, Luigi, Oozie and Kafka.
- Responsible for designing and building new data models and schemas using Python and SQL.
- Experience working with NoSQL database technologies, like Cassandra, MongoDB and HBase.
- Experience in usage of Snowflake, Hadoop distribution like Cloudera, Hortonworks distribution & Amazon AWS (EC2, EMR, RDS, Redshift, DynamoDB, Snowball) and Data bricks (data factory, notebook etc.)
- Built Spark jobs using PySpark to perform ETL for data in S3 data lake.
- Experience in creating reports and dashboards in visualization tools like Tableau, Spotfire and PowerBI.
- Experience with Software development tools and platforms like Jenkins, CI/CD, GIT, JIRA.
- Strong working experience with ingestion, storage, processing and analysis of Big Data.
- Successfully loaded files to HDFS from Oracle, Sql Server and Teradata using Sqoop.
- Working with Sqoop in importing and exporting data from different databases like MySQL, Oracle into HDFS and Hive.
- Experience with working of cloud configuration in (Amazon web services) AWS.
- Demonstrated experience in delivering data and analytic solutions leveraging AWS, Azure or similar cloud Data Lake.
- Experience on working structured, unstructured data with various file formats such as Avro data files, xml files, JSON files, sequence files, ORC and Parquet.
- Experience with Oozie Workflow Engine to automate and parallelize Hadoop, Map Reduce and Pig jobs.
- Designed and implemented end-to-end data pipelines to extract, cleanse, process and analyze huge amounts of behavioral data and log data.
- Good experience working with various data analytics and big data services in AWS Cloud like EMR, Redshift, S3, Athena, Glue etc.,
- Experienced in developing production ready spark application using Spark RDD APIs, Data frames, Spark-SQL and Spark-Streaming API's.
- Worked extensively on fine tuning spark applications to improve performance and troubleshooting failures in spark applications.
- Strong experience in using Spark Streaming, Spark SQL and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs
- Proficient in importing/exporting data from RDBMS to HDFS using Sqoop.
- Used hive extensively to performing various data analytics required by business teams.
- Solid experience in working various data formats like Parquet, Orc, Avro, Json etc.,
- Experience automating end-to-end data pipelines with strong resilience and recoverability.
- Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
- Expertise in developing Scala and Python applications and good working knowledge of working with Java.
- Skilled in Tableau Desktop and MicroStrategy for data visualization, Reporting and Analysis.
- Experience in working with databases, such as Oracle, SQL Server, My SQL.
- Extensive experience with ETL and Query tools for Big Data like Pig Latin and HiveQL.
- Extensive experience in ETL process consisting of data transformation, data sourcing, mapping, conversion and loading.
- Proficient in creation of Schema Objects such as Attributes, facts, hierarchies and transformations for development of application objects on it using MicroStrategy Desktop
- Experienced in operating with cloud systems from Azure (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data storage Explorer).
- Experience in developing data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing data.
- Experience in Data modeling and connecting Cassandra from Spark and saving summarized data frame to Cassandra.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark Yarn.
- Developing applications using Scala, Spark SQL and MLlib libraries along with Kafka and other tools as per requirement then deployed on the Yarn cluster.
- Adequate knowledge and working experience in Agile & Waterfall methodologies.
- Developing and Maintenance the Web Applications using the Web server Tomcat, IBM WebSphere.
- Experience in job workflow scheduling and monitoring tools like Oozie, Nifi.
- Experience in Front-end Technologies like Html, CSS, Html5, CSS3, and Ajax.
- Experience in building high performance and scalable solutions using various Hadoop ecosystem tools like Pig, Hive, Sqoop, Spark, Solr and Kafka.
- Extensive experience in working with Oracle, MS SQL Server, DB2, MySQL RDBMS databases.
- Well verse and hands on experience in Version control tools like GIT, CVS and SVN.
- Expert in implementing advanced procedures like text analytics and processing using Apache Spark written in Scala.
- Good knowledge in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Responsible for deploying the scripts into GitHub version control repository hosting service and deployed the code using Jenkins.
- Thorough knowledge of systems development lifecycle and systems engineering role throughout the lifecycle
- Familiarity with the DB2 LUW DB2Readlog API
- Exceptional ability to work in an inter-disciplinary team of information technology professionals
- Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL using Scala.
TECHNICAL SKILLS
Hadoop/Big Data: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Spark, Kafka, Storm and Zookeeper
Languages: C, Java, Python, Scala, J2EE, PL/SQL, Pig Latin, HiveQL, Unix shell scripts
Java/J2EE Technologies: Applets, Swing, JDBC, JNDI, JSON, JSTL, RMI, JMS, Java Script, JSP, Servlets, EJB, JSF, JQuery
Frameworks: MVC, Struts, Spring, Hibernate
NoSQL Databases: HBase, Cassandra, MongoDB
Cloud: AWS, Azure
Operating Systems: HP-UNIX, RedHat Linux, Ubuntu Linux and Windows XP/Vista/7/8
Web Technologies: HTML, DHTML, XML, AJAX, WSDL,Bootstrap, JSON, AJAX
Web servers: Apache Tomcat, WebLogic, JBoss
Databases: Oracle, DB2, SQL Server, MySQL, Teradata
Tools: and IDE: Eclipse, NetBeans, Toad, Maven, ANT, Hudson, Sonar, JDeveloper, Assent PMD, DB Visualizer
Version control: SVN, CVS, GIT
SDLC Methodology: Agile, Waterfall
Data Warehouse: Amazon Redshift, Microsoft Azure, Snowflake, PostgreSQL, Teradata
Data visualization: Tableau, Spotfire, PowerBi
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential - Birmingham, AL
Responsibilities:
- Work closely with multiple teams to gather requirements and maintain relationships with those that are heavy users of data for analytics.
- ETL architecture design to load raw data from different sources in different format and perform preprocessing like filtering, deduplication and transformation and store in Hadoop cluster.
- Using Pyspark developed framework to implement ETL architecture to input raw data and stores structured data in Hadoop cluster.
- Used Pyspark functions and Spark SQL Data frames to increase performance by writing user defined functions (UDF's).
- Stored and retrieved data from data-warehouses using Amazon Redshift.
- Mapping column-based data for transformation of data based on business requirement and stored in parquet format.
- Unit test case development in python programming, covering all possible scenarios to avoid any errors from end-to-end pipeline.
- Experienced with AWS, where cluster was built using EC2 instances, store data in S3 and used Athena Serverless Query Services
- Build CI/CD pipelines using Azure Datafactory to pull source code from Git repo, build and release the application and push to Azure artifacts.
- Pull the data from Mainframe DB2 and ADABAS using Appache Sqoop and store into AWS S3 to create the data lake for Appache Spark, Athena and Redshift.
- For database migration from Oracle to Azure Data Lake Store(ADLS) using Azure Data Factory
- Wrote Terraform scripts for creating resources, app services, storage accounts on azure cloud.
- Build CI pipeline using Terraform scripts in building Azure Datafactory pipelines.
- Worked on Amazon Redshift and AWS a solution to load data, create data models and run BI on it.
- Started using apache NiFi to copy the data from local file system to HDFS.
- Automated the process of running queries on a daily basis using Hive and Spark from the data stored in HDFS after executing the ETL process.
- Performed data mapping between source systems to Target systems, logical data modeling, created class diagrams and ER diagrams and used SQL queries to filter data
- Implemented data loading and aggregation frameworks and jobs that will be able to handle hundreds of GBs of json files, using Spark
- Created OLAP cubes by getting data from various Data Sources like SQL Server, Flat Files and deployed on the Dev Environment.
- Involved in developing python scripts, SSIS, Informatica, and other ETL tools for extraction, transformation, loading of data into the data warehouse.
- Designed a prototype using Real-Time Data Integration on Informatica Cloud.
- Applied data warehousing solutions while working with database technologies like Snowflake, Migration, Teradata.
- Worked on connecting Snowflake and Cassandra database to the Amazon EMR File System for storing in S3.
- Extensively work on troubleshooting MicroStrategy Web Reports, optimizing the SQL using the VLDB Properties.
- Created MicroStrategy interactive dashboards/reports (including supported schema and application objects), Scheduled demos to end users and collected feedback and provided enhancements in the next sprint.
- Ingested user behavioral data from external servers such as FTP server and S3 buckets on daily basis using custom Input Adapters.
- Read variety of databases from Azure Data Bricks using JDBC connections using Scala, Python and Pyspark and saved in ADL.
- Successfully Ingested batch file and tables from Supply Chain Warehouse databases OMS, TMS,WMS, PKMS, YARD VIEW, STELLA, BOLD360 mostly in Oracle, DB2 databases, converted into parquet and Delta file, transformed and loaded in Azure Data Lake (ADL) by reading from Azure Data bricks using Scala and Pyspark with the help of all databases server, host information, user name, password, driver JCBC jar files corresponding to different databases.
- Created Sqoop scripts to import/export user profile data from RDBMS to S3 Data Lake.
- Developed various spark applications using Scala to perform various enrichments of user behavioral data (click stream data) merged with user profile data.
- Involved in data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for downstream model learning and reporting.
- Utilized Spark Scala API to implement batch processing of jobs
- Troubleshooting Spark applications for improved error tolerance.
- Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines
- Developed Workbooks and Worksheets in Tableau Desktop and published them on to Tableau Server which allowed end users to understand the data on the fly with the usage of quick filters and parameters for on demand desirable information to make business decisions.
- Created Tableau reports, Visualizations and dashboards for Marketing and Finance Teams which are scalable.
- Created Kafka producer API to send live-stream data into various Kafka topics.
- Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.
- Utilized Spark in Memory capabilities, to handle large datasets.
- Used broadcast variables in spark, effective & efficient Joins, transformations and other capabilities for data processing.
- Experienced in working with EMR cluster and S3 in AWS cloud.
- Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in Hive.
- Evaluated business requirements and prepared Detailed Design documents that follow Project guidelines and SLAs required procuring data from all the upstream data sources and developing written programs.
- Data files are retrieved by various data transmission protocols like Sqoop, NDM, SFTP, DMS etc., these data files are then validated by various Spark Control jobs written in Scala.
- Spark RDDs are created for all the data files and then transformed to cash only transaction RDDs.
- The filtered cash only RDDs are aggregated and curated based on the business rules and CTR requirements, converted into data frames, and saved as temporary Hive tables for intermediate processing.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
- Implemented Real time analytics on Cassandra data using thrift API.
- Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra. spark, aws, Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL Datawarehouse environment. experience in DWH/BI project implementation using Azure DF.
- Develop ETL jobs and pipelines using tools such as IBM Datastage, Apache Nifi, and Azure Data Factory and Databricks
- Utilize python/pyspark in Databricks notebooks when creating ETL pipelines in Azure Data Factory
- Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.
- The RDDs and data frames undergo various transformations and actions and are stored in HDFS as parquet Files and in HBase for auto generating CTRs.
- Experience in Custom Process design of Transformation via Microsoft Azure, Snowflake, PostgreSQL, Teradata Data Factory & Automation Pipelines.
- Extensively used the Azure Service like Azure Data Factory and Logic App for ETL, to push in/out the data from DB to Blob storage, HDInsight - HDFS, Hive Tables
- Developed Spark scripts by using Scala and Python shell commands as per the requirement.
- Maintained and administrated HDFS through HADOOP - Java API, Shell scripting, Python.
- Worked on Python scripts to analyzed the data of the customer.
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark RDD's, and Scala Python.
- Developed monitoring and notification tools using Python.
- Wrote Python routines to log into the websites and fetch data for selected options.
- Used Collections in Python for manipulating and looping through different user defined objects.
- Wrote and tested Python scripts to create new data files for Linux sever configuration using a Python template tool.
- Wrote shell scripts to automate the jobs in UNIX.
Environment: Cloudera Distribution HADOOP, Tableau, Spark, HDFS, Python, Hive, HBase, HiveQL, SQOOP, Java, Scala, Unix, IntelliJ, Azure, Autosys, Maven
Hadoop Developer/ Bigdata
Confidential - Charlotte, NC
Responsibilities:
- Importing and exporting data into HDFS from database and vice versa using Sqoop
- Created Data Lake as a Data Management Platform for Hadoop.
- Using Amazon Web Services (AWS) for storage and processing of data in cloud.
- Using Talend and DMX-h to extract the data from other sources into HDFS and Transform the data.
- Integrating data from various vendors into Hadoop, data processing and creating Hive tables in Hadoop.
- Developed connections for Tableau Application to core and peripheral data sources like Flat files, Microsoft Excel, Tableau Server, Amazon Redshift Database, Microsoft SQL Server, Google Analytics, Power BI, etc. to Analyze complicated data.
- Implemented Spark using python and spark SQL for faster processing and testing of the data.
- Worked on AWS S3 and Redshift for storage of data collected from web scraping and API using python to analyze and predict the impact of financial news on stock and business.
- Worked on Kafka messaging platform for real-time transactions and streaming of data from APIs and databases to Reporting tools for analysis.
- Experience in production delivery using Big data technologies like Kafka, Hadoop, Hive, and HBase.
- Design and development of SQL Server T-SQL scripts and functions for data import, data export, data conversion, and data cleansing.
- Involved in writing Spark applications using Scala to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.
- Load the data into Spark RDD and perform in-memory data computation to generate the output as per the requirements.
- Used SSIS and T-SQL stored procedures to transfer data from OLTP databases to staging areas and finally transfer into data marts and Created and deployed parameterized reports using SSRS and Tableau from the Data marts.
- Used DDL Triggers, Stored procedures to check the Data Integrity and verification at early stages before calling them.
- Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyze operational data.
- Worked on performance tuning of Spark application to improve performance.
- Real time streaming the data using Spark with Kafka. Responsible for handling Streaming data from web server console logs.
- Worked on different file formats like Text, Sequence files, Avro, Parquet, JSON, XML files and Flat files using Map Reduce Programs.
- Developed daily process to do incremental import of data from DB2 and Teradata into Hive tables using Sqoop.
- Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Analyzed the SQL scripts and designed the solution to implement using Spark.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and Aggregation and how does it translate to MR jobs.
- Work with cross functional consulting teams within the data science and analytics team to design, develop and execute solutions to derive business insights and solve client's operational and strategic problems.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into HBase
- Started using Apache Nifi to copy the data from local file system to HDP.
- Involved in creating workflow to run multiple hive and Pig Jobs, which run independently with time and data availability
- Using Apache Kafka for Streaming purpose.
- Involved in developing shell scripts and automated data management from end to end integration work
- Developing predictive analytic product by using Apache Spark, SQL/HiveQL.
- Moving data in and out to Hadoop File System Using Talend Big Data Components.
- Developed Map Reduce program for parsing and loading into HDFS
- Built reusable Hive UDF libraries for business requirements which enabled users to use these UDF's in Hive Querying
- Automating and scheduling the Sqoop jobs in a timely manner using Unix Shell Scripts.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig.
- Working with JSON, XML file formats.
- Using HBase and NoSQL databases to store majority of data which needs to be divided based on region.
Environment: Hadoop, MapReduce, Talend, Hive QL, Oracle, Cloudera, HDFS, HIVE, HBase, Java, Tableau, PIG, Sqoop, UNIX, Spark, Scala, JSON, AWS.
Hadoop Developer/ Spark
Confidential - St. Louis, MO
Responsibilities:
- Worked on the large-scale Hadoop Yarn cluster for distributed data processing analyzing using Spark, Hive, and HBase.
- Responsible for data migration and handling large datasets using Partitions, Spark in-memory capabilities, broadcasts in Spark, efficient joins, Transformations during ingestion process itself.
- Exploring Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, Datasets, RDD's, Spark YARN.
- Consumed the data from Kafka topic queue using spark streaming also built spark applications for batch data to deliver downstream cyber security applications.
- Developed PySpark API for CDC data using Spark streaming libs for processing the data.
- Created Pyspark Flinger jobs to ensure the data available to MDH web UI and data quality reports for BI applications like Tableau and MicroStrategy.
- Responsible to Test and Debug the existing code to enhance performance by using the latest libraries in Apache Spark and Kafka.
- Work with Hive on Spark to creating tables, loading TB of historical data and creating Nifi ETL process for daily update from BAC- S3.
- Responsible for performance tuning of pyspark Application for setting CDC right Batch Interval time correct level of parallelism and memory tuning which persists into Hive -S3.
- Responsible for elevate the code into CI/CD process using development, test, and Production environments on schedule. Provides follow up Production support when needed.
- Responsible for whole Presto clusters set up on 200+ nodes from BAC cloud including disaster recovery plan and high availability and create CDP catalogs with Apache Ranger.
- Responsible to maintain our on-premises environment Database Management Systems on MySQL, Netezza, DB2 and Teradata.
- Write Presto SQL queries for data analysis to meet the business requirements and writing Hive scripts to extract, transform, validate and load the data into other databases for BI applications like Tableau and MicroStrategy for reporting.
- Responsible to build a data pipeline using Nifi, Kafka, Spark and Hive to ingest, transform and analyzing behavioral data to identify individual risk rate from Insider Threat Monitoring Tool.
- Involved in creating Data Lake by extracting customer's data from various data sources to HDFS which include data from csv, databases, and log data from servers.
- Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Used various spark Transformations and Actions for cleansing the input data.
- Developed shell scripts to generate the hive create statements from the data and load the data into the table.
- Wrote Map Reduce jobs using Java API and Pig Latin.
- Loaded data into Hive using Sqoop and used Hive QL to analyze the partitioned and bucketed data, executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.
- Developed Spark applications by using Scala and Python and implemented Apache Spark for data processing from various streaming sources.
- Importing data from SQL Server to HDFS using python based on Sqoop framework.
- Exporting data from HDFS to MYSQL using python based on Hawq framework.
- Developed java applications that parses the mainframe report and put into CSV Files and another application will compare the data from SQL server and mainframe report and generates a rip file.
- Documented the technical design and also production support document.
- Involved in creating workflow for the Tidal (Workflow coordinator for Waddell & Reed).
- Created Hive external table with partitions and bucketing to load incremental data coming from SQL server.
- Optimized Map Reduce jobs to use HDFS efficiently by using various compression mechanisms
- Creating Hive tables, loading with data and writing Hive queries which will run internally in Map Reduce
- Used Oozie workflow engine to run multiple Hive and Pig jobs
- Involved in designing and developing non-trivial ETL processes within Hadoop using tools like Pig, Sqoop, Flume, and Oozie
- Used DML statements to perform different operations on Hive Tables
- Developed Hive queries for creating foundation tables from stage data
- Used Pig as ETL tool to do Transformations, event joins, filter and some pre-aggregations
- Developed Hive custom Java UDF's for further transformations on data.
- Done performance tuning in the hive at all point of phases.
- Involved in modifying existing Sqoop and Hawq frameworks to read JSon property file and perform load/unload operations from HDFS to MYSQL.
- Developed Pig scripts to validate the count of data between Sqoop and Hawq loads.
- Developed java Map Reduce custom counters to track the records that are processed by map reduce job.
Environment: Hadoop2.6, Spark, Scala, Hive, Pig, Map Reduce, PivotalHD3.0, Hawq, MYSQL8.4, SQLServer2014/2012, Java, Sqoop, Python, Tidal, Spring-XD, Pig, JSON.
Hadoop Developer
Confidential - San Rafael, CA
Responsibilities:
- Optimizing of existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frames and Pair RDD's.
- Developed Spark scripts by using Java, and Python shell commands as per the requirement.
- Involved with ingesting data received from various relational database providers, on HDFS for analysis and other big data operations.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark sql Context.
- Performed analysis on implementing Spark using Scala.
- Responsible for creating, modifying topics (Kafka Queues) as and when required with varying configurations involving replication factors and partitions.
- Extracted files from Mongo DB through Sqoop and placed in HDFS and processed.
- Created and imported various collections, documents into Mongo DB and performed various actions like query, project, aggregation, sort and limit.
- Experience with creating script for data modelling and data import and export. Extensive experience in deploying, managing and developing MongoDB clusters.
- Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
- Experience in migrating HiveQL into Impala to minimize query response time.
- Creating Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team.
- Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and Map Reduce) and move the data inside and outside of HDFS.
- Collecting data from various Flume agents that are imported on various servers using Multi- Hop Flow.
- Used Flume to collect the log data from different resources and transfer the data type to
- Hive tables using different SerDe's to store in JSON, XML and Sequence file formats.
- Developed Scala scripts, UDFs using Data frames/SQL/Data sets and RDD/Map Reduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Developed Pig Scripts, Pig UDFs and Hive Scripts, Hive UDFs to analyses HDFS data.
- Maintained the cluster securely using Kerberos and making the cluster up and running all the times.
- Implemented optimization and performance testing and tuning of Hive and Pig.
- Developed a data pipeline using Kafka to store data into HDFS.
- Worked on reading multiple data formats on HDFS using Scala
- Written shell scripts and Python scripts for automation of job.
- Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources like Flat files, XML files, and Databases.
- Supported various reporting teams and experience with data visualization tool Tableau.
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing and
- ETL Tools like IBM DataStage, Informatica and Talend.
Environment: Cloudera, HDFS, Hive, HQL scripts, Map Reduce, Java, HBase, Pig, Sqoop, Kafka, Impala, Shell Scripts, Python Scripts, Spark, Scala, Oozie.
Hadoop Developer
Confidential
Responsibilities:
- Experience in Importing and exporting data into HDFS and Hive using Sqoop.
- Developed Flume Agents for loading and filtering the streaming data into HDFS.
- Experienced in handling data from different data sets, join them and pre-process using Pig join operations.
- Moving Bulk amount data into HBase using Map Reduce Integration.
- Developed Map Reduce programs to clean and aggregate the data
- Developed HBase data model on top of HDFS data to perform real time analytics using Java API.
- Developed different kind of custom filters and handled predefined filters on HBase data using API.
- Strong understanding of Hadoop eco system such as HDFS, Map Reduce, HBase, Zookeeper, Pig, Hadoop streaming, Sqoop, Oozie and Hive.
- Implement counters in HBase data to count total records on different tables.
- Experienced in handling Avro data files by passing schema into HDFS using Avro tools and Map Reduce.
- Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV, etc.
- Implemented secondary sorting to sort reducer output globally in Map Reduce.
- Implemented data pipeline by chaining multiple mappers by using Chained Mapper.
- Created Hive Dynamic partitions to load time series data.
- Experience with CDH distribution and Cloudera Manager to manage and monitor Hadoop clusters
Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, HBase, Sqoop, RDBMS/DB, Flat files, MySQL, CSV, Avro data files.