We provide IT Staff Augmentation Services!

Sr. Pyspark Developer Resume

1.50/5 (Submit Your Rating)

PROFESSIONAL SUMMARY:

  • Over 8 years of IT experience in software Development and Big Data Technologies and Analytical Solutions with 1+ years of hands - on experience in development and design of Java and related frameworks and 2+ years’ experience in design, architecture, and data modeling as database developer.
  • Over 4 years’ experience as Hadoop Developer with good knowledge of Hadoop framework, Hadoop Distributed file system and Parallel processing implementation, Hadoop Ecosystems HDFS, Map Reduce, Hive, Pig, Python, HBase, Sqoop, Hue, Oozie, Impala, Spark.
  • Built and Deployed Industrial scale Data Lake on on premise and Cloud platforms.
  • Excellent understanding / knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
  • Experienced in handling different file formats like Text file, Avro data files, Sequence files, Xml and Json files.
  • Extensively worked on Spark Core, Numeric RDD's, Pair RDD's, Data Frames, and Caching for developing Spark applications
  • Expertise in deployment of Hadoop, Yarn, Spark integration with Cassandra, etc.
  • Experience and Expertise in ETL, Data analysis and designing data warehouse strategies.
  • Good knowledge in using apache NiFi to automate the data movement between different Hadoop systems.
  • Upgraded Hadoop CDH to 5.x, and Hortonworks. Installed, Upgraded and Maintained Cloudera Hadoop-based software, Cloudera Clusters, Cloudera Navigator.
  • Good exposure on usage of NoSQL database column-oriented, HBase.
  • Extensive experience writing custom Map Reduce programs for data processing and UDFs for both Hive and Pig in Java. Extensively worked on MRV1 and MRV2 Hadoop architectures.
  • Strong experience in analyzing large amounts of data sets writing PySpark scripts and Hive queries.
  • Experience in using Python’s packages like xlrd, numpy, pandas, scipy, scikit-learn and IDEs - Spyder, Anaconda, Jupyter, IPython.
  • Extensive experience in working with structured data using Hive QL, join operations, writing custom UDF's and experienced in optimizing Hive Queries.
  • Extensive experiences in working with semi/unstructured data by implementing complex map reduce programs using design patterns.
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database.
  • Experience in Apache Flume for collecting, aggregating and moving huge chunks of data from various sources such as webserver, telnet sources etc.
  • Experience with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop MapReduce, Hive, Spark jobs.
  • Involved in moving all log files generated from various sources to HDFS and Spark for further processing.
  • Excellent understanding and knowledge of NOSQL databases like MongoDB, HBase, and Cassandra.
  • Experience in implementing Kerberos authentication protocol in Hadoop for data security.
  • Experience in Dimensional modelling, logical modelling and Physical data modelling.
  • Experienced with code versioning and dependency management systems such as Git, SVT, and Maven.
  • Experience with Testing Map Reduce programs using MRUnit, Junit, ANT, Maven.
  • Experienced in working with scheduling tools such as UC4, Cisco Tidal enterprise scheduler, or Autosys.
  • Adequate knowledge and working experience in Agile & Waterfall methodologies.
  • Great team player and quick learner with effective communication, motivation, and organizational skills combined with attention to details and business improvements.

TECHNICAL SKILLS

Hadoop ECO Systems: Hadoop, MapReduce, HDFS, HBase, Hive, Pig, Sqoop, ZooKeeper, Flume, Impala, Hue, Oozie, Cloudera Manager, Accumulo, Spark, and MRUnit.

Analytics Softwares: R, SAS, Matlab

NO SQL: MongoDB, Cassandra

Data Bases: MS SQL Server 2000/2005/2008/2012, MY SQL, Oracle 9i/10g, MS access, Teradata TeradataV2R5

Languages: Java 8, Java JDK1.4 1.5 1.6 (JDK 5 JDK 6), C/C++, SQL, Teradata SQL, PL/SQL.

Operating Systems: Windows Server 2000/2003/2008, Windows XP/Vista, Mac OS, UNIX, LINUX

Java Technologies: Servlets, JavaBeans, JDBC, JNDI, JTA, JPA, E

Frame Works: Jakarta Struts 1.1, JUnit and JTest, LDAP, Scalatra, CXF, Sinatra, Spray

IDE’s & Utilities: Eclipse, Maven, NetBeans.

SQL Server Tools: SQL Server Management Studio, Enterprise Manager, Query Analyser, Profiler, Export & Import (DTS).

Web Technologies: ASP.NET, HTML,XML

Testing & Case Tools: Bugzilla, QuickTestPro(QTP)9.2, Selenium, Quality Center, Test Link, Junit, Log4j, Rational Clear case, ANT.

Business Intelligence Tools: Tableau, Pentaho, Qlikview, Micro Strategy, Business Objects

ETL Tools: Informatica, Infosphere, TalenD

Methodologies: Agile, UML, Design Patterns

PROFESSIONAL EXPERIENCE

Confidential

Sr. PySpark Developer

Role &Responsibilities:

  • Gained experience in working with MapR distribution which is more suitable for Health Care Domains.
  • Performed gap analysis and worked collaboratively with Configuration, Claims, Members as it relates to HEDIS measurements to build automation Python Framework that automate report generation to integrate with MapReduce.
  • Loaded the data to HBASE using bulk load and HBASE API. Created HBase tables and used various HBase Filters to store variable data formats of data coming from different portfolios.
  • Developed Python code to gather the data from HBase(Cornerstone) and designs the solution to implement using PySpark.
  • Developed Python metaprogram applying business rules on the data that automatically create, spawn, monitor, and then terminate customized programs as dictated by the events and problems detected within in the data stream.
  • Develop code in Python utilizing pandas data frame to read data from the excel and process it and write the processed data to the result file that invokes Hbase and MapReduce.
  • Developed Spark code using scala and Spark -SQL for batch processing of data..Utilized in-memory processing capability of Apache Spark to process data using Spark SQL, Spark Streaming using PySpark and Scala scripts.
  • Create PySpark scripts to load data from source files to RDDs, create data frames from RDD and perform transformations and aggregations and collect the output of the process.
  • Apache Spark Dataframes/RDD's were used to apply business transformations and utilized HiveContext objects to perform read/write operations.
  • Involved in the performance and optimization of the existing algorithms in Hadoop using SparkContext, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Implemented partitioning, dynamic partitions and buckets in HIVE and analyzed the partitioned and bucketed data to compute various metrics for reporting.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in map reduce way. Involved in using HCATALOG to access Hive table metadata from Map Reduce.
  • Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Spark, Hive, and Sqoop) as well as system specific jobs (such as Python programs and shell scripts).

Environment:Hadoop, MapR, HDFS, Map Reduce, Hive, HBase, Zookeeper, Oozie, Spark, Sqoop, Python, PySpark, Scala, Pandas, Numpy.

Confidential, Houston, TX

Hadoop and Spark Developer

Role &Responsibilities:

  • Invovled in Connected Innovation and Big Data Analytics enterprise priority for global BP.
  • Designed/Created HDFS data lake by drawing relationship between different sources of data from various systems
  • Developed Data Lake architecture capable of consuming, processing and storing logs from sensors of devices.
  • Establishment of Data Lake using Hadoop platform is aiming to create an Enterprise grade ecosystem that enables Analytics and Global Reporting resulting into data driven Decision Making, Flexible Management, Financial Reporting, Data Governance across all business units.
  • Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, ZooKeeper, Sqoop, flume, Apache Spark, Impala with Hortonworks Distribution.
  • Involved in loading and transforming large sets of structured, semi-structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
  • Created Analytics and reports from data using the HiveQL
  • Implemented various data Importing and exporting jobs into HDFS and Hive using Sqoop.
  • Started using apache NiFi to copy the data from local file system to HDFS.
  • Transformed and created RDDs, DataFrame using Spark.
  • Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, PySpark and Scala.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark. Knowledge on handling Hive queries using Spark SQL that integrate with Spark environment implemented in Scala.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFFs using PySpark, Data frames/SQL and RDD/MapReduce in Spark 1.3 for Data Aggregation, queries and writing data back into OLTP system directly or through Sqoop.
  • Completed data extraction, aggregation and analysis in HDFS by using PySpark and store the data needed to Hive.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Performance optimizations on Spark/Scala. Diagnose and resolve performance issues.
  • Implemented Fair schedulers on the Job tracker with appropriate parameters to share the resources of the Cluster for the Map Reduce jobs given by the users. Involved in creating Hive tables, loading the data using it and in writing Hive queries to analyze the data. Worked on tuning the performance Pig queries.
  • Involved in loading data from LINUX file system to HDFS. Importing and exporting data into HDFS and Hive using Sqoop. Imported streaming logs and aggregating the data to HDFS through Flume.
  • Experience working on processing unstructured data using Pig and Hive. Extensively used Pig for data cleansing, data deduplication. Created partitioned tables in Hive. Managed and reviewed Hadoop log files. Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Developed bash scripts to bring the Tlog files from ftp server and then processing it to load into hive tables.
  • Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
  • Created UDFs to calculate the pending payment for the given Residential or Small Business customer, and used in Pig and Hive Scripts. Developed multiple Map Reduce jobs in java for data cleaning and preprocessing.
  • Experience with Agile development processes and practices.
  • Working in agile, successfully completed stories related to ingestion, transformation and publication of data on time.

Environment: Cloudera, Hive, MapReduce, Agile, Sqoop, NiFi, Flume, Oozie, Spark, Pig, Scala, Linux, Java 8, Python, PySpark, bash, UNIX Shell Scripting and Big Data

Confidential, Holmdel, NJ

Hadoop Developer

Role &Responsibilities:

  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Worked within and across Agile teams to design, develop, test and support technical solutions across a full-stack of development tools and technologies.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Developed Simple to complex Map reduce Jobs using Hive and Pig.
  • Optimized Map Reduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop.
  • Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB, Cassandra.
  • Experience with Cassandra, with ability to drive the evaluation and potential implementation of it as a new platform.
  • Implemented analytical engines that pull data from API data sources and then present data back as either an API or persist it back into a NoSQL platform.
  • Involved in moving all log files generated from various sources to HDFS and Spark for further processing.
  • Involved in requirement and design phase to implement Streaming Lambda Architecture to use real time streaming using Spark.
  • Implemented and released of the first version of the IOT cloud platform, using Apache Storm, MQTT, HBase, MongoDB and Java EE.
  • Implemented a distributed messaging queue to integrate with Cassandra using ZooKeeper
  • Experienced in using Avro data serialization system to handle Avro data files in map reduce programs.
  • Design, implementation, test, debug of ETL mappings and workflows.
  • Develop ETL routines to source data from client source systems and target the data warehouse.
  • Configure ETL tool and ensures Full and Incremental loads run successfully and be familiar with 'change data capture' concepts.
  • Experience with Python and Shell scripting, which will be used for automating ETL jobs and tasks.
  • Developed Product Catalog and Reporting datamart databases with supporting ETLs.
  • Implemented ETL processes for warehouse and designed and implemented code for migrating data to Data lake using Spark.
  • The data is collected from distributed sources into Avro models. Applied transformations and standardizations and loaded into Hive for further data processing.
  • Built Platfora Hadoop multi-node cluster test labs using Hadoop Distros (CDH 4/5, Apache Hadoop, MapR
  • and HortonWorks) and Hadoop Eco-systems, Virtualizations and Amazon Web Services component.
  • Installed, Upgraded and Maintained Cloudera Hadoop-based software.
  • Experience with hardening Cloudera Clusters, Cloudera Navigator and Cloudera Search.
  • Managing Running Jobs, Scheduling Hadoop Jobs, Configuring the Fair Scheduler, Impala Query Scheduling.
  • Extensively worked on Impala to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
  • Extensively Used Impala to read, write and query the Hadoop data in HDFS.
  • Developed workflows using custom MapReduce, Pig, Hive and Sqoop.
  • Built reusable Hive UDF libraries for business requirements which enabled users to use these UDF's in Hive Querying.
  • Performed troubleshooting, fixed and deployed many Python bug fixes of the two main applications that were a main source of data for both customers and internal customer service team.
  • Comprehensive knowledge in understanding different components of Spark framework.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Participated in development/implementation ofClouderaHadoopenvironment.
  • Integrated Hadoop Security with Active Directory by implementing Kerberos for authentication and Sentry for authorization.
  • Used struts validation framework for form level validation
  • Wrote test cases in Junit for unit testing of classes.

Environment: Hadoop, HDFS, Pig, Agile, Cloudera, Accumulo, MongoDB, Sqoop, Scala, Storm, Python, Spark, MQTT, Kerberos, Impala, XML, ANT 1.6, Perl, Python, Java 8, JavaScript, Junit 3.8, Avro, Hue.

Confidential, Jacksonville, Florida

Hadoop Developer

Roles and Responsibilities:

  • Installed and configured Hadoop and Hadoop stack on a 16 node cluster.
  • Worked on analyzing Hadoop cluster using different big data analytic tools including Pig,Hive, and Map Reduce.
  • Worked on debugging, performance tuning of Hive & Pig Jobs.
  • Developed data access libraries that bring MapReduce, Graph, and RDBMS data to users of Scala, Java, Python.
  • Analyze large and critical datasets using Cloudera, HDFS, Hbase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Zookeeper, & Mahout.
  • Designed and deployed AWS solutions using E2C, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups, Opsworks
  • Object storage service Amazon S3 is used to store and retrieve media files such as images and Amazon Cloud Watch is used to monitor the application and to store the logging information.
  • Involved in writing Java API for Amazon Lambda to manage some of the AWSservices.
  • Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs.
  • Implemented Cluster Coordination services through Zookeeper.
  • Worked on Cisco Tidal Enterprise Scheduler (TES) which is a friendlier alternative to Oozie, the native Hadoop scheduler. Built-in Cisco TES connectors to Hadoop components eliminate manual steps such as writing Sqoop code to download data to HDFS and executing a command to load data to Hive.
  • Install, configure, and operate data integration and analytic tools i.e. Informatica, Chorus, SQLFire, & Gem Fire XD for business needs.
  • Worked with file formats TEXT, AVRO, PARQUET and SEQUENCE files.
  • Develop scripts to automate routine DBA tasks (i.e. refresh, backups, vacuuming, etc.)
  • Installed and configured Hive and also wrote Hive yarn’s that helped spot market trends.
  • Used Hadoop streaming to process terabytes data in XML format.
  • Involved in loading data from UNIX file system to HDFS.
  • Design, develop, unit test, and support ETL mappings and scripts for data marts using Talend.
  • Dynamic schema validation & custom data cleansing within Talend processes.
  • Design and development of Talend jobs to load metadata in SQL Server data base.
  • Developed complex Talend ETL job to load the data from file to HDFS, HDFS to Hive, Hive to Oracle and Oracle to DataMart.
  • Experienced with REST APIs based on frameworks such as Scalatra, CXF, Sinatra, Spray.
  • Involved in Developing a Restful API'S service using Python Flask framework.
  • Experienced with code versioning and dependency management systems such as Git, SVT, and Maven.
  • Experienced with Hive customization, i.e. UDFs, UDTFs and UDAFs.
  • Experienced with Python-Hive integration including Pandas, Numpy and Scipy.
  • Experienced with AWS like EC2, S3, EMR, OpenStack cloud infrastructures.
  • Experienced in working with scheduling tools such as UC4, Cisco Tidal enterprise scheduler, or Autosys.
  • Worked with BI teams in generating the reports in Tableau and designing ETL workflows on Pentaho.
  • Professional expert in Data Visualization development using tableau to creating complex, intuitive and dashboards.
  • Expert in designing customized interactive dashboards in tableau using marks, Action, filters, parameters, Security Concepts, calculations, and relationships.
  • Extensive knowledge in creating Tableau scorecards, bar graphs, dashboards using stack bars, geographical maps, scattered plots, Gantt charts using show me functionality.
  • Exported the analyzed data to relational databases using Sqoop for visualization using Tableau and to generate reports for BI team.
  • Worked extensively with Advance analysis Actions, Calculations, Parameters, Background images, Maps, Trend Lines, Statistics, and Log Axes. Groups, hierarchies, Sets to create detail level summary report and Dashboard using Tableau’s Advanced Capabilities.
  • Created action filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.

Environment: CDH4 with Hadoop 1.x, HDFS, Pig, Cloudera, AWS Lambda, Hive, Hbase, zookeeper, MapReduce, Java, Sqoop, Oozie, Linux, UNIX Shell Scripting and Big Data, Python, Flask, Cisco Tidal enterprise scheduler, OpenStack, Pentaho, Tableau, TalenD

Confidential

SQL/JAVA Developer

Role &Responsibilities:

  • Involved in complete requirement analysis, design, coding and testing phases of the project.
  • Implemented the project according to the Software Development Life Cycle (SDLC).
  • Documented the data flow and the relationships between various entities.
  • Actively participated in gathering of User Requirement and System Specification.
  • Created new Database logical and Physical Design to fit the new business requirement and implemented the same using SQL Server.
  • Created Clustered and Non-Clustered Indexes for improved performance.
  • Created Tables, Views and Indexes on the Database, Roles and maintained Database Users.
  • Developed new Stored Procedures, Functions, and Triggers.
  • Developed JavaScript behavior code for user interaction.
  • Used HTML, JavaScript, and JSP and developed UI.
  • Used JDBC and managed connectivity, for inserting/querying& data management including stored procedures and triggers.
  • Designed the logical and physical data model, generated DDL scripts, and wrote DML scripts for Sql Server database.
  • Implemented application using JSP, Spring MVC, Spring IOC, Spring Annotations, Spring AOP, Spring Transactions, Hibernate.
  • Transformed project data requirements into project data models using Erwin.
  • Involved in logical and physical designs and transforms logical models into physical implementations.
  • Enhanced existing data model based on the requirements, maintained data models in Erwin Model Manager.
  • Part of a team which is responsible for metadata maintenance and synchronization of data from database.
  • Involved in the design and coding of the data capture templates, presentation and component templates.
  • Developed Datamodel for Datamart and OLTP Database.
  • Involved in end to end development of DataMart and Normalization of original data sets.
  • Hands-on writing Teradata bteq scripts to load data into the DataMart tables.
  • Built control-M jobs to schedule the Datamart jobs and load the tables in QA and production servers.
  • Automated the load process of datamart tables from end to end.
  • Involved in peer review and data validation activities in DataMart.
  • Wrote production implementation documents and provided Production support when required. Implemented MS SQL Server Analysis Services setup, tuning, cube partitioning, dimension design including hierarchical and slowly changing dimensions.
  • Designed snowflake, STAR SCHEMA following Dimensional Modeling approach.
  • Designed Dimensional model to support Business process to answer complex business questions
  • Provided assistance to development teams on Tuning Data, Indexes and Queries.
  • Developed an API to write XML documents from database.
  • Used JavaScript and designed user-interface and checking validations.
  • Developed JUnit test cases and validated users input using regular expressions in JavaScript as well as in the server side.
  • Developed complex SQL stored procedures, functions and triggers.
  • Mapped business objects to database using Hibernate.
  • Wrote SQL queries, stored procedures and database triggers as required on the database objects.
  • Analysis and Design with UML and Rational Rose.
  • Created Class Diagrams, Sequence diagrams and Collaboration Diagrams
  • Used the MVC architecture.
  • Worked on Jakarta Struts open framework.
  • Wrote spring configuration for the beans defined and properties to be injected into themusing spring's Dependency Injection.
  • Implemented a ftp utitlity program for copying the contents of an entire directory recursively upto two levels from a remote location using Socket Programming.
  • Implemented a reliable socket interface using the sliding window protocol like TCP stream sockets over UDP unreliable communication channel and later on, tested using the Ftp utility program.
  • Strong domain knowledge of TCP/IP with the expertise in socket programming and IP security domain (IPSec, TLS, SSl and VPN, Firewall and NATs).
  • Have built the strong communication between the source and destination message using socket programming.
  • Hands on experience in writing Spring Restful Web services using JSON / XML.
  • Developed the Spring Features like Spring MVC, Spring DAO, Spring Boot, Spring Batch, Spring Security.
  • Using AngularJS, HTML5, CSS3 all HTML and DHTML is accomplished through AngularJS directives.
  • Developed Servlets in order to deal with requests for account activity,
  • Developed Controller Servlets and Action Servlets to handle the requests and responses.
  • Developed Servlets and created JSP pages for viewing on a HTML page.
  • Developed the front end using JSP.
  • Developed various EJB's to handle business logic.
  • Designed and developed numerous Session Beans deployed on Web logic Application Server.
  • Implemented Database interactions using JDBC with back-end Oracle.
  • Worked on Database designing, Stored Procedures, and PL/SQL.
  • Created triggers and stored procedures using PL/SQL.
  • Written queries to get the data from the Oracle database using SQL.
  • Implemented Backup and Recovery of the databases.
  • Actively participated in User Acceptance Testing, and Debugging of the system.

Environment: Java, spring, XML, Hibernate, SQL Server, Maven2, JUnit, J2EE, Servlets, JSP, Struts, Spring Restful WebServices, NATS, Hibernate, Oracle, TOAD, Web logic Server, AngularJS, HTML5, CSS3 all HTML and DHTML, Dimensional modelling, logical modelling and Physical data modelling, Windows 2000 adv. server, Windows 2000/XP, MS SQL Server 2000, IIS, MS Visual Studio.

We'd love your feedback!