Job Seekers, Please send resumes to resumes@hireitpeople.com
Detailed Job Description:
- 5+ years total experience in development on Bigdata, Hive & Hadoop, Spark, Scala, Python/Pyspark, AWS, and other cloud related technologies.
- Independent/lead developer who can work with minimal supervision.
- Solid understanding of distributed system fundamentals.
- Solid understanding of hadoop security and familiar with kerberos/keytabs etc and hands on experience with working with Spark/Hive/Oozie/Kafka etc on a kerberized cluster.
- Experience in developing, troubleshooting, diagnosing, and performance tuning of distributed batch & real-time data pipelines using Spark/PySpark at scale.
- Develop scalable and reliable data solutions to move data across systems from multiple sources in real time (Nifi, Kafka) as well as batch modes (Sqoop)
- Demonstrated professional experience working with various components of Big Data ecosystem: Spark/Spark Streaming, Hive, Kafka/KSQL, Hadoop (or similar NoSQL ecosystem) and orchestrate these pipelines using oozie, et. al, in a production system.
- Construct data staging layers and fast real-time systems to feed BI applications and machine learning algorithms.
- Strong software engineering skills with Python or Scala/Java.
- Knowledge of some flavor of SQL (MySQL, Oracle, Hive, Impala), including the fundamentals of data modeling and performance.
- Skills in real-time streaming applications.
- Develop scalable and reliable data solutions to move data across systems from multiple sources in real time (Nifi, Kafka) as well as batch modes (Sqoop).
- Experienced in Data Engineering with good understanding of Datawarehouse, Data Lake, Data Modelling, Parsing, Data wrangling, Cleansing & Transformation, and sanitizing.
- Agile work experience, build CI/CD pipelines using Jenkins, GIT, Artifactory, Anisble etc.
- Hands-on Development experience with Scala, Python using Spark 2.0, Spark Internals and Spark jobs performance improvement.
- Good understanding of Yarn, Spark UI, Spark resource management and Hadoop resource management and efficient Hadoop storage mechanisms.
- Good understanding & experience with Performance tuning in Cloud environment for complex S/W projects mainly around large scale and low latency.
- AWS knowledge is essential with good working experience in AWS Technologies EMR, S3, Cluster management, AWS Airflow automation, Snowflake Knowledge is plus.
- AWS development certification/Spark certifications is an advantage.
- Expert in data analysis in Python (Numpy, Scipy, Scikit-learn, Pandas, etc.)
- Strong UNIX Shell scripting experience to support data warehousing solutions.
- Process oriented, focused on standardization, streamlining, and implementation of best practices delivery.
- Excellent problem solving and analytical skill, excellent verbal and written communication skills.
- Proven teamwork in multi-site/multi-geography organizations.
- Ability to multi-task and function efficiently in a fast-paced environment.
- Strong background in Scala or Java and experience with streaming technologies such as Flink, Kafka, Kinesis, and Firehose ,experience with EMR, Spark, Parquet, and Airflow.
- Excellent interpersonal skills, ability to handle ambiguity and learn quickly.
- Exposure to data architecture & governance is helpful.
- A degree in Computer Science or a related technical field; or equivalent work experience
Minimum years of experience*: 5+