Labelling of scientific data is expensive, time consuming and sometimes dangerous, but the majority of scientific prediction replies on supervised learning. Most larger scientific data sets are structured data sets, which may have partial labels, or none at all. Self-supervised or semi-supervised methods are slow to be adopted in the community because scientific domains rely heavily on model interpretation to guide the next phase of scientific research, making most black-box methods undesirable. TabNet may be a way to overcome these issues, while also proving a higher accuracy than alternatives used in fields such as chemistry, physics and biology. TabNet is a high-performance and interpretable canonical deep tabular data learning architecture that uses sequential attention to provide interpretability and improve efficiency, along with the capacity for self-supervised learning for tabular data. In this project you will implement and improve the TabNet algorithm scientific data sets in python, test the performance against conventional alternatives using benchmarks, and compare the result (accuracy and computational efficiency) of some real-world scientific examples involving forward prediction and inverse design.
Implement TabNet for science, demonstrate superiority of the method, pipeline and results. Produce a python module for a library for general use by machine learning researchers and scientists, and a prepare a draft scientific publication.
Python programming and experience in data science and machine learning is essential (such as COMP3720, COMP4660, COMP4670, COMP6670, COMP8420). Familiarity with scikit-learn, Keras, Pytorch or Tensorflow.
This is a 24 credit point project.
self-supervised learning, deep learning, data science, python