With the ever-increasing amount of research data the question arises where this data comes
from and what it is about. The aim of this thesis is to provide an overview of the topic and
address issues caused by the rapid explosion of data by examining existing standards and
attempting to develop new methods for recreating provenance information from data. These
methods are applied to different use cases and the specifics of data similarity and metadata
extraction are explored.