Ontology Based Information Extraction in the License Domain
Abstract
All computer users needs to deal with End User License Agreements(EULA). Every time we install software or sign up for a web service we are expected to read and accept such a legal agreement. For most users this is only a slightly annoying step in the process, and we have been conditioned through many years just to accept these texts unwittingly. These texts are often long and filled with legal jargon, and hence almost impossible for an interested lay person to understand. In this thesis I have explored the use of common natural language processing and knowledge ex- traction techniques in the domain of EULAs and license agreements. My project have included the development of an artifact that use these techniques, and then makes the data available through the usage of semantic technology. It extracts document structure, named entities, binary relations and definitions. I have built a classifier that use topic modeling to find binary relations. These topics are then used by the classifier to decide in what topic a given binary relation belongs. I have also experimented with the use of text search in ontologies to try and find the realization of a given binary relation in a specified ontology. The artifact is run on a specific EULA, and I evaluate the knowledge extracted from each of the techniques investigated. I have not tried to find the best existing implementation of a technique, but instead evaluated the kind of data extracted and what specific needs that arise in the domain of licenses. The extraction and representation of the structure of the license were a suc- cess, and I have used that extraction as a basis for a vocabulary that describes my extracted data. All extractions are related directly back to the text were it was extracted. This is because of the legal documents role in a judicial system. As the text decide the results in court, it is important to keep a reference back to the source document. Because of this my system can be viewed as a system that semantically enrich a text, but without reasoning about higher levels of knowledge. I conclude that extracting knowledge using common NLP and knowledge extraction tools is feasible and opens up for research into its use in document summarization and in facilitating comprehension of such legal texts. I also conclude that my classifier for binary relations has weak performance, but list a set of changes and prerequisites that would warrant further experimentation. I also conclude that we will need to take special steps in the construction of our ontologies for my experiment with using the built in comments and labels in an ontology to be viable.