Wissenschaftliche Veröffentlichungen und Arbeiten

Sowohl an der Universität als auch in meiner Zeit beim Fraunhofer-Institut für sichere Informationstechnologie (SIT) sind wissenschaftliche Werke entstanden. Diese sind hier aufgeführt.

Diese Seite unterscheidet zwischen wissenschaftlichen Veröffentlichungen - diese haben ein Peer Review durchlaufen und wurden auf internationalen Konferenzen veröffentlicht - und Abschluss- sowie Seminararbeiten an der Universität.

Wissenschaftliche Veröffentlichungen

Sensitive and Personal Data: What Exactly Are You Talking About?
Autoren: M. Kober, J. Samhi, S. Arzt, T. F. Bissyandé, J. Klein
Veröffentlicht 2023 in: IEEE/ACM 10th International Conference on Mobile Software Engineering and Systems (MOBILESoft)
Die Veröffentlichung kann hier eingesehen werden.

Mobile devices are pervasively used for a variety of tasks, including the processing of sensitive data in mobile apps. While in most cases access to this data is legitimate, malware often targets sensitive data and even benign apps collect more data than necessary for their task. Therefore, researchers have proposed several frameworks to detect and track the use of sensitive data in apps, so as to disclose and prevent unauthorized access and data leakage. Unfortunately, a review of the literature reveals a lack of consensus on what sensitive data is in the context of technical frameworks like Android. Authors either provide an intuitive definition or an ad-hoc definition, derive their definition from the Android permission model, or rely on previous research papers which do or do not give a definition of sensitive data. In this paper, we provide an overview of existing definitions of sensitive data in literature and legal frameworks. We further provide a sound definition of sensitive data derived from the definition of personal data of several legal frameworks. To help the scientific community further advance in this field, we publicly provide a list of sensitive sources from the Android framework, thus starting a community project leading to a complete list of sensitive API methods across different frameworks and programming languages.

Negative Results of Fusing Code and Documentation for Learning to Accurately Identify Sensitive Source and Sink Methods: An Application to the Android Framework for Data Leak Detection
Autoren: J. Samhi, M. Kober, A. K. Kabore, S. Arzt, T. F. Bissyandé, J. Klein
Veröffentlicht 2023 in: 30th IEEE International Conference on Software Analysis, Evolution and Reengineering
Die Veröffentlichung kann hier eingesehen werden.

Apps on mobile phones manipulate all sorts of data, including sensitive data, leading to privacy-related concerns. Recent regulations like the European GDPR provide rules for the processing of personal and sensitive data, like that no such data may be leaked without the consent of the user. Researchers have proposed sophisticated approaches to track sensitive data within mobile apps, all of which rely on specific lists of sensitive SOURCE and SINK API methods. The data flow analysis results greatly depend on these lists’ quality. Previous approaches either used incomplete hand-written lists that quickly became outdated or relied on machine learning. The latter, however, leads to numerous false positives, as we show. This paper introduces CODOC, a tool that aims to revive the machine-learning approach to precisely identify privacyrelated SOURCE and SINK API methods. In contrast to previous approaches, CODOC uses deep learning techniques and combines the source code with the documentation of API methods. Firstly, we propose novel definitions that clarify the concepts of sensitive SOURCE and SINK methods. Secondly, based on these definitions, we build a new ground truth of Android methods representing sensitive SOURCE, SINK, and NEITHER (i.e., no source or sink) methods that will be used to train our classifier. We evaluate CODOC and show that, on our validation dataset, it achieves a precision, recall, and F1 score of 91% in 10-fold cross-validation, outperforming the state-of-the-art SUSI when used on the same dataset. However, similarly to existing tools, we show that in the wild, i.e., with unseen data, CODOC performs poorly and generates many false positive results. Our findings, together with time-tested results of previous approaches, suggest that machine-learning models for abstract concepts such as privacy fail in practice despite good lab results. To encourage future research, we release all our artifacts to the community.

Towards Automatically Generating Security Analyses from Machine-Learned Library Models
Autoren: M. Kober, S. Arzt
Veröffentlicht 2021 in: Computer Security – ESORICS 2021
Die Veröffentlichung kann hier eingesehen werden.

Automatic code vulnerability scanners identify security antipatterns in application code, such as insecure uses of library methods. However, current scanners must regularly be updated manually with new library models, patterns, and corresponding security analyses. We propose a novel, two-phase approach called Mod4Sec for automatically generating static and dynamic code analyses targeting vulnerabilities based on library (mis)usage. In the first phase, we automatically infer semantic properties of libraries on a method and parameter level with supervised machine learning. In the second phase, we combine these models with high-level security policies. We present preliminary results from the first phase of Mod4Sec, where we identify security-relevant methods, with categorical f1-scores between 0.81 and 0.93.

Abschluss- und Seminararbeiten

Digital Oblivion in Online Social Networks: A Necessity or Just Nice to Have?
Masterarbeit (2019) an der Ruhr-Universität Bochum.
Die Abschlussarbeit kann hier eingesehen werden.

Digital technology makes forgetting difficult. The term digital oblivion summarizes the transfer of forgetting to the digital world. The first contribution of this thesis is an overview of arguments for and against digital oblivion found in literature. The debate of whether digital oblivion should be implemented or not is controversial. Both sides state that the absence or presence of forgetting mechanisms introduces censorship, restricts the freedom of speech, and presents a danger to democracy. The second contribution of this thesis is to answer the question of whether the absence of forgetting mechanisms in online social networks (OSN) is a problem for users. This question was answered by conducting a user study with 250 participants. Users would appreciate tools implementing digital oblivion in OSN to take action against data that is spread about them against their will, to check if their content is offline after they deleted their account, and as an optional feature to automatically delete their content after a fixed time. Users do not want their content to be automatically deleted. The presence of tools implementing several facets of digital oblivion would be appreciated and considered helpful by users of OSN.

Cross-Device Tracking
Seminararbeit (2018) an der Ruhr-Universität Bochum.
Die Seminararbeit kann hier eingesehen werden.

Text-Fehlerkorrektur durch word2vec mit numerischen Klassifikations-Schranken
Bachelorarbeit (2016) an der Universität Passau.
Die Abschlussarbeit kann hier eingesehen werden.

In dieser Arbeit werden automatische Methoden zur Identifizierung von Digitalisierungsfehlern in historischen, mit OCR digitalisierten Zeitungsartikeln untersucht. Durch die Identifizierung der fehlerhaften Wörter soll der manuelle Korrekturaufwand solcher Texte erleichtert werden.
Zur Erkennung von Fehlern werden mehrere binäre, Schwellwert-basierte Klassifikatoren verwendet. Die Klassifikatoren basieren auf dem Word2Vec-Modell oder arbeiten mit den Word2Vec-Wortvektoren.
Zur Evaluierung der Klassifikatoren werden zwei Word2Vec-Modelle berechnet: eines mit einem domänenspezifischen und eines mit einem domänenunabhängigen Trainingskorpus.
Die durchgeführten Experimente zeigen, dass die getesteten Klassifikatoren nicht zur Identifizierung von fehlerhaften Wörtern geeignet sind. Das domänenunabhängige Modell enthält mehr Wörter im Vokabular als das domänenspezifische Modell und erzielt dadurch bessere Ergebnisse. Ansonsten haben sich bei der Verwendung eines domänenspezifischen und eines domänenunabhängigen Modells keine nennenswerten Unterschiede ergeben.

Die Vermessung der Welt
Seminararbeit (2014) über Carl Friedrich Gauß und Alexander von Humboldt im Seminar "Mathematik in Filmen und Serien" an der Universität Passau.
Die Seminararbeit kann hier eingesehen werden.