Förderung für Forschung mit Azure-Services

Microsoft Azure ist eine Plattform, die unterschiedliche Cloud-Services bereitstellt, etwa virtuelle Server oder Services mit künstlicher Intelligenz. Mitarbeiter*innen der Universität Wien können diese Services für die Forschung kostenpflichtig zu besonderen Konditionen über den ZID nutzen. Mehr Informationen zu Microsoft Azure

Um Forschungsaktivitäten in Azure zu unterstützen, bietet der ZID für das Sommersemester 2024 eine finanzielle Förderung an. Insgesamt stehen 22.000,00 Euro zur Verfügung. Pro Projekt werden bis zu 4.000,00 Euro vergeben.

Projekte mit einem der folgenden Merkmale werden bei der Vergabe der Förderung priorisiert:

  • sie nutzen Azure-Services, für die der ZID keine alternativen IT-Services anbietet
  • sie arbeiten mit hybriden Ansätzen (kombinierte Nutzung von Azure-Services mit lokaler Infrastruktur)

Geförderte Forschungsprojekte

Fördersumme 4.000,00 Euro

/

  • Marina Dütsch | FLEXWEB

    Organisationseinheit: Institut für Meteorologie und Geophysik

    Abstract:

    Flexpart (FLEXible PARTicle dispersion model) ist ein numerisches Modell, das die Ausbreitung von Gasen und Aerosolen in der Atmosphäre simuliert. Das Modell wird am Institut für Meteorologie und Geophysik weiterentwickelt, und kommt in verschiedenen internationalen und nationalen Forschungsprojekten zum Einsatz. Einige Anwendungsfälle sind z.B. die Bestimmung von Treibhausgas-Emissionen oder Transport von Mikroplastik, sowie Ausbreitungsrechnungen bei nuklearen Störfällen (z.B. CTBTO).

    Damit Flexpart verwendet werden kann, muss es auf einem (Super-)computer installiert und ausgeführt werden. Das ist allerdings mit Hürden verbunden, denn einerseits haben nicht alle Wissenschaftler*innen Zugang zu einem Supercomputer, und andererseits gibt es bei der Installation oder Ausführung oft technische Probleme. In diesem Projekt wollen wir deshalb ein Flexpart Web Service (FLEXWEB) entwickeln, bei dem Flexpart über eine Webseite laufen gelassen werden kann.

    Das Projekt soll ein Testprojekt für ein späteres operationelles Service sein. Flexpart soll mit Hilfe eines Kubernetes Clusters in der Cloud Trajektorien berechnen und den Usern diese Ergebnisse leicht zugänglich machen. Sobald die Simulation fertig ist, sollen die Output-Dateien zum Download bereitgestellt und graphisch dargestellt werden. Damit hoffen wir, den Zugang zu Flexpart für Wissenschaftler*innen weltweit zu vereinfachen.

    Flexpart Entwicklung an der Universität Wien

  • Wolfgang Klas | FactCheck

    Organisationseinheit: Forschungsgruppe Multimedia Information Systems, Fakultät für Informatik

    Abstract:

    FactCheck is an internal research project of the Research Group Multimedia Information Systems, Faculty of Computer Science, that aims to compare and signal conflicts in information available on the Web. This information, which may be available in textual form (e.g., paragraphs in an HTML document) or multimedia form (e.g., news segments in video form), shall be extracted using a combination of approaches from the Semantic Web (e.g., structured data) and state-of-the-art AI technologies and concepts (e.g., named entity recognition or entity linking). The comparison processes for this information will be partially driven by human intelligence and human feedback, which is why approaches for user identities and user management (e.g., Azure Entra ID) will also be investigated. For the deployment of the FactCheck prototype(s), a hybrid approach is considered, which allows for the use of both scalable Azure services (e.g., cognitive services like AI Video Indexer and user management) as well as available on-premises infrastructure (e.g., VMs or databases) at the University of Vienna to achieve suitable tradeoffs in terms of security, privacy, and costs. To keep the deployment highly flexible and modular, parts of this deployment may be containerized, thus simplifying deployment on both Azure and local infrastructure. 

  • Oliver Wieder | Revolutionizing Olfactory Perception Mapping: A Contrastive Learning Graph Neural Network Approach

    Organisationseinheit: Department für Pharmazeutische Wissenschaften

    Abstract:

    This project proposes a groundbreaking approach to understanding olfactory perceptions by developing a novel computational model that maps chemical structures to olfactory characteristics. Leveraging the advanced techniques of contrastive learning and graph neural networks (GNNs), the project aims to overcome the limitations of current olfactory perception studies, which predominantly rely on subjective human olfactory tests. The core objective is to create a GNN model that accurately represents the complex geometries and properties of small molecules in an embedding space. This space will then be used to fine-tune an odor classifier, significantly enhancing its predictive accuracy. A key innovation of this project is the integration of attention mechanisms to elucidate the role of functional groups in odor perception, a facet largely unexplored in existing research. A significant outcome of this project will be the development of an interactive online dashboard. This platform will enable industry professionals and researchers to visualize and interact with the olfactory map, inputting their compounds and receiving insights into their olfactory characteristics. This tool is expected to have substantial applications in various industries, particularly in the development of products like mosquito repellants. Backed by promising literature in the fields of contrastive learning of small molecules and deep-learning approaches to odor mapping, this project stands on the cusp of a significant breakthrough in olfactory science. It promises not only to advance our fundamental understanding of how chemical structures translate into olfactory experiences but also to transform industries that rely on these insights.

Fördersumme 2.000,00 Euro

/

  • Abert Claas | Very Largescale Distributed Micromagnetic Research Tools

    Organisationseinheit: Institut Physik Funktioneller Materialien

    Abstract:

    In the context of the FWF standalone project "Very Largescale Distributed Micromagnetic Research Tools" (P 34671) we are developing algorithms for the distributed solution of micromagnetic problems on multi-GPU systems. First tests on our group-owned workstation with 4xA100 Nvidia GPUs as well as the VSC5 nodes with 2xA100 GPUs show promising results. However, in order to perform a comprehensive scaling study, we ask for GPU computing hours on the Azure cluster, which features fat GPU nodes with 8xA100 Nvidia GPUs. Our planned study requires 100 hours of the largest GPU VM instance "ND96amsr A100 v4" and will allow us to investigate both scaling of our algorithm on single instances as well as distributed multi-GPU instances. Hence, we ask for a funding of 100 x 32.7 $ = 3270.00 $ in order to carry out our numerical study.

  • Xin Huang | selscape: Automated and Distributed Pipelines for Investigating the Landscape of Natural Selection from Large-scale Genomic Datasets

    Organisationseinheit: Department für Evolutionäre Anthropologie

    Abstract:

    Natural selection plays a pivotal role in evolutionary processes. With the increasing availability of genomic datasets across various species and populations, studying the genomic imprints of natural selection is crucial for understanding evolutionary histories and conserving biodiversity. However, the burgeoning size of these datasets, coupled with the plethora of computational tools available, can overwhelm researchers, especially given the limited computing resources often available for exploring the numerous modes of natural selection. Here, we aim to implement a curated suite of established software tools for detecting and quantifying signals and intensities of natural selection within large-scale genomic datasets. Our proposed pipelines offer a comprehensive, automated analysis workflow, from data preparation to result visualization. Designed for implementation using Snakemake, a versatile workflow management system, these pipelines ensure scalable and reproducible analysis across diverse computing environments, including high-performance computing clusters and cloud infrastructures. Initially developed on our local Life Science Compute Cluster (LiSC), we plan to extend and test these pipelines for cloud deployment via Azure Batch, which provides native support for Snakemake. Our intermediate goal is to apply these pipelines to the UK Biobank dataset, the largest whole-genome dataset in the world, comprising 500,000 genomes. We aim to benchmark our pipelines and investigate the landscape of natural selection within British populations. Finally, the implementation of this workflow on cloud infrastructures can be utilized for analyzing massive genomic datasets from various species, offering new insights into how natural selection shapes the biodiversity of our world.

  • Dylan Paltra | MULTIREP – Multidimensional Representation: Enabling An Alternative Research Agenda on the Citizen-Politician Relationship

    Organisationseinheit: Institut für Staatswissenschaft

    Abstract:

    The “MULTIREP” project aims to enable an alternative approach to studying the citizen-politician relationship. It focuses primarily on how citizens conceptualize representation. A mixed-methods approach combines qualitative methods (focus groups and one-to-one interviews with citizens) and quantitative methods in five countries (ca. 2.000 respondents in each), focusing on natural language processing approaches. In a multinational and multilingual mass survey in five countries, including 10.000 participants, we want to improve on current survey methodology by analyzing respondents’ answers in real-time to provide tailored probing questions. We will use several cloud computing instances during the data collection, accessed from the survey platform via web services. To evaluate respondents’ answers, we will implement several NLP algorithms such as language detection, mBERT, and Flesch’s reading ease score, among others. After the data collection, we want to examine the survey answers through different language models like mBERT and our implementation of a large language model (Llama) to classify citizens’ text answers. These models must be additionally trained and fine-tuned based on existing models for our use case. For this, cloud computing instances are necessary, especially with GPU; otherwise, the computation costs would be very high. Llama especially requires a GPU instance. Additionally, we might access the Microsoft Translator Services depending on the developments in our research process. We aim to classify citizens’ answers to our open-ended questions. Here, we want to categorize how citizens conceptualize different dimensions of representation. Additionally, we would like to access Azure’s developed speaker recognition service to transcribe our focus group and one-to-one interviews. This is standard practice when applying qualitative methods. To the best of our knowledge, the real-time evaluation of survey answers by machine learning algorithms has yet to be adopted in current social science research. Therefore, the implications and contributions of this work could be far-reaching, as a successful implementation of our study through functions offered by Azure would open up new avenues in survey implementation for both respondents and researchers. The delivery of the survey through these means would mimic a humanassisted interaction in the questioning and prompting phases of the survey, which would be far more expensive to achieve through traditional channels of computer-assisted web or telephone interviewing. Finally, it would enhance our analytical capabilities on mass-collected open-ended data to a new standard for social science research.

  • Miguel Angel Rios Gaona | Controlled Machine Translation with Large Language Models for the Technical Domain

    Organisationseinheit: Zentrum für Translationswissenschaft

    Abstract:

    Current state-of-the-art Neural Machine Translation (NMT) models and Large Language Models (LLM) have shown promising results on machine translation of high resource language pairs [5, 3]. However, in a high-risk and low-resource domain, like the technical domain (e.g. clinical notes, or engineering manuals), the accurate translation of terminology and correct document structure is crucial for exchanging information across international healthcare providers or researchers [6]. Moreover, the introduction of terminology and document structure constraints into neural models via instructions are currently an open problem [4, 11]. For example, controlled generation in MT output translations with the correct medical terms, length, or grammar compared to human translations. Our goal is to incorporate terminology and document structure constraints into a LLM. We plan to add a dictionary of technical terms and in-domain technical data as instructions for fine-tuning a pre-trained model based on FLAN-T5 [11] or LLaMA [10]. We will study different strategies for adding dictionaries and constraints into LLMs, e.g. source constraints and instruction fine-tuning [4, 11]. We will test the proposed model on the English-German and German-English language pairs with medical and scientific paper abstracts [6, 1]. We will evaluate with automatic metrics [7, 8], and in-house human experts [9]. We plan to use one A100 40 GB GPU or V100 32 GB GPU for tuning our proposed model and compare it with related work. We require GPUs to develop our model, NMT baselines, and instruction fine-tune related work (e.g. FLAN-T5).


    Project timeline:

    • NMT and LLM baselines, 01.02.24 to 01.03.24
    • LLM instruction fine-tuning, 02.03.24 to 01.07.24
    • Manual error annotation, 15.06.24 to 31.07.24
    • Draft paper, 0.1.06.24 to 15.08.2024
    • Project report, 01.08.24 to 30.09.24


    Project outcomes:

    • Paper submitted to a peer-reviewed publication;
    • Project report;
    • Open source code and models.


    References:

    1. Alam, M., Kvapilíková, I., Anastasopoulos, A., Besacier, L., Dinu, G., Federico, M., Gallé, M., Jung, K.W., Koehn, P., & Nikoulina, V. (2021). Findings of the WMT Shared Task on Machine Translation Using Terminologies. Conference on Machine Translation.

    2. Alves, D.M., Guerreiro, N.M., Alves, J., Pombal, J.P., Rei, R., Souza, J.G., Colombo, P., & Martins, A. (2023). Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning. ArXiv, abs/2310.13448.

    3. Bawden, R., & Yvon, F. (2023). Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM. European Association for Machine Translation Conferences/Workshops.

    4. Exel, M., Buschbeck-Wolf, B., Brandt, L., & Doneva, S. (2020). Terminology-Constrained Neural Machine Translation at SAP. EAMT.

    5. Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F.B., Wattenberg, M., Corrado, G.S., Hughes, M., & Dean, J. (2017). Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics, 5, 339-351.

    6. Neves, M.L., Jimeno-Yepes, A., Névéol, A., Grozea, C., Siu, A., Kittner, M., & Verspoor, K.M. (2018). Findings of the WMT 2018 Biomedical Translation Shared Task: Evaluation on Medline test sets. WMT.

    7. Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. ACL.

    8. Rei, R., Stewart, C.A., Farinha, A.C., & Lavie, A. (2020). COMET: A Neural Framework for MT Evaluation. EMNLP.

    9. Rios, M., Chereji, R., Secară, A., & Ciobanu, D. (2023). Quality Analysis of Multilingual Neural Machine Translation Systems and Reference Test Translations for the English-Romanian language pair in the Medical Domain. European Association for Machine Translation Conferences.

    10. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. ArXiv, abs/2302.13971.

    11. Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., & Le, Q.V. (2021). Finetuned Language Models Are Zero-Shot Learners. ArXiv, abs/2109.01652.

  • Petro Tolochko | Determining Scientific Uncertainty in Academic Publications

    Organisationseinheit: Institut für Publizistik- und Kommunikationswissenschaft

    Abstract:

    Misleading scientific information is increasingly discussed as one of the most pressing challenges to science (Druckman, 2022; Swire-Thompson and Lazer, 2022; West and Bergstrom, 2021). Its threat to planetary and human health “has reached crisis proportions” (West and Bergstrom, 2021, p. 1), and its impact on societies’ and individuals’ reactions to the COVID-19 pandemic has led the WHO to declare an “infodemic” (John, 2020). Research on misleading scientific information is heavily focused on social media (e.g., Renstrom, 2022). However, given that most people still only come in contact with science through its media portrayals (Schäfer et al., 2019), the misrepresentation of scientific information in news coverage might be even more problematic. One central aspect in which scientific findings are misrepresented is the failure to convey uncertainty (Druckman, 2022; Dumas-Mallet et al., 2018; Swire-Thompson and Lazer, 2022). Uncertainty is inherent to the self-correcting nature of science, and scientific findings are always limited by scientists’ decisions regarding sampling and statistical analyses (Gustafson and Rice, 2020). However, the uncertainty of scientific information is often misrepresented in news coverage (Dumas-Mallet et al., 2018; Sumner et al., 2016), and findings are frequently simplified and presented as certain, suggesting causal relationships where researchers describe correlation (Haber et al., 2018). While media logic plays a crucial role in this misrepresentation of scientific information, scholars urge us to acknowledge that the roots might also lie within science (West and Bergstrom, 2021). There are indications that misrepresentation of uncertainty already occurs in scientific articles or related press releases (West and Bergstrom, 2021; Haber et al., 2018). The failure to convey uncertainty has detrimental consequences for science communication. It can leave people misinformed about scientific issues. For example, they might overestimate the effectiveness of new medical discoveries (Dumas-Mallet et al., 2018). Alternatively, it might distort public perceptions of the scientific process. While most scientists understand that uncertainty is an inherent part of the scientific process and there are no “hard facts,” only degrees of plausibility (e.g., Russell, 2013), an average person might not. This misunderstanding might further be exacerbated by overly “deterministic” coverage of scientific evidence in the media. Furthermore, when findings initially presented as certain are not replicated later on (Dumas-Mallet et al., 2018), it might have detrimental effects on people’s trust in science. Thus far there is only little empirical evidence on the prevalence of uncertainty in science and science communication. Specifically, there is no systematic analysis of how the communication of scientific (un)certainty differs across a) different scientific disciplines and b) platforms of science communication (i.e., academic studies, press releases, news coverage). A large amount of data needs to be analyzed to fill these gaps. Thus, in this study, we will develop an automated method of measuring the concept of “uncertainty” in texts. We will then use this method to analyze the prevalence of (un)certainty in a large sample of scientific studies, their related press releases, and news coverage. We select studies from all major research disciplines. The contribution of our study is thus three-fold: first, it would be the first to provide a large-scale, comprehensive analysis of the role of (un)certainty in science communication, adding a comparative perspective across disciplines and platforms. Second, by linking scientific studies and their related press releases and news coverage, we will create a unique dataset that will be used to explain at what stages of science communication (study, press release, news coverage) the degree of (un)certainty changes. Lastly, the measurement of (un)certainty will be a valuable tool in future research as the concept is of high relevance in science communication and other fields such as crisis communication (Sellnow and Seeger, 2021; O’malley, 2012) and political science (e.g., Manski, 2013).

Antragsbedingungen

Der*die Antragsteller*in muss:

  • über ein aufrechtes Dienstverhältnis mit der Universität Wien und über einen aktiven u:account verfügen
  • berechtigt sein, Microsoft 365 über das Selfservice-Portal zu bestellen
  • die Datenschutzbestimmungen und Nutzungsbedingungen für Microsoft Azure akzeptieren, siehe Servicedesk-Formular Microsoft Azure bestellen

Förderungsbedingungen

  • Über die Förderung entscheidet das Team Coordination Digital Transformation des ZID. Bei Bedarf hält es dazu Rücksprache mit Peer Reviewern.
  • Die gewährte Fördersumme pro Projekt wird über die Nutzungsdauer bis 31.07.2024 von den für Azure anfallenden Kosten monatlich aliquot abgezogen.
  • Kosten, die die gewährte Fördersumme überschreiten oder nach Ende der Förderung anfallen, muss eine für das Projekt verfügbare Kostenstelle übernehmen.
  • Der ZID ist für die Einrichtung der Projektumgebung in Azure, das Onboarding und die Vergabe der Nutzer*innenberechtigungen verantwortlich. Unterstützung bei der technischen Umsetzung des Projektvorhabens wird nicht angeboten.
  • Personalressourcen werden ausdrücklich nicht gefördert.
  • Nach Ablauf der Förderung bleiben die zur Verfügung gestellte Azure-Umgebung sowie die darin enthaltenen Ressourcen für Nutzer*innen verfügbar. Eine nachfolgende Nutzung der Services ist möglich und erwünscht.

Zeitplan

  • 01.11.– 31.12.2023: Beantragung der Förderung
  • 01.–14.01.2024: Interne Prüfung der Anträge und eventuelle Rückfragen
  • Ab 16.01.2024: Bekanntgabe der geförderten Projekte per E-Mail
  • 17.01.–31.01.2024: Einrichtung der Azure-Umgebungen durch den ZID, Onboarding der Nutzer*innen
  • 01.02.–31.07.2024: Durchführung der Projekte
  • 01.08.–30.09.2024: Abgabe der Projektberichte

Förderung beantragen

Die Antragsfrist für die Förderung ist abgelaufen.
 

Kontakt

Bei Fragen zur Förderung steht Ihnen das Servicedesk-Formular Anfrage zu Microsoft Azure zur Verfügung.