• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Center for Artificial Intelligence and Cybersecurity – AIRI

  • Home
  • About Us
    • Center Activities
    • Vision, Mission and Goals
    • Center Faculty
    • Steering Committee
    • Press
  • Research
    • Scientific Projects
    • Research Papers
  • Laboratories
    • Machine Learning
    • Natural Speech & Language Processing
    • Blockchain Technology
    • Information Processing & Pattern Recognition
    • AI in Medicine
    • Data Mining
    • Computer Vision
    • Complex Networks
    • Human-Computer Interaction
    • Maritime Cybersecurity
    • Autonomous Navigation
    • AI in Mechatronics
    • AI in Education
    • Hybrid Computational Methods
    • Drug Design
    • Legal Aspects of AI
    • Ethically Aligned AI
    • Cultural Complexity
    • Trustworthy and Explainable AI
  • Collaboration
    • Industry Collaboration
    • Industry Projects
    • International Collaboration
  • News
  • Contact
  • Login

Evaluation of Language Models over Croatian Newspaper Texts

28.11.2017

Statistical language modeling involves techniques and procedures that assign probabilities to word sequences or, said in other words, estimate the regularity of the language. This paper presents basic characteristics of statistical language models, reviews their use in the large set of speech and language applications, explains their formal definition and shows different types of language models. Detailed overview of n-gram and class-based models (as well as their combinations) is given chronologically, by type and complexity of models, and in aspect of their use in different NLP applications for different natural languages. The proposed experimental procedure compares three different types of statistical language models: n-gram models based on words, categorical models based on automatically determined categories and categorical models based on POS tags. In the paper, we propose a language model for contemporary Croatian texts, a procedure how to determine the best n-gram and the optimal number of categories, which leads to significant decrease of language model perplexity, estimated from the Croatian News Agency articles (HINA) corpus. Using different language models estimated from the HINA corpus, we show experimentally that models based on categories contribute to a better description of the natural language than those based on words. These findings of the proposed experiment are applicable, except for Croatian, for similar highly inflectional languages with rich morphology and non-mandatory sentence word order.

Authors:
Slobodan Beliga, Ivo Ipšić, Sanda Martinčić-Ipšić.
Journal:
Information Technology and Control
Publishing date:
15.11.2017
View original article

Primary Sidebar

Latest Projects

Advanced Data Analysis Using Digital Signal Processing and Machine Learning Techniques

Compound Flooding in Coastal Rivers in Present and Future Climate

Data Processing on Graphs

North Adriatic Hydrogen Valley

Data Governance and Intellectual Property Governance in Common European Data Spaces – DGIP-CEDS

Latest Research Papers

Forecasting the Trajectory of Personal Watercrafts Using Models Based on Recurrent Neural Networks

A System for Real-Time Detection of Abandoned Luggage

Enhancing Biophysical Muscle Fatigue Model in the Dynamic Context of Soccer

Pravna tehnologija (Legal Tech) i njezina (ne)prikladnost za zamjenu pravne struke

Regression-Based Machine Learning Approaches for Estimating Discharge from Water Levels in Microtidal Rivers

Latest News

Arian Skoki defended his doctoral thesis “Data-Driven Assessment of Player Performance and Recovery in Soccer”

Anna Maria Mihel defended her PhD dissertation topic

Prof. dr. sc. Renato Filjar participated at the meeting of the 31st National Space-Based Positioning, Navigation and Timing US Advisory Board

Presentation of the NPOO project Peoplet

Ana Vranković Lacković defended her doctoral thesis

We provide the expertise for solving real world problems using AI

If your company wants to implement artificial intelligence in your products or services, or increase your level of cybersecurity, our multidisciplinary team of scientists is your ideal partner.

Contact us

Footer

Center for Artificial Intelligence and Cybersecurity
  • jlerga@airi.uniri.hr
  • +385 51 406 500

University of Rijeka

University of Rijeka

About the Center

  • About Us
  • News
  • Privacy Policy
  • Contact

Center Activities

  • Laboratories
  • Scientific Projects
  • Industry Projects
  • Research Papers
  • Industry Collaboration
  • International Collaboration

Footer bottom left

© 2020 Center for Artificial Intelligence and Cybersecurity, all rights reserved.

Designed & developed by Nela Dunato Art & Design