For faster navigation, this Iframe is preloading the Wikiwand page for Croatian Language Corpus.

Croatian Language Corpus

The Croatian Language Corpus (CLC; Croatian: Hrvatski jezični korpus, HJK) is a corpus of Croatian compiled at the Institute of Croatian Language and Linguistics (IHJJ).

Background

[edit]

The CLC was initially funded as a sub-project of the research program Riznica (Croatian Language Repository) by the Ministry of Science, Education, and Sports of the Republic of Croatia (MZOŠ) (project no. 0212010) from May 2005. In a second development phase, since 2007, the further extension and development of the CLC was embedded within the research program The Croatian Language Repository (CLR) that was granted by the MZOŠ (cf. Ćavar and Brozović Rončević, 2012[1]). Being a research program (PI Dunja Brozović Rončević) with numerous subsumed independent research projects that make use of the CLC, the corpus is mainly developed as a by-product of those research projects within the CLR. Currently Dunja Brozović Rončević and Damir Ćavar are in charge of the corpus development.

Goals

[edit]

One of the main goals of the CLC project is to create a publicly available Croatian corpus that is annotated on multiple levels, i.e. lemmatized, morphologically segmented and morpho-syntactically annotated, phonemically transcribed and syllabified, and syntactically parsed. While the current version of the corpus provides resources from the Croatian language standard, several corpora from different development phases of Croatian are created as well, including the digitizations of manuscripts and Croatian dictionaries.

Format and Availability

[edit]

From the outset, the collected and digitized texts in the CLC were annotated using the Text Encoding Initiative (TEI) P5 XML standard. Currently approx. 90 mil. tokens are available in the TEI P5 XML format. The corpus can be accessed online via the Philologic[2] interface (see The ARTFL Project,[3] Department of Romance Languages and Literatures, The University of Chicago). It is virtualized into various sub-corpora, and individual or specific definitions of sub-corpora can be provided on demand.

Content

[edit]

The CLC is assembled from selected text of Croatian, covering various functional domains and genres. It includes literature and other written sources from the period of the beginning of the final shaping of the standardization of Croatian, i.e. from the second half of the 19th century on.

The CLC consists of:

  • fundamental Croatian literature (e.g. novels, short stories, drama, poetry)
  • non-fiction
  • scientific publications from various domains and University textbooks
  • school books
  • translated literature from outstanding Croatian translators
  • online journals and newspapers
  • books from the pre-standardization period of Croatian that are adapted to nowadays standard Croatian

Cooperation

[edit]

The realization of the CLC was made possible in cooperation with:

References

[edit]
  1. ^ Ćavar and Brozović Rončević, 2012
  2. ^ Philologic
  3. ^ "The ARTFL Project". Archived from the original on 2009-12-04. Retrieved 2011-05-22.
[edit]
{{bottomLinkPreText}} {{bottomLinkText}}
Croatian Language Corpus
Listen to this article

This browser is not supported by Wikiwand :(
Wikiwand requires a browser with modern capabilities in order to provide you with the best reading experience.
Please download and use one of the following browsers:

This article was just edited, click to reload
This article has been deleted on Wikipedia (Why?)

Back to homepage

Please click Add in the dialog above
Please click Allow in the top-left corner,
then click Install Now in the dialog
Please click Open in the download dialog,
then click Install
Please click the "Downloads" icon in the Safari toolbar, open the first download in the list,
then click Install
{{::$root.activation.text}}

Install Wikiwand

Install on Chrome Install on Firefox
Don't forget to rate us

Tell your friends about Wikiwand!

Gmail Facebook Twitter Link

Enjoying Wikiwand?

Tell your friends and spread the love:
Share on Gmail Share on Facebook Share on Twitter Share on Buffer

Our magic isn't perfect

You can help our automatic cover photo selection by reporting an unsuitable photo.

This photo is visually disturbing This photo is not a good choice

Thank you for helping!


Your input will affect cover photo selection, along with input from other users.

X

Get ready for Wikiwand 2.0 ๐ŸŽ‰! the new version arrives on September 1st! Don't want to wait?