Armenian resources – Hossep Dolatian

This list provides general tools and resources on Modern Armenian. I focus on tools and resources that would be useful for linguists and computer scientists. This includes tree banks, digitized lexicons, and more. For more resources on Classical Armenian, see Calfa and the LACIM presentation on Armenian tools.

Contents

1 Transliteration and orthography tools
2 Audio
3 Lexicons, paradigms, and transcriptions
4 Treebanks
5 Corpora and unannotated texts
- 5.1 Eastern Armenian
- 5.2 Western Armenian
6 General NLP tools

Transliteration and orthography tools

To get automatic transliterations, see translitteration.com. The website supports common transliteration or romanization systems for Classical, Western, and Eastern Armenian.

To convert between the reformed and traditional orthographies, you can use the following:

Arak29’s converter, which can be downloaded.
The Instigate Training Center’s conversion tool.

Audio

It’s relatively hard to find transcribed or even un-transcribed (but organized) audio recordings of Armenian. Some useful ones include:

Wiktionary

The Armenian entries of English Wiktionary include audio files for 1000s of lemmas, which you can find in this Wiktionary category. You can download some of them at Lingua Libre or Wikimedia.

Field recordings

Different linguists have made field recordings of Armenian speech for different purposes. Some have transcriptions of some type (transliteration or phonetic/phonemic). Few are available online:

Recordings of different Armenian dialects from the UCLA Phonetics Lab archive. A subset seems analyzed in the VoxAngeles project.
Skopeteas, Hovhannisyan, and Brokmann’s archive on Eastern Armenian
My Iranian Armenian archive (temporary)

General archives

There are collections of general recordings of Armenian, some of which is transcribed in some form.

ReRooted archive of Syrian Armenian testimonies. Most entries have an SRT file with an orthographic transcription, so the data is time-aligned. A corrected version is on GitHub.
Fleurs has 60 speakers of Eastern Armenian reading 10hrs of Wikipedia sentences. The data has an orthographic transcription. Each sound file is a separate sentence.
Various (free) audiobooks at Grqaser and Lsel65

Lexicons, paradigms, and transcriptions

Both dialects

Wiktionary

English Wiktionary contains around 15K Armenian lemmas as of Jan 1 2022.

Most of these lemmas include IPA pronunciations in both standard dialects, and conjugation/declension paradigms in Eastern.
The IPA transcription system for Armenian is handled via Wiktionary modules.
Paradigms are handled by Wiktionary templates.
Templates for the Western paradigms are currently in development.
Wiktionary extractors [wiktextract,wikipron] work well with these Armenian lemmas.

Armenian Wiktionary has substantially more lemmas. But common Wiktionary extraction software has trouble with site because of formatting issues.

Repositories

There are various websites that link to other existing dictionaries of Armenian. Unfortunately, most of these dictionaries are available only in a PDF form, whether OCR-ed or un-OCR-ed. Some utilize a search option or a GUI. But few provide simple text versions of their dictionaries.

Eastern Armenian

EANC’s source code: The source code for the Eastern Armenian National Corpus was recently made public on Bitbucket and GitHub. The source code includes paradigms and a lexicon, and is open-source (obviously). You have to do some scripting magic to convert the lexicons into a more readable tabular format. This is the current version that I have (from circa Dec 2020).
UD tools: The makers of the UD treebank for Eastern Armenian also made various tools available, including a tokenizer and stemmer/lemmatizer.
Calfa provides a proprietary analyzer.
Armenian English Dictionary for MacOS
Various text file versions of Eastern Armenian dictionaries at bararan-hay.

Western Armenian

ArmenianVerbs (Boyacioglu & Dolatian 2020) is a repository of around 3K Western Armenian verbs, along with complete verbal paradigms. The data is based on Boyacioglu 2010, a textbook on verbal paradigms.
The Apertium morphological analyzer for Western Armenian provides lexicons and paradigms.

Treebanks

There are two main Universal Dependencies treebanks for Armenian along with some smaller ones. They are documented on the Universal Dependencies page. To download the Armenian treebanks (including both the corpus and annotation), you have to download all the UD treebanks.

Corpora and unannotated texts

Eastern Armenian

The Eastern Armenian National Corpus (EANC) is the largest extant annotated corpus of Eastern Armenian. See this introduction video for more.
Vortan has a set of unannotated corpora.
Haybook contains a list of various sites that have unannotated texts. Most of their are in PDF form though
Grqaser has various books in OCR-ed PDF format, and some audio material.
Grahavak has lists of repositories where you can find texts and (sometimes OCR-ed) PDFs.

Western Armenian

Nooj annotated corpus on Western Armenian.
Scraped unannotated corpus on Western Armenian. This was used for evaluating the Apertium analyzer.

General NLP tools

For general Deep Learning tools on computational processing of Armenian texts, the YerevaNN research group provides a wealth of tools and datasets. There is likewise a repo of an assortment of Armenian NLP tools.

Calfa provides (affordable) OCR services for Armenian manuscripts and books.

Avetisyan & Broneske 2023 is a survey paper on Armenian NLP with useful citations and hyperlinks.

Tesseract supports Armenian OCR, including OCRmyPDF.