Poor lemmatization in Apple's Natural Language framework

I did a quick experiment to check the accuracy of lemmatization in Apple's Natural Language framework and the results are quite poor.

I wonder if I am doing something wrong or if the framework is really that bad.

For the experiment, I used code straight from Apple's documentation (which also gets repeated in the few online examples I could find).

let tagger = NLTagger(tagSchemes: [.lemma])
tagger.string = text
let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace]
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lemma, options: options) { tag, tokenRange in
    print("\(text[tokenRange]): \(tag?.rawValue ?? "NO LEMMA")")
    return true
}

To test the output, I took a paragraph from a Euronews article, which is available in multiple languages.

The English version seems accurate, but in English, most words coincide with their lemma, so it's not a great benchmark.

I'm running the code in an Xcode Playground, on macOS 10.14.6. I tried both macOS and iOS as the platform for the playground, which makes no difference.

let text = "For the possible necessity of short-time work I want to make sure to build an incentive with connecting it to training. And I want Germany to be able to implement short-time work faster in case of a fast recession of the economic situation because of global economic risks."

// Output

For: for
the: the
possible: possible
necessity: necessity
of: of
short: short
time: time
work: work
I: I
want: want
to: to
make: make
sure: sure
to: to
build: build
an: an
incentive: incentive
with: with
connecting: connect
it: it
to: to
training: training
And: and
I: I
want: want
Germany: Germany
to: to
be: be
able: able
to: to
implement: implement
short: short
time: time
work: work
faster: fast
in: in
case: case
of: of
a: a
fast: fast
recession: recession
of: of
the: the
economic: economic
situation: situation
because: because
of: of
global: global
economic: economic
risks: risk

I then tried Italian, which is my native language, so I can verify it easily. Here I started to see some problems.

let text = "Voglio assicurare che siano creati degli incentivi nel settore del lavoro a orario ridotto, collegati con periodi di training. E voglio che la Germania sia in grado di incrementare questo tipo di offerta lavorativa in modo veloce, in caso di recessione dell'economia per non farsi travolgere dai rischi che a livello globale subiremmo."

// Output

Voglio: volersi
assicurare: assicurare
che: che
siano: essersi
creati: crearsi
degli: degli
incentivi: incentivo
nel: nel
settore: settore
del: del
lavoro: lavoro
a: a
orario: orario
ridotto: ridotto
collegati: collegarsi
con: con
periodi: periodo
di: di
training: training
E: e
voglio: volersi
che: che
la: la
Germania: Germania
sia: essersi
in: in
grado: grado
di: di
incrementare: incrementare
questo: questo
tipo: tipo
di: di
offerta: offerta
lavorativa: lavorativo
in: in
modo: modo
veloce: veloce
in: in
caso: caso
di: di
recessione: recessione
dell'economia: economia
per: per
non: non
farsi: farsi
travolgere: travolgere
dai: dai
rischi: rischio
che: che
a: a
livello: livello
globale: globale
subiremmo: subire

Here some verbs get strange lemmas: "volersi", "essersi", "crearsi" are not the correct infinitive version of these verbs. For some reason, they are in a reflexive form. The problem though is that in many parts of the sample sentence they are not used reflexively.

But it's when I try with Russian (which I speak at an intermediate level) that things really fall apart.

let text = "Чтобы быть готовыми к потенциальному появлению необходимости в краткосрочной работе, я хочу создать возможности для обучения. Я хочу, чтобы Германия могла быстро выполнять работу в самые сжатые сроки в случае быстрого спада экономики из-за нависших над ней глобальных рисков."

// Output

Чтобы: чтобы
быть: быть
готовыми: NO LEMMA
к: к
потенциальному: NO LEMMA
появлению: NO LEMMA
необходимости: NO LEMMA
в: в
краткосрочной: NO LEMMA
работе: NO LEMMA
я: я
хочу: NO LEMMA
создать: NO LEMMA
возможности: NO LEMMA
для: для
обучения: NO LEMMA
Я: я
хочу: NO LEMMA
чтобы: чтобы
Германия: Германия
могла: мочь
быстро: NO LEMMA
выполнять: NO LEMMA
работу: NO LEMMA
в: в
самые: самый
сжатые: NO LEMMA
сроки: NO LEMMA
в: в
случае: случай
быстрого: NO LEMMA
спада: спад
экономики: NO LEMMA
из: из
за: за
нависших: NO LEMMA
над: над
ней: ней
глобальных: NO LEMMA
рисков: риск

Here most words produce no lemma. These are not rare words, and in my opinion, they should not be hard to lemmatize either (but I am no NLP expert).

For example, the word "быстро" (fast, quickly) is among the top 300 most common Russian words and is an adverb of the adjective "быстрый". That's definitely a word I would expect a lemmatizer to recognize. Like the word "хочу" which is "I want".

The Italian output already leaves me perplexed, but the Russian one is definitely unusable.

Am I doing something wrong, or is Apple's framework really that bad?

1 answer

  • answered 2019-08-19 07:11 0x5050

    Lemmatization is a language-dependent process. The same model for lemmatization will probably not give accurate results across multiple languages. The lemmatizer in question is probably using English as default.

    You can try setting the correct language for the lemmatizer using setLanguage function in the NLTagger class. The details of the API is here.

    You can predict the langauge of a given text using:

    let tagger = NSLinguisticTagger(tagSchemes: [.language], options: 0) 
    tagger.string = "NSLinguisticTagger provides text processing APIs." 
    let language = tagger.dominantLanguage