当前位置:首页 >> >> Extraction of Transliteration Pairs from Parallel Corpora Using a Statistical Transliterati

Extraction of Transliteration Pairs from Parallel Corpora Using a Statistical Transliterati


Extraction of Transliteration Pairs from Parallel Corpora Using a Statistical Transliteration Model
Chun-Jen Lee1, 2 Telecommunication Labs., Chunghwa Telecom Co., Ltd. Chungli, Taiwan cjlee@cht.com.tw
1

Jason S. Chang2
2

Jyh-Shing Roger Jang2

Department of Computer Science, National Tsing Hua University Hsinchu, Taiwan {jschang, jang}@cs.nthu.edu.tw

Abstract

This paper describes a framework for modeling the machine transliteration problem. The parameters of the proposed model are automatically acquired through statistical learning from a bilingual proper name list. Unlike previous approaches, the model does not involve the use of either a pronunciation dictionary for converting source words into phonetic symbols or manually assigned phonetic similarity scores between source and target words. We also report how the model is applied to extract proper names and corresponding transliterations from parallel corpora. Experimental results show that the average rates of word and character precision are 93.8% and 97.8%, respectively.

1

Keywords: Transliteration pair; Transliteration model; Parallel corpora; Statistical learning; Machine Transliteration

1

Introduction

Machine transliteration is very important for research in natural language processing (NLP), such as machine translation (MT), cross-language information retrieval (CLIR), question answering (QA), and bilingual lexicon construction. Proper nouns are not often found in existing bilingual dictionaries. Thus, it is difficult to handle transliteration only via simple dictionary lookup. Unfamiliar personal names, place names, and technical terms are especially difficult for human translators to transliterate correctly. In CLIR, the accuracy of transliteration greatly affects the retrieval performance. Recently, much research has been done on machine transliteration for many language pairs, such as English/Arabic [24, 1], English/Chinese [3, 28, 18], English/Japanese [12], and English/Korean [16, 11, 21]. Machine transliteration is classified into two types based on transliteration direction. Transliteration is the process that converts an original proper name in the source language into an approximate phonetic equivalent in the target language, whereas back-transliteration is the reverse process that converts the transliteration back into its original proper name. Lee and Choi [16] proposed an automatic learning procedure for English-to-Korean transliteration with limited evaluation. Chen et al. [3] proposed a method for Chinese-to-English back-transliteration. In that heuristic approach, letters commonly
2

shared between a Romanized Chinese word and an original English word are considered. The model is also enhanced with pronunciation rules. Knight and Graehl [12] explored a generative model for Japanese-to-English back-transliteration based on the source-channel framework. Stalls and Knight [24] extended that approach to Arabic-to-English back-transliteration. Wan and Verspoor [28] proposed a method for English-to-Chinese place name transliteration based on heuristic rules for relationships between English phonemes and the Chinese phonetic system. Kang and Choi [11] proposed a method based on decision trees to learn transliteration and back-transliteration rules between English and Korean. Lin and Chen [18] proposed a learning algorithm for Chinese-to-English back-transliteration using both a pronunciation dictionary and a speech synthesis system to generate the pronunciation of an English proper name. Oh and Choi [21] presented an English-to-Korean transliteration model using a pronunciation dictionary and contextual rules. Al-Onaizan and Knight [1] presented a spelling-based model for Arabic-to-English named entity transliteration. Most of the above approaches require a pronunciation dictionary for converting a source word into a sequence of pronunciations. However, words with unknown pronunciations may cause problems for transliteration. In addition, Chen et al. [3] and Oh and Choi [21] used a language-dependent penalty function to measure the similarity between a proper name and its corresponding transliteration. For learning the rules of transliteration and back-transliteration, Kang and Choi [11] used a language-dependent penalty function to perform phonetic

3

alignment between pairs of English words and Korean transliterations. Wan and Verspoor [28] also used handcrafted heuristic mapping rules. This may lead to problems when porting to other language pairs. In recent years, much research has focused on the study of automatic bilingual lexicon construction based on bilingual corpora. Proper names and corresponding transliterations can often be found in parallel corpora or topic-related bilingual comparable corpora. However, as noted by Tsuji [26], many previous methods [6, 13, 30, 20, 25] dealt with this problem based on the frequencies of words appearing in corpora, an approach which cannot be effectively applied to low-frequency words, such as transliterated words. In this paper, we present a framework for extracting English and Chinese transliterated word pairs based on the proposed statistical machine transliteration model to overcome the problem. Compared with previous approaches, the method proposed in this paper requires no pronunciation dictionary for converting source words into phonetic symbols. Additionally, the parameters of the model are automatically learned from a bilingual proper name list using the Expectation Maximization (EM) algorithm [7]. No manually assigned phonetic similarity scores between bilingual name pairs are required. Moreover, the learning approach is unsupervised except for the use of seed constraints based on phonetic knowledge to accelerate the convergence of EM training. To capture grapheme-level string mapping more precisely, a mapping scheme based on transliteration units, instead of individual characters, is adopted in

4

this study. Furthermore, how the model can be applied to the extraction of proper names and transliterations from parallel corpora is described. The remainder of the paper is organized as follows: Section 2 gives an overview of machine transliteration and describes the proposed approach. Section 3 describes how the model is applied to the extraction of transliterated target words from parallel texts. The experimental setup and a quantitative assessment of performance are presented in Section 4. Concluding remarks are made in Section 5.

2

Statistical Machine Transliteration Model

In this section, we first give an overview of machine transliteration and briefly illustrate our approach with an example. A formal description of the proposed transliteration model and a parameter estimation procedure based on the EM algorithm will be presented in Section 2.2 and Section 2.3, respectively.

2.1 Overview of the Noisy Channel Model One can consider machine transliteration as a noisy channel, as illustrated in Figure 1. Briefly, the language model generates a source proper name E, and the transliteration model converts the proper name E into a target transliteration C.

5

L anguage M odel

T r a n slite r a tio n

E

M odel

C

P (E )

P ( C |E )

Figure 1. The noisy channel model in machine transliteration.

Throughout the rest of the paper, we assume that E is written in English, while C is written in Chinese. Since Chinese and English are not in the same language family, there is no simple or direct way of performing mapping and comparison. One feasible solution is to adopt a Chinese Romanization system 1 to represent the pronunciation of each Chinese character. Among the many Romanization systems for Chinese, Wade-Giles and Hanyu Pinyin are the most widely used. The Wade-Giles system is commonly adopted in Taiwan today and has traditionally been popular among Western scholars. For this reason, we use the Wade-Giles system to Romanize Chinese characters. However, the proposed approach is equally applicable to other Romanization systems. The language model gives the prior probability P(E) which can be modeled using maximum likelihood estimation. As for the transliteration model P(C|E), we can approximate it by decomposing E and Romanization of C into transliteration units (TUs). TU is defined as a sequence of characters transliterated as a

1

Ref. sites: “http://www.romanization.com/index.html” and “http://www.edepot.com/taoroman.html”.

6

group. For English, a TU can be a monograph, a digraph, or a trigraph [29]. For Chinese, a TU can be a syllable initial, a syllable final, or a syllable [2] represented by Romanized characters. To illustrate how the approach works, we take the TU alignment in Figure 2 as an example. In this example, the English name “Smith” can be segmented into four TUs and aligned with the Romanized transliteration. Assuming that “Smith” is segmented into “S-m-i-th,” then a possible alignment with the Chinese transliteration “?v±K?? is depicted in Figure 2. (Shihmissu)”

S Shih ?

m m v

i i ±

th ssu ? K ?

Figure 2. TU alignment between English and Chinese Romanized character sequences. Intuitively, the probability of P(?v±K?? equation based on TU decomposition:

| Smith) can be approximated by the following

P(

| Smith) ? P(Shihmissu | Smith)

? P( Shih | S ) P(m | m) P(i | i ) P ( ssu | th),
where “Shihmissu” is the Wade-Giles Romanization of “?v±K??. ”

(1)

A formal description of this approximation scheme will be given in the next subsection.

7

2.2 Formal Description: Statistical Transliteration Model (STM) A proper name E with l characters and a Romanized transliteration C with n characters are
l denoted by e1 and c1n , respectively. Assume that the number of aligned TUs for (E, C) is N,

and let M = {m1 , m2 ,..., m N } be an alignment candidate, where mj is the match type of the j-th TU. The match type is defined as a pair of lengths of TUs in the two languages. For instance, in the case of (“Smith,” “Shihmissu”), N is 4, and M is {1-4, 1-1, 1-1, 2-3}. We write E and C as follows:
l N ? ?E = e1 = u1 = u1 , u2 ,...,u N , ? n N ? ?C = c1 = v1 = v1 , v2 ,..., v N

(2)

where ui and vj are the i-th TU of E and the j-th TU of C, respectively. Then the probability of C given E, P(C|E), is formulated as follows:
P(C | E) = ∑P(C, M | E) = ∑P(C | M, E)P(M | E).
M M

(3)

Theoretically, Eq. (3) is computed over all possible alignments M. To reduce the computational cost, one alternative approach is to modify the summation criterion through the best alignment. Therefore, the process of finding the most probable transliteration C* for a given E can be formulated as:
C* = arg max max P (C | M , E ) P( M | E ),
C M

= arg max max P(C | M , E ) P( M ).
C M

(4)

We can approximate P(C | M , E ) P( M ) as follows:
8

P(C | M , E ) P( M ) = P(v1N | u1N ) P(m1 , m2 ,..., mN )

≈ ∏ P(vi | ui ) P(mi ).
i =1

N

(5)

Therefore, we have
C * = arg max max ∏ P (v i | u i ) P ( m i ),
C M i =1 N

(6)

Then, the transliteration score function for C, given E, is formulated as
Score STM (C ) ≡ max log(∏ P (vi | u i ) P ( mi )),
M i =1 N

= max ∑ (log P(vi | u i ) + log P(mi ) ).
M i =1

N

(7)

Let S (i, j ) be the maximum accumulated log score between the first i characters of E and the first j characters of C. Then, Score STM (C ) = S (l , n) , the maximum accumulated log score among all possible alignment paths of E with length l and of C with length n, can be computed using a dynamic programming strategy, as shown in the following:
Step 1 (Initialization):

S (0,0) = 0.
Step 2 (Recursion):

(8)

S (i, j ) = max S (i ? h, j ? k ) + log P (c jj? k | eii? h ) + log P ( h, k ) ,
h,k

[

]

0 ≤ i ≤ l , 0 ≤ j ≤ n.

(9)

where log P (h, k ) is defined as the log score of the match type “h-k”, which corresponds to the last term in Eq. (7).
Step 3 (Termination):

ScoreSTM (C ) = S (l , n)

(10)
9

In practice, the values of h and k are limited to a small set.

2.3 Estimation of Model Parameters

In the following, we describe the iterative procedure for re-estimation of P(v j | ui ) and

P(mi ) . We first define the following functions:

count(ui , v j ) = the number of occurrences of aligned pair ui and vi in the training
set;

count (ui ) = the number of occurrences of ui in the training set;
count(h, k ) = the total number of occurrences of match type “h-k” in the training
set.

Therefore, P (v j | ui ) can be approximated as follows:

P (v j | u i ) =

count (u i , v j ) count (u i )

.

(11)

Similarly, P (h, k ) can be estimated as follows:

P ( h, k ) =

count ( h, k ) . ∑∑ count (i, j )
i j

(12)

Because count (ui , v j ) is unknown initially, a reasonable approach to obtaining a rough estimate of the parameters of the translation model is to constrain the TU alignments of a word pair (E, C) within a position distance δ
p + h ?1 [16]. Assume that ui = e p and

q + k ?1 v j = cq , and that the aligned pair (ui, vj) is constrained as follows:

10

? q×l and ? p ? n <δ, ? , ? ? ( p + h ? 1) ? (q + k ? 1) × l < δ ? n ?

(13)

where l and n are the length of the source word E and the target word C, respectively. To accelerate the convergence of the model training process and reduce the number of noisy TU aligned pairs (ui, vj), we restrict the combination of TU pairs to limited patterns in the beginning. Basing on the assumption that the articulatory representations of phonemes are very similar across languages, we consider that the phonemes of TUs are independent from the underlying languages. In this approach, the similarities of phonemes of TUs are initially classified based on phonetic knowledge. Consonant TU pairs only with the same or similar phonemes can be matched. An English consonant can also be matched with a Chinese syllable beginning with the same or similar phonemes. An English semivowel TU can either be matched with a Chinese consonant or with a vowel with the same or similar phonemes, or can be matched with a Chinese syllable beginning with the same or similar phonemes. In the initialization phase, P(h, k ) is set to uniform distribution, as shown in the following:

P(h, k ) =

1 , T

(14)

where T is the total number of match types allowed. Based on the EM algorithm with Viterbi decoding [8], the iterative parameter estimation procedure is described as follows:
Step 1 (Initialization):

Use Eq. (13) to generate likely TU alignment pairs. Calculate the initial model parameters, P (v j | ui ) and P(h, k ) , using Eq. (11) and Eq. (14), respectively.
11

Step 2 (Expectation):

Based on the current model parameters, find the best Viterbi path for each E and C word pair in the training set.
Step 3 (Maximization):

Based on all the TU alignment pairs obtained in Step 2, calculate the new model parameters using Eq. (11) and Eq. (12). Replace the model parameters with the new model parameters. If a stopping criterion or a predefined number of iterations is reached, then stop the training procedure. Otherwise, go back to Step 2.

In the first iteration, TUs in English and Chinese are constrained based on phonetic knowledge. However, in the subsequent iterations, the whole training process is run in a totally unsupervised manner. Therefore, some new TUs are automatically discovered from the training data within the constraints of match types, as demonstrated in Section 4.

3

Extractions of Transliterations from Parallel Corpora

The proposed transliteration model can be applied to the tasks of the extraction of bilingual name and transliteration pairs [14], and back-transliteration [15]. These tasks become more challenging for language pairs with different sound systems, such as Chinese/English, Japanese/English, and Arabic/English. For clarity of the paper, we focus on the extraction of

12

English-Chinese name and transliteration pairs. However, the proposed framework is easily extendable to other language pairs.

3.1 Overall Process

For the purpose of extracting name and transliteration pairs from parallel corpora, a sentence alignment procedure is applied first to align parallel texts at the sentence level. Then, we use a part of speech tagger to identify proper nouns in the source text. After that, the machine transliteration model is applied to isolate the transliteration in the target text. In general, the proposed transliteration model can be further augmented by linguistic processing, which will be described in more detail in the next subsection. The overall process is summarized in Figure 3.

13

Bilingual Co rpus Sentence Alig nment So urce Lang uage Sentences Ta rget Langu age Sentences

Pre pro cess

Pro pe r Names : Wo rd Extractio n So urce Wo rd s

M ain Pro cess

Linguistic Pro cessing

Translite ratio n M o del

Pro pe r Names : So urce & Ta rget Wo rds

Figure 3. The overall process for extracting name and transliteration pairs from parallel corpora.

An excerpt from the magazine Scientific American [5] is given in the following:

Source language sentence:

“Rudolf Jaenisch, a cloning expert at the Whitehead Institute for Biomedical Research at the Massachusetts Institute of Technology, concurred:”

Target language sentence:

“?????z¤u??°|?h?ü?w??????????¨s°|?????s±M?a???§?è??G



14

In the above excerpt, three English proper nouns, “Jaenisch,” “Whitehead,” and “Massachusetts,” were identified from the results of tagging. Utilizing Eq. (7) and Viterbi decoding, we found that the target word “?h?ü “Whitehead.” The other word pair (Jaenisch, ???§?è
?w (huaihaite)” most likely corresponded to

“chiehnihsi”) can also be extracted “masheng”)

through a similar process. However, the third word pair (Massachusetts, ???? failed to be extracted by the proposed approach. The reason is that “???? of “?? ??????
?{

” is an abbreviation

(masachusaichou)” which is a well established popular translated name of

“Massachusetts.” Therefore, the proposed model is incapable of resolving the abbreviation mentioned above. In order to retrieve the transliteration for a given proper noun, we need to keep track of the optimal TU decoding sequence associated with the given Chinese term for each word pair under the proposed method. It can be easily obtained via backtracking the best Viterbi path [19]. For the name-transliteration pair (Whitehead, ?h?ü
?w ) mentioned above, the alignments

of the TU matching pairs via the Viterbi path are illustrated in Figure 4 and Figure 5.

15

Match Type : 0-1, 0-1, 0-1, 0-1, 2-2, 1-1, 1-0, 1-1, 1-1, 0-1, 2-1, 1-2, 0-1, 0-1, 0-1, 0-1, 0-1, :

TU Pair -- y -- u -- a -- n Wh -- hu i -- a t -e -- i h -- h -- a ea -- i d -- te -- s -- h -- e -- n -- g

°|

?h

?ü ?w

??

Figure 4. The alignments of the TU matching pairs via the Viterbi path.

?
W h i t e h e a d

?

?

? ?

z… °
y u a n

|

?

?

h? ü ? … w ?


m a s h e n g l i

h u a i h a i t e s



Figure 5. The Viterbi alignment path. In this example, the word “Whitehead” is decomposed into seven TUs, “Wh-i-t-e-h-ea-d,” and aligned with the Romanization “huaihaite” of the transliteration “?h?ü?w .”

16

3.2 Linguistic Processing

Some language-dependent knowledge can be integrated to further improve the performance, especially when we focus on specific language pairs.
3.2.1 Linguistic Processing Rule 1 (R1):

Some source words have both transliterations and translations, which are equally acceptable and can be used interchangeably. For example, the translation and the transliteration of the source word “England” are “-^°ê (Yingkou)” and “-^??
??

(Yingkolan),” respectively, as

shown in Figure 6. Since the proposed model is designed specifically for transliteration, such cases may cause problems. One way to overcome this limitation is to handle these cases by using a list of commonly used proper names and translations. A portion of the list is shown in Table 1.

England vs. - ^ ° ê The Spanish A rm ada sailed to England in 1588. ?è ?Z ¤ú ?L ?? ?? ?¤ ?ó¤@¤¤K¤K ?~?X ??-^ ° ê ?C

England vs.- ^ ? ?

??

England is the only country coterm inous w ith W ales. -^?? ?? ?O°? ¤@?P ???? ???s ?s?? °ê ?a ?C

Figure 6. Examples of mixed usages of translation and transliteration.

17

Table 1. A portion of the list for translation. Source Word Target Word Source Word Target Word Afghanistan America Asia Canada China Christ
?ü?I?? ?ü°ê ¨??w ?[??¤j ¤ °ê -C?q

England France Greece India Spanish Yugoslavia
?n????¤?

-^°ê ?k°ê §??? ?L?× ?è?Z¤ú

3.2.2 Linguistic Processing Rule 2 (R2):

From error analysis of the aligned results of the training set, we have found that the proposed approach suffers from fluid TUs, such as “t,” “d,” “tt,” “dd,” “te,” and “de.” Sometimes they are omitted during transliteration, and sometimes they are transliterated as Chinese characters. For instance, “d” is usually transliterated as “?S ,” “±o ,” or “?w ” corresponding to the Chinese TU of “te.” The English TU “d” is transliterated as “?w ” in (Clifford, §J§Q???w ), but left out

in (Radford, ?p?w ?? ). This phenomenon causes problems; in the example shown in Figure 7, the TU “d” in “David” is mistakenly matched up with “¤j???? .”

(A boy by the name of David.) ? W ? s ¤ j ? ? ?? ¤ @ - ? ¨ k ? ? ? C

… … Ta W ei Te … … …

… … … Davi d .

Figure 7. Example of transliterated word extraction for “David.”

18

Similarly, the English TU “s” or “se” is likely to misalign with “?O ” (TU “shih”) as in “??¨?
?O ?j?N§? ???? ±j¤j?? ?°¨? ¤§@ ?C (Athens was one of the most powerful city-states of ancient

Greece.).” See Figure 8 for more details.

(Athens was one of the most powerful city-states of ancient Greece.) ? ? ¨ ? ?O N ?j ???§ C ?@¤§ ?¨°???j ±??

Ya Tien Shih ………

Athen s ………

Figure 8. Example of transliterated word extraction for “Athens.”

However, the problem caused by fluid TUs can be partly overcome by adding more linguistic constraints in the post-processing phase. We calculate the Chinese character distributions of proper nouns from the corpus. A small set of Chinese characters is often used for transliteration. Therefore, it is possible to improve the performance by pruning extra tailing characters, which do not belong to the transliterated character set, from the transliteration candidates. For instance, the probability of “?? , ?h , ?? is very low. Therefore, the correct transliteration “¤j?? , ?O , ?? ” being used in transliteration

” for the source word “David” can be

extracted by removing the character “?? .” We denote this strategy as Rule 2 (R2).

19

3.3 Work Flow of Integrating Linguistic and Statistical Information

Combining the linguistic processing and transliteration model, we present the algorithm for transliteration extraction as follows:
Step1: Look up the translation list as stated in R1. If the translation of a source word

appears in both the entry of the translation list and the aligned target sentence (or paragraph), then pick the translation as the target word. Otherwise, go to Step 2.
Step 2: Pass the source word and its aligned target sentence (or paragraph) through

the proposed model to extract the target word. Once this is done, go to Step 3.
Step 3: Apply linguistic processing R2 to remove superfluous tailing characters from

the extracted transliterations.

After the above steps are completed, the performance of source-target word extraction is significantly improved.

4

Experiments

In this section, we focus on the setup for the experiments and a performance evaluation of the proposed model applied to extract bilingual word pairs from parallel corpora.

20

4.1 Experimental Setup

Several corpora were collected to estimate the parameters of the proposed models and to evaluate the performance of the proposed approach. The corpus T0 for training consisted of 2,430 pairs of English names and transliterations in Chinese. The training corpus, composed of a bilingual proper name list, was collected from “Handbook of English Name Knowledge” edited by Huai [10]. The bilingual proper name list consists of first names, last names, and nicknames. For example, (Adolf, ?ü?D¤? are first names, (Abbey, ?ü¤? (Archie, ?ü???_ “Ataofu”) and (Adelaide, ?ü?w
???w

“Atelaite”)

“Api”) and (Adela, ?ü?w??

“Atela”) are last names, and

“Aerhchi”) and (Allie, ?ü?ú

“Ali”) are nicknames, for males and females,

respectively. Some first names are also used as last names. For instance, “Abel” can be either a first name or a last name. In the experiment, three sets of parallel-aligned texts [4], P1, P2, and P3, were prepared to evaluate the performance of proposed methods. P1 consisted of 500 bilingual examples from the English-Chinese version of the Longman Dictionary of Contempory English (LDOCE) [22]. P2 consisted of 300 aligned sentences from Scientific American, USA and Taiwan Editions2. P3 consisted of 300 aligned sentences from the Sinorama Corpus [23].

2

Scientific American: “http://www.sciam.com” (USA edition) and “http://www.sciam.com.tw” (Taiwan edition).

21

Table 2. Some samples from the training set T0. Source Word Abe Abbey Abbot Archer Adolf Adolphus Adela Adelaide Arden Albert Alfonso Alfie Alf Algy Algernon Alma Almeric Archie Alva Alphonsus Alphonso Afra Avril Agnes Argus
?ü?±¨??? ?ü???g?· ?ü???? ?ü¤è???? ?ü¤è?? ?ü ??? ?ü??±?¨?§J ?ü???_ ?ü???? ?ü?D?±?? ?ü?w?? ?ü?w???w ?ü?n ?ü??§B?S ?ü??¤è?? ?ü???á ?ü??¤? ?ü???N ?ü?????A ?ü?? ?

Target Word
?ü¨? ?ü¤? ?ü§B?S ?ü?? ?ü?D¤?

Source Word Agatha Acton Arkwright Arabella Alaric Alasdair Alastair Alethea Alonzo Ariadne Allegra Alister Allie Arlene Alan Aloys Aloysius Amadeus Amabel Amanda Amelia Arms Armstrong Anastasia Arno

Target Word
?ü¨??? ?ü§J?y ?ü§J?à?S ?ü???_ ?ü??¨?§J ?ü?????N?? ?ü???????? ?ü???è ?ü???? ?ü?ú???w?g ?ü?ú???? ?ü§Q???S ?ü?ú ?ü?Y ?ü-? ?ü?? ì?? ?ü?? ì-×?? ?ü??-}?? ?ü???_¨à ?ü°??H ?ü?e?ú?? ?ü?i?? ?ü?i???S?? ?ü?R?·???è ? ?ü??

22

In the experiment, we dealt with personal and place names as well as their transliterations from the parallel corpora. The performance of transliteration extraction was evaluated based on the precision rates of transliteration words or characters. For simplicity, we considered each proper name in the source sentence in turn and determined its corresponding transliteration independently. Table 3 shows some examples from the testing set P1.

23

Table 3. Some bilingual examples from the testing set P1.
He is a (second) Caesar in speech and leadership. ?L?b?t??¤? ??? è-±?? ~?à?? p???? Hamlet kills the king in Act 5 Scene 2. ???i?p?S ?b??¤-?? G??¤ §?°ê ?±?? . ?A?@.

Can you adduce any reason at all for his strange behaviour, Holmes? ?? ????? , §A?à§_?|?X¤°?ò?z??? ?? L???j?????°?

To see George, of all people, in the Ritz Hotel! ?u·Q¤?¨ì, ?~?M?b?R?A ???]??¨ì? ?v !

He has 2 caps for playing cricket for England. ?L?N?í-^°ê ???O?y??±o¨ì ????a?A?U.

They appointed him to catch all the rats in Hamelin. ?L-??ü???L???~?i?L °??????? ???.

Burlington Arcade is a famous shopping passage in London. ?f?F?y ???Y?O-??° ?? W???? ??ó.

The architecture of ancient Greece. ?j§??? ?????v-·??.

Drink Rossignol, the aristocrat of table wines! ?????è?? °s§a! ?o?O?\°s¤ ?? W?~!

Cleopatra was bitten by an asp. §J§Q??¨??S?? ¤k¤??O?Q¤p?r?D?r??? .

I shall soon be leaving for an assignment in India. §??? ??N-n?h?L?× ?á??¤@????°?.

Our plane stopped at London (airport) on its way to New York. §?-??? ??÷?????ù , ?~¤ ?b-??° ?÷???L??.

Schoenberg used atonality in the music of his middle period. ?á§B?? ?b¤ ?????L????¤è??§@?±.

This tune is usually attributed to J. S. Bach. ?o-??±¤l?q±`?Q?{?° O¤??? ??§@.

Now that this painting has been authenticated as a Rembrandt, it's worth 10 times as much as I paid for it! ???ó?o?T?e¤w???ê?O-????? ?u??, ???????ù?O§?·í?ì?R¤U¨?????¤Q-?!

Byron awoke one morning to find himself famous. ??-? ¤@????¨??o?{??¤v¤w?g?¨ W.

24

4.2 TUs for English and Chinese

The proposed model is based on TUs, which are more linguistically motivated than individual characters. Table 4 lists some of the most frequently occurring English TUs of length 1 to 3. Table 5 lists some of the most frequently occurring Chinese TUs. Table 6 shows some English-Chinese TU-mapping probabilities automatically estimated from all of the training data. Table 4. Some high frequency English TUs. Length of English TU u 1 2 3 a, e, i, n, l, s, o, r, d, t er, ie, ar, ll, th, or, ch, tt, ck, ph lle, sch High Frequency TUs

Table 5. Some high frequency Chinese TUs. Length of Chinese TU v 1 2 3 4 5 i, a, l, n, o, t, e, p, m, u te, ei, ai, ch, ko, hs, ng, ao, pu, fu ssu, erh, ieh, chi, hsi shih chieh High Frequency TUs

25

Table 6. English-Chinese TU-mapping probabilities. u ae ae ai ai ar au aw aw v h ei i a e a o ao o P(v|u) 0.272 0.571 0.214 0.500 0.250 0.794 0.772 0.545 0.454 u ei eu ew ey f ff ff g g v i yu u i f f fu ko ch P(v|u) 0.900 0.785 0.500 0.998 0.586 0.733 0.266 0.350 0.345

The automatic learning process resulted in mostly regular monographs and digraphs found in pronunciation dictionaries, such as the Longman Pronunciation Dictionary (LPD) [29], including “rh” and “au.” However, it also learned additional TUs, such as “cq” in the personal names “Jacqueline” and “Jacquetta.” For example, after the second iteration of EM training, the most likely TU alignment sequence of the name pair (Jacqueline, Chiehkueilin “?????Y ”) is shown in Figure 9.

J Ch ?

a ie h

cq k ?

u u

e ei ?

l l ?

i i

ne n ? Y

Figure 9. TU alignment of the name pair (Jacqueline, Chiehkueilin “?????Y

”).

26

It should be noted that an original word may have more than one transliteration. For instance, the English name “Beaufort” has several possible Chinese terms {“?j?? “?j?ò ” (Paofo), “?Z?? ”(Pufu), “?j?ò?S ” (Paofu),

”(Paofote)}. The TUs of the word “Beaufort” were

automatically and dynamically constructed and aligned with their corresponding transliteration TUs via the proposed model. The results are shown in Figure 10.

B ea u f or t

B ea u f or t

P

a

o f ?j??

u

-

P

a ?j?ò

o f

o

-

B ea u f o r t

B ea u f or t

P

-

u f ?Z??

u - -

P

a o f ?j?ò ?S

o

te

Figure 10. TU alignment of “Beaufort” and corresponding transliterations. Although Knight and Graehl [12] applied EM to automatically learn similarities of English-Japanese name pairs, English words and Japanese katakana words have to be converted into English sounds and Japanese sounds, respectively, via pronunciation dictionaries. Each English sound can map to one or more Japanese sounds. Compared with their study, one of the advantages of our approach is that we do not have to find the exact pronunciations via dictionary lookup or various grapheme-to-phoneme rules. To be more specific, a set of often-used Chinese characters for transliteration was selected from the
27

collected corpora. Although many Chinese characters have more than one pronunciation, we found that almost all the characters used for transliteration have unique pronunciations. For those Chinese characters not used for transliteration, we choose the most frequently used pronunciation instead. Since we focus on transliterated words, we do not apply any Chinese pronunciation disambiguation algorithm to decide the exact pronunciation for each character. Thus, the Romanization of Chinese Characters can be conducted directly via table lookup instead of using a pronunciation dictionary. Moreover, to accelerate the convergence of EM training and reduce noisy TU pairs at grapheme-level string mapping, we adopt a many-to-many mapping under the constraints of a limited set of matched types based on phonetic knowledge. The maximum lengths of English and Chinese TUs are 3 and 5, respectively. Table 7 shows the match types and English and Chinese TUs obtained in our experiments.

28

Table 7. Examples for each match type. Match Type 0–1 1–0 1–1 1–2 1–3 1–4 2–0 2–1 2–2 2–3 2–4 2–5 3–2 3–3 3–4 ( (h, , h), ( ), (k, TU Pair , i), ( ), (d, , n), ( ), (t, , u) )

(r, l), (y, i), (m, m) (j, ch), (f, fu), (d, te) (s, ssu), (l, erh), (r, erh) (s, shih) (gh, )

(bb, p), (ey, i), (mm, m) (dg, ch), (wh, hu), (ck, ko) (le, erh), (re, erh), (ce, ssu) (ce, shih) (ge, chieh) (sch, hs) (lle, erh) (sch, shih)

4.3 Evaluation Metric

In the experiment, the performance of transliteration extraction was evaluated based on precision and recall rates at the word and character levels. Since we considered exactly one

29

proper name in the source language and one transliteration in the target language at a time, the word recall rates were same as the word precision rates:

Word Precision (WP ) =

number of correctly extracted words . number of correct words

The character level recall and precision rates were defined as follows:

Character precision (CP ) =
Character Recall (CR ) =

number of correctly extracted characters , number of extracted characters

number of correctly extracted characters . number of correct characters

4.4 Experimental Results and Discussion

In the experiment of extracting transliterations on the data set P1, the STM model achieved, on average, a word precision rate of 86%, a character precision rate of 94.4%, and a character recall rate of 96.3%, as shown in Table 8. The performance could be further improved by means of simple statistical and linguistic processing, as shown in Table 8.

30

Table 8. The experimental results of transliterated word extraction for P1. Test Set P1 (LDOCE) Methods STM STM+R1 STM+R2 STM+R1+R2 P2 (Scientific American) STM STM+R1 STM+R2 STM+R1+R2 P3 STM STM+R1 STM+R2 STM+R1+R2 WP 86.0% 88.6% 90.8% 94.2% 90.7% 92.7% 92.0% 94.0% 86.7% 89.0% 87.7% 93.0% CP 94.4% 95.4% 97.4% 98.3% 96.9% 97.6% 97.8% 98.3% 94.2% 94.9% 95.8% 96.5% CR 96.3% 97.7% 95.9% 97.7% 97.3% 97.9% 97.3% 97.9% 96.1% 96.8% 94.9% 96.7%

(Sinorama)

Table 9 shows some examples of Chinese transliterated words, correctly extracted using the STM model, from P1. Although, the STM model failed in some cases, most of these problems could be overcome through the addition of simple linguistic processing, as shown in Table 10. The error in the case of “Quirk” occurred because “Quirk” is much closer to “§J?M
??

(kohoko)” than to “?_§J

(KoKo),” based on phonetic similarity. In this case, the Chinese

transliteration plainly cannot be correctly extracted. Similar problems, due to similarities at the grapheme level, occurred with the name pairs (Tom, ???i “tangmu”) and (John, ?ù??

“yuehhan”), as shown in Table 10. It is obvious that a collection of commonly used or highly

31

varying transliterations can be incrementally added to a lookup list to further improve the system performance.

32

Table 9. Some examples of Chinese transliterations, correctly extracted by the STM model, from P1. Bilingual Sentence
He is a second Caesar in speech and leadership. ?L?b?t??¤? ??? è-±?? ~?à?? p???? ?A?@. ?v±K?? (shihmissu)

Chinese Transliteration
???? (kaisa)

In this case I'm acting for my friend Mr. Smith. §??N?í§? ? B¤??v±K?? ?? ??B?z??¨?.

What's your alibi for being late this time Jones? ?ò?? , §A?o????¨ì¤S??¤°?ò?? f ?

?ò??

(chungssu)

Can you adduce any reason at all for his strange behaviour, Holmes? ?? ????? , §A?à§_?|?X¤°?ò?z??? ?? L???j?????°?

?? ?????

(fuerhmossu)

They appointed him to catch all the rats in Hamelin. ?L-??ü???L???~?i?L °??????? ???.

?~?i?L

(hanmulin)

Drink Rossignol, the aristocrat of table wines! ?????è?? °s§a! ?o?O?\°s¤ ?? W?~!

???è??

(lohsino)

Cleopatra was bitten by an asp. §J§Q??¨??S?? ¤k¤??O?Q¤p?r?D?r??? .

§J§Q??¨??S?? (koliaopeitela) ?á§B?? (sangpoko)

Schoenberg used atonality in the music of his middle period. ?á§B?? ?b¤ ?????L????¤è??§@?±.

If you have to change trains in London, you may be able to book through to your last station. °??p§A-n?b ???á¤@????¨??? -??° ??¤?¨? . ???? , §A?i?H ?R¤@±i¤??q -?¨? ì

-??°

(luntun)

This tune is usually attributed to J. S. Bach. ?o-??±¤l?q±`?Q?{?° O¤??? ??§@.

¤???

(paha)

Byron awoke one morning to find himself famous. ??-? ¤@????¨??o?{??¤v¤w?g?¨ W.

??-?

(pailun)

You must have kissed the Blarney Stone to be able to talk like that! §A¤@?w?O§k?L???? § ?????Y¤~?à°÷§?????±o¨???°??? !

???? §

(pulani)

Quirk and Greenbaum collaborated on the new grammar. ?_§J?M???L¨? ?X???o??·s¤??k???? .

???L¨?

(kolinpang)

33

Table 10. Some examples of possible Chinese transliterations extracted by the proposed approaches. (“*” means the Chinese transliterated words are not correctly extracted.) Bilingual Sentence
David, as you know, writes

STM

STM+R1

STM+R2

STM+R1 +R2

¤j????

*

¤j????

*

¤j?? (tawei)

¤j??

dictionaries. §A¤]???D , ¤j?? ??¤u§@?O?s?g?ü¨?.

(taweite)

The Mediterranean Sea bathes the ??·N¤j§Q?ü?¤ sunny shores of Italy. ?a¤ ?ü ?y°???. You have borne yourself bravely in this battle, Lord Faulconbridge. ?ò??§B¨? ??°-^?i ??¤h ! ¤?§?* (chihsi) ??¤w * , ?b?o????§? §A?í?{ ?ò??§B¨??? ?u???§ ?ú??·???·N¤j §Q?ü?¤ (teitalihaian)

*

?a¤?ü (tichunghai)

??·N¤j§Q?ü?¤

*

?a¤?ü

*

?ò??§B¨???

*

?ò??§B¨? (fokenpoli)

?ò??§B¨?

(fokenpolichueh)

Ancient Rome and Greece. ?j??°¨¤?§?? .

§??? (hsila) ??¤w * ??

¤?§?*

§???

Jane is blossoming out into a beautiful girl. ?? ¤w??¤j?¨?°¤@-??}?G??¤k??¤l .

?? (chen)

(cheni)

Tom likes to boss younger children about. ???i ?O. Quirk and Greenbaum collaborated on the new grammar. ?_§J ?M???L¨??X???o??·s¤??k???? . ?? ???w???? ?~????¤p??¤l ?o???I

??* (tang)

??*

??*

??*

§J?M??

*

§J?M??

*

§J ?M ??

*

§J ?M ??

*

(kohoko)

John seems to have made a real conquest of Janet. They're always together. ?ù?? ?n?? ?u?? ¤w?g??±o ???R ?? ?¤? ,

* (chen)

??

*

??

*

??

*

?L-?±`?b¤@°_.

34

We have also performed the same experiments on the data sets P2 and P3, and the results are shown in Table 8. Although the performance of the STM approach on the data sets P1 and P3 are worse than that of P2, obviously, the integrated scheme (STM+R1+R2) exhibits considerable robustness in extracting transliterated words from different data sets in various domains. The results in Table 11 show the average rates of word and character precision for the test sets are around 93.8% and 97.8%, respectively. Table 11. The average rates of transliterated word extraction for overall corpora. Methods STM STM+R1 STM+R2 STM+R1+R2 WP 87.5% 89.8% 90.3% 93.8% CP 95.1% 95.9% 97.1% 97.8% CR 96.6% 97.5% 96.1% 97.5%

Compared with the previous work, the proposed approach has three advantages. First, the proposed method learns the parameters of the model automatically from a list of bilingual name pairs without using a pronunciation dictionary or grapheme-to-phoneme rules for the source words. Second, the proposed framework is easier to port to other language pairs as long as there is some transliteration training data. Third, the proposed approach matches TUs in the two languages directly, therefore accelerates the matching process by skipping the grapheme-to-phoneme phase.

35

5

Conclusions

A new statistical modeling approach to the machine transliteration problem has been presented in this paper. The parameters of the model are automatically learned from a bilingual proper name list using the EM algorithm. Moreover, the model is applicable to the extraction of proper names and transliterations. The proposed method can be easily extended to other language pairs that have different sound systems without the assistance of pronunciation dictionaries. Experimental results indicate that high precision and recall rates can be achieved by the proposed method.

References

[1] Yaser Al-Onaizan and Kevin Knight, Translating named entities using monolingual and bilingual resources, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, 2002, pp. 400-408. [2] Yuen Ren Chao, A Grammar of spoken Chinese, Berkeley, University of California Press, 1968. [3] Hsin-Hsi Chen, Sheng-Jie Huang, Yung-Wei Ding, and Shih-Chung Tsai, Proper name translation in cross-language information retrieval, in: Proceedings of 17th COLING and 36th ACL, 1998, pp. 232-236 [4] Thomas C. Chuang, Geeng Neng You, and Jason S. Chang, Adaptive bilingual sentence

36

alignment, Lecture Notes in Artificial Intelligence 2499 (2002) 21-30. [5] Jose B. Cibelli, Robert P. Lanza, Michael D. West, and Carol Ezzell, What Clones, Scientific American, January 2002. (http://www.sciam.com) [6] Ido Dagan, Kenneth W. Church, and William A. Gale, Robust bilingual word alignment for machine aided translation, in: Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1993, pp. 1-8. [7] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society 39 (1) (1977) 1-38. [8] G. D. Forney, The Viterbi algorithm, Proceedings of IEEE 61 (1973) 268-278. [9] Patrick A.V. Hall and Geoff R. Dowling, Approximate String Matching, ACM Computing Surveys, 12 (1980) 381- 402. [10] Lu Huai, Handbook of English Name Knowledge, ISBN 7-5012-0144-7/Z.10, 1st edition, 1989. ( ? a ? | 1989?~¤T¤???@??
?s , - ^ ? y ? m ? W ? ? ? ? ¤??U

, ?@??

????

?X????

, ISBN 7-5012-0144-7/Z.10,

.)

[11] Byung-Ju Kang and Key-Sun Choi, Two approaches for the resolution of word mismatch problem caused by English words and foreign words in Korean information retrieval, International Journal of Computer Processing of Oriental Languages, 14 (2) (2001) 109-131. [12] Kevin Knight and Jonathan Graehl, Machine transliteration, Computational Linguistics,

37

24 (4) (1998) 599-612. [13] Julian Kupiec, An algorithm for finding noun phrase correspondences in bilingual corpora, in: Proceedings of the 40th Annual Conference of the Association for Computational Linguistics (ACL), Columbus, Ohio, 1993, pp. 17-22. [14] Chun-Jen Lee and Jason S. Chang, Acquisition of English-Chinese transliterated word pairs from parallel-aligned texts using a statistical machine transliteration model, in: Proceedings of HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, Edmonton, Canada, 2003, pp. 96-103. [15] Chun-Jen Lee, Jason S. Chang and Jyh-Shing Roger Jang, A statistical approach to Chinese-to-English Back-transliteration, in: Proceedings of the 17th Pacific Asia Conference on Language, Information, and Computation (PACLIC), Singapore, 2003, pp. 310-318. [16] Jae Sung Lee and Key-Sun Choi, A statistical method to generate various foreign word transliterations in multilingual information retrieval system, in: Proceedings of the 2nd International Workshop on Information Retrieval with Asian Languages (IRAL'97), Tsukuba, Japan, 1997, pp. 123-128. [17] V. I. Levenshtein, Binary codes capable of correcting spurious insertions and deletions of ones, Problems of Information Transmission, 1 (1965) 8-17. [18] Wei-Hao Lin and Hsin-Hsi Chen, Backward transliteration by learning phonetic

38

similarity, in: CoNLL-2002, Sixth Conference on Natural Language Learning, Taipei, Taiwan, 2002. [19] Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 1st edition, 1999. [20] I Dan Melamed, Automatic construction of clean broad coverage translation lexicons, in: Proceedings of the 2nd Conference of the Association for Machine Translation in the Americas (AMTA'96), Montreal, Canada, 1996. [21] Jong-Hoon Oh and Key-Sun Choi, An English-Korean transliteration model using pronunciation and contextual rules, in: Proceedings of the 19th International Conference on Computational Linguistics (COLING), Taipei, Taiwan, 2002, pp. 758-764. [22] P. Proctor, Longman English-Chinese Dictionary of Contemporary English, Longman Group (Far East) Ltd., Hong Kong, 1988. [23] Sinorama, Sinorama Magazine, http://www.greatman.com.tw/sinorama.htm, 2002. [24] Bonnie Glover Stalls and Kevin Knight, Translating names and technical terms in Arabic text, in: Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages, 1998. [25] Frank Z. Smadja, Kathleen McKeown, and Vasileios Hatzivassiloglou, Translating collocations for bilingual lexicons: a statistical approach, Computational Linguistics, 22 (1) (1996) 1-38.

39

[26] Keita Tsuji, Automatic extraction of translational Japanese-KATAKANA and English word pairs from bilingual corpora, International Journal of Computer Processing of Oriental Languages, 15 (3) (2002) 261-279. [27] Ellen M. Voorhees and Dawn M. Tice, The trec-8 question answering track evaluation, in: Proceedings of the Eighth Text Retrieval Conference (TREC-8), 1999. [28] Stephen Wan and Cornelia Maria Verspoor, Automatic English-Chinese name transliteration for development of multilingual resources, in: Proceedings of 17th COLING and 36th ACL, 1998, pp. 1352-1356. [29] J. C. Wells, Longman Pronunciation Dictionary (New Edition), Addison Wesley Longman, Inc., 2001. [30] Dekai Wu and Xuanyin Xia, Learning an English-Chinese lexicon from a parallel corpus, in: Proceedings of the First Conference of the Association for Machine Translation in the Americas (AMTA), 1994, pp. 206–213.

40


更多相关文档:

...Transliterated Word Pairs from Parallel-.pdf

Pairs from ParallelAligned Texts using a ...the extraction of transliteration from parallel text...Bilingual corpus Sentence alignment Preprocessing ...

...Transliteration of Proper Nouns from Arabic to_....pdf

THE CHALLENGE OF ARABIC FOR NLPMT Automatic Transliteration of Proper Nouns...(2005) use parallel corpora in Spanish and Arabic and an NE tagger in ...

Named Entity Translation with Web Mining and Transliteration_....pdf

transliteration approach with web mining, using web...parallel corpora, [Rapp, 1999; Fung and Yee, ... we extract all named entity pairs from the LDC...

A Cross-linguistic Study on Bilingual Terminology Acquisition....pdf

Similarity vectors are constructed for each pair ...the Internet was used in the transliteration model...Extraction: From Parallel Corpora to NonParallel ...

What’s in a Name Proper Names in Arabic Cross Language ....pdf

pairs, and with conversion to machinereadable form, these can be used for...transliteration system, or translations from the United Nations parallel corpus...

命名实体翻译分析与研究_图文.ppt

Translation Model Transliteration Model Tagging Model Co-occurrence Model ...2004a. Bilingual named-entity pairs extrac-tion from parallel corpora. In ...

...Supervised Named Entity Transliteration and Discovery from....pdf

of transliteration pairs, our algorithm discovers ...corpora, and use them to develop an algorithm ...extraction, in general, and NE extraction, in ...

C.R.LinandM.Gerla,“Adaptive Clustering for Mobile Wireless ....pdf

extracted translation pairs from anchor texts ...using a transliteration model which captures the ...A. Smith, The Web as a Parallel Corpus, ...

Digital Libraries_图文.pdf

Extraction Text Summarization Ontological Engineering ...Parallel corpora: translation-equivalent pairs ? ...Transliteration of Proper Names in Cross-lingual ...

NE翻译分析统计报告_图文.ppt

Translation Model Transliteration Model Tagging Model Co-occurrence Model ...2004a. Bilingual named-entity pairs extrac-tion from parallel corpora. In ...

...the Evaluation of Automated Transliteration Systems.pdf

Corpus Effects on the Evaluation of Automated Transliteration Systems_专业资料...(SYS-2) makes use of some explicit knowledge of our chosen language pair...

更多相关标签:
网站地图

文档资料共享网 nexoncn.com copyright ©right 2010-2020。
文档资料共享网内容来自网络,如有侵犯请联系客服。email:zhit325@126.com