Help   About ProQuest | 

Dissertations & Theses
The world's most comprehensive collection of dissertations and theses.Learn More...

Citation/Abstract

Print  |  Email  |  Order a Copy  
Detection of foreign words and names in written text
by Ahmed, Bashir U., D.P.S., Pace University, 2005, 172 pages; AAT 3172339

Abstract (Summary)

Tremendous research effort has gone into the field of natural language processing and understanding during the last half of the 20 th century, yet it remains as one of the most challenging problems. Nevertheless, many companies have commercialized language processing applications such as Text-to-Speech (TTS) systems, domain-specific Automatic Speech Recognition (ASR) systems, Machine Translation (MT) systems, and limited-domain, speaker-dependent Speech-to-Speech (STS) prototype systems. While the quality of all natural language processing applications has improved steadily, they remain far from being perfect.

Due to globalization and the widespread use of the Internet, occurrences of foreign entities--foreign words, names, locations, and events--in written text are becoming commonplace. This further complicates the automatic processing of natural language text. Identification of such foreign entities is important for several reasons. For example, TTS systems will usually fail to pronounce such entities properly in their native language as they try to sound them using the lexicon of the base language. These foreign entities also cause problems for MT programs where translation is initially done word by word.

This dissertation simplifies the Naïve Bayesian, N-gram based text-classification algorithm by taking its logarithm, naming it the Cumulative Frequency Addition (CFA) algorithm, and applies it to three text processing tasks: language identification, name-to-nationality identification, and detection of foreign words and names. In the language identification task CFA yields 100% accuracy on string sizes greater than 150 characters. In the name-to-nationality task, it yields 86% accuracy on a 14 country database and 96% on a 7 country database within the top three choices. Identifying the nationality, or at least the language group, of a person from his/her name, can be important not only for proper pronunciation but also for purposes of forensics studies and national security. Finally, in the task of detecting foreign words we obtain 66.9% accuracy. This is the first study to apply natural language processing techniques to the latter two tasks.

Indexing (document details)

Advisor:Tappert, Charles
School:Pace University
School Location:United States -- New York
Keyword(s):Foreign words, Written, Name detection, Machine translation, Pattern recognition, Natural language
Source:DAI-A 66/04, p. 1203, Oct 2005
Source type:Dissertation
Subjects:Information systems, Communication, Artificial intelligence, Linguistics
Publication Number: AAT 3172339
ISBN:9780542090608
Document URL:http://proquest.umi.com/pqdlink?did=913516661&Fmt=7&clientId =79356&RQT=309&VName=PQD
ProQuest document ID:913516661


 

 » Purchase the full text

Dissertations and theses can be purchased in a variety of formats which may include: PDF for web download, softcover, hardcover, or microform. Click the "Order a Copy" button to see the formats available for this item.

Available without purchase:

Preview  Preview

Print  |  Email  |  Order a Copy  
^Back to Top
Copyright © 2009 ProQuest LLC. All rights reserved. Terms and Conditions