A research field focussed on creating software systems with knowledge about natural (human) language 研究重點是關於自認語言的知識
Interdisciplinary: makes use of theories from Linguistics 語言學理論 , adopts an Engineering approach
Aimed at human-like understanding of language (but not yet there)
!tip\] Contributing disciplines
Linguistics: formal models of language, linguistic knowledge
Computer Science: representations, efficient processing, state machines, parsing
algorithms, probabilistic models, dynamic programming, machine learning
Mathematics: formal automata theory, computational modelling
Psychology: psychologically plausible modelling of language use
Many types of ambiguity歧義:
Phonological 語音學
multiple interpretations due to how it sounds 有些音聽起來一樣 那麼在識別的時候會存在很多種解釋
Lexical 詞匯
multiple interpretations due to a word having multiple senses 詞語的歧義,由於有的單詞本身帶有的意思多重導致的
Syntactic 句法
due to a word having more than one possible part of speech 一個單詞有多個演講部分
due to prepositional phrase attachment 介詞的附件
Semantic 語義
multiple possible interpretations解釋 unless knowledge of the world is available
!danger\] Two major approaches to NLP
* Symbolic 象征
* Rule- and dictionary-based systems 基於規則和字典系統
* Captures linguistic knowledge in rules written by experts 補貨專家撰寫的規則中的語言知識
* Statistical/Machine learning-based
* Data-driven 數據驅動
* Use of large amounts of (labelled) textual data (文本數據) to train systems, **discover patterns**
Comparison
✅Can generalise well on unseen examples可以很好的概括在看不見的例子上
❎Shortage of experts
❎Need people for labelling
❎Laborious rule writing, dictionary preparation
❎Time consuming and laborious labelling 耗時且費力
❎ Domain adaptation problematic域適應性問題
❎Must retrain for new domain必須重新訓練
✅Results can be interpreted結果可以解釋
❎ Often cannot inspect/change models通常無法檢查/更改模型
✅ Good when labelled data is hard to obtain當很難獲得標記的數據時
✅ Good where dictionaries are unavailable
NLP Pipelines
A 'complete' NLP system is usually a pipeline of components
在这里插入图片描述
Sentence Segmentation 句子細分
Tokenisation 象征化
Parsing
Information
Sentence segmentation (split)
Tokenisation (split)
Named entity recognition (combine)
Why is NLP challenging:
Natural Language evolve:
new words appear constantly
Syntactic rules are flexible 句法規則靈活
ambiguity (模糊性) is inherent 固有的
Why Machine Learning for NLP
Traditional rule-based artificial intelligence (symbolic AI):
requires expert knowledge to engineer the rules 需要專家知識來設計
not flexible to adapt in multiple languages, domains, applications 不能靈活低使用與多種語言
Learning from data (machine learning) adapts:
to evolution: just learn from new data 從新數據中學習
to different applications: just learn with the appropriate target representation
Sentence Segmentation 句子細分
Sentence Detection is done before the text is tokenized 句子檢測在文本標記化之前進行
Task 順序:
Determination of boundaries between sentences確定句子之間的界限
Sentences used in subsequent NLP tasks 隨後NLP中使用的句子
Is it enough to detect the full stop
Could be an end-of-sentence (EOS) marker 可能是句子結束 (EOS) 標記
Or an end of abbreviation marker 縮寫標記的結尾
Or both
通常是Text Mining的第一步,因為它將非結構化文本拆分成基本處理單位
Variation in delimiters
Typical:".","!","?"
!todo\] # Approaches
* Regular expressions (Patterns) 正則表達式
* Dictionaries (e.g., abbreviation lists) 字典
* Hand-crafted rules 手工製作的規則(e.g., to check whether the word following an EOS delimiter starts with an uppercase character 檢查EOS界定符之後的單詞是否從大寫字符開始)
* Statistical and ML approaches
* Hybrid approaches 混合方法
Example of useful rules or Features
First character after potential EOS char 潛在EOS Char之後的第一個字符
● Should be uppercase? Problematic for some languages, e.g. German
● Permissible chars after potential EOS, e.g. lowercase characters?
Abbreviations 縮寫
● titles not likely to occur at EOS (e.g., Dr. Jones)
● company indicators could occur at EOS (e.g., MySocialMedia Inc.)