COMP 6714-Info Retrieval and Web Search笔记week1

哭了哭了，这周唯一能听懂的就这门

IR（Information Retrieval)是什么？

不等同于search，不是做数据查询(database query)

The field of computer science that is most involved with R&D(research and development) for search is information retrieval (IR)

集合(Collection) ：一组文档，静态的（a static collection for the moment）
目标(Goal) ：检索与用户需要的信息相关的文档(retrieve documents with information that is relevant to the user's information
need and helps the user complete a task)

market cap 市场总值

90年代中期，大部分数据是非结构化的，而在行业里，大部分的钱都在结构化数据库上。如oracle、Microsoft SQL Server、IBM database、DB2

而到了2019年的时候，非结构数据更多了，在非结构化数据上花的钱也比结构化数据更多了（如chatgpt）

这让信息检索比以前更重要了

数据库记录（或关系数据库中的元组tuple）通常由定义良好的字段field(或属性attribute)组成。数据库( fields with well-defined semantics)查询很容易，文本(text or documents)较难。

将查询文本(query text)与文档文本(document text)进行比较，确定什么是好的匹配，是信息检索的核心问题(core issue)。

IR不仅仅是文本和网络搜索（虽然在这门课上是核心）

动态查询(Ad-hoc search)：查找任意文本(arbitrary text)查询的相关文档
筛选(Filtering):又名信息传播(aka information dissemination)，为新文档识别相关用户的profile(比如你告诉你的社交媒体你喜欢动漫，它可能以后会给你推这方面的)
分类(Classification)：识别文档相关的标签
问题回答(Question answering)：对问题给出一个具体的答案

accuracy不是信息检索的词，accuracy很误导，我们不用accuracy来衡量信息检索而是Precision和Recall

Precision ：fraction of retrieved docs that are relevant = P (relevant|retrieved)
你搜索到的有多少是正确的样本？
Recall ：fraction of relevant docs that are retrieved = P (retrieved|relevant)
在正确的样本中有多少正确的样本被搜索到了？
所以一个是关于retrieve，另一个是关于collection
- tp:true positive(相关，并且搜索到了)
- fp:false positive
- fn:false negative
- tn:true negative(不相关，并且没搜索到)
  all the true are good stuff, all the false you don't like