Oracle text index 更新机制

复制代码

Applies To
All Users
Summary

This document attempts to give an understanding of how DML (updates, inserts and deletes) work for columns which are indexed by Oracle Text. A normal user or developer would not be expected to know all this information, but it is likely to be useful to anyone doing performance tuning on an Text system, or looking for bottlenecks during the DML / reindexing process.


Solution
 
Table of Contents

Table of Contents
The Extensibility Framework
Index Tables
Inserts
Deletes
Updates
Commit Time
Query Time
Index Synchronization
Implications
Notes

The Extensibility Framework

To understand Text indexing, it is necessary to know a little about Oracle's Extensibility Framework, around which Oracle Text architecture is built.

The Extensibility Framework allows developers to build their own index types. The kernel knows the names of the various user-provided routines to handle these indexes, but nothing about the underlying structure of the indexes.

The developer must provide a specified set of routines. For example, ODCIIndexInsert, ODCIIndexUpdate, and ODCIIndexDelete are the function definitions for creating, modifying and removing an index entry respectively (ODCI stands for Oracle Data Cartridge Interface). The routines are generally passed the ROWID for the row that is changing.

These user-provided functions are called by the kernel as and when necessary to do the index processing. In normal usage, extensibility functions run in a separate address space from the kernel, running under the EXTPROC program (communicating via interprocess communication, or IPC). However, certain modules written internally within Oracle are described as "trusted" callouts, and run within the kernel address space (such trusted callouts are not available to normal developers using the extensibility framework).

In Release 8.1.5 and 8.1.6, the query functions and the DML calls are trusted, but the actual indexing code (which runs at SYNC time) execute in the EXTPROC space. In 8.1.7, all calls are trusted, and no EXTPROC process is required.
Index Tables

The Oracle Text index consists of four tables, referred to as the $I, $K, $N and $R tables respectively. The tables exist within the schema of the index owner, and have names concatenated from "DR$", the name of the index, and the suffix (e.g. "$I").
$I The "Token" table

This table consists of all the tokens (words) that have been indexed, together with a binary representation of the documents they occur in, and their positions within those documents. Each document is represented by an internal DOCID value.
$K The DOCID mapping table

This is an index-organized table (IOT) which maps internal DOCID values to external ROWID values. Each row in the table consists of a single DOCID/ROWID pair. The IOT allows for rapid retrieval of DOCID given the corresponding ROWID value.
$R The ROWID mapping table

This is designed for the opposite lookup from the $K table - fetching a ROWID when you know the DOCID value. Given that ROWIDs are a fixed length (14 bytes), and DOCIDs are allocated sequentially, it is possible to write all rowids into a binary structure and any specific docid by reading the 14 bytes starting at position ( 1 + (DOCID*14) ).

In practice, this binary structure is split over several rows in the $R table to prevent any single row getting two large, but this makes no difference to the principle.
$N Negative row table

This contains a list of deleted DOCID values, which is used (and cleaned up) by the index optimization process.
Inserts

When a new record is inserted into a table with an Text index, the appropriate index creation routine is called by the kernel. The index creation routine creates a row in the DR$PENDING table (owned by CTXSYS), containing the rowid of the new row. No other processing is done at this time, so the indexes have not yet been updated to reflect the new information (this is done at SYNC time - see later).

Note that the kernel does not check for NULL values - an entry will be made in DR$PENDING even if the data value to be indexed by Text is null.

The row inserted in DR$PENDING is in the same commit unit as the new data inserted into the base table, so they will both be either committed or rolled back together.

Under certain circumstances, extra SQL statements will be executed to load the index information cache - see note 1.

An extra step is needed when there is already a similar row in DR$PENDING. In this case, there will be unique index violation, and instead a row is inserted into CTXSYS.DR$WAITING. The reasoning behind this is that the row in DR$PENDING may already be being processed by an index sync. If this is true, then we must be sure that the data is reindexed again at a later date.
Deletes

When an indexed row is deleted, the corresponding row in the $K table is immediately deleted. At the same time, a row (containing the index id and docid) is inserted in the DR$DELETE table owned by CTXSYS, and a row (containing just ROWID) is inserted into the $N table.

These three events are committed when the user commits his delete.

Removing the row from $K means that functional lookups in the index will not the deleted row (see Note 2). Adding a row into DR$DELETE means that normal index lookups will not find out (see queries, later), and enables the commit callback to delete the row from the $R table. The row in the $N table will be used during index optimization to remove unwanted DOCIDs from the $I table.

A final stage is to register a "commit callback" for this index. This is an instruction to the kernel to call a specific Text routine at commit time. There only needs to be one such callback per index, so if one is already registered for this index within this transaction, there is no need to do this.

Note: if the data item is inserted and deleted in the same commit unit, then there will be no row in the $K at the start of this process. In that case, there is no need to go through the rest of the process.
Updates

Updates are basically treated as a delete followed by an insert. The record is deleted as in the section above, then the rowid for the record is inserted into DR$PENDING (or maybe DR$WAITING) as described in the INSERTS section
Commit Time

At commit time, our "commit callback" will be invoked, getting passed the internal index id for the index to be updated.

The callback will fetch all the docids from DR$DELETE for the index id in question.

For each docid, the callback will perform a LOB piecewise update of the $R table, setting the rowid string to nulls.

It will then delete all the rows from DR$DELETE for this index, and deregister the callback.
Query Time

There are two sorts of index lookup used in Oracle Text - normal and functional lookups. The normal lookup effectively says "give me all the rowids that satisfy my text criteria", whereas the functional lookup says "does row satisfy my text criteria?" The first of these (normal lookup) fetches a set of docids from the $I table, then uses the $R table to convert them to rowid values.

If our current session has deleted a record, but not committed the delete, then the $R table will not yet have been modified. Therefore, during a normal lookup, the index lookup code must check DR$DELETE, and remove any unwanted DOCID values that it finds in this table before converting these values to rowids using the $R table.

This ONLY applies to records modified in our own session - if other sessions have made modifications but not committed, those modifications are invisible to us anyway. And once they have committed, the $R table will have had the old DOCIDs nulled out.

In the case of a functional lookup, there is no need for any special processing. Functional lookup uses the $K table, and this table is updated immediately the record is changed.
Index Synchronization

Index synchronization (sync, for short) occurs when a user executes the SQL statement ALTER INDEX indexname REBUILD ONLINE PARAMETERS ('sync') or in 8.1.6 or later calls a PL/SQL sync routine.

Sync looks in DR$PENDING and DR$WAITING for rowids of records to be updated. Rowids from these two tables are combined.

For each rowid, a new DOCID value is assigned. The data is indexed via the indexing pipeline (which will not be covered in detail here) and the resulting token, DOCID, and word position information will be inserted into the $I table. A new row is inserted in the $K table containing the DOCID/ROWID pair, and the $R data is extended via a LOB piecewise write to the correct 18 character string.
Implications

When a record is deleted, the index change is immediate. That is, your own session will no longer find anything in that record from the moment you make the change, and other users will not find it as soon as you have committed.

Inserts - and by implication updates - are different. The new information will not be visible to text searches until an index sync has occurred.

The most important affect of this is on updates. If you make a minor alteration to a document, then that document effectively becomes invisible to searches until an index sync occurs. Application developers should bare this in mind.
Notes
Note 1 - loading the index detail cache

The first time that the Oracle kernel deals with an Oracle Text index, it will load an internal cache with various information about the index - such as the filter used, the section groups, stopwords, etc. For this reason, in a trace you will sometimes see a bunch of extra SQL statements dealing with tables such as DR$INDEX, DR$INDEX_OBJECT, etc. You can see this by creating a new index, then performing an insert to the table with SQL_TRACE switched on. Note that this only happens the first time the index is used after creation, or after restarting the database. Subsequent, separate sessions do not need to reload this information - it is available in the SGA to all sessions.

It may seem a little odd that information like stopwords is loaded during an insert, when it is only needed during indexing or querying, but the logic seems to be that since we have to find some information about the index, we may as well fetch it all.
Note 2 - removing rows from $K on delete

Actually, when a record has really been deleted, there is no way we will ever get to do a functional lookup anyway. For the kernel to do a functional lookup, it has to find the row via some other criteria, and if it has been deleted then that's not going to happen. However, when we do an update, this is described as a "delete followed by insert", and in this case the physical row has not been deleted, so the argument does apply.

本文档试图让大家理解 DML（更新、插入和删除）对于由 Oracle Text 索引的列是如何工作的。普通用户或开发者可能不会知道所有这些信息，但对于正在文本系统进行性能调优，或在DML/重新索引过程中寻找瓶颈的人来说，这些信息很可能会有用。

解决方案

可扩展性框架

索引表

插入

删除

更新

提交时间

查询时间

索引同步

影响

注释

可扩展性框架

要理解文本索引，有必要了解甲骨文的可扩展性框架，而甲骨文文本架构正是基于此框架构建的。

可扩展性框架允许开发者构建自己的索引类型。内核知道处理这些索引的各种用户提供的例程名称，但对索引的底层结构一无所知。

开发者必须提供一套特定的例程。例如，ODCIIndexInsert、ODCIIndexUpdate 和 ODCIIndexDelete 分别是创建、修改和删除索引条目的函数定义（ODCI 代表 Oracle Data Cartridge Interface）。例程通常会传递给正在更改的行的 ROWID。

这些用户提供的函数由内核在必要时调用以进行索引处理。在正常使用中，可扩展函数运行在与内核不同的地址空间中，运行于EXTPROC程序下（通过进程间通信，简称IPC通信）。然而，Oracle 内部编写的某些模块被称为"可信"调用，运行在内核地址空间内（使用扩展框架的普通开发者无法获得此类可信调用）。

在8.1.5和8.1.6版本中，查询函数和DML调用是被信任的，但实际的索引代码（在SYNC时运行）是在EXTPROC空间执行的。在8.1.7中，所有调用都是信任的，无需EXTPROC进程。

索引表

Oracle Text 索引由四个表格组成，分别称为 $I、$ K、 $N 和$ R 表格。这些表存在于索引所有者的模式中，名称由索引名称"DR $"和后缀（例如"$ I"）串接而成。

$I "代币"表

该表包含所有已被索引的词，以及它们出现的文档的二进制表示及其在文档中的位置。每个文档由内部 DOCID 值表示。

$K DOCID映射表

这是一个索引组织表（IOT），用于将内部DOCID值映射到外部ROWID值。表中的每一行由一对 DOCID/ROWID 组成。IOT允许在对应的ROWID值下快速检索DOSID。

$R ROWID 映射表

这设计用于与$K表相反的查找------在知道DOCID值时获取ROWID。鉴于ROWID是固定长度（14字节），且DOCID按顺序分配，可以通过读取从位置（1+（DOCID*14））开始的14字节，将所有行写入二进制结构和任意特定docid。

实际上，这种二元结构被分成$R表的几行，以防止单行出现两个大行，但这对原则没有影响。

$N 负行表

该列表包含已删除的DOCID值列表，索引优化过程会使用（并进行清理）。

插入物

当新记录插入带有文本索引的表时，内核会调用相应的索引创建例程。索引创建例程在DR$PENDING 表（由CTXSYS拥有）中创建一行，包含新行的行。此时没有其他处理，因此索引尚未更新以反映新信息（同步时完成------后文见）。

注意，内核不会检查 NULL 值------即使 Text 要索引的数据值为空，仍会在 DR$PENDING 中进行条目。

插入 DR$PENDING 的行与新插入到基表的数据处于同一提交单元，因此两者要么提交，要么合并回滚。

在某些情况下，会执行额外的SQL语句来加载索引信息缓存------详见注释1。

当DR $PENDING中已有类似行时，则需要额外一步。在这种情况下，会出现唯一的索引违规，因此会插入一行进入CTXSYS。等待。其背后的原因是，DR$ PENDING 中的这一行可能已经被索引同步处理中。如果属实，我们必须确保数据在以后再次重新索引。

删除

当索引行被删除时， $K表中对应的行也会立即被删除。同时，在 CTXSYS 拥有的 DR$ DELETE 表中插入一行（包含索引 ID 和 docid），一行（仅包含 ROWID）插入 $N 表。

这三个事件在用户提交删除时被提交。

从 $K中移除该行意味着索引中的函数查找不会显示已删除的行（见注2）。在DR$ DELETE中添加一行意味着普通索引查找不会发现（详见后面的查询），并允许提交回调从 $R表中删除该行。$ N表中的这一行将在索引优化时用于移除$I表中的不需要的DOCID。

最后一步是为该索引注册"提交回调"。这是指令内核在提交时调用特定的文本例程。每个索引只需有一个这样的回调，所以如果该交易中已经注册了一个回调，就无需再这样做。

注意：如果数据项在同一提交单元中插入和删除，则该进程开始时$K中将不存在行。那样的话，就不需要再经历其他流程了。

更新

更新基本上被视为删除后再插入。记录如上文所述被删除，然后该记录的行入DR $PENDING（或DR$ WAITING），如INSERTS部分

所述提交时间

在提交时，我们的"提交回调"会被调用，并获得内部索引ID，以便更新索引。

回调会从 DR$DELETE 获取该索引 ID 的所有 docid。

对于每个docid，回调会对$R表进行LOB分段更新，将行字符串设置为空。

然后它会删除DR$DELETE中该索引的所有行，并取消回调。

查询时间

Oracle Text 中使用两种索引查找------普通查找和函数查找。普通查找实际上是"给我所有满足文本条件的 rowid"，而函数式查找则是"row 是否满足我的文本条件？"第一种（普通查找）从 $I表中获取一组 docid，然后用$ R 表将其转换为 rowid 值。

如果当前会话删除了记录但尚未提交删除，那么 $R表尚未被修改。因此，在正常查找过程中，索引查找代码必须检查DR$ DELETE，并在使用$R表将这些值转换为行之前，先检查该表中发现的任何不需要的DOCID值。

这只适用于我们自己会话中修改的记录------如果其他会话做了修改但未提交，这些修改对我们来说是不可见的。一旦提交，$R表中的旧DOCID就会被清除。

在函数查找的情况下，无需特殊处理。函数查找使用$K表，记录一变，该表立即更新。

索引同步

索引同步（简称同步）发生在用户执行SQL语句ALTER INDEXNAME REBUILD ONLINE parameters（"同步"），或在8.1.6及更高版本中调用PL/SQL同步例程时。

同步在DR $PENDING 和 DR$ WAITING 中查找记录的rowid。这两张表中的 rowid 会合并。

每个rowid都会被分配一个新的DOCID值。数据通过索引流水线进行索引（此处不详细介绍），生成的标记、DOSID和词位置信息将入 $I表中。在包含 DOCID/ROWID 对的$ K表中插入新行，并通过 LOB 分段写入正确的 18 字符字符串，将$R数据扩展。

影响

当记录被删除时，索引的更改是立即发生的。也就是说，从你更改的那一刻起，你的会话就不会再在那个记录中找到任何内容，而其他用户在你承诺后也找不到它。

插入式------以及由此带来的更新------是不同的。在索引同步完成之前，文本搜索不会显示这些新信息。

这其中最重要的影响是更新。如果你对文档做了小修改，那么该文档实际上会在索引同步发生前对搜索隐形。应用开发者应牢记这一点。

注释

1 - 加载索引细节缓存

Oracle 内核首次处理 Oracle Text 索引时，会加载包含索引各种信息的内部缓存------如所用过滤器、章节组、停止词等。因此，在跟踪中你有时会看到许多额外的SQL语句处理表，比如DR $INDEX、DR$ INDEX_OBJECT等。你可以通过创建一个新索引，然后在开启SQL_TRACE的情况下插入表来看到这一点。注意，这种情况只发生在创建索引后首次使用或重启数据库后。随后，单独的会话无需重新加载这些信息------这些信息在 SGA 中对所有会话均可访问。

在插入时加载停止词，而实际上只在索引或查询时需要，这看起来可能有点奇怪，但逻辑似乎是，既然我们需要查找索引的信息，那不如全部取用。

注2 - 删除$K时移除行

实际上，当记录真的被删除时，我们根本不可能做功能性查找。内核要做函数式查找，必须通过其他条件找到该行，如果它已经被删除，那就不会发生。然而，当我们进行更新时，这被称为"删除后插入"，而在这种情况下，物理行并未被删除，因此该论证仍然适用。