leetcode146、OrderedDict与lru_cache

文章目录

- leetcode146
- dict的有序性
- lru_cache
- [OrderedDict VS lru_cache](#OrderedDict VS lru_cache)

leetcode146

leetcode146就是考了一道LRU缓存题目，使用python作答的话，最直接的办法就是使用OrderedDict，OrderedDict是python标准库里提供的一种数据结构，顾名思义，它是一个可排序的字典，相比dict，它多了对字典进行排序的方法，具体来说就是多了popitem和move_to_end方法，popitem可以移除第一个或者最后一个元素，move_to_end可以将指定键移动到第一位或者最后一位，这样一来思路就很清晰了，因为这个数据结构完美契合LRU的思想，插入或访问元素时使用move_to_end将元素移到最前，当缓存满时使用popitem将最后的元素删除，参考代码如下，一切都是那么自然：

python 复制代码

class LRUCache:
    
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.container = OrderedDict()

    def get(self, key: int) -> int:
        if key in self.container:
            self.container.move_to_end(key, False)
        return self.container.get(key, -1)

    def put(self, key: int, value: int) -> None:
        if key not in self.container and len(self.container) == self.capacity:
            self.container.popitem()
        self.container[key] = value
        self.container.move_to_end(key, False)

能过吗？能过，但是直觉告诉我们，最直接的方法往往不是最优解，这道题的难点其实是使用哈希表+双向链表来实现这个缓存，而这也是python标准库中的缓存functools.lru_cache的实现方案。到底是不是最优解呢？在这之前，还需要解决一个问题，dict是不是有序的？如果是有序的，那么与OrderedDict的区别是什么？

dict的有序性

答案是肯定的，dict确实是有序的，它的有序性表现在可以记住数据插入的顺序并在遍历时按顺序遍历，比如调用items方法，它返回的元素顺序就是和插入顺序是一致的。dict的有序性是python3.6之后才实现的，在python3.6之前dict的实现还是传统的哈希表，哈希表当然是没有顺序的，在3.6之后，字典的实现改用了一种紧凑数组的方案，这个方案不仅节省内存，还更高效，还能记录元素顺序，正是这个方案使字典实现了有序，可以看下这个外国人的邮件，里面简明扼要的描述了紧凑数组实现方案，关于python字典的底层实现，我也有一篇文章详细分析了cpython中的实现。

lru_cache

functools.lru_cache是python提供的开箱即用的LRU缓存，它是一个装饰器，可以直接装饰于要缓存的函数之上，既然有了标准实现方案，那么不如来直接研究一下标准库的实现。lru_cache是一个带参装饰器，可以指定缓存的大小，核心实现在_lru_cache_wrapper函数中，它的源码如下：

python 复制代码

def _lru_cache_wrapper(user_function, maxsize, typed, _CacheInfo):
    # Constants shared by all lru cache instances:
    sentinel = object()          # unique object used to signal cache misses
    make_key = _make_key         # build a key from the function arguments
    PREV, NEXT, KEY, RESULT = 0, 1, 2, 3   # names for the link fields

    cache = {}
    hits = misses = 0
    full = False
    cache_get = cache.get    # bound method to lookup a key or return None
    cache_len = cache.__len__  # get cache size without calling len()
    lock = RLock()           # because linkedlist updates aren't threadsafe
    root = []                # root of the circular doubly linked list
    root[:] = [root, root, None, None]     # initialize by pointing to self

    if maxsize == 0:

        def wrapper(*args, **kwds):
            # No caching -- just a statistics update
            nonlocal misses
            misses += 1
            result = user_function(*args, **kwds)
            return result

    elif maxsize is None:

        def wrapper(*args, **kwds):
            # Simple caching without ordering or size limit
            nonlocal hits, misses
            key = make_key(args, kwds, typed)
            result = cache_get(key, sentinel)
            if result is not sentinel:
                hits += 1
                return result
            misses += 1
            result = user_function(*args, **kwds)
            cache[key] = result
            return result

    else:

        def wrapper(*args, **kwds):
            # Size limited caching that tracks accesses by recency
            nonlocal root, hits, misses, full
            key = make_key(args, kwds, typed)
            with lock:
                link = cache_get(key)
                if link is not None:
                    # Move the link to the front of the circular queue
                    link_prev, link_next, _key, result = link
                    link_prev[NEXT] = link_next
                    link_next[PREV] = link_prev
                    last = root[PREV]
                    last[NEXT] = root[PREV] = link
                    link[PREV] = last
                    link[NEXT] = root
                    hits += 1
                    return result
                misses += 1
            result = user_function(*args, **kwds)
            with lock:
                if key in cache:
                    # Getting here means that this same key was added to the
                    # cache while the lock was released.  Since the link
                    # update is already done, we need only return the
                    # computed result and update the count of misses.
                    pass
                elif full:
                    # Use the old root to store the new key and result.
                    oldroot = root
                    oldroot[KEY] = key
                    oldroot[RESULT] = result
                    # Empty the oldest link and make it the new root.
                    # Keep a reference to the old key and old result to
                    # prevent their ref counts from going to zero during the
                    # update. That will prevent potentially arbitrary object
                    # clean-up code (i.e. __del__) from running while we're
                    # still adjusting the links.
                    root = oldroot[NEXT]
                    oldkey = root[KEY]
                    oldresult = root[RESULT]
                    root[KEY] = root[RESULT] = None
                    # Now update the cache dictionary.
                    del cache[oldkey]
                    # Save the potentially reentrant cache[key] assignment
                    # for last, after the root and links have been put in
                    # a consistent state.
                    cache[key] = oldroot
                else:
                    # Put result in a new link at the front of the queue.
                    last = root[PREV]
                    link = [last, root, key, result]
                    last[NEXT] = root[PREV] = cache[key] = link
                    # Use the cache_len bound method instead of the len() function
                    # which could potentially be wrapped in an lru_cache itself.
                    full = (cache_len() >= maxsize)
            return result

    def cache_info():
        """Report cache statistics"""
        with lock:
            return _CacheInfo(hits, misses, maxsize, cache_len())

    def cache_clear():
        """Clear the cache and cache statistics"""
        nonlocal hits, misses, full
        with lock:
            cache.clear()
            root[:] = [root, root, None, None]
            hits = misses = 0
            full = False

    wrapper.cache_info = cache_info
    wrapper.cache_clear = cache_clear
    return wrapper

入参user_function就是要缓存的函数，maxsize即是缓存大小，typed可以不用关注，_CacheInfo是缓存使用统计信息，也不用关注，下面主要关注缓存逻辑的实现。

当maxsize为0时，缓存器失效，maxsize为None时缓存不限制大小，lru缓存退化为哈希表，重要关注maxsize不为0也不为None时，也就是42行之后。
make_key就是根据入参生成一个缓存键，然后调用cache_get获取缓存键，缓存的底层数据结构就是dict，cache_get就是dict的get，那么下面就分为三种情况，来依次分析一下：

缓存命中

缓存命中时，link不为None，根据LRU的思想，这时候需要把命中元素放到链表末尾，代码中是50-56行，在缓存器中保存着一条双向链表，root为链表的根节点，其中每个元素都应保存上一个元素的指针，下一个元素的指针，自身的键和值，这里并没有用特别的数据结构，而是使用list来表示每个元素，list大小为4，其中元素依次表示上一个元素、下一个元素，键和值。

那么这里的操作就是三步：

先将命中元素从链表上摘下来
使链表尾节点的下一个元素指向命中元素，使链表头节点的上一个元素指向命中元素
再将命中元素的上一个指针指向链表的尾节点，下一个指针指向链表的头结点

这样就完成的命中元素的更新，由于每次更新只涉及固定次指针的操作，所以时间复杂度是o(1)

缓存未命中且缓存未满

缓存未命中且缓存未满的话直接将新元素链接到双向链表末尾即可，对应代码91-93行。

缓存未命中且缓存满

如果缓存未命中且缓存满了的话就需要执行缓存清除的操作，对应代码70-88行。根据上述分析，最近最新的元素都在双向链表的末尾，那么要删除的元素其实就是root根节点指向的一个元素，这里需要进行两步，首先删除最少使用的元素，然后再把新元素链表到链表上去，为了简化操作，这里使用了一种哨兵节点的思想，root节点即为哨兵节点，它的键和值都是空的，使用哨兵节点这里对链表的操作就可以简化为两步：

将新的键和值存入root节点中，并使root指向root的下一个节点
清空当前root节点的键和值

也就是说不需要实际删除增加链表的节点，只需要将root节点往下挪一个位置即可，进行完链表操作后再将哈希表中的缓存进行删除和增加，这样就完成了LRU的操作。

OrderedDict VS lru_cache

lru_cache好像确实有点东西，那么OrderedDict是怎么实现的呢？如果使用OrderedDict实现lru缓存，它们两个的差距在哪里呢？答案需要去OrderedDict的源码中去寻找，看完OrderedDict的源码就会发现，使用OrderedDict实现LRU缓存的方案和lru_cache的区别并不大，因为OrderedDict的实现原理其实就是在类中增加了一个双向链表，二者本质都是哈希表+双向链表，操作当然也都是类似的，在本题中，使用它们的并不大，也就是说OrderedDict写法其实也是最优解，只不过在面试中不让用罢了，需要手写双向链表。

那么在实际中这两种方案的区别可能就体现在lru_cache直接提供了生产级的解决方案，而OrderedDict只是提供了一种工具，还需要DIY一下。那么这样看来OrderedDict的作用好像并不大，实际也确实是这样的，由于dict也实现了有序，OrderedDict的重要性随之下降了许多，而要实现LRU缓存的话，又有lru_cache可以直接使用，而且还有一点，lru_cache是线程安全的，OrderedDict在多线程场景下还需要开发者去手动控制并发访问，这样OrderedDict的重要性也就进一步下降了。