python字典中字段重复性的分析~~

在python编程中，为了高效处理数据，我们经常需要对字典中的字段进行分析，找出最小不重复字段，显然这不是人干的活，得代码来~~

python 复制代码

def right_align_truncate(field, width):
    """右对齐，如果长度超过width则截取最后width个字符"""
    if len(field) > width:
        return field[-width:]                         # 截取最后width个字符
    else:
        return ' ' * (width - len(field)) + field     # 填充空格直到宽度为width
    #endif
#end def 

if __name__ == '__main__':

    class_type_dict = {}      # 载入你实际的字典

    total_bytes = 0
    count = 0
    max_bytes = 0
    min_bytes = float('inf')
    max_str = ""
    min_str = ""

    for value in class_type_dict.values():
        
        # 本例检查字典中list型值的第二个字段
        field = value[1]
        byte_len = len(field)
        total_bytes += byte_len
        count += 1
        
        if byte_len > max_bytes:
            max_bytes = byte_len
            max_str = field
        #endif
        
        if byte_len < min_bytes:
            min_bytes = byte_len
            min_str = field
        #endif
        
    #next
    
    avg_bytes = total_bytes / count

    print(f"平均字节数: {avg_bytes:.1f}")
    print(f"最大字节数: {max_bytes}, 对应字段: {max_str}")
    print(f"最小字节数: {min_bytes}, 对应字段: {min_str}\n")

    # 检查宽度从7降到2
    for width in range(max_bytes, 1, -1):

        aligned_dict = {}
        duplicates = {}

        for key, value in class_type_dict.items():
            
            if len(value) > 1:
                
                field = value[1]
                aligned = right_align_truncate(field, width)

                if aligned in aligned_dict:
                    if aligned not in duplicates:
                        duplicates[aligned] = [aligned_dict[aligned], field]
                    else:
                        duplicates[aligned].append(field)
                    #endif
                else:
                    aligned_dict[aligned] = field
                #endif
            
            #endif
            
        #next

        print(f"宽度={width}: 总字段数={len(aligned_dict)}, 重复组数={len(duplicates)}")
        
        if duplicates:
            
            print("重复示例:")
            count = 0
            
            for aligned, fields in duplicates.items():
                print(f"  右对齐值: '{aligned}' 来自: {fields}")
            #next
            
            break
            
        #endif
        
    #next

以上代码，博主需要分析的字典案例输出如下:

python 复制代码

平均字节数: 4.0
最大字节数: 10, 对应字段: 消费电子零部件及组装
最小字节数: 1, 对应字段: 铝

宽度=10: 总字段数=257, 重复组数=0
宽度=9: 总字段数=257, 重复组数=0
宽度=8: 总字段数=257, 重复组数=0
宽度=7: 总字段数=257, 重复组数=0
宽度=6: 总字段数=257, 重复组数=0
宽度=5: 总字段数=257, 重复组数=0
宽度=4: 总字段数=254, 重复组数=3
重复示例:
  右对齐值: '芯片设计' 来自: ['数字芯片设计', '模拟芯片设计']
  右对齐值: '电零部件' 来自: ['家电零部件', '风电零部件']
  右对齐值: '专用设备' 来自: ['锂电专用设备', '其他专用设备']

因此本例中，博主最小需要取的字段宽度为5。