2024 ,pgvector如何使你龙年识别真龙

前文

AI识别必定是未来的主流技术,数据库作为后端的关键核心服务,必须紧跟时代潮流。pgvector向量数据库横空出世,它可以把物体以多种特征变量存储在数据库里面,并且通过简单的SQL实现,对相似物体进行比较确认。

基于postgresql的一个扩展插件,与postgresql强强联手可以实现向量数据的存储、计算。

imgsmlr同样也是基于postgresql的一个扩展插件,它可以把非结构化文件转向成识别二进制数据,二进制再转换成特征数据,最后为转换成二进制数据做铺垫。流程如图,结果是我们从图片中识别出来的真龙, 千万图片丛中过,我只找属于我自己的真龙。

环境准备

安装 postgreSQL14

我的环境安装包如下,注意libs和devel都必须要有

csharp 复制代码
[root@server128 ~]# rpm -qa |grep -i postgres
postgresql14-14.10-1PGDG.rhel7.x86_64
postgresql-libs-9.2.24-9.el7_9.x86_64
postgresql14-libs-14.10-1PGDG.rhel7.x86_64
postgresql14-server-14.10-1PGDG.rhel7.x86_64
postgresql14-devel-14.10-1PGDG.rhel7.x86_64

安装imgsmlr

ini 复制代码
git clone --depth 1 https://github.com/postgrespro/imgsmlr  
  
  
cd imgsmlr/  
USE_PGXS=1 make   执行报错,见下图 
 
USE_PGXS=1 make install  
​

解决方法如下,在/usr/pgsql-14/include/server/access/gist.h的259行位置加入以下注释,即可以成功编译通过USE_PGXS=1 make

arduino 复制代码
​
#ifndef FALSE
#define FALSE   (0)
#endif
​
#ifndef TRUE
#define TRUE    (!FALSE)
#endif

安装pgvector

bash 复制代码
git clone  https://github.com/pgvector/pgvector.git
cd pgvector
make
make install

准备语句

csharp 复制代码
#img_raw装载图片
create table  img_raw(
rawid int primary key,
info   text,
ts timestamp,
raw  bytea
);
​
#img_vec装载图片的特征数据
create table  img_vec(
id  int  primary key  references img_raw(rawid),
patt   pattern,
sig signature
);
​
# 基于img_vec表构建gist
create index on   img_vec  using  gist(sig);
​
​
​
#插入图片信息
​
insert into img_raw values (1, 'long1,png', now(), pg_read_binary_file('/img/long1.jpg'));  
insert into img_raw values (2, 'long2,jpg', now(), pg_read_binary_file('/img/long2.jpg'));  
insert into img_raw values (3, 'long3,jpg', now(), pg_read_binary_file('/img/long3.jpg'));  
insert into img_raw values (4, 'long4,jpg', now(), pg_read_binary_file('/img/long4.jpg'));  
insert into img_raw values (5, 'long5,jpg', now(), pg_read_binary_file('/img/long5.jpg'));  
insert into img_raw values (6, 'long6,jpg', now(), pg_read_binary_file('/img/long6.jpg'));  
insert into img_raw values (7, 'long7,jpg', now(), pg_read_binary_file('/img/long7.jpg'));  
insert into img_raw values (8, 'long8,jpg', now(), pg_read_binary_file('/img/long8.jpg'));  
insert into img_raw values (9, 'long9,jpg', now(), pg_read_binary_file('/img/long9.jpg'));  
insert into img_raw values (10, 'long10,jpg', now(), pg_read_binary_file('/img/long10.jpg'));  
insert into img_raw values (11, 'long11,jpg', now(), pg_read_binary_file('/img/long11.jpg')); 
​
insert into img_raw values (12, 'chang1,jpg', now(), pg_read_binary_file('/img/chang1.jpg')); 
insert into img_raw values (13, 'hu1,jpg', now(), pg_read_binary_file('/img/hu1.jpg')); 
insert into img_raw values (14, 'mao1,jpg', now(), pg_read_binary_file('/img/mao1.jpg')); 
insert into img_raw values (15, 'she1,jpg', now(), pg_read_binary_file('/img/she1.jpg')); 
insert into img_raw values (16, 'tian1,jpg', now(), pg_read_binary_file('/img/tian1.jpg')); 
insert into img_raw values (17, 'wo1,jpg', now(), pg_read_binary_file('/img/wo1.jpg')); 
insert into img_raw values (18, 'xiong1,jpg', now(), pg_read_binary_file('/img/xiong1.jpg')); 
insert into img_raw values (19, 'tu1,jpg', now(), pg_read_binary_file('/img/tu1.jpg')); 
insert into img_raw values (20, 'tu2,jpg', now(), pg_read_binary_file('/img/tu2.jpg')); 
insert into img_raw values (21, 'tu3,jpg', now(), pg_read_binary_file('/img/tu3.jpg'));
​
​
#将图片提取所有特征保存img_vec表里面
insert into img_vec select rawid, jpeg2pattern(raw), pattern2signature(jpeg2pattern(raw)) from img_raw;
​
#img_vec表添加向量embedding
alter table img_vec add column embedding vector;
​
​
# img_vec表的特征数据转化为向量数据
update img_vec set embedding = replace(replace(sig::text, '(', '{'), ')', '}')::float[]::vector;  
​
​
​
select   
    t1.id,  
    t2.info,  
    embedding <-> (select embedding from img_vec where id=9) as "欧氏相似距离"  
  from img_vec t1 join img_raw t2 on t1.id=t2.rawid  
    order by embedding <-> (select embedding from img_vec where id=9)
    
    
    
​

实践:福龙与狗 、熊、兔、蜗牛、植物射手

用自己拍照21张,针对福龙从不同角度拍了10张,与及其它各式小动物共11张,在postgresql转换成向量数据后,用使用欧氏算法、内积算法、余弦算法来进行计算

为了证明这个是龙,看它的标签

vbnet 复制代码
以long9作为标准图,计算其它图片与它的相似度
​
#用使用欧氏算法,<->来进行计算
postgres=# select
postgres-#     t1.id,
postgres-#     t2.info,
postgres-#     embedding <-> (select embedding from img_vec where id=9) as "欧氏相似距离"
postgres-#   from img_vec t1 join img_raw t2 on t1.id=t2.rawid
postgres-#     order by embedding <-> (select embedding from img_vec where id=9);
 id |    info    |    欧氏相似距离
----+------------+--------------------
  9 | long9,jpg  |                  0
  4 | long4,jpg  | 0.8131946015532012
 14 | mao1,jpg   |  0.873532153956551
  5 | long5,jpg  | 0.9437004404760417
  3 | long3,jpg  |  1.006888032114946
  2 | long2,jpg  | 1.0538826025705688
 17 | wo1,jpg    | 1.0584338866296883
  8 | long8,jpg  | 1.1284268427144581
  1 | long1,png  | 1.1660293246028335
 10 | long10,jpg | 1.1743752827626244
  7 | long7,jpg  | 1.1849544245168537
 11 | long11,jpg | 1.2636647526394387
 18 | xiong1,jpg |   1.28623207769336
 16 | tian1,jpg  | 1.2928207635465576
  6 | long6,jpg  |  1.415787966613027
 15 | she1,jpg   |  1.472345617909073
 12 | chang1,jpg | 1.5688576102868548
 13 | hu1,jpg    | 1.5765029578406513
 19 | tu1,jpg    | 1.6046681136433938
 21 | tu3,jpg    | 1.9848909683454863
 20 | tu2,jpg    |  1.996557491652661
(21 rows)
​
​
#内积算法  <#>来进行计算
postgres=# select
postgres-#     t1.id,
postgres-#     t2.info,
postgres-#     embedding <#> (select embedding from img_vec where id=9) as "内积相似距离"
postgres-#   from img_vec t1 join img_raw t2 on t1.id=t2.rawid
postgres-#     order by embedding <#> (select embedding from img_vec where id=9)  ;
 id |    info    |    内积相似距离
----+------------+---------------------
  9 | long9,jpg  |  -9.793889999389648
 15 | she1,jpg   |   -9.56818962097168
  5 | long5,jpg  |  -9.520377159118652
 10 | long10,jpg |  -9.336055755615234
  4 | long4,jpg  |   -9.27934455871582
  8 | long8,jpg  |  -9.063756942749023
 20 | tu2,jpg    |  -9.029650688171387
  1 | long1,png  |  -8.906010627746582
 11 | long11,jpg |  -8.879608154296875
  3 | long3,jpg  |  -8.859291076660156
  2 | long2,jpg  |  -8.828414916992188
  7 | long7,jpg  |  -8.590826034545898
  6 | long6,jpg  |   -8.51888656616211
 14 | mao1,jpg   |  -8.475362777709961
 17 | wo1,jpg    | -7.8920063972473145
 12 | chang1,jpg |  -7.692102432250977
 16 | tian1,jpg  |  -7.567811965942383
 13 | hu1,jpg    |    -6.4823317527771
 18 | xiong1,jpg |  -6.195792198181152
 19 | tu1,jpg    |  -5.272591590881348
 21 | tu3,jpg    |    -4.3834547996521
(21 rows)
​
#余弦算法  <=> 来进行计算
postgres=# select
postgres-#     t1.id,
postgres-#     t2.info,
postgres-#     embedding <=> (select embedding from img_vec where id=9) as "余弦相似距离"
postgres-#   from img_vec t1 join img_raw t2 on t1.id=t2.rawid
postgres-#     order by embedding <=> (select embedding from img_vec where id=9);
 id |    info    |     余弦相似距离
----+------------+----------------------
  9 | long9,jpg  |                    0
  4 | long4,jpg  |    0.034229251128992
 14 | mao1,jpg   |  0.03767662614530476
 18 | xiong1,jpg | 0.039896219019039614
  5 | long5,jpg  |  0.04453997228870388
  3 | long3,jpg  |  0.05313360470652673
 17 | wo1,jpg    |  0.05427983467295261
  2 | long2,jpg  |  0.05828036923202573
  8 | long8,jpg  |  0.06559011022972738
 10 | long10,jpg |  0.06853261342194783
  1 | long1,png  |  0.07069949063544745
  7 | long7,jpg  |  0.07420155647983717
 19 | tu1,jpg    |  0.07621919722304038
 11 | long11,jpg |  0.08243281628522736
 16 | tian1,jpg  |  0.08686050822976776
 15 | she1,jpg   |  0.09882522468462784
  6 | long6,jpg  |   0.1048961893306749
 13 | hu1,jpg    |  0.12904889503224415
 12 | chang1,jpg |  0.13378521110088515
 20 | tu2,jpg    |  0.17567972259493136
 21 | tu3,jpg    |   0.1793031852332665
(21 rows)
​
​
​

上面的计算结果不理想,反复查看,最后发现long9右边有鼓起的红包,特征收集钱包的信息,我把long9改成long6,重新运行欧氏相似距离计算,这样发现前5名都是long的收集结果,而且long6与long9离得老远!

实践:龙与猫

我在网络上随机摘了6个猫的图,以及5个龙的图,这下子可以确定照片来自于不同的拍摄工具、不同的照片格式、不同的编码方式、不同的分辨率、不同的大小。

vbnet 复制代码
​
​
postgres=#   select
postgres-#     t1.id,
postgres-#     t2.info,
postgres-#     embedding <-> (select embedding from img_vec where id=9) as "欧氏相似距离"
postgres-#   from img_vec t1 join img_raw t2 on t1.id=t2.rawid
postgres-#     order by embedding <-> (select embedding from img_vec where id=9)
postgres-#     limit 20;
 id |   info   |    欧氏相似距离
----+----------+--------------------
  9 | d3,jpg   |                  0
  7 | d1,jpeg  | 0.7647856568512812
  8 | d2,jpeg  | 1.2676685459702477
  5 | cat4,png | 1.3949615071226016
  6 | cat5,jpg |  1.464445502504794
  3 | cat3,jpg | 1.7297577441389917
 11 | d5,jpg   |  2.441273726871975
  1 | cat1,png |  2.449192466777335
  2 | cat2,jpg |  2.623407880151245
 10 | d4,jpeg  | 2.8941867638822063
  4 | cat4,jpg |  4.956624524660588
(11 rows)
​
postgres=#   select
postgres-#     t1.id,
postgres-#     t2.info,
postgres-#     embedding <#> (select embedding from img_vec where id=9) as "内积相似距离"
postgres-#   from img_vec t1 join img_raw t2 on t1.id=t2.rawid
postgres-#     order by embedding <#> (select embedding from img_vec where id=9)
postgres-#     limit 20;
 id |   info   |    内积相似距离
----+----------+---------------------
 11 | d5,jpg   | -13.741656303405762
 10 | d4,jpeg  | -12.781399726867676
  1 | cat1,png | -11.762970924377441
  2 | cat2,jpg |  -9.446988105773926
  5 | cat4,png |  -8.874711036682129
  9 | d3,jpg   |   -8.14848518371582
  4 | cat4,jpg |  -8.092750549316406
  7 | d1,jpeg  |   -7.85388708114624
  3 | cat3,jpg |  -6.499306678771973
  8 | d2,jpeg  |  -5.019601821899414
  6 | cat5,jpg |  -4.994093418121338
(11 rows)
​
postgres=#   select
postgres-#     t1.id,
postgres-#     t2.info,
postgres-#     embedding <=> (select embedding from img_vec where id=9) as "余弦相似距离"
postgres-#   from img_vec t1 join img_raw t2 on t1.id=t2.rawid
postgres-#     order by embedding <=> (select embedding from img_vec where id=9)
postgres-#     limit 20;
 id |   info   |     余弦相似距离
----+----------+----------------------
  9 | d3,jpg   |                    0
  7 | d1,jpeg  |  0.03589936156946816
 11 | d5,jpg   | 0.042836060244666685
  8 | d2,jpeg  |  0.05975859196041422
  5 | cat4,png |  0.08507819244719383
  1 | cat1,png |   0.1087175221334512
 10 | d4,jpeg  |  0.11832410643368074
  6 | cat5,jpg |  0.12351988272197956
  3 | cat3,jpg |   0.1869637379378707
  2 | cat2,jpg |   0.2117638205685053
  4 | cat4,jpg |   0.5035052630227678
(11 rows)

上面结论来看,这来自四面八方不同的图片使用三种算法,余弦距离计算的功能最好,前面三名都是龙有关的

结论

  • imgsmlr和pgvector是好东西,但是要注意图片的特征值的抽取,例如同样的手机拍摄 ,因为 识别目标旁边放的东西,向量化计算时会大打折扣。

  • imgsmlr和pgvector虽好,但是没有可以监测的调试环境,如何改变特征值?如何在内部对图像进行观测?目前还没有这个功能

  • pgvector以后在AI大模型领域必定大有作为,目前识别虽有误差,但是识别有80%以上的通过 , 以后进一步的发展, 通过后端与前端的合力,必定可以识别100%的真龙。

    最后预祝各位同学龙年遇真龙,龙年大乐!

相关推荐
Channing Lewis17 分钟前
sql server如何创建表导入excel的数据
数据库·oracle·excel
秃头摸鱼侠18 分钟前
MySQL安装与配置
数据库·mysql·adb
UGOTNOSHOT22 分钟前
每日八股文6.3
数据库·sql
行云流水行云流水1 小时前
数据库、数据仓库、数据中台、数据湖相关概念
数据库·数据仓库
John Song1 小时前
Redis 集群批量删除key报错 CROSSSLOT Keys in request don‘t hash to the same slot
数据库·redis·哈希算法
写bug写bug1 小时前
手把手教你使用JConsole
java·后端·程序员
苏三说技术1 小时前
给你1亿的Redis key,如何高效统计?
后端
IvanCodes1 小时前
七、Sqoop Job:简化与自动化数据迁移任务及免密执行
大数据·数据库·hadoop·sqoop
tonexuan1 小时前
MySQL 8.0 绿色版安装和配置过程
数据库·mysql
JohnYan1 小时前
工作笔记- 记一次MySQL数据移植表空间错误排除
数据库·后端·mysql