PgSQL,字段打乱脚本
需求场景
- 随着AI场景的广泛应用,或是数据分级分类开发等场景的需求,需要在保证数据安全的同时,又尽可能的保留数据的原有特征以供AI进行数据学习分析或是外部人员进行数据治理的脚本开发。
- 最简单且常用的方法即是将数据表的各个字段进行随机打乱,再进行重新排序并截取其部分数据。可在保证单个字段保持其自有特征和内部逻辑的情况下,实现数据记录的内容混淆,保证数据内容的不外泄。
代码逻辑
- 生成自增序列,序列长度与表记录数一致
- 依次将随机数与各个目标字段进行拼接
- 依次对随机数进行递增/递减排序,进而打乱目标字段
- 截取期望数量的记录数,并对各个字段进行自增序列编号
- 利用自增序列关联各个目标字段,形成最终的乱序目标表
代码实现
总体代码
select
xzcfxwbh.xzcfxwbh,
jlwybs.jlwybs,
xm.xm
from
(select row_number() over() as abc from spfjbasepop.b01_hyxxb) abc
left join
(select xzcfxwbh as xzcfxwbh,row_number() over() as rn from
(select xzcfxwbh,random() as rand from spfjbasepop.b01_xzcfxwxx order by rand desc )
limit 1000
) xzcfxwbh
on abc.abc=xzcfxwbh.rn
left join
(select jlwybs as jlwybs,row_number() over() as rn from
(select jlwybs,random() as rand from spfjbasepop.b01_xzcfxwxx order by rand desc )
limit 1000
) jlwybs
on abc.abc=jlwybs.rn
left join
(select xm as xm,row_number() over() as rn from
(select xm,random() as rand from spfjbasepop.b01_xzcfxwxx order by rand desc )
limit 1000
) xm
on abc.abc=xm.rn
代码分解
- 生成自增序列
select
row_number() over() as abc
from spfjbasepop.b01_hyxxb
- 拼接随机数并排序
select
xzcfxwbh
,random() as rand
from spfjbasepop.b01_xzcfxwxx
order by rand desc
- 为乱序数据编号
select
xzcfxwbh as xzcfxwbh
,row_number() over() as rn
from
(select xzcfxwbh,random() as rand from spfjbasepop.b01_xzcfxwxx order by rand desc )
limit 1000
自动化脚本
库名 | 表名 | 字段名 | 限制字段数 | 最终脚本 | |
---|---|---|---|---|---|
spfjbasepop | B01_XZCFXWXX | XZCFXWBH | 1000 | =CONCAT("left join (select ",C2," as ",C2,",row_number() over() as rn from(select ",C2,",random() as rand from ",A2,".",B2," order by rand desc )limit ",$D$2," ) ",C2," on abc.abc=",C2,".rn") | =LOWER(TEXTJOIN(" ",TRUE,CONCAT("select ",TEXTJOIN(",",TRUE,F:F)," from (select row_number() over() as abc from spfjbasepop.B01_HYXXB) abc "),E:E)) |