使用Arrow管理数据

在之前的数据挖掘:是时候更新一下TCGA的数据了推文中,保存TCGA的数据就是使用Arrow格式,因为占空间小,读写速度快,多语言支持(我主要使用的3种语言都支持)

Format

https://arrow.apache.org

Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

Language Supported

Arrow's libraries implement the format and provide building blocks for a range of use cases, including high performance analytics. Many popular projects use Arrow to ship columnar data efficiently or as the basis for analytic engines.

Libraries are available for C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

Ecosystem

Apache Arrow is software created by and for the developer community. We are dedicated to open, kind communication and consensus decisionmaking. Our committers come from a range of organizations and backgrounds, and we welcome all to participate with us.

R

install.packages("arrow")

library(arrow)

write iris to iris.arrow and compressed by zstd

arrow::write_ipc_file(iris,'iris.arrow', compression = "zstd",compression_level=1)

read iris.arrow as DataFrame

iris=arrow::read_ipc_file('iris.arrow')

python

conda install -y pandas pyarrow

import pandas as pd

read iris.arrow as DataFrame

iris=pd.read_feather('iris.arrow')

write iris to iris.arrow and compressed by zstd

iris.to_feather('iris.arrow',compression='zstd', compression_level=1)

Julia

using Pkg

Pkg.add("Arrow","DataFrames")

using Arrow, DataFrames

read iris.arrow as DataFrame

iris = Arrow.Table("iris.arrow") |> DataFrame

write iris to iris.arrow, using 8 threads and compressed by zstd

Arrow.write("iris.arrow",iris,compress=:zstd,ntasks=8)

相关推荐
IT龟苓膏3 小时前
Redis 数据类型底层原理:SDS、quicklist、intset、skiplist、Bitmap、HyperLogLog 一篇讲清
数据库·redis·skiplist
流星白龙4 小时前
【MySQL高阶】19.变更缓冲区,自适应哈希索引,日志缓冲区
数据库·windows·mysql
晴天¥4 小时前
Oracle中的监听配置与管理(动态、静态监听配置对比以及listener.ora和tnsnames.ora)
数据库·oracle
瀚高PG实验室5 小时前
python连接HGDB超时
数据库·瀚高数据库·highgo
闪电悠米6 小时前
黑马点评-Redisson-01_why_redisson
java·服务器·网络·数据库·缓存·wpf
Counter-Strike大牛6 小时前
SpringBoot2.7.10+MyBatisPlus实现MySQL+DM双数据库切换
数据库·mysql
dllxhcjla6 小时前
Redis
数据库·redis·缓存
睡不醒男孩0308236 小时前
数据库高可用运维实操指南:基于CLup的PostgreSQL生产环境自动化管理
运维·数据库·postgresql
神仙别闹6 小时前
基于Python + SQL server 实现(GUI)原神圣遗物管理与角色数值模拟系统
java·数据库·python
Crazy_eater7 小时前
Mysql(6)--基础查询
数据库·mysql