使用Arrow管理数据

在之前的数据挖掘:是时候更新一下TCGA的数据了推文中,保存TCGA的数据就是使用Arrow格式,因为占空间小,读写速度快,多语言支持(我主要使用的3种语言都支持)

Format

https://arrow.apache.org

Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

Language Supported

Arrow's libraries implement the format and provide building blocks for a range of use cases, including high performance analytics. Many popular projects use Arrow to ship columnar data efficiently or as the basis for analytic engines.

Libraries are available for C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

Ecosystem

Apache Arrow is software created by and for the developer community. We are dedicated to open, kind communication and consensus decisionmaking. Our committers come from a range of organizations and backgrounds, and we welcome all to participate with us.

R

install.packages("arrow")

library(arrow)

write iris to iris.arrow and compressed by zstd

arrow::write_ipc_file(iris,'iris.arrow', compression = "zstd",compression_level=1)

read iris.arrow as DataFrame

iris=arrow::read_ipc_file('iris.arrow')

python

conda install -y pandas pyarrow

import pandas as pd

read iris.arrow as DataFrame

iris=pd.read_feather('iris.arrow')

write iris to iris.arrow and compressed by zstd

iris.to_feather('iris.arrow',compression='zstd', compression_level=1)

Julia

using Pkg

Pkg.add(["Arrow","DataFrames"])

using Arrow, DataFrames

read iris.arrow as DataFrame

iris = Arrow.Table("iris.arrow") |> DataFrame

write iris to iris.arrow, using 8 threads and compressed by zstd

Arrow.write("iris.arrow",iris,compress=:zstd,ntasks=8)

相关推荐
噼里啪啦啦.几秒前
Spring事务和事务传播机制
数据库·sql·spring
搬码红绿灯7 分钟前
MySQL主从复制深度解析:原理、架构与实战部署指南
数据库·mysql·架构
呼拉拉呼拉9 分钟前
Redis高可用架构
数据库·redis·架构·高可用架构
却尘9 分钟前
当全世界都在用 Rust 重写一切时,Prisma 却选择了反方向
前端·数据库·orm
藥瓿锻19 分钟前
2024 CKA题库+详尽解析| 15、备份还原Etcd
linux·运维·数据库·docker·容器·kubernetes·cka
远方160931 分钟前
16-Oracle 23 ai-JSON-Relational Duality-知识准备
数据库·oracle·json
Wooden-Flute36 分钟前
七、数据库的完整性
数据库·oracle
珹洺1 小时前
数据库系统概论(十七)超详细讲解数据库规范化与五大范式(从函数依赖到多值依赖,再到五大范式,附带例题,表格,知识图谱对比带你一步步掌握)
java·数据库·sql·安全·oracle
TDengine (老段)1 小时前
TDengine 开发指南——无模式写入
大数据·数据库·物联网·时序数据库·iot·tdengine·涛思数据
TDengine (老段)1 小时前
TDengine 在电力行业如何使用 AI ?
大数据·数据库·人工智能·时序数据库·tdengine·涛思数据