使用Arrow管理数据

在之前的数据挖掘:是时候更新一下TCGA的数据了推文中,保存TCGA的数据就是使用Arrow格式,因为占空间小,读写速度快,多语言支持(我主要使用的3种语言都支持)

Format

https://arrow.apache.org

Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

Language Supported

Arrow's libraries implement the format and provide building blocks for a range of use cases, including high performance analytics. Many popular projects use Arrow to ship columnar data efficiently or as the basis for analytic engines.

Libraries are available for C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

Ecosystem

Apache Arrow is software created by and for the developer community. We are dedicated to open, kind communication and consensus decisionmaking. Our committers come from a range of organizations and backgrounds, and we welcome all to participate with us.

R

install.packages("arrow")

library(arrow)

write iris to iris.arrow and compressed by zstd

arrow::write_ipc_file(iris,'iris.arrow', compression = "zstd",compression_level=1)

read iris.arrow as DataFrame

iris=arrow::read_ipc_file('iris.arrow')

python

conda install -y pandas pyarrow

import pandas as pd

read iris.arrow as DataFrame

iris=pd.read_feather('iris.arrow')

write iris to iris.arrow and compressed by zstd

iris.to_feather('iris.arrow',compression='zstd', compression_level=1)

Julia

using Pkg

Pkg.add(["Arrow","DataFrames"])

using Arrow, DataFrames

read iris.arrow as DataFrame

iris = Arrow.Table("iris.arrow") |> DataFrame

write iris to iris.arrow, using 8 threads and compressed by zstd

Arrow.write("iris.arrow",iris,compress=:zstd,ntasks=8)

相关推荐
未来之窗软件服务40 分钟前
solidwors插件 开发————仙盟创梦IDE
前端·javascript·数据库·ide·仙盟创梦ide
yc_122441 分钟前
SqlHelper 实现类,支持多数据库,提供异步操作、自动重试、事务、存储过程、分页、缓存等功能。
数据库·c#
Leo.yuan2 小时前
基于地图的数据可视化:解锁地理数据的真正价值
大数据·数据库·信息可视化·数据挖掘·数据分析
好吃的肘子2 小时前
MongoDB入门
数据库·mongodb
noravinsc2 小时前
人大金仓数据库 与django结合
数据库·python·django
代码配咖啡3 小时前
《Navicat之外的新选择:实测支持国产数据库的SQLynx核心功能解析》
数据库
懒大王爱吃狼3 小时前
怎么使用python进行PostgreSQL 数据库连接?
数据库·python·postgresql
时序数据说3 小时前
IoTDB集群的一键启停功能详解
大数据·数据库·开源·时序数据库·iotdb
小叶子来了啊3 小时前
信息系统运行管理员:临阵磨枪版
运维·服务器·数据库
数据库幼崽4 小时前
MySQL 8.0 OCP 1Z0-908 131-140题
数据库·mysql·ocp