安装Polars #
Polars是一个库,安装起来就像调用相应编程语言的包管理器一样简单。
pip install polars
#或者对于那些不支持高级矢量扩展指令集2(AVX2)的旧CPU
pip install polars-lts-cpu
cargo add polars -F lazy
# Or Cargo.toml
[dependencies]
polars = { version = "x", features = ["lazy", ...]}
大索引 #
默认情况下,Polars dataframes的行数限制为2^32(约43亿)行。通过启用大索引扩展功能,可将此限制提升至2^64(约1800京)行:
pip install polars-u64-idx
cargo add polars -F bigidx
# Or Cargo.toml
[dependencies]
polars = { version = "x", features = ["bigidx", ...] }
旧款CPU #
在不支持高级矢量扩展指令集( AVX)的旧款 CPU 上为 Python 安装 Polars,请运行:
pip install polars-lts-cpu
导入polars #
要使用polars库,只需将其导入到你的项目中即可:
import polars as pl
use polars::prelude::*;
特性标志 #
通过使用上述命令,你可以将 Polars 的核心部分安装到你的系统上。然而,根据你的使用场景,你可能还需要安装一些可选的依赖项。将这些设置为可选的目的是尽量减少占用空间。根据编程语言的不同,相应的标志也有所不同。在整个用户指南中,当所使用的某项功能需要额外的依赖项时,将会提别提醒。
Python #
# 示例
pip install 'polars[numpy,fsspec]'
All #
| 标志 | 说明 |
|---|---|
| all | 安装所有可选的依赖项。 |
GPU #
| 标志 | 说明 |
|---|---|
| gpu | 在英伟达(NVIDIA)图形处理器(GPU)上运行查询。 |
说明
有关更详细的说明和先决条件,请参阅 GPU支持相关内容。
互操作性 #
| 标志 | 说明 |
|---|---|
| pandas | Convert data to and from pandas dataframes/series. |
| numpy | Convert data to and from NumPy arrays. |
| pyarrow | Convert data to and from PyArrow tables/arrays. |
| pydantic | Convert data from Pydantic models to Polars. |
Excel #
| 标志 | 说明 |
|---|---|
| calamine | Read from Excel files with the calamine engine. |
| openpyxl | Read from Excel files with the openpyxl engine. |
| xlsx2csv | Read from Excel files with the xlsx2csv engine. |
| xlsxwriter | Write to Excel files with the XlsxWriter engine. |
| excel | Install all supported Excel engines. |
数据库 #
| 标志 | 说明 |
|---|---|
| adbc | Read from and write to databases with the Arrow Database Connectivity (ADBC) engine. |
| connectorx | Read from databases with the ConnectorX engine. |
| sqlalchemy | Write to databases with the SQLAlchemy engine. |
| database | Install all supported database engines. |
云 #
| 标志 | 说明 |
|---|---|
| fsspec | Read from and write to remote file systems. |
其他I/O #
| 标志 | 说明 |
|---|---|
| deltalake | Read from and write to Delta tables. |
| iceberg | Read from Apache Iceberg tables. |
其他 #
| 标志 | 说明 |
|---|---|
| async | Collect LazyFrames asynchronously. |
| cloudpickle | Serialize user-defined functions. |
| graph | Visualize LazyFrames as a graph. |
| plot | Plot dataframes through the plot namespace. |
| style | Style dataframes through the style namespace. |
| timezone | Timezone support.仅使用Windows时才需要 |
Rust #
# Cargo.toml
[dependencies]
polars = { version = "0.26.1", features = ["lazy", "temporal", "describe", "json", "parquet", "dtype-datetime"] }
可选择启用的功能如下:
- 额外的数据类型:
dtype-datedtype-datetimedtype-timedtype-durationdtype-i8dtype-i16dtype-u8dtype-u16dtype-categoricaldtype-struct
lazy- Lazy API:regex- 在列选择中使用正则表达式.dot_diagram- 根据惰性逻辑计划创建点图。
sql- 将 SQL 查询传递给 Polars。streaming- 能够处理比内存容量更大的数据集。random- 生成包含随机采样值的数组ndarray- 将DataFrame(数据框)转换为ndarray(多维数组)temporal- 针对时间数据类型在 Chrono(时间库)和 Polars(数据处理库)之间进行转换timezones- 激活时区支持。strings- Extra string utilities forStringChunked:string_pad- forpad_start,pad_end,zfill.string_to_integer- forparse_int.
object- Support for generic ChunkedArrays calledObjectChunked<T>(generic overT). These are downcastable from Series through the Any trait.- 性能相关:
nightly- Several nightly only features such as SIMD and specialization.performant- more fast paths, slower compile times.bigidx- Activate this feature if you expect » $2^{32}$ rows. This allows polars to scale up way beyond that by usingu64as an index. Polars will be a bit slower with this feature activated as many data structures are less cache efficient.cse- Activate common subplan elimination optimization.
- IO相关:
serde- Support for serde serialization and deserialization. Can be used for JSON and more serde supported serialization formats.serde-lazy- Support for serde serialization and deserialization. Can be used for JSON and more serde supported serialization formats.parquet- Read Apache Parquet format.json- JSON serialization.ipc- Arrow’s IPC format serialization.decompress- Automatically infer compression of csvs and decompress them. Supported compressions:- gzip
- zlib
- zstd
- Dataframe操作:
dynamic_group_by- Group by based on a time window instead of predefined keys. Also activates rolling window group by operations.sort_multiple- Allow sorting a dataframe on multiple columns.rows- Create dataframe from rows and extract rows fromdataframes. Also activatespivotandtransposeoperations.join_asof- Join ASOF, to join on nearest keys instead of exact equality match.cross_join- Create the Cartesian product of two dataframes.semi_anti_join- SEMI and ANTI joins.row_hash- Utility to hash dataframe rows toUInt64Chunked.diagonal_concat- Diagonal concatenation thereby combining different schemas.dataframe_arithmetic- Arithmetic between dataframes and other dataframes or series.partition_by- Split into multiple dataframes partitioned by groups.
- Series/表达式操作:
is_in- Check for membership in Series.zip_with- Zip twoSeries/ChunkedArrays.round_series- round underlying float types of series.repeat_by- Repeat element in an array a number of times specified by another array.is_first_distinct- Check if element is first unique value.is_last_distinct- Check if element is last unique value.checked_arithmetic- checked arithmetic returningNoneon invalid operations.dot_product- Dot/inner product on series and expressions.concat_str- Concatenate string data in linear time.reinterpret- Utility to reinterpret bits to signed/unsigned.take_opt_iter- Take from a series withIterator<Item=Option<usize>>.mode- Return the most frequently occurring value(s).cum_agg-cum_sum,cum_min, andcum_max, aggregations.rolling_window- rolling window functions, likerolling_mean.interpolate- InterpolateNonevalues.extract_jsonpath- Runjsonpathqueries onStringChunked.list- List utils:list_gather- take sublist by multiple indices.
rank- Ranking algorithms.moment- Kurtosis and skew statistics.ewma- Exponential moving average windows.abs- Get absolute values of series.arange- Range operation on series.product- Compute the product of a series.diff-diffoperation.pct_change- Compute change percentages.unique_counts- Count unique values in expressions.log- Logarithms for series.list_to_struct- ConvertListtoStructdata types.list_count- Count elements in lists.list_eval- Apply expressions over list elements.cumulative_eval- Apply expressions over cumulatively increasing windows.arg_where- Get indices where condition holds.search_sorted- Find indices where elements should be inserted to maintain order.offset_by- Add an offset to dates that take months and leap years into account.trigonometry- 三角函数.sign- 计算一个序列中每个元素的符号(正、负或零)。propagate_nans-NaN-propagating min/max aggregations.
- Dataframe美化格式化:
fmt- 激活Dataframe格式化功能.