load_optimized_csv

load_optimized_csv

Load a CSV file and return a memory-optimized DataFrame.

Functions

Name Description
load_optimized_csv Load a CSV as a memory-optimized DataFrame with type downcasting

load_optimized_csv

load_optimized_csv.load_optimized_csv(
    filepath,
    nrows=None,
    usecols=None,
    no_sparse_cols=None,
    no_downcast_cols=None,
    no_category_cols=None,
    sparse_threshold=0.3,
    category_threshold=0.7,
    **kwargs,
)

Load a CSV as a memory-optimized DataFrame with type downcasting and categorical/sparse conversions.

Automatically determines optimal chunk size based on file size and available system memory, then processes each chunk by downcasting dtypes, converting low-cardinality string columns to categorical, and converting high-zero-density columns to sparse. Returns a single concatenated, memory-optimized DataFrame with a RangeIndex.

Parameters

Name Type Description Default
filepath str Path to the CSV file to load. required
nrows int Maximum number of rows to read. If None, reads all rows. None
usecols list of str Columns to read. If None, reads all columns. None
no_sparse_cols list of str Columns to exclude from sparse conversion. None
no_downcast_cols list of str Columns to exclude from dtype downcasting. None
no_category_cols list of str Columns to exclude from categorical conversion. None
sparse_threshold float Minimum proportion of zeros required to convert a column to sparse. Must be between 0 and 1. 0.3
category_threshold float Maximum ratio of unique values to total values for a string column to be converted to categorical. Must be between 0 and 1. 0.7
**kwargs Additional keyword arguments passed to pandas.read_csv (e.g., sep, encoding, parse_dates). {}

Returns

Name Type Description
pd.DataFrame A memory-optimized DataFrame with: - Numeric columns downcasted to smallest sufficient dtype - Low-cardinality string columns converted to categorical - High-zero columns converted to SparseDtype - RangeIndex set as index

Raises

Name Type Description
FileNotFoundError If filepath does not exist.
ValueError If the file is not a valid CSV, if sparse_threshold or category_threshold are not in [0, 1], or if usecols contains columns not present in the CSV.
TypeError If arguments are of incorrect types.
pd.errors.EmptyDataError If the CSV file is empty or contains only headers.
MemoryError If the final DataFrame exceeds available memory.

Examples

>>> from csvplus.load_optimized_csv import load_optimized_csv
>>> df = load_optimized_csv(
...     "large_dataset.csv",
...     nrows=100000,
...     usecols=["id", "value", "category", "status"],
...     no_sparse_cols=["id"],
...     no_downcast_cols=["value"],
...     no_category_cols=["id"],
...     sparse_threshold=0.6,
...     category_threshold=0.3,
... )
>>> df.info(memory_usage="deep")