optimize_dataframe.optimize_dataframe

optimize_dataframe.optimize_dataframe(df)

This function creates an optimized copy of a pandas DataFrame by applying a series of memory-reduction strategies while preserving the original data.

It serves as the main wrapper function for the package and coordinates multiple optimization steps: - Numeric columns are downcast to smaller, safe data types where possible - Low-cardinality string columns are converted to pandas ‘category’ dtype - Special columns (e.g., IDs, coordinates, high-cardinality text) are identified and reported but not modified

The original DataFrame is never mutated; all operations are performed on a copy.

Parameters

Name Type Description Default
df pd.DataFrame The input pandas DataFrame to be optimized. required

Returns

Name Type Description
pd.DataFrame A new DataFrame with optimized data types and reduced memory usage.

Notes

  • This function acts as a wrapper that calls lower-level helper functions such as numeric and categorical optimizers.
  • The optimization process is designed to be transparent and reproducible; a summary of changes and memory savings may be printed.
  • This function prioritizes safety over aggressiveness and avoids modifying columns that could lead to unexpected behavior.
  • The optimized DataFrame should behave identically to the original in downstream analysis, aside from potential minor float precision changes if numeric downcasting is applied.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     "status": ["pending", "shipped", "pending"],
...     "quantity": [1, 2, 3]
... })
>>> optimized_df = optimize_dataframe(df)
>>> optimized_df.dtypes["status"]
CategoricalDtype(categories=['pending', 'shipped'], ordered=False)
>>> optimized_df.dtypes["quantity"]
dtype('int8')
>>> df = pd.DataFrame({
...     "region": ["US", "CA", "US", "US"],
...     "price": [10.5, 12.0, 9.99, 11.25]
... })
>>> optimized_df = optimize_dataframe(df)
>>> optimized_df.dtypes["region"]
CategoricalDtype(categories=['CA', 'US'], ordered=False)
>>> optimized_df.dtypes["price"]
dtype('float32')