optimize_dataframe.optimize_dataframe
optimize_dataframe.optimize_dataframe(df)This function creates an optimized copy of a pandas DataFrame by applying a series of memory-reduction strategies while preserving the original data.
It serves as the main wrapper function for the package and coordinates multiple optimization steps: - Numeric columns are downcast to smaller, safe data types where possible - Low-cardinality string columns are converted to pandas ‘category’ dtype - Special columns (e.g., IDs, coordinates, high-cardinality text) are identified and reported but not modified
The original DataFrame is never mutated; all operations are performed on a copy.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| df | pd.DataFrame | The input pandas DataFrame to be optimized. | required |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | A new DataFrame with optimized data types and reduced memory usage. |
Notes
- This function acts as a wrapper that calls lower-level helper functions such as numeric and categorical optimizers.
- The optimization process is designed to be transparent and reproducible; a summary of changes and memory savings may be printed.
- This function prioritizes safety over aggressiveness and avoids modifying columns that could lead to unexpected behavior.
- The optimized DataFrame should behave identically to the original in downstream analysis, aside from potential minor float precision changes if numeric downcasting is applied.
Examples
>>> import pandas as pd
>>> df = pd.DataFrame({
... "status": ["pending", "shipped", "pending"],
... "quantity": [1, 2, 3]
... })
>>> optimized_df = optimize_dataframe(df)
>>> optimized_df.dtypes["status"]
CategoricalDtype(categories=['pending', 'shipped'], ordered=False)
>>> optimized_df.dtypes["quantity"]
dtype('int8')>>> df = pd.DataFrame({
... "region": ["US", "CA", "US", "US"],
... "price": [10.5, 12.0, 9.99, 11.25]
... })
>>> optimized_df = optimize_dataframe(df)
>>> optimized_df.dtypes["region"]
CategoricalDtype(categories=['CA', 'US'], ordered=False)
>>> optimized_df.dtypes["price"]
dtype('float32')