Text Utilities

Welcome to textutils

Package Latest TestPyPI Version Supported Python Versions
Meta Code of Conduct
Coverage codecov

textutils is a lightweight Python package that provides a small collection of utility functions for basic text processing and manipulation. The package is designed to be simple, beginner-friendly, and easy to integrate into data analysis or general Python workflows where quick text operations are needed without the overhead of large NLP libraries.

Contributors

  • Mehmet Imga
  • Shi Fan Jin
  • Aidan Hew
  • Sidharth Malik

Installation

$ pip install -i https://test.pypi.org/simple/ textutils-dsci524

Package Overview

This package will include the following functions: - word_count(text: str) -> int Counts the number of words in a given string. The function will handle empty strings and raise appropriate errors for invalid inputs.

  • remove_punctuation(text: str) -> str Removes punctuation characters from a string and returns the cleaned text while preserving spacing and alphanumeric characters.

  • most_common_word(text: str) -> str Identifies and returns the most frequently occurring word in a string. The function ignores punctuation and can be case-insensitive or case-sensitive.

  • reverse_text(text: str) -> str Reverses the input string and returns the reversed result. The function will validate input types and handle edge cases such as empty strings.

Quick Usage Examples

from textutils.textutils import (
    word_count,
    remove_punctuation,
    most_common_word,
    reverse_text,
)

word_count("Hello world!")  # returns 2
remove_punctuation("Hello, world!")  # returns "Hello world"
most_common_word("apple banana apple orange")  # returns "apple"
reverse_text("textutils")  # returns "slitxet"

Detailed Usage Examples

word_count

Count the number of words in a string. Handles extra spaces and empty input gracefully.

from textutils.textutils import word_count

# Example 1: Simple sentence
text = "Data science is fun"
print(word_count(text))  # Output: 4

# Example 2: Extra spaces between words
messy = "  This   is   a   test   "
print(word_count(messy))  # Output: 4

# Example 3: Empty string
print(word_count(""))  # Output: 0

# Example 4: String with only whitespace
print(word_count("     "))  # Output: 0

# Example 5: Real-world use case – counting words in user input
comment = "I really enjoyed using this package!"
num_words = word_count(comment)
print(num_words)  # Output: 6

remove_punctuation

Remove all punctuation from text while preserving letters, numbers, spaces, and emojis.

from textutils.textutils import remove_punctuation

# Example 1: Basic sentence
text = "Hello, World! How are you?"
result = remove_punctuation(text)
print(result)  # Output: "Hello World How are you"

# Example 2: Text with multiple punctuation marks
messy_text = "Wait... What?! That's amazing!!!"
clean_text = remove_punctuation(messy_text)
print(clean_text)  # Output: "Wait What Thats amazing"

# Example 3: Preserves numbers and emojis
mixed = "Sale: 50% off! Ends soon! 🎉"
print(remove_punctuation(mixed))  # Output: "Sale 50 off Ends soon 🎉"

# Example 4: Real-world use case - cleaning text data for analysis
reviews = [
    "Great product! 5/5 stars!!!",
    "Terrible... would NOT recommend.",
    "It's okay, nothing special."
]
clean_reviews = [remove_punctuation(r) for r in reviews]
print(clean_reviews)
# Output: ['Great product 55 stars', 'Terrible would NOT recommend', 'Its okay nothing special']

most_common_word

Identify the most common word in a given text.

from textutils.textutils import most_common_word

# Example 1: Basic sentence
most_common_word("Hello. Hello. hello. How's your day?") # Output 'hello'

# Example 2: With case-sensitive
most_common_word("Hello. Hello. hello. How's your day?", True) # Output 'Hello'

# Example 3: Tie situation, return first appearance word
most_common_word("apple banana apple banana") # Output 'apple'

# Example 4: Single word
most_common_word("hello") # Output: 'hello'

reverse_text

Reverse text either by words or by characters, with support for flexible formatting and simple text transformations.

from textutils.textutils import reverse_text

# Example 1: Basic sentence (default word mode)
text = "Hello World"
result = reverse_text(text)
print(result)  # Output: "World Hello"

# Example 2: Explicit word-based reversal
sentence = "Data science is fun"
reversed_words = reverse_text(sentence, mode="word")
print(reversed_words)  # Output: "fun is science Data"

# Example 3: Character-based reversal
char_text = "Hello World"
reversed_chars = reverse_text(char_text, mode="char")
print(reversed_chars)  # Output: "dlroW olleH"

# Example 4: Preserves spacing between words in word mode
messy_spacing = "Hello    World   again"
print(reverse_text(messy_spacing))
# Output: "again World Hello"

# Example 5: Real-world use case – reversing text for simple transformations
messages = [
    "Machine learning is powerful",
    "Python makes data analysis easier",
    "Reproducibility matters"
]

reversed_messages = [reverse_text(m, mode="word") for m in messages]
print(reversed_messages)
# Output:
# ['powerful is learning Machine',
#  'easier analysis data makes Python',
#  'matters Reproducibility']

Development Setup

To set up the development environment locally using conda:

  1. Clone the repository:
  git clone https://github.com/UBC-MDS/DSCI_524_group34_textutils.git
  cd DSCI_524_group34_textutils
  1. Create and activate the conda environment:
  conda env create -f environment.yml
  conda activate textutils
  1. Install the package in editable mode:
  pip install -e .

Running Tests

To run the full test suite locally:

  pytest

Documentation

Package documentation is generated using quartodoc and deployed automatically to GitHub Pages via GitHub Actions.

To build the documentation locally:

  quarto render docs

The deployed documentation can be found at: https://ubc-mds.github.io/DSCI_524_group34_textutils/

Relationship to the Python Ecosystem

Python has several powerful text-processing libraries such as:

  • re for regular expressions

  • nltk and textblob for advanced natural language processing

While these libraries provide extensive functionality, they can be unnecessarily complex for simple text manipulation tasks. textutils is intended to complement existing tools by offering a minimal, lightweight alternative for common text operations that do not require full NLP pipelines.

Continuous Integration and Deployment

This project uses GitHub Actions for:

  • Continuous integration (running tests and style checks on pushes and pull requests)

  • Continuous deployment to TestPyPI on pushes to the main branch

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.