liwca.count#

liwca.count(texts: Iterable[str] | Series, dx: DataFrame, *, tokenizer: Callable[[str], list[str]] | None = None, as_percentage: bool = True, precision: int | None = None, return_words: Literal[False] = False) DataFrame[source]#
liwca.count(texts: Iterable[str] | Series, dx: DataFrame, *, tokenizer: Callable[[str], list[str]] | None = None, as_percentage: bool = True, precision: int | None = None, return_words: Literal[True]) tuple[DataFrame, DataFrame]

Count LIWC dictionary categories across documents (pure-Python).

Parameters:
  • texts (Iterable of str or Series) – The documents to analyse. Each element is a single document string.

  • dx (DataFrame) – A LIWC dictionary DataFrame as returned by liwca.read_dx. Index contains dictionary terms (may include * wildcards), columns are category names, values are binary (0/1).

  • tokenizer (Callable, optional) – A function str -> list[str] used to split each document into lowercase tokens. Defaults to a regex tokenizer that preserves contractions (don't["don't"]).

  • as_percentage (bool, optional) – If True (default), return category values as a percentage of total word count per document (matching LIWC’s default output). If False, return raw category counts.

  • precision (int, optional) – If set, round category value columns to this many decimal places. Only applies when as_percentage=True. The "WC" column is never rounded.

  • return_words (bool, optional) – If True, return a tuple (categories, words) where words is a documents × tokens DataFrame holding per-word counts (or percentages) for every dictionary token that appeared in the corpus. Wildcard entries are expanded to the actual corpus tokens that matched (e.g., recall*recalled, recalling, …). The same as_percentage and precision settings apply to both DataFrames. Default False.

Returns:

When return_words=False (default): a documents × categories DataFrame. Index matches the input order (or the Series index if a Series was passed). Columns are the dictionary category names. An additional "WC" column contains the total word count for each document.

When return_words=True: a tuple (categories, words) where categories is the DataFrame described above and words is a documents × tokens DataFrame with one column per matched dictionary token plus a "WC" column.

Return type:

DataFrame or tuple of DataFrame

Examples

>>> import liwca
>>> dx = liwca.fetch_threat()
>>> texts = [
...     "This is a grave threat to our safety.",
...     "All is calm today.",
... ]
>>> liwca.count(texts, dx)
Category  WC  threat
0          8    12.5
1          4     0.0

Get per-word contributions:

>>> cats, words = liwca.count(texts, dx, return_words=True)
>>> words.columns.tolist()
['WC', 'grave', 'threat']