bulwark.checks module

Each function in this module should:

  • take a pd.DataFrame as its first argument, with optional additional arguments,

  • make an assert about the pd.DataFrame, and

  • return the original, unaltered pd.DataFrame

bulwark.checks.custom_check(df, check_func, *args, **kwargs)[source]

Assert that check(df, *args, **kwargs) is true.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • check_func (function) – A function taking df, *args, and **kwargs. Should raise AssertionError if check not passed.

Returns

Original df.

bulwark.checks.has_columns(df, columns, exact_cols=False, exact_order=False)[source]

Asserts that df has columns

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • columns (list or tuple) – Columns that are expected to be in df.

  • exact_cols (bool) – Whether or not columns need to be the only columns in df.

  • exact_order (bool) – Whether or not columns need to be in the same order as the columns in df.

Returns

Original df.

bulwark.checks.has_dtypes(df, items)[source]

Asserts that df has dtypes

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • items (dict) – Mapping of columns to dtype.

Returns

Original df.

bulwark.checks.has_no_infs(df, columns=None)[source]

Asserts that there are no np.infs in df.

This is a convenience wrapper for has_no_x.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • columns (list) – A subset of columns to check for np.infs.

Returns

Original df.

bulwark.checks.has_no_nans(df, columns=None)[source]

Asserts that there are no np.nans in df.

This is a convenience wrapper for has_no_x.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • columns (list) – A subset of columns to check for np.nans.

Returns

Original df.

bulwark.checks.has_no_neg_infs(df, columns=None)[source]

Asserts that there are no np.infs in df.

This is a convenience wrapper for has_no_x.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • columns (list) – A subset of columns to check for -np.infs.

Returns

Original df.

bulwark.checks.has_no_nones(df, columns=None)[source]

Asserts that there are no Nones in df.

This is a convenience wrapper for has_no_x.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • columns (list) – A subset of columns to check for Nones.

Returns

Original df.

bulwark.checks.has_no_x(df, values=None, columns=None)[source]

Asserts that there are no user-specified values in df’s columns.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • values (list) – A list of values to check for in the pd.DataFrame.

  • columns (list) – A subset of columns to check for values.

Returns

Original df.

bulwark.checks.has_set_within_vals(df, items)[source]

Asserts that all given values are found in columns’ values.

In other words, the given values in the items dict should all be a subset of the values found in the associated column in df.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • items (dict) – Mapping of columns to values excepted to be found within them.

Returns

Original df.

Examples

The following check will pass, since df[‘a’] contains each of 1 and 2:

>>> import bulwark.checks as ck
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
>>> ck.has_set_within_vals(df, items={"a": [1, 2]})
   a  b
0  1  a
1  2  b
2  3  c

The following check will fail, since df[‘b’] doesn’t contain each of “a” and “d”:

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
>>> ck.has_set_within_vals(df, items={"a": [1, 2], "b": ["a", "d"]})
Traceback (most recent call last):
    ...
AssertionError: The following column: value pairs are missing: {'b': ['d']}
bulwark.checks.has_unique_index(df)[source]

Asserts that df’s index is unique.

Parameters

df (pd.DataFrame) – Any pd.DataFrame.

Returns

Original df.

bulwark.checks.has_vals_within_n_std(df, n=3)[source]

Asserts that every value is within n standard deviations of its column’s mean.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • n (int) – Number of standard deviations from the mean.

Returns

Original df.

bulwark.checks.has_vals_within_range(df, items=None)[source]

Asserts that df is within a range.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • items (dict) – Mapping of columns (col) to a (low, high) tuple (v) that df[col] is expected to be between.

Returns

Original df.

Examples

The following check will pass, since df[‘a’] contains values between 0 and 3:

>>> import bulwark.checks as ck
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
>>> ck.has_vals_within_range(df, items= {'a': (0, 3)})
   a  b
0  1  a
1  2  b
2  3  c

The following check will fail, since df[‘b’] contains ‘c’ which is outside of the specified range:

>>> ck.has_vals_within_range(df, items= {'a': (0, 3), 'b': ('a', 'b')})
Traceback (most recent call last):
    ...
AssertionError: ('Outside range', 0    False
1    False
2     True
Name: b, dtype: bool)
bulwark.checks.has_vals_within_set(df, items=None)[source]

Asserts that df is a subset of items.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • items (dict) – Mapping of columns (col) to array-like of values (v) that df[col] is expected to be a subset of.

Returns

Original df.

bulwark.checks.is_monotonic(df, items=None, increasing=None, strict=False)[source]

Asserts that the df is monotonic.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • items (dict) – Mapping of columns to conditions (increasing, strict) E.g. {‘col_a’: (None, False), ‘col_b’: (None, False)}

  • increasing (bool, None) – None checks for either increasing or decreasing monotonicity.

  • strict (bool) – Whether the comparison should be strict, meaning two values in a row being equal should fail.

Returns

Original df.

Examples

The following check will pass, since each column matches its monotonicity requirements:

>>> import bulwark.checks as ck
>>> import pandas as pd
>>> df = pd.DataFrame({"incr_strict": [1, 2, 3, 4],
...                    "incr_not_strict": [1, 2, 2, 3],
...                    "decr_strict": [4, 3, 2, 1],
...                    "decr_not_strict": [3, 2, 2, 1]})
>>> items = {
...     "incr_strict": (True, True),
...     "incr_not_strict": (True, False),
...     "decr_strict": (False, True),
...     "decr_not_strict": (False, False)
... }
>>> ck.is_monotonic(df, items=items)
   incr_strict  incr_not_strict  decr_strict  decr_not_strict
0            1                1            4                3
1            2                2            3                2
2            3                2            2                2
3            4                3            1                1

All of the same cases will also pass if increasing=None, since only one of increasing or decreasing monotonicity is then required:

>>> ck.is_monotonic(df, increasing=None, strict=False)
   incr_strict  incr_not_strict  decr_strict  decr_not_strict
0            1                1            4                3
1            2                2            3                2
2            3                2            2                2
3            4                3            1                1

The following check will fail, displaying a list of which (row, column)s caused the issue:

>>> df2 = pd.DataFrame({'not_monotonic': [1, 2, 3, 2]})
>>> ck.is_monotonic(df2, increasing=True, strict=False)
Traceback (most recent call last):
    ...
AssertionError: [(3, 'not_monotonic')]
bulwark.checks.is_same_as(df, df_to_compare, **kwargs)[source]

Asserts that two pd.DataFrames are equal.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • df_to_compare (pd.DataFrame) – A second pd.DataFrame.

  • **kwargs (dict) – Keyword arguments passed through to pandas’ assert_frame_equal.

Returns

Original df.

bulwark.checks.is_shape(df, shape)[source]

Asserts that df is of a known row x column shape.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • shape (tuple) – Shape of df as (n_rows, n_columns). Use None or -1 if you don’t care about a specific dimension.

Returns

Original df.

bulwark.checks.multi_check(df, checks, warn=False)[source]

Asserts that all checks pass.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • checks (dict) – Mapping of check functions to parameters for those check functions.

  • warn (bool) – Indicates whether an error should be raised or only a warning notification should be displayed. Default is to error.

Returns

Original df.

bulwark.checks.none_missing(df, columns=None)[source]

Deprecated: Replaced with has_no_nans

bulwark.checks.one_to_many(df, unitcol, manycol)[source]

Asserts that a many-to-one relationship is preserved between two columns.

For example, a retail store will have have distinct departments, each with several employees. If each employee may only work in a single department, then the relationship of the department to the employees is one to many.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • unitcol (str) – The column that encapulates the groups in manycol.

  • manycol (str) – The column that must remain unique in the distict pairs between manycol and unitcol.

Returns

Original df.

bulwark.checks.unique(df, columns=None)[source]

Asserts that columns in df only have unique values.

Parameters
  • df (pd.DataFrame) – Any pd.DataFrame.

  • columns (list) – A subset of columns to check for uniqueness of row values.

Returns

Original df.

bulwark.checks.unique_index(df)[source]

Deprecated: Replaced with has_unique_index

bulwark.checks.within_n_std(df, n=3)[source]

Deprecated: replaced with has_vals_within_n_std

bulwark.checks.within_range(df, items=None)[source]

Deprecated: Replaced with has_vals_within_range

bulwark.checks.within_set(df, items=None)[source]

Deprecated: replaced with has_vals_within_set