bulwark.checks module¶
Each function in this module should:
take a pd.DataFrame as its first argument, with optional additional arguments,
make an assert about the pd.DataFrame, and
return the original, unaltered pd.DataFrame
-
bulwark.checks.
custom_check
(df, check_func, *args, **kwargs)[source]¶ Assert that check(df, *args, **kwargs) is true.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
check_func (function) – A function taking df, *args, and **kwargs. Should raise AssertionError if check not passed.
- Returns
Original df.
-
bulwark.checks.
has_columns
(df, columns, exact_cols=False, exact_order=False)[source]¶ Asserts that df has
columns
- Parameters
- Returns
Original df.
-
bulwark.checks.
has_dtypes
(df, items)[source]¶ Asserts that df has
dtypes
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
items (dict) – Mapping of columns to dtype.
- Returns
Original df.
-
bulwark.checks.
has_no_infs
(df, columns=None)[source]¶ Asserts that there are no np.infs in df.
This is a convenience wrapper for has_no_x.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
columns (list) – A subset of columns to check for np.infs.
- Returns
Original df.
-
bulwark.checks.
has_no_nans
(df, columns=None)[source]¶ Asserts that there are no np.nans in df.
This is a convenience wrapper for has_no_x.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
columns (list) – A subset of columns to check for np.nans.
- Returns
Original df.
-
bulwark.checks.
has_no_neg_infs
(df, columns=None)[source]¶ Asserts that there are no np.infs in df.
This is a convenience wrapper for has_no_x.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
columns (list) – A subset of columns to check for -np.infs.
- Returns
Original df.
-
bulwark.checks.
has_no_nones
(df, columns=None)[source]¶ Asserts that there are no Nones in df.
This is a convenience wrapper for has_no_x.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
columns (list) – A subset of columns to check for Nones.
- Returns
Original df.
-
bulwark.checks.
has_no_x
(df, values=None, columns=None)[source]¶ Asserts that there are no user-specified values in df’s columns.
-
bulwark.checks.
has_set_within_vals
(df, items)[source]¶ Asserts that all given values are found in columns’ values.
In other words, the given values in the items dict should all be a subset of the values found in the associated column in df.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
items (dict) – Mapping of columns to values excepted to be found within them.
- Returns
Original df.
Examples
The following check will pass, since df[‘a’] contains each of 1 and 2:
>>> import bulwark.checks as ck >>> import pandas as pd >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}) >>> ck.has_set_within_vals(df, items={"a": [1, 2]}) a b 0 1 a 1 2 b 2 3 c
The following check will fail, since df[‘b’] doesn’t contain each of “a” and “d”:
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}) >>> ck.has_set_within_vals(df, items={"a": [1, 2], "b": ["a", "d"]}) Traceback (most recent call last): ... AssertionError: The following column: value pairs are missing: {'b': ['d']}
-
bulwark.checks.
has_unique_index
(df)[source]¶ Asserts that df’s index is unique.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
- Returns
Original df.
-
bulwark.checks.
has_vals_within_n_std
(df, n=3)[source]¶ Asserts that every value is within
n
standard deviations of its column’s mean.- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
n (int) – Number of standard deviations from the mean.
- Returns
Original df.
-
bulwark.checks.
has_vals_within_range
(df, items=None)[source]¶ Asserts that df is within a range.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
items (dict) – Mapping of columns (col) to a (low, high) tuple (v) that
df[col]
is expected to be between.
- Returns
Original df.
Examples
The following check will pass, since df[‘a’] contains values between 0 and 3:
>>> import bulwark.checks as ck >>> import pandas as pd >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}) >>> ck.has_vals_within_range(df, items= {'a': (0, 3)}) a b 0 1 a 1 2 b 2 3 c
The following check will fail, since df[‘b’] contains ‘c’ which is outside of the specified range:
>>> ck.has_vals_within_range(df, items= {'a': (0, 3), 'b': ('a', 'b')}) Traceback (most recent call last): ... AssertionError: ('Outside range', 0 False 1 False 2 True Name: b, dtype: bool)
-
bulwark.checks.
has_vals_within_set
(df, items=None)[source]¶ Asserts that df is a subset of items.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
items (dict) – Mapping of columns (col) to array-like of values (v) that
df[col]
is expected to be a subset of.
- Returns
Original df.
-
bulwark.checks.
is_monotonic
(df, items=None, increasing=None, strict=False)[source]¶ Asserts that the df is monotonic.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
items (dict) – Mapping of columns to conditions (increasing, strict) E.g. {‘col_a’: (None, False), ‘col_b’: (None, False)}
increasing (bool, None) – None checks for either increasing or decreasing monotonicity.
strict (bool) – Whether the comparison should be strict, meaning two values in a row being equal should fail.
- Returns
Original df.
Examples
The following check will pass, since each column matches its monotonicity requirements:
>>> import bulwark.checks as ck >>> import pandas as pd >>> df = pd.DataFrame({"incr_strict": [1, 2, 3, 4], ... "incr_not_strict": [1, 2, 2, 3], ... "decr_strict": [4, 3, 2, 1], ... "decr_not_strict": [3, 2, 2, 1]}) >>> items = { ... "incr_strict": (True, True), ... "incr_not_strict": (True, False), ... "decr_strict": (False, True), ... "decr_not_strict": (False, False) ... } >>> ck.is_monotonic(df, items=items) incr_strict incr_not_strict decr_strict decr_not_strict 0 1 1 4 3 1 2 2 3 2 2 3 2 2 2 3 4 3 1 1
All of the same cases will also pass if increasing=None, since only one of increasing or decreasing monotonicity is then required:
>>> ck.is_monotonic(df, increasing=None, strict=False) incr_strict incr_not_strict decr_strict decr_not_strict 0 1 1 4 3 1 2 2 3 2 2 3 2 2 2 3 4 3 1 1
The following check will fail, displaying a list of which (row, column)s caused the issue:
>>> df2 = pd.DataFrame({'not_monotonic': [1, 2, 3, 2]}) >>> ck.is_monotonic(df2, increasing=True, strict=False) Traceback (most recent call last): ... AssertionError: [(3, 'not_monotonic')]
-
bulwark.checks.
is_same_as
(df, df_to_compare, **kwargs)[source]¶ Asserts that two pd.DataFrames are equal.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
df_to_compare (pd.DataFrame) – A second pd.DataFrame.
**kwargs (dict) – Keyword arguments passed through to pandas’
assert_frame_equal
.
- Returns
Original df.
-
bulwark.checks.
is_shape
(df, shape)[source]¶ Asserts that df is of a known row x column shape.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
shape (tuple) – Shape of df as (n_rows, n_columns). Use None or -1 if you don’t care about a specific dimension.
- Returns
Original df.
-
bulwark.checks.
one_to_many
(df, unitcol, manycol)[source]¶ Asserts that a many-to-one relationship is preserved between two columns.
For example, a retail store will have have distinct departments, each with several employees. If each employee may only work in a single department, then the relationship of the department to the employees is one to many.
-
bulwark.checks.
unique
(df, columns=None)[source]¶ Asserts that columns in df only have unique values.
- Parameters
df (pd.DataFrame) – Any pd.DataFrame.
columns (list) – A subset of columns to check for uniqueness of row values.
- Returns
Original df.