Bulwark’s Documentation¶
Bulwark is a package for convenient property-based testing of pandas dataframes, supported for Python 3.5+.
Documentation: https://bulwark.readthedocs.io/en/latest/index.html
This project was heavily influenced by the no-longer-supported Engarde library by Tom Augspurger(thanks for the head start, Tom!), which itself was modeled after the R library assertr.
Why?¶
Data are messy, and pandas is one of the go-to libraries for analyzing tabular data. In the real world, data analysts and scientists often feel like they don’t have the time or energy to think of and write tests for their data. Bulwark’s goal is to let you check that your data meets your assumptions of what it should look like at any (and every) step in your code, without making you work too hard.
Usage¶
Bulwark comes with checks for many of the common assumptions you might want to validate for the functions that make up your ETL pipeline, and lets you toss those checks as decorators on the functions you’re already writing:
import bulwark.decorators as dc
@dc.IsShape((-1, 10))
@dc.IsMonotonic(strict=True)
@dc.HasNoNans()
def compute(df):
# complex operations to determine result
...
return result_df
Still want to have more robust test files? Bulwark’s got you covered there, too, with importable functions.
import bulwark.checks as ck
df.pipe(ck.has_no_nans())
Won’t I have to go clean up all those decorators when I’m ready to go to production? Nope - just toggle the built-in “enabled” flag available for every decorator.
@dc.IsShape((3, 2), enabled=False)
def compute(df):
# complex operations to determine result
...
return result_df
What if the test I want isn’t part of the library?
Use the built-in CustomCheck
to use your own custom function!
def len_longer_than(df, l):
if len(df) <= l:
raise AssertionError("df is not as long as expected.")
return df
@dc.CustomCheck(len_longer_than, df=df, l=6)
def append_a_df(df, df2):
return df.append(df2, ignore_index=True)
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df2 = pd.DataFrame({"a": [1, np.nan, 3, 4], "b": [4, 5, 6, 7]})
append_a_df(df, df2)
What if I want to run a lot of tests and want to see all the errors at once?
You can use the built-in MultiCheck
.
It will collect all of the errors
and either display a warning message of throw an exception based on the warn
flag.
You can even use custom functions with MultiCheck:
def len_longer_than(df, l):
if len(df) <= l:
raise AssertionError("df is not as long as expected.")
return df
# `checks` takes a dict of function: dict of params for that function.
# Note that those function params EXCLUDE df.
# Also note that when you use MultiCheck, there's no need to use CustomCheck - just feed in the function.
@dc.MultiCheck(checks={ck.has_no_nans: {"columns": None},
len_longer_than: {"l": 6}},
warn=False)
def append_a_df(df, df2):
return df.append(df2, ignore_index=True)
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df2 = pd.DataFrame({"a": [1, np.nan, 3, 4], "b": [4, 5, 6, 7]})
append_a_df(df, df2)
See examples to see more advanced usage.
Contributing¶
Bulwark is always looking for new contributors! We work hard to make contributing as easy as possible, and previous open source experience is not required! Please see contributing.md for how to get started.
Thank you to all our past contributors, especially these folks:
Changelog¶
[0.5.1] - 2019-08-29
Changed
- Remove unnecessary six dependency
[0.5.0] - 2019-08-18
Added
- Add support for old Engarde function names with deprecation warnings for v0.7.0.
- Add ability to check bulwark version with
bulwark.__version__
- Add status badges to README.md
- Add Sphinx markdown support and single-source readme, changelog.
Changed
- Upgrade Development Status to Beta (from Alpha)
- Update gitignore for venv
- Update contributing documentation
- Single-sourced project version
[0.4.2] - 2019-07-28
Changed
- Hotfix to allow import bulwark to work.
[0.4.1] - 2019-07-26
Changed
- Hotfix to allow import bulwark to work.
[0.4.0] - 2019-07-26
Added
- Add
has_no_x
,has_no_nones
, andhas_set_within_vals
.
Changed
has_no_nans
now checks only for np.nans and not also None. Checking for None is available through has_no_nones.
[0.3.0] - 2019-05-30
Added
- Add
exact_order
param tohas_columns
Changed
- Hotfix for reversed
has_columns
error messages for missing and unexpected columns - Breaking change to
has_columns
parameter nameexact
, which is nowexact_cols
[0.2.0] - 2019-05-29
Added
- Add
has_columns
check, which asserts that the given columns are contained within the df or exactly match the df’s columns. - Add changelog
Changed
- Breaking change to rename unique_index to has_unique_index for consistency
[0.1.2] - 2019-01-13
Changed
- Improve code base to automatically generate decorators for each check
- Hotfix multi_check and unit tests
[0.1.1] - 2019-01-12
Changed
- Hotfix to setup.py for the sphinx.setup_command.BuildDoc requirement.
[0.1.0] - 2019-01-12
Changed
- Breaking change to rename unique_index to has_unique_index for consistency
Quickstart¶
Bulwark is designed to be easy to use and easy to add checks to code while you’re writing it.
First, install Bulwark:
pip install bulwark
Next, import bulwark. You can either use function versions of the checks or decorator versions. By convension, import either/both of these as follow:
import bulwark.checks as ck
import bulwark.decorators as dc
If you’ve chosen to use decorators to interact with the checks (the recommended method for checks to be run on each function call), you can write a function for your project like normal, but with your chosen decorators on top:
import bulwark.decorators as dc
import pandas as pd
@dc.HasNoNans()
def add_five(df):
return df + 5
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
add_five(df)
You can stack multiple decorators on top of each other to have the first failed decorator check result in an assertion error or use the built-in MultiCheck to collect all of the errors are raise them at once.
See examples to see more advanced usage.
Design¶
It’s important that Bulwark
not get in your way. Your task is hard
enough without a bunch of assertions cluttering up the logic of the
code. And yet, it does help to explicitly state the assumptions
fundamental to your analysis. Decorators provide a nice compromise.
Checks¶
Each check:
- takes a pd.DataFrame as its first argument, with optional additional arguments,
- make an assert about the pd.DataFrame, and
- return the original, unaltered pd.DataFrame
If the assertion fails, an AssertionError
is raised and Bulwark
tries to print out some informative information about where the failure
occurred.
Decorators¶
Each check
has an auto-magically-generated associated decorator. The
decorator simply marshals arguments, allowing you to make your
assertions outside the actual logic of your code. Besides making it
quick and easy to add checks to a function, decorators also come with
bonus capabilities, including the ability to enable/disable the check as
well as switch from raising an error to logging a warning.
Examples¶
Coming soon!
API¶
bulwark.checks |
Each function in this module should: |
bulwark.decorators |
How to Contribute¶
First off, thank you for considering contributing to bulwark
!
It’s thanks to people like you that we continue to have a high-quality, updated and documented tool.
There are a few key ways to contribute:
- Writing new code (checks, decorators, other functionality)
- Writing tests
- Writing documentation
- Supporting fellow developers on StackOverflow.com.
No contribution is too small! Please submit as many fixes for typos and grammar bloopers as you can!
Regardless of which of these options you choose, this document is meant to make contribution more accessible by codifying tribal knowledge and expectations. Don’t be afraid to ask questions if something is unclear!
Workflow¶
- Set up Git and a GitHub account
- Bulwark follows a forking workflow, so next fork and clone the bulwark repo.
- Set up a development environment.
- Create a feature branch. Pull requests should be limited to one change only, where possible. Contributing through short-lived feature branches ensures contributions can get merged quickly and easily.
- Rebase on master and squash any unnecessary commits. We do not squash on merge, because we trust our contributors to decide which commits within a feature are worth breaking out.
- Always add tests and docs for your code. This is a hard rule; contributions with missing tests or documentation can’t be merged.
- Make sure your changes pass our CI. You won’t get any feedback until it’s green unless you ask for it.
- Once you’ve addressed review feedback, make sure to bump the pull request with a short note, so we know you’re done.
Each of these abbreviated workflow steps has additional instructions in sections below.
Development Practices and Standards¶
- Obey follow PEP-8 and Google’s docstring format.
- The only exception to PEP-8 is that line length can be up to 100 characters.
- Use underscores to separate words in non-class names.
E.g.
n_samples
rather thannsamples
. - Don’t ever use wildcard imports (
from module import *
). It’s considered to be a bad practice by the official Python recommendations. The reasons it’s undesireable are that it pollutes the namespace, makes it harder to identify the origin of code, and, most importantly, prevents using a static analysis tool like pyflakes to automatically find bugs. - Any new module, class, or function requires units tests and a docstring. Test-Driven Development (TDD) is encouraged.
- Don’t break backward compatibility. In the event that an interface needs redesign to add capability, a deprecation warning should be raised in future minor versions, and the change will only be merged into the next major version release.
- Semantic line breaks are encouraged.
Set up Git and a GitHub Account¶
- If you don’t already have a GitHub account, you can register for free.
- If you don’t already have Git installed, you can follow these git installation instructions.
Fork and Clone Bulwark¶
You will need your own fork to work on the code. Go to the Bulwark project page and hit the Fork
button.
Next, you’ll want to clone your fork to your machine:
git clone https://github.com/your-user-name/bulwark.git bulwark-dev cd bulwark-dev git remote add upstream https://github.com/ZaxR/bulwark.git
Set up a Development Environment¶
Bulwark supports Python 3.5+. For your local development version of Python it’s recommended to use version 3.5 within a virtual environment to ensure newer features aren’t accidentally used.
Within your virtual environment,
you can easily install an editable version of bulwark
along with its tests and docs requirements with:
pip install -e '.[dev]'
At this point you should be able to run/pass tests and build the docs:
python -m pytest
cd docs
make html
To avoid committing code that violates our style guide, we strongly advise you to install pre-commit hooks, which will cause your local commit to fail if our style guide was violated:
pre-commit install
You can also run them anytime (as our tox does) using:
pre-commit run --all-files
You can also use tox to run CI in all of the appropriate environments locally, as our cloud CI will:
tox
# or, use the -e flag for a specific environment. For example:
tox -e py35
Create a Feature Branch¶
To add a new feature, you will create every feature branch off of the master branch:
git checkout master
git checkout -b feature/<feature_name_in_snake_case>
Rebase on Master and Squash¶
If you are new to rebase, there are many useful tutorials online, such as Atlassian’s. Feel free to follow your own workflow, though if you have an default git editor set up, interactive rebasing is an easy way to go about it:
git checkout feature/<feature_name_in_snake_case>
git rebase -i master
Create a Pull Request to the master branch¶
Create a pull request to the master branch of Bulwark. Tests will be be triggered to run via Travis CI. Check that your PR passes CI, since it won’t be reviewed for inclusion until it passes all steps.
For Maintainers¶
Steps for maintainers are largely the same, with a few additional steps before releasing a new version:
Update version in bulwark/project_info.py, which updates three spots: setup.py, bulwark/__init__.py, and docs/conf.py.
Update the CHANGELOG.md and the main README.md (as appropriate).
Rebuild the docs in your local version to verify how they render using:
pip install -e ".[dev]" sphinx-apidoc -o ./docs/_source ./bulwark -f cd docs make html
Test distribution using TestPyPI with Twine:
# Installation python3 -m pip install --user --upgrade setuptools wheel python3 -m pip install --user --upgrade twine # Build/Upload dist and install library python3 setup.py sdist bdist_wheel python3 -m twine upload --repository-url https://test.pypi.org/legacy/ dist/* pip install --index-url https://test.pypi.org/simple/bulwark
Releases are indicated using git tags. Create a tag locally for the apporiate commit in master, and push that tag to GitHub. Travis’s CD is triggered on tags within master:
git tag -a v<#.#.#> <SHA-goes-here> -m "bulwark version <#.#.#>"
git push origin --tags