boilerdata

boilerdata#

All Contributors DOI

Data processing pipeline for a nucleate pool boiling apparatus.

Overview#

The data process graph shows the data process, and allows for individual steps in the process to be defined indpendently as Python scripts, Jupyter notebooks, or even in other languages like Matlab. The process is defined in dvc.yaml as as series of “stages”. Each stage has dependencies and outputs. This structure allows the data process graph to be constructed, directly from the dvc.yaml file. This separates the concerns of data management, process development, and pipeline orchestration. This project reflects the application of data science best practices to modern engineering wokflows.

Usage#

If you would like to adopt this approach to processing your own data, you may clone this repository and begin swapping configs and logic for your own, or use a similar architecture for your data processing. To run a working example with some actual data from this study, perform the following steps:

  1. Clone this repository and open it in your terminal or IDE (e.g. git clone https://github.com/blakeNaccarato/boilerdata.git boilerdata).

  2. Navigate to the clone directory in a terminal window (e.g. cd boilerdata).

  3. Create a Python 3.10 virtual environment (e.g. py -3.10 -m venv .venv on Windows w/ Python 3.10 installed from python.org).

  4. Activate the virtual environment (e.g. .venv/scripts/activate on Windows).

  5. Run pip install --editable . to install boilerdata package in an editable fashion. This step may take awhile.

  6. Delete the top-level data and config directories, then copy the config and data folders inside of tests/data to the root directory.

  7. Copy the .propshop folder in tests/data/.propshop to your user-folder (e.g. C:/Users/<you>/.propshop on Windows).

  8. Run dvc repro metrics to execute the data process up to that stage.

The data process should run the following stages: axes, modelfun, runs, parse_benchmarsk, pipeline, and metrics. Some stages are skipped because we specified to run just the necessary stages up to metrics (the example data doesn’t currently include the literature data). You may inspect the actual code that runs during these stages in src/boilerdata/stages, e.g. pipeline.py contains the logic for the pipeline stage. This example happens to use Python scripts, but you could define a stage in dvc.yaml that instead runs Matlab scripts, or any arbitrary action. This approach allows for the data process to be reliably reproduced over time, and for the process to be easily modified and extended in a collaborative effort.

There are other details of this process, such as the hosting of data in the data folder in a Google Cloud Bucket (alternatively it can be hosted on Google Drive), and more. This has to do with the need to store data (especially large datasets) outside of the repository, and access it in an authenticated fashion.

Data process graph#

This data process graph is derived from the structure of the code itself. It is automatically generated by dvc. This self-documenting process improves reproducibility and reduces documentation overhead.

flowchart TD node1["axes"] node2["data\benchmarks.dvc"] node3["data\curves.dvc"] node4["data\literature.dvc"] node5["data\plotter.dvc"] node6["literature"] node7["metrics"] node8["modelfun"] node9["originlab"] node10["parse_benchmarks"] node11["pipeline"] node12["runs"] node1-->node8 node1-->node10 node1-->node12 node2-->node10 node3-->node12 node4-->node6 node5-->node9 node6-->node9 node8-->node11 node10-->node11 node11-->node7 node11-->node9 node12-->node11