Data Science

Bioremediation of Industrial Pollutants

Machine learning and optimization for fungal degradation of petroleum hydrocarbons

Forward + Inverse

Approach

Many-to-One

Mapping

Inverse Problem

Core Challenge

The Problem

Total Petroleum Hydrocarbons (TPH) are among the most pervasive and toxic industrial soil pollutants, clogging soil pores, disrupting ecosystems, and resisting natural breakdown. Mycoremediation, using fungi to decompose these contaminants, offers a sustainable alternative to conventional cleanup methods, but its effectiveness depends on a complex interplay of substrate compositions (hardwood, softwood, wheatbran, wheatstraw, miscanthus, compost), the fungal species itself, and mineral supplements (macro, meso, and micronutrients). The company maintained a digital twin of their bioremediation process, a mathematical model simulating fungal degradation dynamics under varying environmental conditions. While the forward problem of predicting degradation outcomes from a given composition was tractable through this model, the company's real need was the inverse: given desired remediation targets, what environmental conditions should be engineered? This inverse problem is fundamentally harder because the mapping is many-to-one. Different substrate and supplement compositions can produce nearly identical degradation profiles, making the relationship non-injective and the inverse ill-posed.

Approach

Working at OpenHub in Belgium alongside a globally distributed team of engineers and scientists, we structured the project in two phases. The first phase focused on the forward problem: replicating and extending the company's digital twin with supervised learning. After reducing the 15+ input variables through Morris sensitivity analysis and SelectKPercentile, we benchmarked several model architectures and found that a Multi-Layer Perceptron best captured the nonlinear substrate-to-degradation relationships. SHAP analysis and clustering over the predictions then gave the company a clear picture of which inputs mattered most and how different formulations grouped into performance tiers. The second phase attacked the inverse problem directly. Because the forward mapping is many-to-one, simply inverting a trained model does not work. We framed it instead as a constrained optimization: given a trained forward model and a set of desired degradation targets, use Bayesian Optimization to search the substrate and supplement space for compositions that satisfy those targets. This let us fix known substrates, enforce output constraints, and systematically explore the remaining degrees of freedom.

Results

The forward models successfully captured the nonlinear relationships between substrate compositions and TPH degradation metrics, with the MLP architecture providing reliable predictions across the output space. Feature importance analysis identified the most influential substrates and mineral supplements driving remediation performance, giving the company actionable insight into which variables matter most. The clustering analysis revealed distinct remediation performance tiers, enabling classification of substrate combinations into high, medium, and low effectiveness groups. For the inverse problem, the Bayesian Optimization approach demonstrated convergence to viable solutions in constrained scenarios, achieving target degradation metrics when supplement variables were included in the search space. The work provided the company with a data-driven framework for designing remediation treatments, moving from trial-and-error experimentation toward guided, AI-assisted formulation of substrate compositions for their remediation operations.