# ruff: noqa: E402, I001
Physics Consistency Filter: Legacy Database vs. PERLA Pipeline
This notebook evaluates data quality by testing the fundamental physics relationship for solar cell power conversion efficiency: PCE = (FF × VOC × JSC) / Pin. We compare two datasets from the Perovskite Solar Cell Database in NOMAD: the legacy human-curated entries and the new PERLA LLM-extracted entries.
The analysis reveals the fraction of legacy database entries that fail this physics consistency check (with 0.2% absolute tolerance), while the PERLA pipeline enforces this filter as a validation requirement, ensuring only physically consistent entries are accepted into the database.
Implications¶
This comparison reveals important differences in data quality between the two datasets:
Legacy Database: Entries that fail the physics consistency check may originate from errors in the manual data curtion, inconsistencies in the source publications themselves, such as reporting errors, calculation mistakes in the original papers, or unit mismatches. The fraction of failing entries reflects the challenges inherent in literature-reported data, regardless of the curation method.
PERLA Pipeline: By enforcing physics-based filters during the LLM extraction process, PERLA automatically excludes entries that fail consistency checks. This automated validation approach ensures physically coherent data enters the database, improving overall data reliability while maintaining scalability. This also helps to exclude extracted solar cells with mixed parameters from papers that report multiple solar cells.
The results demonstrate that we can dramatically reduce these inconsistencies by having this check in the extraction pipeline.
Methodology¶
The physics consistency check validates that the reported power conversion efficiency (PCE) matches the calculated efficiency from measured parameters:
$$\text{PCE} = \frac{\text{FF} \times V_{\text{OC}} \times J_{\text{SC}}}{P_{\text{in}}}$$
Where:
- FF = Fill Factor (dimensionless, 0-1)
- VOC = Open Circuit Voltage (V)
- JSC = Short Circuit Current Density (A/m², SI units)
- Pin = Incident Power Density (W/m², SI units; typically 1000 W/m² for standard test conditions)
All quantities are in SI units as stored in NOMAD results. The calculation is straightforward: (FF × VOC × JSC) / Pin × 100 gives PCE in percentage. Units: (V × A/m²) / (W/m²) × 100 = (W/m²) / (W/m²) × 100 = %. We use an absolute tolerance of ±0.2% to account for rounding and measurement precision. Entries that fail this check may indicate data entry errors, unit mismatches, or measurement inconsistencies.
from plotly_theme import register_template, set_defaults # type: ignore
register_template()
set_defaults()
The dataset for this analysis¶
The query to create this parquet files includes onlysolar cells with registered illumination intensity around 1-sun conditions. It excludes entries where the illumination intendity could not be resgistered.
# load the data from into a df from the parquet file
import pandas as pd
df = pd.read_parquet('perovskite_solar_cell_database.parquet')
# df = pd.read_parquet('perovskite_solar_cell_database.parquet')
# Set a source_database column: if name_of_person_entering_the_data is 'LLM Extraction', use 'LLM Extracted', else 'Manual Entry'
df['source_database'] = df['data.ref.name_of_person_entering_the_data'].apply(
lambda x: 'LLM Extracted' if x == 'LLM Extraction' else 'Manual Entry'
)
# set in the df a source_database column. Is data.ref.person_entering_data is LLM Extracted else Manual Entry
# df['source_database'] = df['data.ref.name_of_person_entering_the_data'].apply(
# lambda x: 'LLM Extracted' if x == 'LLM Extraction' else 'Manual Entry'
# )
from plotly_theme import DEFAULT_COLORWAY
SOURCE_ORDER = ['Manual Entry', 'LLM Extracted']
COLOR_MAP = dict(zip(SOURCE_ORDER, DEFAULT_COLORWAY))
# check in a histogram and print the distribution of results.properties.optoelectronic.solar_cell.illumination_intensity
# print the overall distribution
print("="*80)
print("OVERALL DISTRIBUTION")
print("="*80)
print(df['results.properties.optoelectronic.solar_cell.illumination_intensity'].describe())
print()
# Check for None/NaN values
none_count = df['results.properties.optoelectronic.solar_cell.illumination_intensity'].isna().sum()
print(f"Number of None/NaN entries: {none_count}")
print(f"Percentage: {none_count/len(df)*100:.2f}%")
print()
# Check for zero values
zero_count = (df['results.properties.optoelectronic.solar_cell.illumination_intensity'] == 0).sum()
print(f"Number of zero entries: {zero_count}")
print(f"Percentage: {zero_count/len(df)*100:.2f}%")
print()
# print how many are not 1000 W/m2 (excluding None/NaN and zeros)
not_1000 = df[
(df['results.properties.optoelectronic.solar_cell.illumination_intensity'] != 1000) &
(df['results.properties.optoelectronic.solar_cell.illumination_intensity'].notna()) &
(df['results.properties.optoelectronic.solar_cell.illumination_intensity'] != 0)
]
print(f"Number of entries not at 1000 W/m2 (excluding None/NaN and zeros): {len(not_1000)}")
print(f"Percentage: {len(not_1000)/len(df)*100:.2f}%")
print()
# print the distribution values for the illumination intensity for each source_database
print("="*80)
print("BREAKDOWN BY SOURCE DATABASE")
print("="*80)
for source in SOURCE_ORDER:
subset = df[df['source_database'] == source]
print(f"\n{source}:")
print(subset['results.properties.optoelectronic.solar_cell.illumination_intensity'].describe())
print()
# Check for None/NaN values
none_count = subset['results.properties.optoelectronic.solar_cell.illumination_intensity'].isna().sum()
print(f"Number of None/NaN entries for {source}: {none_count}")
print(f"Percentage: {none_count/len(subset)*100:.2f}%")
print()
# Check for zero values
zero_count = (subset['results.properties.optoelectronic.solar_cell.illumination_intensity'] == 0).sum()
print(f"Number of zero entries for {source}: {zero_count}")
print(f"Percentage: {zero_count/len(subset)*100:.2f}%")
print()
# print how many are not 1000 W/m2 (excluding None/NaN and zeros)
not_1000 = subset[
(subset['results.properties.optoelectronic.solar_cell.illumination_intensity'] != 1000) &
(subset['results.properties.optoelectronic.solar_cell.illumination_intensity'].notna()) &
(subset['results.properties.optoelectronic.solar_cell.illumination_intensity'] != 0)
]
print(f"Number of entries not at 1000 W/m2 for {source} (excluding None/NaN and zeros): {len(not_1000)}")
print(f"Percentage: {len(not_1000)/len(subset)*100:.2f}%")
print()
================================================================================ OVERALL DISTRIBUTION ================================================================================ count 48745.000000 mean 1004.623427 std 278.436899 min 0.000000 25% 1000.000000 50% 1000.000000 75% 1000.000000 max 18000.000000 Name: results.properties.optoelectronic.solar_cell.illumination_intensity, dtype: float64 Number of None/NaN entries: 2383 Percentage: 4.66% Number of zero entries: 8 Percentage: 0.02% Number of entries not at 1000 W/m2 (excluding None/NaN and zeros): 449 Percentage: 0.88% ================================================================================ BREAKDOWN BY SOURCE DATABASE ================================================================================ Manual Entry: count 43032.000000 mean 999.496165 std 191.401147 min 0.000000 25% 1000.000000 50% 1000.000000 75% 1000.000000 max 18000.000000 Name: results.properties.optoelectronic.solar_cell.illumination_intensity, dtype: float64 Number of None/NaN entries for Manual Entry: 75 Percentage: 0.17% Number of zero entries for Manual Entry: 8 Percentage: 0.02% Number of entries not at 1000 W/m2 for Manual Entry (excluding None/NaN and zeros): 403 Percentage: 0.93% LLM Extracted: count 5713.000000 mean 1043.243480 std 619.607578 min 100.000000 25% 1000.000000 50% 1000.000000 75% 1000.000000 max 10000.000000 Name: results.properties.optoelectronic.solar_cell.illumination_intensity, dtype: float64 Number of None/NaN entries for LLM Extracted: 2308 Percentage: 28.77% Number of zero entries for LLM Extracted: 0 Percentage: 0.00% Number of entries not at 1000 W/m2 for LLM Extracted (excluding None/NaN and zeros): 46 Percentage: 0.57%
# check in a histogram and print the distribution of results.properties.optoelectronic.solar_cell.illumination_intensity
# print the distribution values for the illumination intensity for each source_database
for source in SOURCE_ORDER:
subset = df[df['source_database'] == source]
print(f"Distribution for {source}:")
print(subset['results.properties.optoelectronic.solar_cell.illumination_intensity'].describe())
print()
# print how many are not 1000 W/m2
not_1000 = subset[subset['results.properties.optoelectronic.solar_cell.illumination_intensity'] != 1000]
print(f"Number of entries not at 1000 W/m2 for {source}: {len(not_1000)}")
print()
Distribution for Manual Entry: count 43032.000000 mean 999.496165 std 191.401147 min 0.000000 25% 1000.000000 50% 1000.000000 75% 1000.000000 max 18000.000000 Name: results.properties.optoelectronic.solar_cell.illumination_intensity, dtype: float64 Number of entries not at 1000 W/m2 for Manual Entry: 486 Distribution for LLM Extracted: count 5713.000000 mean 1043.243480 std 619.607578 min 100.000000 25% 1000.000000 50% 1000.000000 75% 1000.000000 max 10000.000000 Name: results.properties.optoelectronic.solar_cell.illumination_intensity, dtype: float64 Number of entries not at 1000 W/m2 for LLM Extracted: 2354
import numpy as np
import plotly.graph_objects as go
# columns we REQUIRE to be present
required_cols = [
'results.properties.optoelectronic.solar_cell.fill_factor',
'results.properties.optoelectronic.solar_cell.short_circuit_current_density',
'results.properties.optoelectronic.solar_cell.open_circuit_voltage',
'results.properties.optoelectronic.solar_cell.efficiency',
'results.properties.optoelectronic.solar_cell.illumination_intensity',
]
# drop rows where ANY required value is missing
df_clean = df.dropna(subset=required_cols).copy()
# alias for readability
# Units from NOMAD results (all SI):
# ff: dimensionless (0-1)
# jsc: A/m² (SI)
# voc: V (SI)
# pce: % (percentage)
# illumination_intensity: W/m² (SI)
ff = df_clean['results.properties.optoelectronic.solar_cell.fill_factor']
jsc = df_clean[
'results.properties.optoelectronic.solar_cell.short_circuit_current_density'
]
voc = df_clean['results.properties.optoelectronic.solar_cell.open_circuit_voltage']
pce = df_clean['results.properties.optoelectronic.solar_cell.efficiency']
illumination = df_clean['results.properties.optoelectronic.solar_cell.illumination_intensity']
# compute expected PCE using correct formula
# PCE (%) = (FF × V_OC [V] × J_SC [A/m²]) / P_in [W/m²] × 100
# Units: (dimensionless × V × A/m²) / (W/m²) × 100 = (W/m²) / (W/m²) × 100 = %
df_clean['pce_calc'] = (ff * voc * jsc) / illumination * 100
# isclose check (absolute tolerance only)
df_clean['pce_isclose'] = np.isclose(
pce,
df_clean['pce_calc'],
atol=0.2,
)
summary = (
df_clean.groupby('source_database')['pce_isclose']
.agg(fraction='mean', n='size')
.reindex(['Manual Entry', 'LLM Extracted'])
)
# Calculate mismatch data for panel b
df_mismatch = df_clean[~df_clean['pce_isclose']].copy()
df_mismatch['pce_diff'] = abs(df_mismatch['pce_calc'] - pce[df_mismatch.index])
df_mismatch['pce_diff_percent'] = (df_mismatch['pce_diff'] / pce[df_mismatch.index]) * 100
from plotly.subplots import make_subplots
# Create figure with subplots
fig = make_subplots(
rows=1, cols=2,
subplot_titles=('', ''),
horizontal_spacing=0.15,
column_widths=[0.45, 0.55]
)
# PANEL A: Bar chart
bar_colors = [COLOR_MAP[src] for src in summary.index]
fig.add_trace(
go.Bar(
x=summary.index,
y=summary['fraction'],
text=[f'{frac:.1%}<br>n={n}' for frac, n in zip(summary['fraction'], summary['n'])],
textposition='inside',
textfont=dict(size=16, color='white', family='Arial'),
marker=dict(color=bar_colors),
showlegend=False,
),
row=1, col=1
)
# PANEL B: Scatter plot with improved styling
for source in SOURCE_ORDER:
subset = df_mismatch[df_mismatch['source_database'] == source]
fig.add_trace(
go.Scatter(
x=subset['results.properties.optoelectronic.solar_cell.efficiency'],
y=subset['pce_calc'],
mode='markers',
name=source,
marker=dict(
color=COLOR_MAP[source],
size=6,
# opacity=0.9,
line=dict(color='white', width=1.0)
),
showlegend=True,
),
row=1, col=2
)
# Add diagonal line to panel b
fig.add_trace(
go.Scatter(
x=[0, 26],
y=[0, 26],
mode='lines',
line=dict(color='gray', dash='dash', width=1.5),
showlegend=False,
hoverinfo='skip'
),
row=1, col=2
)
# Update axes for panel a
fig.update_xaxes(
# title_text='Data Source',
title_font=dict(size=16, family='Arial'),
tickfont=dict(size=16, family='Arial'),
showgrid=False,
row=1, col=1
)
fig.update_yaxes(
title_text='Fraction Passing Consistency Check',
title_font=dict(size=16, family='Arial'),
tickformat='.0%',
tickfont=dict(size=16, family='Arial'),
range=[0, 1.05],
showgrid=True,
gridcolor='rgba(200, 200, 200, 0.3)',
griddash='dot',
row=1, col=1
)
# Update axes for panel b
fig.update_xaxes(
title_text='Reported PCE (%)',
title_font=dict(size=16, family='Arial'),
tickfont=dict(size=16, family='Arial'),
range=[0, 26],
showgrid=True,
gridcolor='rgba(200, 200, 200, 0.3)',
griddash='dot',
row=1, col=2
)
fig.update_yaxes(
title_text='Calculated PCE (%)',
title_font=dict(size=16, family='Arial'),
tickfont=dict(size=16, family='Arial'),
range=[0, 26],
showgrid=True,
gridcolor='rgba(200, 200, 200, 0.3)',
griddash='dot',
row=1, col=2
)
# Add Nature-style panel labels
fig.add_annotation(
text='<b>a</b>',
xref='x domain', yref='y domain',
x=-0.15, y=1.05,
xanchor='left', yanchor='bottom',
font=dict(size=18, family='Arial', color='black'),
showarrow=False,
row=1, col=1
)
fig.add_annotation(
text='<b>b</b>',
xref='x2 domain', yref='y2 domain',
x=-0.15, y=1.05,
xanchor='left', yanchor='bottom',
font=dict(size=18, family='Arial', color='black'),
showarrow=False,
row=1, col=2
)
# Update overall layout
fig.update_layout(
plot_bgcolor='white',
paper_bgcolor='white',
font=dict(family='Arial', size=12),
legend=dict(
x=0.535,
y=0.98,
xanchor='left',
yanchor='top',
bgcolor='rgba(255, 255, 255, 0.8)',
font=dict(size=16, family='Arial')
),
width=700,
height=400,
margin=dict(t=60, b=80, l=80, r=80)
)
fig.show(renderer="notebook")
n={n}' for frac, n in zip(summary['fraction'], summary['n'])], textposition='inside', textfont=dict(size=16, color='white', family='Arial'), marker=dict(color=bar_colors), showlegend=False, ), row=1, col=1 ) # PANEL B: Scatter plot with improved styling for source in SOURCE_ORDER: subset = df_mismatch[df_mismatch['source_database'] == source] fig.add_trace( go.Scatter( x=subset['results.properties.optoelectronic.solar_cell.efficiency'], y=subset['pce_calc'], mode='markers', name=source, marker=dict( color=COLOR_MAP[source], size=6, # opacity=0.9, line=dict(color='white', width=1.0) ), showlegend=True, ), row=1, col=2 ) # Add diagonal line to panel b fig.add_trace( go.Scatter( x=[0, 26], y=[0, 26], mode='lines', line=dict(color='gray', dash='dash', width=1.5), showlegend=False, hoverinfo='skip' ), row=1, col=2 ) # Update axes for panel a fig.update_xaxes( # title_text='Data Source', title_font=dict(size=16, family='Arial'), tickfont=dict(size=16, family='Arial'), showgrid=False, row=1, col=1 ) fig.update_yaxes( title_text='Fraction Passing Consistency Check', title_font=dict(size=16, family='Arial'), tickformat='.0%', tickfont=dict(size=16, family='Arial'), range=[0, 1.05], showgrid=True, gridcolor='rgba(200, 200, 200, 0.3)', griddash='dot', row=1, col=1 ) # Update axes for panel b fig.update_xaxes( title_text='Reported PCE (%)', title_font=dict(size=16, family='Arial'), tickfont=dict(size=16, family='Arial'), range=[0, 26], showgrid=True, gridcolor='rgba(200, 200, 200, 0.3)', griddash='dot', row=1, col=2 ) fig.update_yaxes( title_text='Calculated PCE (%)', title_font=dict(size=16, family='Arial'), tickfont=dict(size=16, family='Arial'), range=[0, 26], showgrid=True, gridcolor='rgba(200, 200, 200, 0.3)', griddash='dot', row=1, col=2 ) # Add Nature-style panel labels fig.add_annotation( text='a', xref='x domain', yref='y domain', x=-0.15, y=1.05, xanchor='left', yanchor='bottom', font=dict(size=18, family='Arial', color='black'), showarrow=False, row=1, col=1 ) fig.add_annotation( text='b', xref='x2 domain', yref='y2 domain', x=-0.15, y=1.05, xanchor='left', yanchor='bottom', font=dict(size=18, family='Arial', color='black'), showarrow=False, row=1, col=2 ) # Update overall layout fig.update_layout( plot_bgcolor='white', paper_bgcolor='white', font=dict(family='Arial', size=12), legend=dict( x=0.535, y=0.98, xanchor='left', yanchor='top', bgcolor='rgba(255, 255, 255, 0.8)', font=dict(size=16, family='Arial') ), width=700, height=400, margin=dict(t=60, b=80, l=80, r=80) ) fig.show(renderer="notebook")
# Export combined figure to PDF (Nature quality)
fig.write_image('physics_filter_combined.pdf', scale=1, width=700, height=500)
print("Figure exported to: physics_filter_combined.pdf")
Figure exported to: physics_filter_combined.pdf