Query the Perovskite Database in NOMAD
This notebook retrieves data from the Perovskite database in NOMAD and stores it as a parquet file. ⚠️ Note: This query may take up to 1 hour to complete due to API rate limits.
Initialize the NOMAD query¶
Import the async helpers, apply nest_asyncio so the event loop can run inside Jupyter, define the archive sections to retrieve, and build a filter that isolates perovskite solar cell entries measured near one-sun illumination.
from time import time
import nest_asyncio
from nomad.client.archive import ArchiveQuery
nest_asyncio.apply()
# Define which sections to retrieve from the archive
required = {
'results': '*',
'data': '*',
}
# Build the query to find perovskite solar cells measured under approximately one-sun conditions
query = ArchiveQuery(
query={
'and': [
{
'results.properties.optoelectronic.solar_cell.illumination_intensity': {
'gte': 600.0, # constrain to around one-sun illumination conditions
'lte': 1200.0,
}
},
{
'section_defs.definition_qualified_name:all': [
'perovskite_solar_cell_database.schema.PerovskiteSolarCell'
]
},
]
},
required=required,
page_size=50000,
results_max=60000,
)
Estimate matching entries¶
Trigger async_fetch to learn how many archive entries satisfy the filter. This count guides the download size used in the following step.
number_of_entries = (
await query.async_fetch()
) # indicative number n applies: async_fetch(n)
Download the dataset¶
Download the full set of matching archives asynchronously and record how long the transfer takes for future reference.
start_time = time()
results = await query.async_download(
number_of_entries
) # indicative number n applies: async_download(n)
end_time = time()
print(f'Downloaded in {end_time - start_time} seconds.')
Convert results to a dataframe¶
Append the downloaded batch to the query object's internal cache and materialize a pandas dataframe for inspection inside the notebook.
query._entries_dict.append(results) # this is only needed in async mode.
df = query.entries_to_dataframe()
df
Persist a local copy¶
Store the dataframe as perovskite_solar_cell_database.parquet so the curated dataset can be reused outside the NOMAD client session.
df.to_parquet('perovskite_solar_cell_database.parquet', index=False)