Query the Perovskite Database in NOMAD

This notebook retrieves data from the Perovskite database in NOMAD and stores it as a parquet file. ⚠️ Note: This query may take up to 1 hour to complete due to API rate limits.

Initialize the NOMAD query¶

Import the async helpers, apply nest_asyncio so the event loop can run inside Jupyter, define the archive sections to retrieve, and build a filter that isolates perovskite solar cell entries measured near one-sun illumination.

In [ ]:

Copied!





from time import time

import nest_asyncio
from nomad.client.archive import ArchiveQuery

nest_asyncio.apply()


# Define which sections to retrieve from the archive
required = {
    'results': '*',
    'data': '*',
}

# Build the query to find perovskite solar cells measured under approximately one-sun conditions
query = ArchiveQuery(
    query={
        'and': [
            {
                'results.properties.optoelectronic.solar_cell.illumination_intensity': {
                    'gte': 600.0,  # constrain to around one-sun illumination conditions
                    'lte': 1200.0,
                }
            },
            {
                'section_defs.definition_qualified_name:all': [
                    'perovskite_solar_cell_database.schema.PerovskiteSolarCell'
                ]
            },
        ]
    },
    required=required,
    page_size=50000,
    results_max=60000,
)
from time import time

import nest_asyncio
from nomad.client.archive import ArchiveQuery

nest_asyncio.apply()


# Define which sections to retrieve from the archive
required = {
    'results': '*',
    'data': '*',
}

# Build the query to find perovskite solar cells measured under approximately one-sun conditions
query = ArchiveQuery(
    query={
        'and': [
            {
                'results.properties.optoelectronic.solar_cell.illumination_intensity': {
                    'gte': 600.0,  # constrain to around one-sun illumination conditions
                    'lte': 1200.0,
                }
            },
            {
                'section_defs.definition_qualified_name:all': [
                    'perovskite_solar_cell_database.schema.PerovskiteSolarCell'
                ]
            },
        ]
    },
    required=required,
    page_size=50000,
    results_max=60000,
)

Estimate matching entries¶

Trigger async_fetch to learn how many archive entries satisfy the filter. This count guides the download size used in the following step.

In [ ]:

Copied!

number_of_entries = (
    await query.async_fetch()
)  # indicative number n applies: async_fetch(n)
number_of_entries = (
    await query.async_fetch()
)  # indicative number n applies: async_fetch(n)

Download the dataset¶

Download the full set of matching archives asynchronously and record how long the transfer takes for future reference.

In [ ]:

Copied!





start_time = time()
results = await query.async_download(
    number_of_entries
)  # indicative number n applies: async_download(n)
end_time = time()

print(f'Downloaded in {end_time - start_time} seconds.')
start_time = time()
results = await query.async_download(
    number_of_entries
)  # indicative number n applies: async_download(n)
end_time = time()

print(f'Downloaded in {end_time - start_time} seconds.')

Convert results to a dataframe¶

Append the downloaded batch to the query object's internal cache and materialize a pandas dataframe for inspection inside the notebook.

In [ ]:

Copied!

query._entries_dict.append(results)  # this is only needed in async mode.
df = query.entries_to_dataframe()
df
query._entries_dict.append(results)  # this is only needed in async mode.
df = query.entries_to_dataframe()
df

Persist a local copy¶

Store the dataframe as perovskite_solar_cell_database.parquet so the curated dataset can be reused outside the NOMAD client session.

In [ ]:

Copied!

df.to_parquet('perovskite_solar_cell_database.parquet', index=False)
df.to_parquet('perovskite_solar_cell_database.parquet', index=False)