eegunity.modules.batch#

Public API#

eegunity.modules.batch.eeg_batch.EEGBatch class#

class eegunity.modules.batch.eeg_batch.EEGBatch(main_instance)[source]#

Bases: _UDatasetSharedAttributes, EEGBatchMixinEpoch

This is a key module of UnifiedDataset class, with focus on batch processing. This EEGBatch class has the same attributes as the UnifiedDataset class. In this class, we define the functions relative to EEG batch processing.

batch_process(con_func, app_func, is_patch, result_type=None, execution_mode=None)[source]#

Process each row of locator based on conditions specified in con_func and apply app_func accordingly. This function handles both list and dataframe return types, ensuring the result aligns with the original locator’s rows based on the is_patch flag.

Parameters:

con_func (Callable) – A function that takes a row of locator and returns True or False to determine if app_func should be applied to that row. The input is a single row from the locator, which you can access like a dictionary. For example, to read the file path attribute, use: file_path = row[‘File Path’]
app_func (Callable) – A function that processes a row of locator and returns the result. The input is same as con_func.
is_patch (bool) – If True, the returned list length or dataframe rows will match the locator’s row count, using placeholder elements as needed.
result_type ({'series', 'value', None}, optional) – Specifies the expected return type of app_func results. Can be “series”, “value”, or None (case insensitive). Defaults to None.
execution_mode ({'thread', 'process', None}, optional) – Selects the concurrency backend used when num_workers > 0. When None (default), the method always runs sequentially regardless of the num_workers setting - use this for lightweight or shared-state operations. 'thread' uses a ThreadPoolExecutor and is suited to I/O-bound workloads (file reads, network calls). 'process' uses a ProcessPoolExecutor for CPU-bound workloads; closures are handled transparently via cloudpickle. Defaults to None.

Returns:

The processed results, either as a list or dataframe, depending on result_type and app_func return type and consistency. Returns None if result_type is None.

Return type:

Union[None, list, pd.DataFrame]

Raises:

ValueError – If result_type is not one of the expected values.
ValueError – If execution_mode is not one of 'thread', 'process', or None.

Note

This method is essential when designing a custom processing pipeline for the dataset. Ensure that con_func and app_func are compatible with the structure of the locator. If using is_patch, consider the implications on the data integrity. When execution_mode is None, num_workers is ignored and execution is always sequential; this is the safe default for methods that have not yet been classified as I/O- or CPU-bound.

Examples

>>> from eegunity import UnifiedDataset
>>> u_ds = UnifiedDataset(***)
>>> # example1: sequential (default)
>>> new_locator = u_ds.eeg_batch.batch_process(app_func, con_func, is_patch=True, result_type='series')
>>> print(new_locator)
>>> # example2: threaded I/O-bound
>>> a_list = u_ds.eeg_batch.batch_process(app_func, con_func, is_patch=True, result_type='value',
...                                        execution_mode='thread')
>>> print(a_list)
>>> # example3: CPU-bound multiprocessing
>>> u_ds.eeg_batch.batch_process(app_func, con_func, is_patch=True, result_type=None,
...                              execution_mode='process')

set_metadata(col_name, value)[source]#

Set the specified metadata in the locator with the given list of values. This function is generally used to modify metadata of datasets, directly.

Parameters:

col_name (str) –

The name of the column to be set, such as File Path, File Path, Domain Tag, File Type, Data Shape,
Channel Names, Number of Channels, Sampling Rate, Duration, Completeness Check.
value (list) – The list of values to set in the column. Its length must match the number of rows in the dataframe.

Return type:

None

Raises:

ValueError – If the length of the value list does not match the number of rows in the dataframe.
TypeError – If the input types are not as expected (e.g., col_name is not a string or value is not a list).

Note

Ensure that the provided value list contains valid entries for the specified column type.

Examples

>>> from eegunity import UnifiedDataset
>>> unified_dataset = UnifiedDataset(***)
>>> unified_dataset.eeg_batch.set_metadata('Sampling Rate', [250, 250, 250])

sample_filter(channel_number=None, sampling_rate=None, duration=None, completeness_check=None, domain_tag=None, file_type=None)[source]#

Filters the ‘locator’ dataframe based on the given criteria. This function is typically used to select the data file according to specified requirements. For advanced filtering, refer to the batch_process() method.

Parameters:

channel_number (Union[Tuple[int, int], List[int], None], optional) – A tuple or list with (min, max) values to filter the “Number of Channels” column. If None, this criterion is ignored. Defaults to None.
sampling_rate (Union[Tuple[float, float], List[float], None], optional) – A tuple or list with (min, max) values to filter the “Sampling Rate” column. If None, this criterion is ignored. Defaults to None.
duration (Union[Tuple[float, float], List[float], None], optional) – A tuple or list with (min, max) values to filter the “Duration” column. If None, this criterion is ignored. Defaults to None.
completeness_check (str, optional) – A string that can be ‘Completed’, ‘Unavailable’, or ‘Acceptable’ to filter the “Completeness Check” column. The check is case-insensitive. If None, this criterion is ignored. Defaults to None.
domain_tag (str, optional) – A string to filter the “Domain Tag” column. If None, this criterion is ignored. Defaults to None.
file_type (str, optional) – A string to filter the “File Type” column. If None, this criterion is ignored. Defaults to None.

Return type:

None

Raises:

ValueError – If any of the input parameters are not in the expected format (e.g., invalid tuples or strings).

Note

This method modifies the ‘locator’ dataframe in place based on the provided filters.

Examples

>>> from eegunity import UnifiedDataset
>>> unified_dataset = UnifiedDataset(***)
>>> unified_dataset.eeg_batch.sample_filter(completeness_check='Completed')

save_as_other(output_path, domain_tag=None, format='fif', preserve_events=True, get_data_row_params=None, overwrite=False, miss_bad_data=False)[source]#

Save data in the specified format (‘fif’ or ‘csv’) to the given output path. If you want to save as hdf5 file, please use ‘export_h5Dataset’, because hdf5 file is generally used to save the whole dataset.

Parameters:

output_path (str) – The directory path where the converted files will be saved. If the path does not exist, a FileNotFoundError is raised.
domain_tag (str, optional) – Optional filter to save only the files with a matching ‘Domain Tag’. If None, all files are processed.
format (str, optional) – The format to save the data in. Supported formats are ‘fif’ and ‘csv’. If an unsupported format is provided, a ValueError is raised. Defaults to ‘fif’.
preserve_events (bool, optional) – If True, event markers will be included in the CSV file, and metadata will be adjusted. Defaults to True.
get_data_row_params (dict, optional) – Additional parameters passed to get_data_row() for data retrieval.
overwrite (bool, optional) – If True, existing files with the same name will be overwritten. If False, a new file name with an incremented suffix (e.g., “_raw(1).fif”) will be created to avoid overwriting. Defaults to False.

Returns:

This method modifies internal state in-place and does not return any value.

Return type:

None

Raises:

FileNotFoundError – If the output path does not exist.
ValueError – If the format is not ‘fif’ or ‘csv’.

Note

Ensure that the output_path is accessible and has the necessary write permissions.

Examples

>>> from eegunity import UnifiedDataset
>>> unified_dataset = UnifiedDataset(***)
>>> new_locator = unified_dataset.eeg_batch.save_as_other('/path/to/output', domain_tag='example', format='fif', overwrite=False)

process_mean_std(domain_mean=True, pick_type_params={'eeg': True, 'eog': False, 'meg': False, 'stim': False}, miss_bad_data=False)[source]#

Process the mean and standard deviation for EEG data across different channels and optionally compute domain-level statistics.

This function calculates the mean and standard deviation for all EEG channels, both combined and individually. It can also aggregate the results by domain if domain_mean is set to True.

Parameters:

domain_mean (bool, optional) – If True (default), the function aggregates the results by domain tags. Each domain contains the mean and standard deviation across all related EEG channels. If False, the function calculates and stores individual mean and standard deviation for each EEG recording.
pick_type_params (dict, optional) – Additional keyword arguments passed to mne.pick_types(). This allows users to pass extra parameters required by the mne.pick_types function seamlessly. For details on the parameters, refer to the mne.pick_types() function in MNE-Python documentation. Detault is {‘eeg’:True, ‘meg’:False, ‘stim’:False, ‘eog’:False}
miss_bad_data (bool, optional) – Whether to skip the current file and continue processing the next one if an error occurs. Defaults to False.

Returns:

The function updates the instance by setting the “MEAN STD” column with the calculated mean and standard deviation values. If domain_mean is True, it computes domain-aggregated statistics; otherwise, it stores per-channel results.

Return type:

None

Raises:

ValueError – If inconsistent channel names or numbers are found within a domain when domain_mean is True.

Note

Ensure that the EEG data is properly formatted and that all necessary channels are present before calling this method.

Examples

>>> from eegunity import UnifiedDataset
>>> unified_dataset = UnifiedDataset(***)
>>> unified_dataset.eeg_batch.process_mean_std(domain_mean=True)

format_channel_names(format_type='EEGUnity', miss_bad_data=False)[source]#

Format channel names in the dataset and update the ‘Channel Names’ column.

This function processes each row in the dataset, checks the ‘Channel Names’ column, and applies a formatting function to standardize the channel names. The function utilizes the batch_process method to apply the formatting to each row of locator, and the updated channel names are then saved back to the ‘Channel Names’ column.

Parameters:

format_type (str, optional) – The format for channel names, possible values are ‘EEGUnity’, ‘normal’, by default ‘EEGUnity’. If set to ‘EEGUnity’, the channel names are formated in “type:name”, like ‘eeg:C3’, ‘eeg:Cz’, ‘stim:stim1’, which store channel type in the locator, rather than change the source data. If set to ‘normal’, the channel, only formatted channels name are stored, like ‘C3’, ‘Cz’, ‘stim1’.
miss_bad_data (bool, optional) – Whether to skip the current file and continue processing the next one if an error occurs. Defaults to False.

Returns:

The function modifies the dataset in place by updating the ‘Channel Names’ column.

Return type:

None

Raises:

KeyError – If the ‘Channel Names’ column is missing from the dataset.

Note

Ensure that the dataset is properly loaded and contains the ‘Channel Names’ column before calling this method.

Examples

>>> from eegunity import UnifiedDataset
>>> unified_dataset = UnifiedDataset(***)
>>> unified_dataset.eeg_batch.format_channel_names()

filter(output_path, filter_type='bandpass', l_freq=None, h_freq=None, notch_freq=None, auto_adjust_h_freq=True, picks='all', miss_bad_data=False, get_data_row_params=None, filter_params=None, notch_filter_params=None)[source]#

Apply filtering to the data, supporting low-pass, high-pass, band-pass, and notch filters.

Parameters:

output_path (str) – Path to save the filtered file.
filter_type ({'lowpass', 'highpass', 'bandpass', 'notch'}, optional) – Type of filter to apply. Defaults to ‘bandpass’.
l_freq (float, optional) – Low cutoff frequency for the filter (used in high-pass or low-frequency band-pass filters). Defaults to None.
h_freq (float, optional) – High cutoff frequency for the filter (used in low-pass or high-frequency band-pass filters). Defaults to None.
notch_freq (float, optional) – Frequency for the notch filter. Defaults to None.
auto_adjust_h_freq (bool, optional) – Whether to automatically adjust the high cutoff frequency to fit the Nyquist frequency. Defaults to True.
picks (str, optional) – Channels to be used for filtering. Defaults to ‘all’.
miss_bad_data (bool, optional) – Whether to skip the current file and continue processing the next one if an error occurs. Defaults to False.
get_data_row_params (dict, optional) – Additional parameters passed to get_data_row() for data retrieval.
filter_params (dict, optional) – Additional parameters for mne_raw.filter().
notch_filter_params (dict, optional) – Additional parameters for mne_raw.notch_filter().

Returns:

The function modifies the dataset in place.

Return type:

None

ica(output_path, miss_bad_data=False, get_params=None, ica_params=None, fit_params=None, apply_params=None)[source]#

Apply ICA (Independent Component Analysis) to the specified file in the dataset.

This method applies ICA to clean the EEG data using parameters passed through ica_params, fit_params, and apply_params. Please refer to the official documentation for mne.preprocessing.ICA, ica.fit(), and ica.apply() for the complete list of available parameters.

Documentation links: - mne.preprocessing.ICA: https://mne.tools/stable/generated/mne.preprocessing.ICA.html - ICA.fit: https://mne.tools/stable/generated/mne.preprocessing.ICA.html#fitting-ica - ICA.apply: https://mne.tools/stable/generated/mne.preprocessing.ICA.html#applying-ica

Parameters:

output_path (str) – Path to save the processed file after applying ICA.
miss_bad_data (bool, optional) – Whether to skip bad data files and continue processing the next one. Defaults to False.
get_params (dict, optional) – Additional parameters passed to eegunity.module_eeg_parser.eeg_parser.get_data_row,
ica_params (dict, optional) – Additional parameters passed to mne.preprocessing.ICA, such as n_components, method, etc.
fit_params (dict, optional) – Additional parameters passed to ica.fit(), such as picks, decim, etc.
apply_params (dict, optional) – Additional parameters passed to ica.apply(), such as exclude, include, etc.

Returns:

Updates the file path in the dataset locator after ICA is applied.

Return type:

None

Raises:

ValueError – If the output path is invalid or if the specified parameters are inconsistent.

Note

Ensure that the input data is properly formatted and that all necessary parameters are specified before calling this method.

Examples

>>> from eegunity import UnifiedDataset
>>> unified_dataset = UnifiedDataset(***)
>>> unified_dataset.eeg_batch.ica('/path/to/save/', ica_params={'n_components': 20}, fit_params={'picks': 'eeg'})

resample(output_path, miss_bad_data=False, resample_params=None, get_data_row_params=None)[source]#

Resample the data using MNE’s resampling functionality and save the processed data.

Parameters:

output_path (str) – The path where the resampled file will be saved.
miss_bad_data (bool, optional) – Whether to skip the current file and continue processing the next one if an error occurs. Defaults to False.
resample_params (dict, optional) – Additional parameters to be passed to the mne_raw.resample() function.
get_data_row_params (dict, optional) – Additional parameters to be passed to the get_data_row() function.

Returns:

The function modifies the dataset in place by saving the resampled data.

Return type:

None

Raises:

Exception – If an error occurs during resampling and miss_bad_data is set to False, the error will be raised.

Note

Ensure that the output path is accessible and that the input data is properly formatted before calling this method.

Examples

>>> from eegunity import UnifiedDataset
>>> unified_dataset = UnifiedDataset(***)
>>> unified_dataset.eeg_batch.resample('/path/to/save/', resample_params={'sfreq': 256})

align_channel(output_path, channel_order, min_num_channels=1, miss_bad_data=False, get_data_row_params=None)[source]#

Adjust the channel order and perform interpolation on the data.

This method realigns the EEG data channels based on the provided channel_order. It utilizes get_data_row() for retrieving the data. Additional parameters can be passed to get_data_row() via get_data_row_params. For more information on available options, refer to the get_data_row() function in this documentation.

Parameters:

output_path (str) – The path where the adjusted file will be saved.
channel_order (list) – The desired order of channels, provided as a list.
min_num_channels (int, optional) – The minimum number of channels required for alignment. Defaults to 1.
miss_bad_data (bool, optional) – Whether to skip the current file and continue processing the next one if an error occurs. Defaults to False.
get_data_row_params (dict, optional) – Additional keyword arguments to be passed to get_data_row() for data fetching. This allows fine-tuning the data retrieval process.

Returns:

The function modifies the dataset in place by saving the adjusted data.

Return type:

None

Raises:

ValueError – If any invalid channels are found in the provided channel_order or if the number of matching channels is below min_num_channels.

Note

Ensure that the output path is accessible and that the provided channel order is valid before calling this method.

Examples

>>> from eegunity import UnifiedDataset
>>> unified_dataset = UnifiedDataset(***)
>>> unified_dataset.eeg_batch.align_channel('/path/to/save/', channel_order=['C3', 'C4', 'O1'], min_num_channels=3)

normalize(output_path, norm_type='sample-wise', miss_bad_data=False, domain_mean=True, get_data_row_params=None)[source]#

Normalize the data.

This method normalizes the EEG data based on the specified normalization type. It can either perform sample-wise normalization or aggregate by domain mean, depending on the provided parameters.

Parameters:

output_path (str) – The path where the normalized file will be saved.
norm_type (str) – The type of normalization to perform. It can be: - ‘channel-wise’: Normalize each channel individually based on its mean and standard deviation. - ‘sample-wise’: Normalize all channels based on a common mean and standard deviation.
miss_bad_data (bool, optional) – Whether to skip the current file and continue processing the next one if an error occurs. Defaults to False.
domain_mean (bool, optional) – If True (default), the function aggregates the results by domain tags. Each domain contains the mean and standard deviation across all related EEG channels. If False, the function calculates and stores individual mean and standard deviation for each EEG recording.
get_data_row_params (dict, optional) – Additional keyword arguments passed to get_data_row(). This allows users to pass extra parameters required by the get_data_row function seamlessly. For details on the parameters, refer to the get_data_row() function in this documentation.

Returns:

The function modifies the dataset in place by saving the normalized data.

Return type:

None

Raises:

ValueError – If the specified normalization type is invalid or if there are issues with the input data.

Note

Ensure that the output path is accessible and that the input data is properly formatted before calling this method.

Examples

>>> from eegunity import UnifiedDataset
>>> unified_dataset = UnifiedDataset(***)
>>> unified_dataset.eeg_batch.normalize('/path/to/save/', norm_type='channel-wise', domain_mean=True)

epoch_for_pretraining(output_path, seg_sec, resample=None, overlap=0.0, exclude_bad=True, baseline=None, miss_bad_data=False, get_data_row_params=None, resample_params=None, epoch_params=None)[source]#

Processes raw EEG data by creating epochs for pretraining, with optional resampling and event segmentation.

Parameters:

output_path (str) – Path to save the preprocessed epoch data in .npy format.
seg_sec (float) – Segment length in seconds for each epoch.
resample (Optional[int], optional) – New sampling rate. If specified, raw data will be resampled.
overlap (float, optional) – Fraction of overlap between consecutive segments (0.0 means no overlap).
exclude_bad (bool, optional) – If True, drops epochs marked as bad.
baseline (Tuple[Optional[float], float], optional) – Baseline correction period, represented as a tuple (start, end). Default is (None, 0).
miss_bad_data (bool, optional) – If True, skips files with errors instead of raising an exception.
get_data_row_params (dict, optional) – Additional parameters passed to get_data_row() for data retrieval.
resample_params (dict, optional) – Additional parameters passed to raw_data.resample().
epoch_params (dict, optional) – Additional parameters passed to mne.Epochs().

Returns:

The function modifies the dataset in place by saving the processed epoch data.

Return type:

None

Raises:

ValueError – If the segment length is invalid or if any specified parameters are inconsistent.

Note

Ensure that the output path is accessible and that the input data is properly formatted before calling this method.

Examples

>>> from eegunity import UnifiedDataset
>>> unified_dataset = UnifiedDataset(***)
>>> unified_dataset.eeg_batch.epoch_for_pretraining('/path/to/save/', seg_sec=2.0, resample=256)

get_events(miss_bad_data=False, get_data_row_params=None)[source]#

Extract events and log them in the data rows.

This method processes each data row by applying the get_data_row() and extract_events() functions. Additional parameters can be passed to these functions via get_data_row_params and extract_events_params.

Parameters:

miss_bad_data (bool, optional) – Whether to skip the current file and continue processing the next one if an error occurs. Defaults to False.
get_data_row_params (dict, optional) – Additional parameters passed to get_data_row() for data retrieval.

Raises:

Exception – If miss_bad_data is False, an exception is raised on processing errors.

Return type:

None

Note

Please refer to the documentation of get_data_row() and extract_events() for detailed descriptions of the available parameters.

get_meta(miss_bad_data=False, get_data_row_params=None)[source]#

Extract events AND kernel-written metadata (description, channel names) in one pass.

This method extends get_events() by also persisting two additional columns back into the locator after the kernel has been applied:

description — the full raw.info["description"] string, which kernels use to store eegunity_description JSON (age, sex, device …).
Channel Names — post-kernel channel name list (kernels may rename channels, so this overrides the pre-kernel names stored during directory scan).

Use this method instead of calling get_events() followed by a separate per-file reload for subject metadata extraction.

Parameters:

miss_bad_data (bool, optional) – Whether to skip the current file and continue processing the next one if an error occurs. Defaults to False.
get_data_row_params (dict, optional) – Additional parameters forwarded to _get_data_row().

Return type:

None

infer_units(miss_bad_data=False, get_data_row_params=None)[source]#

Infer the units of each channel and record them in the data line.

Parameters:

miss_bad_data (bool, optional) – Whether to skip the current file and continue processing the next file when an error occurs. Defaults to False.
get_data_row_params (dict, optional) – Additional parameters passed to get_data_row(). These allow for more flexible data processing during the inference process.

Raises:

Exception – If an error occurs during file processing and miss_bad_data is set to False.

Return type:

None

Note

This method applies a custom function to each row in the dataframe to infer the units for each channel based on the raw MNE data. The function handles errors gracefully if miss_bad_data is True.

get_quality(miss_bad_data=False, method='shady', ica_params=None, save_name='scores', get_data_row_params=None)[source]#

Process the data quality of EEG files by calculating quality scores for each row in the dataset.

Parameters:

miss_bad_data (bool, optional) – If True, skips rows that contain bad data without raising an error. If False, raises an exception when encountering bad data.
get_data_row_params (dict, optional) – Additional parameters passed to the get_data_row() function. This allows fine-tuning of parameters such as unit conversion, data normalization, etc. For details, refer to the get_data_row() function documentation.
method (str)
ica_params (Dict | None)
save_name (str)

Returns:

The function modifies the dataset in place by updating quality scores for each row.

Return type:

None

replace_paths(old_prefix, new_prefix)[source]#

Replace the prefix of file paths in the dataset according to the provided mapping (in-place). “This function is generally used in the context of multi-server or multi-user coordination.

Parameters:

old_prefix (str) – The old path prefix to be replaced.
new_prefix (str) – The new path prefix to replace the old one.

Returns:

This method modifies internal state in-place and does not return any value.

Return type:

None

export_h5Dataset(output_path, name='EEGUnity_export', get_data_row_params=None, miss_bad_data=False, pipeline=None)[source]#

Export the dataset in HDF5 format to the specified output path.

This export format is used by large-brain-model pipelines such as LaBraM.

This function processes all files in the dataset, ensuring that each file is stored in a separate group with its own dataset and attributes.

The exported HDF5 file contains a root group named by name (default "EEGUnity_export"). Each source file is stored as one subgroup (group name = source basename). Every subgroup contains: dataset "eeg", dataset "info", and "eeg" attributes "rsFreq" and "chOrder".

Parameters:

output_path (str) – The directory path where the exported HDF5 files will be saved. A FileNotFoundError is raised if the path does not exist.
name (str) – The name of the HDF5 file. Must be a string. The default value is ‘EEGUnity_export’. Raises a TypeError if the value provided is not a string.
get_data_row_params (dict, optional) – Additional parameters passed to get_data_row() for data retrieval.
miss_bad_data (bool, optional) – If True, skips rows that contain bad data without raising an error. If False, raises an exception when encountering bad data.
pipeline (callable, optional) – A user-supplied preprocessing function applied to each raw recording after loading: raw = pipeline(raw).

Returns:

The function does not return any value.

Return type:

None

Raises:

FileNotFoundError – If the output path does not exist.
FileExistsError – If the HDF5 file already exists in the specified output path.
ValueError – If channel configurations or counts are inconsistent within a domain tag.
TypeError – If the name parameter is not a string.

auto_domain()[source]#

Automatically modify the ‘Domain Tag’ of each row based on ‘Sampling Rate’ and channel names.

This function processes each row in the dataset and updates the ‘Domain Tag’ by appending the ‘Sampling Rate’ and a unique encoded representation of the channel names. The channel names are retrieved using get_data_row() to ensure accuracy, and parameters can be passed to get_data_row() via get_data_row_params.

The ‘Domain Tag’ is updated in the format: f”row[‘Domain Tag’]-row[‘Sampling Rate’]Hz-ch_enc(channel_names)”.

The function utilizes the batch_process method to apply these modifications across the dataset.

Returns:: The function modifies the dataset in place by updating the ‘Domain Tag’ column.
Return type:: None
Raises:: KeyError – If the required columns (‘Domain Tag’, ‘Sampling Rate’) are missing.

Examples

>>> unified_dataset.eeg_batch.auto_domain(get_data_row_params={'preload': True})

get_file_hashes(data_stream=False)[source]#

Generate and store unique file identifiers for EEG data files.

Two hashing modes are supported, selected by data_stream:

File mode (data_stream=False, default)

Computes SHA-256 over the raw bytes of each file on disk. The digest is format-dependent: the same EEG signal stored in two different formats will produce different hashes. Results are written to the "Source Hash" column.

Data-stream mode (data_stream=True)

Computes SHA-256 over a format-independent fingerprint of the decoded EEG signal, intended to detect recordings that have been re-published across different datasets or under different file formats.

Fingerprint construction:

Load the EEG signal via get_data_row() (MNE Raw, SI units [V]).
Sort channel names alphabetically and reorder the data matrix accordingly, so that channel-order differences do not affect the hash.
Extract three fixed-length blocks of the signal matrix: the head (first block_sec seconds), the mid (centred block), and the tail (last block_sec seconds), where block_sec = 5. If the recording is shorter than 3 x block_sec the entire matrix is used.
Round each sample to six decimal places to absorb minor floating-point conversion differences between formats.
Compute SHA-256 over the raw bytes of the resulting NumPy array.

Files whose data stream cannot be loaded (unsupported format, missing file, etc.) are represented by None. Results are written to the "Data Hash" column.

Parameters:

data_stream (bool, optional) – When False (default) hash raw file bytes and write to "Source Hash". When True hash the decoded EEG signal fingerprint and write to "Data Hash".

Returns:

Results are written in-place to the locator via set_metadata().

Return type:

None

Raises:

FileNotFoundError – File mode only: raised when the file at row['File Path'] does not exist.
IOError – File mode only: raised when the file cannot be read.

Examples

>>> u_ds.eeg_batch.get_file_hashes()                  # file hash
>>> u_ds.eeg_batch.get_file_hashes(data_stream=True)  # signal hash

get_file_sizes()[source]#

Compute and store the on-disk size of each EEG data file in the locator.

For every row in the locator (regardless of Completeness Check status), the size of the file at File Path is retrieved via a single os.stat() call and written to the "File Size" column as an integer number of bytes. Files that cannot be found or accessed are represented by -1.

The method delegates row-level iteration and optional parallelism to batch_process(), so the global num_workers setting of the parent UnifiedDataset is honoured automatically. Using multiple workers is beneficial when the dataset is stored on a network-mounted filesystem where each stat() incurs network latency.

Returns:: Results are written in-place to the "File Size" column of the shared locator.
Return type:: None

Notes

File sizes are reported in bytes (int). A value of -1 indicates that the file was not found or could not be stat-ed.

To convert units in pandas after calling this method:

df = u_ds.get_locator()
df["File Size KB"] = df["File Size"] / 1_024
df["File Size MB"] = df["File Size"] / 1_024 ** 2
df["File Size GB"] = df["File Size"] / 1_024 ** 3

Examples

>>> from eegunity import UnifiedDataset
>>> u_ds = UnifiedDataset(dataset_path="/path/to/dataset")
>>> u_ds.eeg_batch.get_file_sizes()
>>> locator = u_ds.get_locator()
>>> print(locator[["File Path", "File Size"]].head())

epoch_by_event(*args, **kwargs)#

Deprecated since version This: method is no longer maintained and will be removed in a future release. Use epoch_by_event_hdf5() instead, which produces a single HDF5 file with significantly faster IO and smaller storage.

Raises:: NotImplementedError – Always. Migrate to epoch_by_event_hdf5.
Return type:: None

epoch_by_event_hdf5(output_path, exclude_bad=True, file_name_prefix='EpochData', miss_bad_data=False, include_events=None, format_version='v2', get_data_row_params=None, resample_params=None, epoch_params=None, pipeline=None)#

Batch process EEG data to create epochs based on events and save as HDF5.

v2 format (default) — flat array layout optimised for PyTorch random access and storage efficiency:

data array: (N, n_ch, n_times) float32, gzip-1, chunk per epoch.
epoch_meta/source_group: source file name for each epoch.
epoch_meta/event_code: integer class code for each epoch.
source_meta/{group}/: per-file attrs + pickled mne.Info.
Root attrs include label_map (JSON: code → event name).

v1 format — legacy file-per-group layout (deprecated, will be removed in a future release).

Parameters:

output_path (str) – Directory to save the processed epochs.
exclude_bad (bool, optional) – Whether to exclude bad epochs. Default is True.
file_name_prefix (str, optional) – Filename prefix for the HDF5 file. Default is 'EpochData'.
miss_bad_data (bool, optional) – Whether to skip files with processing errors. Default is False.
include_events (list of str, optional) – Whitelist of event names to include. If None, all events are saved. Use this to exclude noise events (e.g. 'Start of a trial').
format_version (str, optional) – 'v2' (default, recommended) or 'v1' (deprecated).
get_data_row_params (dict, optional) – Additional parameters passed to get_data_row().
resample_params (dict, optional) – Parameters for resampling. Must include sfreq for target rate.
epoch_params (dict, optional) – Additional parameters for mne.Epochs.
pipeline (callable, optional) – A user-supplied preprocessing function applied to each raw recording after loading and before resampling: raw = pipeline(raw). If pipeline itself performs resampling and resample_params is also provided, the final resample (from resample_params) will still be applied afterwards — resample_params always runs last. To avoid double-resampling, either omit resample_params when the pipeline already resamples, or ensure both target the same sampling frequency.

Raises:

ValueError – If recordings have inconsistent channel counts. Use eeg_batch.auto_domain() + group_by_domain() to split them first, then call this function on each sub-dataset separately.

Return type:

None

epoch_by_long_event(*args, **kwargs)#

Deprecated since version This: method is no longer maintained and will be removed in a future release. Use epoch_by_long_event_hdf5() instead, which produces a single HDF5 file with significantly faster IO and smaller storage.

Raises:: NotImplementedError – Always. Migrate to epoch_by_long_event_hdf5.
Return type:: None

epoch_by_long_event_hdf5(output_path, overlap, file_name_prefix='EpochData', exclude_bad=True, miss_bad_data=False, include_events=None, format_version='v2', get_data_row_params=None, resample_params=None, epoch_params=None, pipeline=None)#

Batch process EEG data to create epochs from long-duration events (with overlap) and save as HDF5.

See epoch_by_event_hdf5 for documentation of the v2 file format.

Parameters:

output_path (str) – Directory to save the processed epochs.
overlap (float) – Overlap between consecutive segments (0.0 ≤ overlap < 1.0).
file_name_prefix (str, optional) – Filename prefix for the HDF5 file. Default is 'EpochData'.
exclude_bad (bool, optional) – Whether to exclude bad epochs. Default is True.
miss_bad_data (bool, optional) – Whether to skip files with processing errors. Default is False.
include_events (list of str, optional) – Whitelist of event names to include. None keeps all events.
format_version (str, optional) – 'v2' (default, recommended) or 'v1' (deprecated).
get_data_row_params (dict, optional) – Additional parameters passed to get_data_row().
resample_params (dict, optional) – Parameters for resampling.
epoch_params (dict, optional) – Additional parameters for mne.Epochs.
pipeline (callable, optional) – Preprocessing function applied after loading and before resampling: raw = pipeline(raw). See epoch_by_event_hdf5 for details on the pipeline/resample_params ordering.

Raises:

ValueError – If recordings have inconsistent channel counts. Use eeg_batch.auto_domain() + group_by_domain() first.

Return type:

None

epoch_by_segmentation_hdf5(output_path, exclude_bad=True, file_name_prefix='EpochData', miss_bad_data=False, format_version='v2', get_data_row_params=None, resample_params=None, segment_params=None, epoch_params=None, pipeline=None)#

Batch process EEG data to create epochs by sliding-window segmentation and save as HDF5.

See epoch_by_event_hdf5 for documentation of the v2 file format.

Parameters:

output_path (str) – Directory to save the processed epochs.
exclude_bad (bool, optional) – Whether to exclude bad epochs. Default is True.
file_name_prefix (str, optional) – Filename prefix for the HDF5 file. Default is 'EpochData'.
miss_bad_data (bool, optional) – Whether to skip files with processing errors. Default is False.
format_version (str, optional) – 'v2' (default, recommended) or 'v1' (deprecated).
get_data_row_params (dict, optional) – Additional parameters passed to get_data_row().
resample_params (dict, optional) – Parameters for resampling.
segment_params (dict, optional) – Must include 'segment_length' (seconds) and 'overlap' (0–1).
epoch_params (dict, optional) – Additional parameters for mne.Epochs.
pipeline (callable, optional) – Preprocessing function applied after loading and before resampling: raw = pipeline(raw). See epoch_by_event_hdf5 for details on the pipeline/resample_params ordering.

Raises:

ValueError – If recordings have inconsistent channel counts. Use eeg_batch.auto_domain() + group_by_domain() first.

Return type:

None

process_epochs(output_path, long_event=False, overlap=0, use_hdf5=True, file_name_prefix='EpochData', exclude_bad=True, miss_bad_data=False, get_data_row_params=None, resample_params=None, epoch_params=None)#

Unified interface for processing epochs.

Method selection rule: (long_event=False, use_hdf5=False) -> epoch_by_event; (long_event=False, use_hdf5=True) -> epoch_by_event_hdf5; (long_event=True, use_hdf5=False) -> epoch_by_long_event; (long_event=True, use_hdf5=True) -> epoch_by_long_event_hdf5.

Parameters:

output_path (str) – Directory to save the processed epochs.
long_event (bool, optional) – Whether to process long-duration events. If True, overlap must be provided. Default is False.
overlap (float, optional) – Overlap ratio for long events (0.0 <= overlap < 1.0). Required if long_event is True. Default is 0 (non-overlap).
use_hdf5 (bool, optional) – Whether to save the results in HDF5 format. If you are working with deep learning, especially large models, we strongly recommend using this interface (use_hdf5=True) for faster processing. Default is True.
file_name_prefix (str, optional) – Filename prefix for HDF5 saving (used only if use_hdf5 is True). Default is ‘EpochData’.
exclude_bad (bool, optional) – Whether to exclude bad epochs. Default is True.
miss_bad_data (bool, optional) – Whether to skip files with processing errors. Default is True.
get_data_row_params (dict, optional) – Additional parameters for data retrieval via get_data_row().
resample_params (dict, optional) – Parameters for resampling the raw data, mne.io.raw.resample()
epoch_params (dict, optional) – Additional parameters for creating epochs.

Returns:

The method processes and saves the epochs by calling the appropriate underlying method.

Return type:

None

Note

method_mixin_epoch is an internal implementation module and is not listed as a separate top-level API page.

Supporting modules#

eegunity.modules.batch.eeg_scores_modified_mne module#

eegunity.modules.batch.eeg_scores_modified_mne.compute_quality_score_mne(raw, ica_params=None)[source]#

Compute a data quality score using MNE’s built-in artifact detection methods.

Parameters:

raw (mne.io.Raw) – The EEG raw data.
method (str) – The method to use for artifact detection. Options are “ica” and “maxwell”.
plot (bool) – Whether to plot artifact scores and diagnostics.

Returns:

A dictionary containing the quality score, artifact ratio, and individual artifact counts.

Return type:

dict

eegunity.modules.batch.eeg_scores_shady module#

eegunity.modules.batch.eeg_scores_shady.butter_bandpass(lowcut, highcut, fs, order=4)[source]#

Design a Butterworth bandpass filter.

Parameters:

low_cut (float) – The lower cutoff frequency of the filter.
high_cut (float) – The upper cutoff frequency of the filter.
sampling_freq (float) – The sampling frequency of the input signal.
order (int, optional) – The order of the Butterworth filter (default is 4).

Returns:

b (ndarray) – The numerator (b) coefficients of the IIR filter.
a (ndarray) – The denominator (a) coefficients of the IIR filter.

Note

This function designs a bandpass filter using a Butterworth design. The cutoff frequencies are normalized by the Nyquist frequency, which is half the sampling frequency. The filter coefficients are returned as arrays suitable for use with scipy.signal.lfilter or similar filtering functions.

eegunity.modules.batch.eeg_scores_shady.butter_bandpass_filter(data, lowcut, highcut, fs, order=4)[source]#

Apply a bandpass filter to the data array.

Parameters:

data (ndarray) – Data to filter, with channels as rows.
lowcut (float) – The low frequency cut-off for the filter.
highcut (float) – The high frequency cut-off for the filter.
fs (int) – The sampling frequency of the data.
order (int, optional) – The order of the filter (default is 4).

Returns:

y – The filtered data.

Return type:

ndarray

Note

This function uses a Butterworth bandpass filter designed with the specified cut-off frequencies and order. The filter is applied using zero-phase filtering with scipy.signal.filtfilt, which ensures that the filtered signal is not phase-shifted.

eegunity.modules.batch.eeg_scores_shady.plot_radar_chart(scores, score_names, title='EEG Scores')[source]#

Plot a radar chart for the given scores.

Parameters:

scores (list of float) – List of scores to plot. Each score represents a metric for evaluating EEG quality.
score_names (list of str) – Names corresponding to the scores. These names describe the metrics for evaluating EEG quality.
title (str, optional) – Title for the radar chart (default is ‘EEG Scores’).

Return type:

None

Note

This function creates a radar chart using the provided scores and score names. The radar chart is displayed using matplotlib.

eegunity.modules.batch.eeg_scores_shady.calculate_general_amplitude_score(data)[source]#

Calculate a general amplitude score based on the proportion of signal amplitudes that fall within a specific range.

Parameters:: data (ndarray) – 2D array of EEG data where rows represent channels and columns represent amplitudes at each time point.
Returns:: The average score across all channels, normalized to 100.
Return type:: float

Note

This function calculates the general amplitude score for each channel by counting the number of amplitudes within a specific range (-100 to 100) and dividing it by the total number of amplitudes. The scores are then averaged across all channels to obtain the final general amplitude score.

eegunity.modules.batch.eeg_scores_shady.calculate_highest_amplitude_score(data, channel_indices)[source]#

Calculate the highest amplitude score for specific channels within the Alpha band.

Parameters:

data (ndarray) – The EEG data in the Alpha band, expected to be a 2D array where rows represent channels and columns represent amplitudes at each time point.
channel_indices (list of int) – List of indices for the channels of interest.

Returns:

The calculated highest amplitude score, normalized to 100.

Return type:

float

Note

This function first filters the data for the specified channels of interest. It then calculates the maximum amplitude for each channel. The amplitudes are sorted in descending order, and scoring is applied based on the percentile of each channel’s amplitude. The final score is the average of these scores, normalized to 100.

eegunity.modules.batch.eeg_scores_shady.calculate_dominant_frequency(signal, fs)[source]#

Calculate the dominant frequency of a signal.

Parameters:

signal (ndarray) – The signal array. This should be a 1D array representing the time-series data of the EEG signal.
fs (int) – The sampling frequency of the signal, representing the number of data points collected per second.

Returns:

The dominant frequency of the signal, which is the frequency component that has the highest amplitude in the Fourier Transform of the signal.

Return type:

float

Note

This function computes the Fast Fourier Transform (FFT) of the input signal and identifies the frequency with the highest amplitude. The dominant frequency is determined from the positive frequency components of the FFT.

eegunity.modules.batch.eeg_scores_shady.calculate_symmetry_score(data, channels_left, channels_right, fs)[source]#

Calculate the symmetry score between two sets of channels.

Parameters:

data (ndarray) – The EEG data, expected to be a 2D array where rows represent channels and columns represent amplitudes at each time point.
channels_left (list of int) – List of indices for the left channels.
channels_right (list of int) – List of indices for the right channels.
fs (int) – The sampling frequency of the data.

Returns:

The symmetry score between the two sets of channels, expressed as a percentage.

Return type:

float

Note

This function filters the data for the specified left and right channels, calculates the dominant frequencies for each set of channels, and computes the correlation score between the dominant frequencies. The symmetry score is normalized to a range of 0 to 100.

eegunity.modules.batch.eeg_scores_shady.calculate_beta_sinusoidal_score(fft_data)[source]#

Calculate the beta sinusoidal score by analyzing the proportion of significant energy in the FFT data.

Parameters:: fft_data (ndarray) – The FFT results of the EEG data, expected to be a 2D array where each row represents the FFT results of a channel.
Returns:: The sinusoidal score as a percentage.
Return type:: float

Note

This function computes the total and significant energy for each channel based on the FFT results. It applies a threshold to determine significant frequencies and calculates the score as the percentage of significant energy relative to the total energy. If the total energy is zero, the score for that channel is set to zero to avoid division errors.

eegunity.modules.batch.eeg_scores_shady.calculate_beta_amplitude_score(beta_data, threshold=20)[source]#

Calculate Score 4 based on the percentage of beta wave samples in each channel that do not exceed the maximum amplitude threshold.

Parameters:

beta_data (ndarray) – The EEG data filtered in the beta band, expected to be a 2D array where rows represent channels and columns represent amplitudes at each time point.
threshold (float, optional) – The maximum amplitude threshold for the beta waves (in microvolts, μV). Default is 20.

Returns:

The average percentage of samples across all channels that do not exceed the threshold.

Return type:

float

Note

This function counts the number of samples in each channel that are less than or equal to the specified threshold and calculates the percentage of such samples relative to the total number of samples in that channel. The final score is the average percentage across all channels.

eegunity.modules.batch.eeg_scores_shady.calculate_theta_amplitude_score(data, threshold=30)[source]#

Calculate the percentage of data points not exceeding a specified amplitude threshold in the theta frequency band.

Parameters:

data (ndarray) – The EEG data, expected to be a 2D array where rows represent channels and columns represent amplitudes at each time point.
threshold (float, optional) – The amplitude threshold for considering a data point as not exceeding. Default is 30.

Returns:

The average percentage of data points across all channels that do not exceed the threshold.

Return type:

float

Note

This function checks for data points in the theta frequency band that are below the specified threshold and calculates the percentage of such points for each channel. It returns the average percentage across all channels. If the input data is not already filtered for the theta band, appropriate filtering should be applied before using this function.

eegunity.modules.batch.eeg_scores_shady.classify_channels(channels)[source]#

Classify the given channels into different groups based on their location and type.

Parameters:

channels (list) – List of channel names.

Returns:

list – List of channel indices that belong to Score 2.
list – List of left-side channel indices.
list – List of right-side channel indices.

Note

This function classifies channels based on common EEG electrode naming conventions. Channels are grouped into frontal, temporal, parietal, occipital, and auricular categories. It distinguishes between left-side and right-side channels based on the suffix of their names, where odd-numbered suffixes typically denote left-side channels and even-numbered suffixes denote right-side channels. Midline and auricular channels are included in the Score 2 classification.

eegunity.modules.batch.eeg_scores_shady.compute_quality_scores_shady(mne_io_raw)[source]#

Calculate EEG scores for the given data.

Parameters:: mne_io_raw (mne.io.Raw) – MNE Raw object containing EEG data.
Returns:: List of EEG scores corresponding to different metrics.
Return type:: list

Note

This function calculates various EEG quality scores based on different metrics. The scores are calculated using filtered data from the original EEG signals. If fewer channels are available, only selected scores are computed. Otherwise, additional metrics are included for a more comprehensive assessment.

eegunity.modules.batch#

Public API#

eegunity.modules.batch.eeg_batch.EEGBatch class#

Supporting modules#

eegunity.modules.batch.eeg_scores_modified_mne module#

eegunity.modules.batch.eeg_scores_shady module#

eegunity.modules.batch.utils module#

Package exports#