CoreFilter

class chemFilters.core.CoreFilter(rdkit_filter=True, pep_filter=True, silly_filter=True, bloom_filter=True, rdfilter_subset='ALL', rdfilter_output='string', std_mols=True, std_method='chembl', n_jobs=1, parallel_chunk_size=None)[source]

Bases: object

Class implementation to run all filters on a list of smiles, with the option of adding a chunk size to process the input in batches. The filtering the dataset is done by:

>>> from chemFilters.core import CoreFilter
>>> core_filter = CoreFilter() # all filters enabled by default
>>> filtered_df = core_filter(smiles, chunksize=100)
Parameters:
  • rdkit_filter (bool) – toggle applying rdkit filters to smiles. Defaults to True.

  • pep_filter (bool) – toggle applying peptide filters to smiles. Defaults to True.

  • silly_filter (bool) – toggle applying silly filters to smiles. Defaults to True.

  • bloom_filter (bool) – toggle applying bloom filters to smiles. Defaults to True.

  • rdfilter_subset (str) – subset of the rdkit filters to be applied. For the available filters, see RdkitFilters.available_filters. Defaults to “ALL”.

  • rdfilter_output (str) – output format of the rdkit filters. Available: ‘bool’ and ‘string’. Defaults to “string”.

  • std_mols (bool) – whether to standardize the mols. Defaults to False.

  • std_method (str) – standardization method to be used. Defaults to “chembl”.

  • n_jobs – number of jobs to run in parallel. Defaults to 1.

  • parallel_chunk_size (int)

__init__(rdkit_filter=True, pep_filter=True, silly_filter=True, bloom_filter=True, rdfilter_subset='ALL', rdfilter_output='string', std_mols=True, std_method='chembl', n_jobs=1, parallel_chunk_size=None)[source]

Initialize the CoreFilter class. The filters are initialized with the default parameters.

Parameters:
  • rdkit_filter (bool) – toggle applying rdkit filters to smiles. Defaults to True.

  • pep_filter (bool) – toggle applying peptide filters to smiles. Defaults to True.

  • silly_filter (bool) – toggle applying silly filters to smiles. Defaults to True.

  • bloom_filter (bool) – toggle applying bloom filters to smiles. Defaults to True.

  • rdfilter_subset (str) – subset of the rdkit filters to be applied. For the available filters, see RdkitFilters.available_filters. Defaults to “ALL”.

  • rdfilter_output (str) – output format of the rdkit filters. Available: ‘bool’ and ‘string’. Defaults to “string”.

  • std_mols (bool) – whether to standardize the mols. Defaults to True.

  • std_method (str) – standardization method to be used. Defaults to “chembl”.

  • n_jobs – number of jobs to run in parallel. Defaults to 1.

  • parallel_chunk_size (int) – size of chunks for ParallelApplier. If None, auto-calculated. Defaults to None.

Return type:

None

filter_smiles(smiles, chunksize=1000)[source]

Filter a list of smiles based on the filters that are toggled on. Will use the chunksize and n_jobs to process the data in chunks & parallel.

Parameters:
  • smiles (list) – list of smiles to be filtered.

  • chunksize (int) – size of chunks to process the data. Set to None or < 0 for processing all at once. Defaults to 1000.

Returns:

dataframe with the filtered dataframes.

Return type:

pd.DataFrame