credsweeper.ml_model package

Subpackages

credsweeper.ml_model.features package

Submodules

credsweeper.ml_model.ml_validator module

class credsweeper.ml_model.ml_validator.MlValidator(threshold: float | ThresholdPreset, ml_config: None | str | Path = None, ml_model: None | str | Path = None, ml_providers: str | None = None)[source]

Bases: object

ML validation class

FAKE_CHAR = '\x01'

MAX_LEN = 128

ZERO_CHAR = '\x00'

encode(text: str, limit: int) → ndarray[source]: Encodes prepared text to array

encode_line(text: str, position: int)[source]: Encodes line with balancing for position

encode_value(text: str) → ndarray[source]: Encodes line with balancing for position

extract_common_features(candidates: List[Candidate]) → ndarray[source]: Extract features that are guaranteed to be the same for all candidates on the same line with same value.

extract_features(candidates: List[Candidate]) → ndarray[source]: extracts common and unique features from list of candidates

extract_unique_features(candidates: List[Candidate]) → ndarray[source]: Extract features that can be different between candidates. Join them with or operator.

get_group_features(candidates: List[Candidate]) → Tuple[ndarray, ndarray, ndarray, ndarray][source]: np.newaxis used to add new dimension if front, so input will be treated as a batch

property session: InferenceSession: session getter to prevent pickle error

validate_groups(group_list: List[Tuple[CandidateKey, List[Candidate]]], batch_size: int) → Tuple[ndarray, ndarray][source]

Use ml model on list of candidate groups.

Parameters:

group_list – List of tuples (value, group)
batch_size – ML model batch

Returns:

Boolean numpy array with decision based on the threshold, and numpy array with probability predicted by the model

Module contents