Overall Architecture¶

CredSweeper is largely composed of 3 parts as follows. (Pre-processing, Scan, ML validation)

https://raw.githubusercontent.com/Samsung/CredSweeper/main/docs/images/Architecture.png

Pre-processing¶

When paths to scan are entered, get the files in that paths and the files are excluded based on the list created by config.json.

config.json

exclude
- pattern: Regex patterns to exclude scan.
- containers: Extensions in lower case of container files which might be scan with –depth option
- documents: Extensions in lower case of container files which might be scan with –doc and/or –depth option
- extension: Extensions in lower case to exclude scan.
- path: Paths to exclude scan.
source_ext: List of extensions for scanning categorized as source files.
source_quote_ext: List of extensions for scanning categorized as source files that using quote.
find_by_ext_list: List of extensions to detect only extensions.
check_for_literals: Bool value for whether to check line has string literal declaration or not.
line_data_output: List of attributes of line_data for output.
candidate_output: List of attributes of candidate for output.

...
"exclude": {
    "pattern": [
        ...
    ],
    "containers": [
        ".gz",
        ".zip",
        ...
    ],
    "documents": [
        ".docx",
        ".pdf",
        ...
    ],
    "extension": [
        ".7z",
        ".jpg",
        ...
    ],
    "path": [
        "/.git/",
        "/.idea/",
        ...
    ]
}
...

Scan¶

Basically, scanning is performed for each file path, and it is performed based on the Rule. Scanning method differs from scan type of the Rule, which is assigned when the Rule is generated. There are 3 scan types: SinglePattern, MultiPattern, and PEMKeyPattern. Below is the description of the each scan type and its scanning method.

SinglePattern
- When : The Rule has only 1 pattern.
- How : Check if a single line Rule pattern present in the line.
MultiPattern
- When : The Rule has 2 patterns.
- How : Check if a line is a part of a multi-line credential and the remaining part exists within 10 lines below.
PEMKeyPattern
- When : The Rule type is pem_key.
- How : Check if a line’s entropy is high enough and the line have no substring with 5 same consecutive characters. (like ‘AAAAA’)

Rule¶

Each Rule is dedicated to detect a specific type of credential, imported from config.yaml at the runtime.

config.yaml

...
- name: API
severity: medium
type: keyword
values:
- api
filter_type: GeneralKeyword
use_ml: true
validations: []
- name: AWS Client ID
...

Rule Attributes

severity

Severity

...
class Severity(Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"
...

confidence
- Confidence - The manually configured value indicates the confidence that the found candidate could be the credential type.
... class Confidence(Enum): STRONG = "strong" MODERATE = "moderate" WEAK = "weak" ...

type

RuleType

...
class RuleType(Enum):
    KEYWORD = "keyword"
    PATTERN = "pattern"
    PEM_KEY = "pem_key"
    MULTI = "multi"
...

values
- keyword : The keywords you want to detect. If you want to detect multiple keywords, you can write them as follows : password|passwd|pwd.
- pattern : The patterns you want to detect. For more accurate detection, it is recommended to specify ?P<value> in the patterns : (?P<value>AIza[0-9A-Za-z-_]{35}).
- pem_key : Specific rule to find multiline PEM private keys.
- multi : Two patterns you want to detect. Candidate will be found only if second pattern matched nearby.
filter_type
- The type of the Filter group you want to apply. Filter groups implemented are as follows: GeneralKeyword, GeneralPattern, PasswordKeyword, and UrlCredentials.
use_ml
- The attribute to set whether to perform ML validation. If true, ML validation will be performed.
validations
- The type of the validation you want to apply. Validations implemented are as follows: GithubTokenValidation, GoogleApiKeyValidation, GoogleMultiValidation, MailchimpKeyValidation, SlackTokenValidation, SquareAccessTokenValidation, SquareClientIdValidation, and StripeApiKeyValidation.

Filter¶

Check the detected candidates from the formal step. If a candidate is caught by the Filter, it is removed from the candidates set. There are 21 filters and 4 filter groups. Filter group is a set of Filter_s, which is designed to use many Filter_s effectively at the same time.

ML validation¶

CredSweeper provides pre-trained ML models to filter false credential lines. ML validation is on by the default and its sensitivity can be adjusted using --ml_threshold:

--ml_threshold FLOAT_OR_STR
   setup threshold for the ml model.
   The lower the threshold - the more credentials will be reported.
   Allowed values: float between 0 and 1, or any of ['lowest', 'low', 'medium', 'high', 'highest']
   (default: medium)

And ML can be fully disable by setting --ml_threshold 0

python -m credsweeper --ml_threshold 0 ...

Our ML model architecture is a combination of Bidirectional LSTM with additional handcrafted features. It uses last 50 characters from the potential credential and 91 handcrafted features to decide if it’s a real credential or not.

Example:

leaked_cred.py:
my_db_password = "NUU423cds"

Steps:

Regular expression extracts `NUU423cds` as a secret value, `my_db_password` as a variable, and `my_db_password = "NUU423cds"` as whole line
Handcrafted feature classes instantiated from classes in features.py using model_config.json. Instantiation process can be checked at ml_validator.py#L46. Features include: ` ` character in line: yes/no, `(` character in line: yes/no, file extension is `.c`: yes/no, etc.
Handcrafted features from step 2 used on line, value, variable, and filename to get feature vector of length 91
`NUU423cds` lowercased and right padded with special padding characters to the length 50. Last 50 characters selected if longer. Only 70 symbols used: 68 ASCII characters + 1 padding character + 1 special character for all other symbols: ml_validator.py#L29. Padded line than one-hot encoded. Link to corresponding code: ml_validator.py#L63
Padded line from step 4 inputted to Bidirectional LSTM. LSTM produce single vector of length 60 as output
LSTM output and handcrafted features concatenated into a single vector of length 151
Vector from step 6 feed into the two last Dense layers
Last layer outputs float value in range 0-1 with estimated probability of line being a real credential
Predicted probability compared to the threshold (see –ml_threshold CLI option) and credential reported if predicted probability is greater

https://raw.githubusercontent.com/Samsung/CredSweeper/main/docs/images/Model_with_features.png

Additional:

Handcrafted features are based on the rules described in “Secrets in Source Code” publication.

@INPROCEEDINGS{9027350,
    author={Saha, Aakanksha and Denning, Tamara and Srikumar, Vivek and Kasera, Sneha Kumar},
    booktitle={2020 International Conference on COMmunication Systems   NETworkS (COMSNETS)},
    title={Secrets in Source Code: Reducing False Positives using Machine Learning},
    year={2020},
    pages={168-175},
    doi={10.1109/COMSNETS48256.2020.9027350}
}

Mapping between text threshold values and float can be found at model_config.json#L2. Values are based on F-0.25, F-0.5, F-1, F-2 and F-4 scores on CredData test