Baselining, on steroids!
Baseline is a cross-platform library and command-line utility that creates file-oriented baselines of your systems.
The project aims to offer an open-source alternative to the famous NSRL or HashSets and allows you to generate baselines from your own systems. Plus, it is cross-platform, so you can use it the same way whether on a Windows or a GNU/Linux system.
Currently available extractors:
fs
: Extracts filesystem-related metadata.hash
: Computes several hashes from the entry's data (e.g. MD5, SHA-1, ssdeep).pe
: Extracts detailed information from Portable Executable (PE) files.
Table of contents
Installation
Installing from PyPI
Baseline is currently not available on PyPI. The main reason is that the name is currently taken by another project.
Installing from source
Since Baseline uses Poetry as its packaging toolkit of choice, so installing it from source is as simple as:
git clone httpe://github.com/sk4la/baseline.git
cd baseline
python3 -m pip install poetry
python3 -m poetry install
Precompiled binaries
Precompiled binaries are available in the Releases section.
Docker
Baseline is also available as a Docker image.
To pull the latest image from Docker Hub:
docker pull sk4la/baseline
See the official Docker documentation for details on how to install and use it.
Usage
The help menu for the baseline
command-line utility:
Usage: baseline
COMMAND [ARGS]...
Command-line utility that creates file-oriented baselines.
Options:
--ensure-administrator Ensure that the current user has administrative privileges (i.e.
is `root` or equivalent on GNU/Linux systems).
--log-file
Set the log file path (e.g. 'baseline.log'). [default:
/home/sk4la/baseline/20211031110323.25756fb1b706.log]
--logging-configuration
Set the logging configuration using a custom file. The file must
adhere to the official specification. See
https://docs.python.org/3/library/logging.config.html#logging-
config-dictschema for more details.
--monochrome Disable console output coloring. This can be useful when piping
the output to a log file.
-v, --verbose Increase the logging verbosity. Supports up to 4 occurrences of
the same option (e.g. -vvvv). [0<=x<=4]
--version Show the version and exit.
--help Show this message and exit.
Commands:
new Creates a new filesystem-based baseline.
schema Show the JSON representation of the actual schema.
The help menu for the baseline new
subcommand:
Usage: baseline new
...
Creates a new filesystem-based baseline.
Options:
--comment
Add an arbitrary comment to the generated output file.
--exclude-directory
...
Exclude a specific directory from the baseline. Can be specified
multiple times (e.g. `--exclude-directory /dev --exclude-
directory /proc`).
--exclude-extractor [hash|pe|fs]
Exclude extractors. Can be specified multiple times (e.g.
`--exclude-extractor hash --exclude-extractor pe`).
--max-size
Set the maximum file size (in bytes) to inspect. [default: 5000000; x>=1] -o, --output-file
Set the output file path (e.g. 'baseline.ndjson'). [default: /workspaces/baseline/20211102200452.58bd60a3b16a.ndjson] --output-file-encoding [utf-8|utf-16le] Set the output file encoding. Only applies when writing to an actual file. [default: utf-8] -f, --output-format [ndjson] Set the output format. [default: ndjson] --partition-size
Set the partition size (i.e. number of entries per process). [default: 200; x>=1] --processes
Set the number of parallel processes. [default: 2; x>=1] --recursive / --non-recursive Whether to walk the filesystem recursively. When set, the program will only inspect the files and directories specifies on the first level of any included path. For example, if '/mnt/image' is specified as an included path, then only the directory '/mnt/image' itself and its direct children will be inspected. [default: recursive] --remap
... Artificially remap included paths (e.g. '/mnt/image:/'). Can be specified multiple times (e.g. `--remap /mnt/image:/ --remap /dev/null:/dev/void`). --report / --no-report Whether to show a final report at the end. [default: report] --skip-compression Whether to skip on-the-fly compression of the resulting file. --skip-directories Whether to skip directories. --skip-empty Whether to skip empty entries. --help Show this message and exit.
The help menu for the baseline schema
subcommand:
Usage: baseline schema
Show the JSON representation of the actual schema.
Options:
--compact Render compact JSON instead of the default idented version.
--output-file
Set the output file path (e.g. 'schema.json').
--output-file-encoding [utf-8|utf-16le]
Set the output file encoding. Only applies when writing to an
actual file. [default: utf-8]
--help Show this message and exit.
Creating a baseline of a live system
Creating a baseline of a live system is as simple as:
baseline new
When using Baseline from a removable device, you may want to exclude its path (for example /mnt/usb
) from the generated baseline:
baseline --ensure-administrator new --exclude-directory /mnt/usb
See the Usage section for a complete list of options and arguments.
Creating a baseline from a mounted image
When creating a baseline of a mounted image, you may want the baseline to represent the files as if they were read from the actual system, not the mounted image.
For example, if your image is currently mounted on /mnt/IMG-001
, you can then execute the following command to remap all entries read from this path to /
:
baseline new --remap /mnt/IMG-001:/ /mnt/IMG-001
You can think of this as a chroot jail.
Displaying the schema
Baseline uses a fixed schema for rendering the information. This schema is enforced using the Pydantic package and produces a heavily-typed output that can later be ingested as-is.
To print the standardized JSON schema:
baseline schema
To dump a compact version of the JSON schema to schema.min.json
:
baseline schema --compact --output-file schema.min.json
The JSON schema produced by Pydantic is compatible with the specifications from JSON Schema Core, JSON Schema Validation and OpenAPI Data Types. See the official Pydantic documentation for more details.
Advanced usage
Building binaries
Baseline currently supports the following packaging systems:
- PyInstaller (preferred) ;
- Nuitka.
Although precompiled binaries are available in the Releases section, you should always build your own binaries.
To produce a binary using PyInstaller:
make pyinstaller-linux
To produce a binary using Nuitka:
make nuitka-linux
As Nuitka is a Python compiler by itself and does not rely on the standard CPython interpreter, you should be aware that there may be bugs and/or issues unrelated to Baseline itself.
Building the Docker image
To build the official Docker image:
make docker
Additional instructions can be added to the Dockerfile
in order to customize the image.
The official Docker image is available at https://hub.docker.com/r/sk4la/baseline. You can use the
FROM docker.io/sk4la/baseline:latest
instruction in your ownDockerfile
to derive your own image.
API
Using Baseline from Python is possible using the Baseline
class:
from baseline.core import Baseline
with Baseline() as baseline:
for record in baseline.compute(*[
"/mnt/IMG-001",
"/mnt/IMG-002",
]):
print(record.json(exclude_none=True))
See the actual code for a more thorough example.
The
Baseline
class emits logging messages to thebaseline
logger, to which you can subscribe to if you wish. The command-line utility displays these messages to the console by default.
Contribute
Baseline is a work in progress, everyone is welcome to contribute!
Writing a new extractor
In order for new extractors to be able to enrich the generated records, the global schema first needs to be updated. To do this, you must create a sublass of Pydantic's BaseModel
in schema.py
that references the fields that will eventually be filled by the extractor. This class will then be referenced in the schema's root Record
class.
In this example, we want to extract the first 50 lines of any *.txt
file. Here we arbitrarily decide that the extracted text will be stored in the content
attribute and that the extractor's key will be text
:
class Text(pydantic.BaseModel):
content: str
class Record(pydantic.BaseModel):
...
text: typing.Optional[Text]
We can then start to write the actual code. All extractors must inherit from the base Extractor
class:
from baseline.models import Extractor
from baseline.schema import Text
class Text(Extractor):
"""Extracts the first line of any `.txt` file."""
EXTENSION_FILTERS = (
r"\.txt$",
)
KEY = "text"
def run(self: object, record: schema.Record) -> None:
with self.entry.open() as stream:
setattr(
record,
self.KEY,
schema.Text(
content=stream.read(50),
),
)
The extractor's KEY
class variable must correspond to the one that was specified in the schema's root Record
class (text
in this example).
Support
In case you encounter a problem or want to suggest a new feature, please submit a ticket.
License
Baseline is licensed under the GNU General Public License (GPL) version 3.