edm: Edge DNSTAP Minimiser
About
dnstapir-edm reads DNSTAP and depending on configuration can output some different
data based on the observed messages:
- DNS queries for names considered well-known will be summarised into
histograms which are saved as parquet files. These files will then be submitted
to Core.
- DNS queries for names not considered well-known are collected into other
parquet files for further local analysis and here the complete message content
is saved but the client and server IP-addresses are pseudonymised via
Crypto-PAn.
- DNS queries that are not considered well-known and have never been seen
before by a given instance of
dnstapir-edm will result in notifications being
sent to Core via MQTT messages.
Usage
Running dnstapir-edm requires the creation of a TOML config file for holding the
crypto-PAn secret used for pseudonymisation as well as a
well-known-domains.dawg file which can be created using dnstapir-cli from
https://github.com/dnstapir/cli
Steps for a basic local-only setup
A basic setup where dnstapir-edm will listen on a unix socket for DNSTAP data and
output files to a directory structure under /tmp/dnstapir-edm but not send anything to
Core can be created like this:
make build
echo 'cryptopan-key = "mysecret"' > dnstapir-edm.toml
curl -O https://www.domcop.com/files/top/top10milliondomains.csv.zip
unzip top10milliondomains.csv.zip
dnstapir-cli dawg --standalone compile --format csv --src top10milliondomains.csv --dawg well-known-domains.dawg
dnstapir-edm run --input-unix /tmp/dnstapir-edm/input.sock --data-dir /tmp/dnstapir-edm/data --config-file dnstapir-edm.toml --well-known-domains-file well-known-domains.dawg --disable-mqtt --disable-histogram-sender
Since all communication with Core is disabled this is helpful for creating some
local parquet files to look around in.
Inspecting the resulting files
For inspecting the content you can use e.g. DuckDB like
so:
- For summarised histogram data
duckdb -c 'select * from "/tmp/dnstapir-edm/data/parquet/histograms/outbox/dns_histogram-2024-09-26T18-14-00Z_2024-09-26T18-15-00Z.parquet"'
- For pseudonymised session (full message) data
duckdb -c 'select * from "/tmp/dnstapir-edm/data/parquet/sessions/dns_session_block-2024-09-26T18-18-00Z_2024-09-26T18-19-00Z.parquet"'
Next to the parquet directory you will also see a directory called "pebble".
This is where dnstapir-edm keeps its key-value store which is used to tell if a
query name has been seen before or not. The key-value store being used is
pebble.
Observability
dnstapir-edm exposes prometheus metrics at 127.0.0.1:2112
and go pprof profiling data at 127.0.0.1:6060.
To look at prometheus metrics:
curl 127.0.0.1:2112/metrics
There are multiple types of profiling data available, here is a CPU-centric example:
go tool pprof http://127.0.0.1:6060/debug/pprof/profile?seconds=30
Development
When working with this code at least the following tools are expected to be
run at the top level directory prior to commiting:
Building
Binary
The most simple way of getting the dnstapir-edm binary is this:
make build