I was making some firewall changes last weekend and while watching the logs I discovered that every now and then some host would try to connect to 169.254.169.254 on port 80. This was peculiar since I don't use the IPv4 link local addresses anywhere in my network. It seemed to be happening randomly from all of my Linux hosts, both physical and virtual.
If the processes originated on my firewall which is running OpenBSD I'd be able to track down the process that was doing this by adding a more specific rule with "log (user)" to pf.conf(5) but it seems that Linux dropped this ability from Netfilter sometime back in the 2.6 time frame. 😢
The part that makes this a bit unique is that this is a connection that will certainly fail. It trips a rule in my firewall that blocks bogons outbound on my WAN interface which means that the normal tools like netstat(1) and lsof(8) will only reveal anything if I somehow catch the process between the execution of the connect(2) system call and it failing. What I need to be able to do is log in real time what is happening, which I could do with something like strace(1) but I'd need to know the PID of the process and that is what I'm trying to find out.
So off I went looking for other things that might be helpful and stumbled upon the Linux kernel audit system. The audit system has been around for a while and lets you ask the kernel to communicate the details of syscalls as they happen. There is a filtering mechanism built in so that you don't end up dumping too much information or dramatically impacting performance and the raw data is sent to a userland program via a netlink socket. By default most distributions ship auditd, which listens on that netlink socket and dumps all the messages into a log.
Since I am looking at an attempted TCP connection the connect system call is the one I am interested in. I don't know much else about it though so it turns out a pretty simple filter rule is what I was looking for.
$ sudo auditctl -a exit,always -F arch=b64 -S connect
This asks the kernel to log upon exiting of the syscall any calls to connect(2). This immediately started flooding the audit log with entries like:
type=SYSCALL msg=audit(1538057732.986:13752): arch=c000003e syscall=42 success=yes exit=0 a0=12 a1=7ffc987e28c0 a2=6e a3=7f20abf93dd0 items=1 ppid=9803 pid=19584 auid=4294967295 uid=33 gid=33 euid=33 suid=33 fsuid=33 egid=33 sgid=33 fsgid=33 tty=(none) ses=4294967295 comm="apache2" exe="/usr/sbin/apache2" key=(null)
type=SOCKADDR msg=audit(1538057732.986:13752): saddr=01002F746D702F7...
type=CWD msg=audit(1538057732.986:13752): cwd="/"
type=PATH msg=audit(1538057732.986:13752): item=0 name="/tmp/passenger.PIT9MCV/agents.s/core" inode=1499146 dev=fe:01 mode=0140666 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL
type=PROCTITLE msg=audit(1538057732.986:13752): proctitle=2F7573722F7362696E2F617...
OK, so I'm getting closer but obviously some of the data is coming out in some packed hex format and the things I want aren't all on the same line so I need to figure out how to decode this. While searching for the format of the messages in hopes of writing a quick and dirty parser I found ausearch(8) which includes the handy -i option.
I fired up tcpdump(8) on the pflog(4) interface and waited for one of the 169.254.169.254 packets to be dropped. That let me find what I was looking for in the audit log... the culprit.
It turns out it was a puppet agent run. Now I know none of my modules try to talk to that address but puppet does a lot of things including running facter to get information about the system and the environment it is running in. I know some cloud infrastructure has standardized on that address as a location for guest agents to pick up metadata from so I suspected some default module trying to see if we are running on a cloud provider. A quick locate(1) and grep(1) later and it turns out that the built in facter ec2 module does in fact try to pull metadata from 169.254.169.254.
apollo@10:16:49 1.8T ~ >locate facter
[ ... many paths elided for brevity ...]
/usr/lib/ruby/vendor_ruby/facter
[ ... many more paths ...]
apollo@10:16:53 1.8T ~ >grep -R 169.254.169.254 /usr/lib/ruby/vendor_ruby/facter/*
/usr/lib/ruby/vendor_ruby/facter/ec2/rest.rb: DEFAULT_URI = "http://169.254.169.254/latest/meta-data/"
/usr/lib/ruby/vendor_ruby/facter/ec2/rest.rb: DEFAULT_URI = "http://169.254.169.254/latest/user-data/"
/usr/lib/ruby/vendor_ruby/facter/util/ec2.rb: url = "http://169.254.169.254:80/"
/usr/lib/ruby/vendor_ruby/facter/util/ec2.rb: # GET request for the URI http://169.254.169.254/latest/user-data/ If the
/usr/lib/ruby/vendor_ruby/facter/util/ec2.rb: uri = "http://169.254.169.254/#{version}/user-data/"
So in the end, the Linux audit system is our friend. There is a lot of other cool stuff in there, I ran across a post from the slack engineering team that talks about how they use the audit system and how they leverage this information to alert on and challenge user actions in real time. It is also a cautionary tale that good network hygiene is important since you never know what random things you might leak out onto the Internet (or your ISP's network) if you aren't careful.
🍻