Wednesday, April 22, 2009

Dansguardian access.log summarizing, counting, unique

I have a dansguardian access.log file in smoothwall. I'd like to get a list of unique domains in use, and who'd be a sample IP address to check on.

This, my first effort, is good as far as it goes, which is to simply alphabetize the domains and give an IP address for *someone* who has accessed it:

awk "{ split (\$5,a,\"/\"); print \$4 \"\t\" a[3]; }" access.log | sort +1 -u


Of course, if I needed a date or time, I could add it in the print statement.


But now I think to myself, what about seeing how popular a domain (front part of url) is?

awk "{ split (\$5,a,\"/\"); print \$4 \"\t\" a[3]; }" access.log | sort +1 | awk '{a[$2] = $0; b[$2]++ } END {for(i in a){ print a[i] "\t" b[i]};}' | sort +1


This gives an IP address that has accessed the domain, and how many times that domain has been accessed. It DOES NOT mean that the IP address has accessed that domain that many times. If I wanted to do that ...


awk "{ split (\$5,a,\"/\"); print \$4 \"\t\" a[3]; }" access.log | sort | awk '{a[$0] = $0; b[$0]++ } END {for(i in a){ print a[i] "\t" b[i]};}' | sort


Further, you can use the above to see who "hogs" the web...
awk "{ split (\$5,a,\"/\"); print \$4 \"\t\" a[3]; }" access.log | sort | awk '{ a[$0] = $0; b[$0]++ } END {for(i in a){ print a[i] "\t" b[i]};}' | sort -r -n +2 -t " "

Inside the " " Linux users would use, in vi: ctrl-v, then Tab to put the real tab character. This puts the biggest numbers on top, so piping through more or head would be ideal.

I would argue that using these scripts is faster than most any other log analysis program, or use it in conjunction with your log analysis program.

No comments:

Blog Archive