Friday, June 26, 2015

Hack The Cause

Author:    Sean Moriarty
Email:      smoriarty21@gmail.com
Project:    http:/hackthecause.info

Background


    A few weeks ago I was asked to start working on a project involving security log analysis via machine learning.  I had already received a bit of experience with log streaming and analysis by running a HIDS system at home but, using machine learning for analysis was new to me.  I am aware this is something that has been done before but my long term goal is to optimize algorithms specifically for threat detection and mitigation.  Again I am aware this is nothing new, I am just hoping I can do it better without insanely expensive hardware, on an open source platform.
    The first problem that needed to be tackled before this project can really start is getting data.  I needed some form of log data that I can begin to model and write my algorithms around.  I personally find it very tough to write machine learning algorithms with no data to feed to them. I decided to use Apache Flume for streaming my log data.  If you are interested in how flume works take a look at my post about setting up flue to stream log data.  The first prototype version was a static page with the text "I'm a teapot" displayed on the page.  I then wrote a quick python script to send requests to the server in random time intervals.
    This setup was great for prototyping my environment.  It gave me a pipeline to test my flume configuration and ensure that for each hit on this site the log entry was sent to my Hadoop cluster and written to HBase for later retrieval and analysis.  This also provided for a great chance to load test the pipeline and ensure flume was going to be reliable.  Once we start talking about writing machine learning algorithms this data will not provide us with very useful information.  If we want to write algorithms that will catch real bad guys we need some real data.  Preferably a server with a decent amount of normal daily traffic and some attempted breaches.  Solving the problem of getting data of this nature was a little more fun.

The Coding Begins   


    After taking a nice walk and thinking about solutions to my problem I decided I wanted to write a CTF, open it up to the world and encourage everyone to tinker away.  I had been meaning to teach myself code igniter as in my past I have heavily used Zend and I felt it was time to try something new and see how I liked it.  After a sleepless night of coding and scanning through the code igniter documentation I had a working site.  The very first version released with four beginner levels.  I bought the domain, spun up a web server  and did a soft release announcing the site to a handful of friends who are into cyber security.  The soft launch went very well without either end of my pipeline crashing over the few days.  I announced the CTF publicly and continued to put an hour in here and there as I had free time over the next week or so (I wish I could spend more time on it but work is pulling me a million different ways).  I added a few more levels a night and improved a few of the existing levels based on user feedback.  The site now has nine practice challenges with more to come as soon as I get some free time.
    On a side note I would like to thank the first handful of people who helped test and gave feedback.  The first few days the site was live it had a few really lame levels.  The SQLi stuff was all simulated using regex but, I quickly realized how silly that was and changed to a real live database being exploited.  Another note is that I understand that the levels can be played in any order by just hand typing the URL in.  This was intentional I want people to be able to skip around and play whatever levels they think are most challenging and fun for there personal style and interests.  I also realize that you can just use the JS console to call the function that displays the message for capturing a flag.  I will fix this eventually I just have not had the time and it is low priority in my mind.  You get out of a site like this what you put into it.  If you just want to call the function nine times and call it a day that is your m.o. and I won't try to stop you.  I really do appreciate the feedback, good or bad it is amazing that people care enough to send me there two cents.  I have found most all of the feedback has been positive and the few super negative responses I got were wither someone expecting way to much out of a site that has under 30 hours worth of work put into it or not understanding a level and getting frustrated.  Even that type of negative feedback provides good insight for me, I usually go back and change hints or try to make things a little but more obvious.  To a point that is, I don't want every level to be super obvious.  Even with the hints.

The Results


    The results of this project have been much better than I could have anticipated.  Big shout out to the reddit community for spreading the word and getting involved.  I myself am an avid redditor and it is really nice to be able to post something that others in the community try out and let me know how they feel about.  As of writing this the site has been up just under two weeks.  The site at this moment has just shy of 20,000 views from 4,000 unique visitors.  Between this and my internal testing that was done before I had google analytics setup we have just over a million entries sitting in HBase all from real people generating real traffic.  Over the past day I have been modeling my data to prepare it for machine learning.  I will write a part two to the first post explaining that process so keep your eyes open.

Final Notes


    Again I cant stress enough how grateful I am for every use that tried out the site.  Keep your eyes open for updates I usually add at least a new level every few days.  They are starting to take a bit longer on the coding end as they get more complex but I will try my best to put works stuff aside and spend some time on the site.  Feel free to contact me with any feedback or questions.  Thanks again and as always, happy hacking.

Sunday, June 21, 2015

Security Log Analysis And Machine Learning On Hadoop(Part 1 - Ingestion)

Author: Sean Moriarty
Email: smoriarty21@gmail.com


 For my most recent venture I have been working on using the hadoop ecosystem to setup a sandbox for analyzing log data. I chose to use hadoop for log storage for a few reasons. The most dominant of these reasons being that hadoop is what I use for work so I grab any excuse I can to get another install under my belt. On top of this hadoop provides a great set of tools for streaming, storing and analyzing log data. It also provides cheap easy scalability and storage. On the off chance that I generate terabytes of log data I can easily expand my storage capacity by adding a new datanode to my cluster.
  For my environment I went with the hortonworks stack on live and a raw hadoop install for my development environment.  I chose hortonworks for a few reasons. With my work I have had to install and manage hortonworks, mapr and cloudera stacks. I find mapr to be by far the fastest stack but due to it being much more resource intensive than its competitors I felt that was not the right way to go. I do like cloudera a lot but I tend to prefer ambari over CDH and I can't help but love supporting hortonworks due to their heavy dedication to open source, maintaining high quality code and assembling a great team of engineers.
  For ingestion of the log data I use flume. Flume is a great Apache project for real time data ingestion. It runs as agents you setup on one or more machines. Each agent consists of a source, channel and sink.  The source is just as it sounds, your data source. The channel is a passive store that keeps the event until it is sent to your sink. The sink is your datas final resting place within the agent. This may be writing your data to HDFS or passing the data to another agent via avro.  For a more detailed description of flume go here:



  In my setup I have two agents running. My first agent sits on the web server and listens for new entries in my logs.  This then sends the data to an avro sink which passes it off to an agent running on my hadoop cluster.  The agent on the cluster is configured with an avro source and an hbase sink. This means when the agent receives data into its avro source from the web server's agent it will write the entry to hbase for long term storage.  I also write the data into HDFS as a second archive should anything ever happen to hbase.  Below is the configuration file for the agent sitting on the web server.




  The above is a basic script that tails a log file and listens for new entries.  When a new entry is detected it is passed into memory as that is the channel type we have specified.  The data is then sent out over our avro sink and cleared from memory.  On the hadoop side our agent is configured to listen for the avro event being passed, write the data passed over the wire via avro into memory and send the data to our sinks.  In this configuration I have two sinks.  One writing the data onto HDFS as a flat file and one writing the data to hbase.  The configuration for this end is shown below.




  In a production environment you would want to add one more step to your agent that dumps data into hbase. This step is serializing your data. In the interest of speed I have broken my data up in python. The correct means of data ingestion would be writing a serializer class in java that will split log entries before inserting into hbase and place all data in the proper columns. Ingesting your data this way saves you from having to break the data up anytime you want to use it. For now I have written a quick python script that extracts all the data from my log entry and writes the entries to hive properly split up.  This gives me a static schema making it much easier to access my data. I then connect to hive for creating visualizations and querying data to pass off to a machine learning algorithm.
  With a working pipeline for streaming my logs it was now time to generate some data. I decided to throw together a small CTF style game and allow the world to play.  Right now it has six levels ranging from easy to moderate difficulty. This method turned out to work very well generating a little over 1.2 million log entries in the first 4 days.  The site can be found here:


  Check out the site and give the challenges a try to have your log entries added to the data pool.  In part two I will discuss modeling my data to maximize the quality of information we can get from it.  I will be talking about taking our modeled data and running it through a clustering algorithm in order to find patterns and more importantly anomalies in out data.  We will learn about looking at a users data in a manner that will give us a feel for who they are and what they are doing in our system.  All IP and related info will be blacked out in screenshots. All log data collected will be used internally only and will never be shared.