Friday, June 26, 2015

Hack The Cause

Author:    Sean Moriarty
Project:    http:/


    A few weeks ago I was asked to start working on a project involving security log analysis via machine learning.  I had already received a bit of experience with log streaming and analysis by running a HIDS system at home but, using machine learning for analysis was new to me.  I am aware this is something that has been done before but my long term goal is to optimize algorithms specifically for threat detection and mitigation.  Again I am aware this is nothing new, I am just hoping I can do it better without insanely expensive hardware, on an open source platform.
    The first problem that needed to be tackled before this project can really start is getting data.  I needed some form of log data that I can begin to model and write my algorithms around.  I personally find it very tough to write machine learning algorithms with no data to feed to them. I decided to use Apache Flume for streaming my log data.  If you are interested in how flume works take a look at my post about setting up flue to stream log data.  The first prototype version was a static page with the text "I'm a teapot" displayed on the page.  I then wrote a quick python script to send requests to the server in random time intervals.
    This setup was great for prototyping my environment.  It gave me a pipeline to test my flume configuration and ensure that for each hit on this site the log entry was sent to my Hadoop cluster and written to HBase for later retrieval and analysis.  This also provided for a great chance to load test the pipeline and ensure flume was going to be reliable.  Once we start talking about writing machine learning algorithms this data will not provide us with very useful information.  If we want to write algorithms that will catch real bad guys we need some real data.  Preferably a server with a decent amount of normal daily traffic and some attempted breaches.  Solving the problem of getting data of this nature was a little more fun.

The Coding Begins   

    After taking a nice walk and thinking about solutions to my problem I decided I wanted to write a CTF, open it up to the world and encourage everyone to tinker away.  I had been meaning to teach myself code igniter as in my past I have heavily used Zend and I felt it was time to try something new and see how I liked it.  After a sleepless night of coding and scanning through the code igniter documentation I had a working site.  The very first version released with four beginner levels.  I bought the domain, spun up a web server  and did a soft release announcing the site to a handful of friends who are into cyber security.  The soft launch went very well without either end of my pipeline crashing over the few days.  I announced the CTF publicly and continued to put an hour in here and there as I had free time over the next week or so (I wish I could spend more time on it but work is pulling me a million different ways).  I added a few more levels a night and improved a few of the existing levels based on user feedback.  The site now has nine practice challenges with more to come as soon as I get some free time.
    On a side note I would like to thank the first handful of people who helped test and gave feedback.  The first few days the site was live it had a few really lame levels.  The SQLi stuff was all simulated using regex but, I quickly realized how silly that was and changed to a real live database being exploited.  Another note is that I understand that the levels can be played in any order by just hand typing the URL in.  This was intentional I want people to be able to skip around and play whatever levels they think are most challenging and fun for there personal style and interests.  I also realize that you can just use the JS console to call the function that displays the message for capturing a flag.  I will fix this eventually I just have not had the time and it is low priority in my mind.  You get out of a site like this what you put into it.  If you just want to call the function nine times and call it a day that is your m.o. and I won't try to stop you.  I really do appreciate the feedback, good or bad it is amazing that people care enough to send me there two cents.  I have found most all of the feedback has been positive and the few super negative responses I got were wither someone expecting way to much out of a site that has under 30 hours worth of work put into it or not understanding a level and getting frustrated.  Even that type of negative feedback provides good insight for me, I usually go back and change hints or try to make things a little but more obvious.  To a point that is, I don't want every level to be super obvious.  Even with the hints.

The Results

    The results of this project have been much better than I could have anticipated.  Big shout out to the reddit community for spreading the word and getting involved.  I myself am an avid redditor and it is really nice to be able to post something that others in the community try out and let me know how they feel about.  As of writing this the site has been up just under two weeks.  The site at this moment has just shy of 20,000 views from 4,000 unique visitors.  Between this and my internal testing that was done before I had google analytics setup we have just over a million entries sitting in HBase all from real people generating real traffic.  Over the past day I have been modeling my data to prepare it for machine learning.  I will write a part two to the first post explaining that process so keep your eyes open.

Final Notes

    Again I cant stress enough how grateful I am for every use that tried out the site.  Keep your eyes open for updates I usually add at least a new level every few days.  They are starting to take a bit longer on the coding end as they get more complex but I will try my best to put works stuff aside and spend some time on the site.  Feel free to contact me with any feedback or questions.  Thanks again and as always, happy hacking.


  1. There are lots of information about latest technology and how to get trained in them, like Best Hadoop Training in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Best hadoop training institute in chennai). By the way you are running a great blog. Thanks for sharing this.

    Big Data Course in Chennai | Big Data Training Chennai

  2. Congratulations guys, quality information you have given!!!..Its really useful blog. Thanks for sharing this useful information. iOS Training in Chennai | iOS Training

  3. PHP scripting is definitely one of the easiest, if not the easiest scripting language to learn and grasp for developers. This is partially due to the similarities PHP syntax has with C and Java. Even if the only knowledge of development that you have is with HTML, picking up PHP is still fairly easy.
    PHP training in Chennai|PHP training institute in Chennai|PHP course in Chennai

  4. I would disagree, learning PHP requires setting up apache, I would say ruby or python would be a farrr better starter language.

  5. As the world is constantly getting advanced digitally for every brand or a company it is very improtant to mark their presence online. not only to mark their presence they also have to be very active on the web so that they can have a conversation with their clients/customers and solve their problems or improve their service.
    Digital Marketing Training in Chennai|Digital Marketing Course in Chennai|Digital Marketing Chennai

  6. It is an extraordinary blog. It will upgrade my ideas and knowledge in online marketing.
    Digital Marketing Course in Chennai | Digital Marketing Course

  7. Your blog information are really creative and useful for the readers.I ever read such kind of nice article yet.hope you will add more innovative ideas on your post.
    German Training in Saidapet
    German Training in Nolambur
    institutes to learn german in bangalore
    german Training near me