Monday, May 18, 2015

I guess I can't really say it's a hobby

I can't say it's a hobby because I sure have been spending a lot of time on it (this will be the third class I actually get a certificate for in this specialization), but I really like d@ta mining. It's another "thing" you can say I"m learning and interested in.
I know, weird, because it's mostly used by economists, bi0informatics people or c0mputer AI people, but I have an interest in it and well, this is my third class. It's a fair amount of math, and apparently it sits (I found out today) in a sector between engineering and c0mputer programming.
I think there is also something topographically interesting in it for me, also (in a vfx P0int Cl0ud sort of way?)
Hmm...fascinating, because my prof was telling me I'd have to choose between mech@nics, elect0nics and pr0gramming and I told him I like both electr0nics and pr0gramming. So it's like they sort of line up with this, too.

In any case, It's pretty challenging, so even though I'm getting a certificate for this class, I've just honestly been working through the examples (I'm prob going to get about an 80 in this class, but I could get 100. It's an online class, so it won't affect any paper grades or anything like that; I've just been doing this on my own, honestly. I want to understand it thoroughly so I may retake it. (one class leads to another, which leads to..etc) I'm also part of a study group, which hasn't started as yet.

But essentially, you take data from a sample. One is divided into your testing set, and another your training set. You run tests on your training set, and devise an algorithm based on your sample (but not too much so it 0verfits the data, as they call it) that is pretty accurate. From there, you apply it to your test set and see how well it matches up, and what sort of error or inaccuracy your prediction is.
That's essentially what it is; using data to determine things about the future.

Actually, years ago, Netfl!x had a competition to see who could devise such an algorithm for their people and no one could get better than a ten percent accuracy (.1), even with the prize being millions of dollars. Companies use this to devise, for example, the chances that someone will click on an ad, or say, how to determine if people will get cancer or whatnot.
This same concept is used to determine, say, how your email is filtered as "Sp@m" or not.

For example, we can determine whether something is more likely to be Sp@m by the frequency of capitalized letters; chances are that if it is almost all caps, it is spam, so we can filter data and label it as such based on criteria such as these. Or, if it contains a number of characters, the likelihood that it will be sp@m is also very high.

Anyways, I find it fascinating, but I'm still learning, and working off of some tutorials and work that others have done, as well as reading up on my own. Again, I'm probably going to get a certificate in the class I am taking, but I'd like to get 100 percent in the future and will prob retake it again before going to further parts of the programme.

Here are some of the charts I was able to make (I cropped the X and Y labels, etc).

We were given data in basically Excel type files. The first graph is a map of the data. The actual data was about 19 000 plus of samples of data, so you have to take a part of that (usually about 30 percent of the sample). We then split it and run different tests, sl!ce the data, as they say, and determine how best we can separate or see patterns within the data. From then, we can determine certain things about it.

I hope this is not too nerdy. lol.





1 comment:

  1. Dang! I can see you making a fortune working as a consultant doing this!

    ReplyDelete