PDA

View Full Version : Data Mining: How It Works



Aragorn
1st September 2015, 12:02
We've all heard of data mining. All the big corporations do it, from IBM and Microsoft over Google and Amazon to stock markets and alphabet soup agencies. Now, the video here-below shows you how that works, in a ~19 minute presentation of the new IBM LinuxONE™ mainframe, which uses Free & Open Source Software in so-called Linux Containers -- for the techies: virtualized userspaces running on top of the same kernel.

It's pretty interesting. It's also pretty scary. :belief:



https://www.youtube.com/watch?v=VWBNoIwGEjo

lcam88
1st September 2015, 12:52
Nice promo for the IBM option.

Aragorn
1st September 2015, 13:12
Nice promo for the IBM option.

Yes, it is a promotional presentation, but I felt that it was important enough to share, as an illustration of how data mining works. The people making that presentation were obviously thinking about its promotional value, but they obviously neglected the side effect of the presentation, which is that they've now given us a clear insight in how they are mining all our data for use by the corporations, and how effective they are at that.

lcam88
1st September 2015, 13:41
It is interesting indeed. Apache spark does a lot of the analysis from the live feeds. http://spark.apache.org/

Processing big data is the key here. Spark is supposed to be faster and easier to setup than Hadoop.

This system isn't for your mere home grown setup though, it is a compute cluster solution. Probably occupying more than one U42 rack in a data center, external storage, probably connected by fiber, as well as a huge pipe to the internet, perhaps 500 mb symmetric full-duplexed link just for the demo.

EDIT

And that data-mining demo mostly uses open source software. Commercial software like Google analytics is already deployed and used widely. There are a couple of companies offering Apache Spark compute power as a cloud service. Amazon has something as well as another company (I can't remember the name).

Mining requires a couple of components in the stack and they are not so easy to understand and setup. The companies are attempting to commoditize big data analysis...

Aragorn
1st September 2015, 14:37
[...] This system isn't for your mere home grown setup though, it is a compute cluster solution. Probably occupying more than one U42 rack in a data center, external storage, probably connected by fiber, as well as a huge pipe to the internet, perhaps 500 mb symmetric full-duplexed link just for the demo. [...]

Actually, it's a mainframe. ;)

lcam88
1st September 2015, 14:58
...new IBM LinuxONE™ mainframe, which uses Free & Open Source Software in so-called Linux Containers -- for the techies: virtualized userspaces running on top of the same kernel.

Mainframe indeed, with "Linux containers" that define the cluster nodes...

EDIT

Linux containers -- "virtualized userspace" is something new to me. I just noticed. :)

https://linuxcontainers.org/

Aragorn
1st September 2015, 15:43
Mainframe indeed, with "Linux containers" that define the cluster nodes...

EDIT

Linux containers -- "virtualized userspace" is something new to me. I just noticed. :)

https://linuxcontainers.org/

Well, the technology isn't new. It already existed in FreeBSD as Jails, in Solaris as Zones and in IBM AIX as Workload Partitions -- not to be confused with LPAR ("Logical PARtitions"), which is a technique for assigning dedicated processors to a specific process. Virtualized userspace can be seen as a more secure and isolated form of a so-called chroot jail, and more lightweight than a full-blown virtualization solution where you have a complete guest operating system (including the kernel) running on top of a host operating system, or -- as with Xen or VMware -- on top of a bare-metal hypervisor.

In GNU/Linux, container technology has already long existed in the form of independent distributions such as vServer and OpenVZ, but both were applying their own patches to the Linux kernel, so it wasn't part of the upstream source tree. However, quite a while ago as well -- albeit more recently than when vServer and OpenVZ appeared -- the upstream Linux development introduced the cgroups ("control groups") framework into the kernel, and it was upon this framework that LXC ("LinuX Containers") was then developed, using some input from the developers of the OpenVZ distribution.

As I understand it, the delay was not so much due to getting Linux -- the kernel -- to support containers, but rather due to the absence of a suitable userspace toolchain to manage these containers, which is something both OpenVZ and vServer were already long providing for in their own distributions, and of course, given that those tools are attuned to the OpenVZ- and/or vServer-specific kernel customizations, they weren't quite going to work with the LXC framework. ;)

The Linux kernel actually supports the greatest variety of virtualization solutions out-of-the-box, compared to any other operating system...:


It has both dom0 and domU support for running on top of Xen.
It can act as a hypervisor, using either kvm (with qemu) or lguest.
It supports running entirely in userspace on top of another Linux kernel -- this is called "User Mode Linux", or UML for short.

lcam88
1st September 2015, 17:51
Indeed. I've used the chroot jail frequently, Gentoo uses it to bootstrap the installation process.

I also use KVM type virtualizations to run different OS's within a vm. qemu-kvm was the way I went. I haven't played with Xen.

I understand UML to be a way to execute a kernel... I don't do kernel development so that hasn't been in my spotlight.

I just installed LXC (linux containers) in my CLI based CentOS 7 env and got a container up and running. It is different from chroot in that it offers control to network virtualization, CPU allocation and block device access limits within the environment, at least as far as I can tell. There is probably a way to limit the memory footprint as well...

I can imagine all the documented options have been refined over some period of time; developing high level management tools for LXC would require that a certain level stability be reached. Once you consider the scope these tools would need to have, LVM, network and the such, perhaps the bleeding edge nature of how linux develops can easily be seen as a drawback. Specialized userland tools are likely to be CLI based until a certain level of stability and maturity is reached. Not for the faint hearted. <shrug/>

It is neat in that it offers a light virtualized environment; doesn't need a whole installation just for its existence... It is better than KVM as long as you don't need a different OS environment.

As a novice to LXC, and yet having a feeling about these things, a LinuxOne type cluster _not_ based on a mainframe should be possible... The system leverages the network connective nature of services used by the stack. That aspect seems to suggest lots of flexibility in the possibilities of the underlying hardware.

EDIT

cgroups - interesting and arcane. It is the way to go really, +1 to memory limits in containers by cgroups.

EDIT 2

Integration of nodes across multiple servers is another neat problem I haven't looked at. It is essential to get scalability using commodity hardware. Suggestions besides LXD? (https://computing.llnl.gov/tutorials/linux_clusters/)

sandy
1st September 2015, 23:57
:holysheep: :belief: :spinning: :flag: I'm lost !! :ok: :ttr:

bsbray
2nd September 2015, 03:08
This is both scary and amazing. This could be put to many, many positive uses, or it could just be used to predict sociological responses to things and manipulate public opinion in real time.

Imagine all of the online data this kind of software could synthesize and summarize, or compare and contrast between different sources, if it were taken in that direction. Instead of having to do normal online searches from website to website and still only covering a small handful of the thousands of available sites out there, similar software could be developed to look through them all rapidly and summarize their main ideas and chart which are the most common perspectives by something like the most common key words occurring together.

Aragorn
2nd September 2015, 03:50
Suggestions besides LXD?

Well, that depends on what you're looking for. If you're asking about container alternatives (for GNU/Linux), then OpenVZ (https://www.openvz.org/Main_Page) is fairly mature -- it has been around for over a decade already. If you're asking about other scalable virtualization solutions, then Xen (http://www.xenproject.org/) is pretty good, and it supports clustering and live migration -- its documentation sucks, though. If you're asking about a cheap and DIY clustering solution, try the Beowulf Cluster (https://en.wikipedia.org/wiki/Beowulf_cluster). ;)

And lastly, if you're interested in a distributed operating system, try Plan 9 from Bell Labs (https://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs), which has recently been released as Free & Open Source Software. ;)





:holysheep: :belief: :spinning: :flag: I'm lost !! :ok: :ttr:

I agree that the geek factor has just been pumped up to 100% in this thread, Sandy, but so long as you got the message from the video presentation in the opening post, then I've attained my objective in posting it in the first place. ;)





This is both scary and amazing. This could be put to many, many positive uses, or it could just be used to predict sociological responses to things and manipulate public opinion in real time.

Imagine all of the online data this kind of software could synthesize and summarize, or compare and contrast between different sources, if it were taken in that direction. Instead of having to do normal online searches from website to website and still only covering a small handful of the thousands of available sites out there, similar software could be developed to look through them all rapidly and summarize their main ideas and chart which are the most common perspectives by something like the most common key words occurring together.

Well, I'll give you three guesses as to who the players are who can afford to either buy or lease one of these things -- actually two of them, because they'll want redundancy and fault tolerance -- and what they'll be using it for... :ninja:

bsbray
2nd September 2015, 03:54
Well, I'll give you three guesses as to who the players are who can afford to either buy or lease one of these things -- actually two of them, because they'll want redundancy and fault tolerance -- and what they'll be using it for... :ninja:

Hmmm...

NSA
MI5/6
Mossad

Am I getting close? :P

I'd imagine the NSA probably already has something equivalent or even more sophisticated. They don't seem to have to wait for public technology, and they're already tapping everything directly from ISP's, and trading foreign governments for whatever else they can't get.

Aragorn
2nd September 2015, 04:09
Hmmm...

NSA
MI5/6
Mossad

Am I getting close? :P

I'd imagine the NSA probably already has something equivalent or even more sophisticated. They don't seem to have to wait for public technology, and they're already tapping everything directly from ISP's, and trading foreign governments for whatever else they can't get.

Not the NSA, because they have a habit of designing their own -- they are actually in the process of building a brand-new one for several millions of US Dollars right now -- and they are currently already renting rack space from Amazon to house their supercomputers. They don't necessarily have much use for a mainframe per se. They're more interested in supercomputers -- most notably for cracking encryption -- and supercomputers are also already very good at handling throughput these days. This used to be different in the past: supercomputers were considered better for extreme number crunching and mainframes were considered better for information consolidation and throughput. Nowadays, supercomputers blur the line, though.

Well, no, apart from the usual suspects among the alphabet soup agencies, I was rather hinting at the financial-economic industry. It's the materialistic and sociopathic Machiavellian's wet dream to own and utilize something like this.

bsbray
2nd September 2015, 04:39
Well, no, apart from the usual suspects among the alphabet soup agencies, I was rather hinting at the financial-economic industry. It's the materialistic and sociopathic Machiavellian's wet dream to own and utilize something like this.

I've read things before about how the stock markets are already tightly controlled, at least in the US. They go up and down to the bankers' whims, though of course they have to maintain some semblance of reality. I've seen anecdotes of large volumes of stock trying to move and being blocked for extended periods of time, apparently simply because it wasn't convenient at that time for whoever was controlling stocks.

Just a couple of weeks or so ago stocks shot way down on opening and they just shut trading down. Then again maybe a week ago the market dived about 1000 points within the first few minutes, and they built it back up by about 500 points before closing. The whole thing is a big scam and mind control operation as far as I see it.

lcam88
2nd September 2015, 13:07
Hmmm...

NSA
MI5/6
Mossad

Am I getting close? :P

I'd imagine the NSA probably already has something equivalent or even more sophisticated. They don't seem to have to wait for public technology, and they're already tapping everything directly from ISP's, and trading foreign governments for whatever else they can't get.

+1 to Aragorn's response. The NSA probably has at least an order of magnitude more data to sift through as well.


This is both scary and amazing. This could be put to many, many positive uses, or it could just be used to predict sociological responses to things and manipulate public opinion in real time.

Imagine all of the online data this kind of software could synthesize and summarize, or compare and contrast between different sources, if it were taken in that direction. Instead of having to do normal online searches from website to website and still only covering a small handful of the thousands of available sites out there, similar software could be developed to look through them all rapidly and summarize their main ideas and chart which are the most common perspectives by something like the most common key words occurring together.

Here is the thing. I cannot think of a single use that does not have some negative aspect to it. Google has gotten as close to "free of evil" as possible with their search system. Clearly the trend as shown in this promo has been to use computing power to gain market advantage so as to leverage a position backed by statistics that these systems can compute. But that does not actually produce something beneficial to anyone [else].


If you're asking about a cheap and DIY clustering solution, try the Beowulf Cluster.

This is getting close to what I was thinking about. I think a Hadoop cluster is more flexible though, more what I had in mind. And actually, Hadoop very closely models the google clustering model except that its written in Java instead of C and C++; it's not optimized.

I also looked at D-bus based IPC; it's not suitable in that it is designed to coordinate interactions between programs on the same machine. LXD uses D-bus. A cluster requires a "swarm" or "master -> slaves" type node model that Hadoop and Beowulf architectures sort of have. These to address problems that cluster solutions face, but that the mainframe hardware implementation doesn't have as it is a single machine. Plus the mainframe architecture reduces communication latencies to a level that hard wired networks can't attain.

lcam88
2nd September 2015, 13:37
:holysheep: :belief: :spinning: :flag: I'm lost !! :ok: :ttr:

My contemplation is how to make 10, 100 or maybe 10,000 computers work together in a "divide and conquer" type strategy, using free software, to solve a complex or difficult problem.

A mainframe is a single powerful computer; the problem IBM solves is how to use all of the computational capacity in an interesting way. Many of the technological solutions for the "divide and conquer" strategy exist (LXC, virtualization etc etc) as free software in Linux and used by IBM in their offering.

Clustering permits a single computer to become part of a coordination defining a larger and more powerful computer system. I'm examining free software solutions that can unite smaller computers so they may be used as one much more powerful computer.

What is scary to remember, besides all of this stuff being free, is a trend we know of as Moores Law. It is an observation/thesis made by an early computer hardware engineer working for Intel, who stated that every 18 months, the power (computational capacity) of a computer doubles, this was in the 1960's or 70's (Aragorn certainly will know which year exactly :) ).

And now in 2015 we are talking about freely available software that can unit these unit of computational capacity, which has been doubling for the last 40 years every 18 months, so that 10,000 of them work on a single task.

That task may be modeling nuclear events in the study of physics, fluid dynamic calculations for F1 racing, or statistical analysis of social media trends for a market advantage, or whatever else that by nature can be described by the model of complexity (https://en.wikipedia.org/wiki/Complexity).

That is world we live in.

sandy
3rd September 2015, 02:42
Well I do think that is a wonderful concept and goal Icam88. It would unite peoples/communities versus corporations and would help offset the all prevailing surveillance, control grid and ensuing trans humanistic bent the world seems to be travelling toward.

I just wished I had a better understanding and with that more self trust to download Linux and replace Microsoft but too scary for basically a self taught, peck around, illiterate such as I . :)

Here's the funny thing :( I managed a training company (pilot project funded by employment insurance in 97') that facilitated a grueling course to become a Microsoft System Engineer. I did not know how to turn on a computer. The Instructors were great and gave me a clone computer with written instructions on emailing and searching the web so that i would stop saying, oh; just print it out for me and skip email as I'm a hard copy lady. They caught on to my denial and sent me home with the afore mentioned computer and said have fun and should I screw things up, not to worry, just bring the clone in and they would get the students to fix it............thus that is how I learned the little I know.

The one thing I remember they use to say often is DOS is the Boss and Windows is the manager and that they could fix most anything when they went to the DOS.........gave me confidence and soon I was sending and receiving email and surfing the net.

Today, I would be lost without this means of contact, entertainment and education so I am ever grateful even for the little I know. :winner:

lcam88
3rd September 2015, 14:11
Sandy,

There are linux distributions you can download to a pendrive or burn to a DVD which permit you to try it out without interfering with windows.

Most of the software to do clustering is already freely available, but some have use limitations which makes them practical only for very specific tasks... The main obstacle is actually "finding" the hardware.

Aragorn,

Perhaps there is a different way besides clustering in the traditional sense.

http://www.mersenne.org/primes/

The organization above requires large computational capacity to be able to find large prime numbers defined as 2^n -1 where n is some number of bits. Number 2 on the list above for example, 2^3 - 1 = 7 represented as 111 in binary notation.

People interested in participating can download and install a bit of code that can be run in spare CPU cycles. The program connects to a web service and receives a value N and then verifies whether or not 2^N - 1 is prime or not.

Maybe a compute cluster type system can be defined in a similar way? Sort of like torrent but with a focus on CPU cycles available to do do something?

Aragorn
3rd September 2015, 14:38
Aragorn,

Perhaps there is a different way besides clustering in the traditional sense.

http://www.mersenne.org/primes/

The organization above requires large computational capacity to be able to find large prime numbers defined as 2^n -1 where n is some number of bits. Number 2 on the list above for example, 2^3 - 1 = 7 represented as 111 in binary notation.

People interested in participating can download and install a bit of code that can be run in spare CPU cycles. The program connects to a web service and receives a value N and then verifies whether or not 2^N - 1 is prime or not.

Maybe a compute cluster type system can be defined in a similar way? Sort of like torrent but with a focus on CPU cycles available to do do something?

Actually, nVidia's CUDA technology already provides for a small-scale type of clustering by using the GPU in the graphics adapter card for general purpose floating point operations. And in addition to that, a team of Belgian university students built a supercomputer a few years go, housed in a single full-tower chassis, but with multiple graphics adapter cards, each of which was used as an independent node in the cluster.

One thing that's extremely suitable for this, is the ARM architecture, and particularly something like the Raspberry Pi. They're small, they're inexpensive, and they're easy to come by. There are videos on YouTube of guys who've built such a supercomputer, housed in a normal desktop tower chassis. Of course, ARM is not x86, so you'd need a different binary installation, or you'd have to build everything from scratch using Gentoo or LFS.

You know, IBM, Toshiba and Sony developed an ideal clustering processor somewhere in the past decade. It was called the Cell processor (https://en.wikipedia.org/wiki/Cell_%28microprocessor%29), and it was comprised of a complete IBM POWER RISC core in conjunction with -- I believe -- 8 additional floating point cores, each of those 8 cores having a 16 MiB cache, so that the chip itself had 128 MiB of RAM on board as primary cache. However, for some reason, the design never really got off the ground, except in Sony's Playstation 3.

lcam88
3rd September 2015, 15:10
If I remember correctly... :)

The shader units in GPU's are not quite as precise as the FPU on a CPU, but they are sufficient to calculate for shading pixels. In that way they are very specialized.

And in general GPUs operate with a large amount of parallelism. They perform many similar operations that are not so dependent on each other at the same time, such as pixel shading, but that specialization of task makes them rather poor when performing them more general tasks a CPU typically performs. That just means the resulting computing capacity offered has a narrower scope of application.

I did read about the Cell processor, Intel also developed something similar. The PS3 chip is actually 8 (two disabled) cpu cores packed on the die, they have a type of "token ring" style bus to move data from one core to the next... The idea was to create a type of process pipeline where the output of one core serves as input to one or more other cores.

The specialization of this type of architecture detracts from general use where workloads don't conform well to the type of problems that inspires the initial design. In a way, that is the same problem the GPU based systems have.

System on chip is actually very interesting, ARM is an interesting architecture / instruction set. x86 uses some of the strategies ARM does but to a lesser extent; they both use microcode (hardware instructions that implement a "native" machine instruction). ARM has logic blocks that are more general purpose which would then be used to implement these micro-ops, X86 has logic blocks that are more specialized, that view is the basis of the idea that ARM would be more efficient, because less chip area would need to be powered on but idle than the X86 counterpart.

In reality, there are other aspects of chip design that weigh into efficiency. Production capabilities and transistor size etc etc.

lcam88
16th October 2015, 12:59
An interesting thread on Slashdot about how NSA breaks standard asymmetric encryption.

http://it.slashdot.org/story/15/10/15/1957259/how-is-the-nsa-breaking-so-much-crypto

EDIT

Perhaps this is actually for the other thread on computer security. Pardon me for the error.

Dumpster Diver
2nd December 2015, 18:33
Hmmm...

NSA
MI5/6
Mossad

Am I getting close? :P

I'd imagine the NSA probably already has something equivalent or even more sophisticated. They don't seem to have to wait for public technology, and they're already tapping everything directly from ISP's, and trading foreign governments for whatever else they can't get.

One does not have to be large to be able to do Machine Learning techniques effectively. We do "big data" modeling on essentially desk top computers you can buy in the computer store. $2-3K sets you up for most problems. Now, certain problems (like weather forecasting) are very large; but most "big data" issues can be done by partitioning the data by proper data set sampling to solve it "at home". BUT! you must have the data in the first place. I'm "told" that NSA excels at this.