Page 1 of 2 12 LastLast
Results 1 to 15 of 22

Thread: Data Mining: How It Works

  1. #1
    Administrator Aragorn's Avatar
    Join Date
    17th March 2015
    Location
    Middle-Earth
    Posts
    20,308
    Thanks
    88,708
    Thanked 81,142 Times in 20,323 Posts

    Data Mining: How It Works

    We've all heard of data mining. All the big corporations do it, from IBM and Microsoft over Google and Amazon to stock markets and alphabet soup agencies. Now, the video here-below shows you how that works, in a ~19 minute presentation of the new IBM LinuxONE™ mainframe, which uses Free & Open Source Software in so-called Linux Containers -- for the techies: virtualized userspaces running on top of the same kernel.

    It's pretty interesting. It's also pretty scary.


    = DEATH BEFORE DISHONOR =

  2. The Following 6 Users Say Thank You to Aragorn For This Useful Post:

    Bob (2nd September 2015), bsbray (1st September 2015), Frances (1st September 2015), lcam88 (1st September 2015), pointessa (16th October 2015), RealityCreation (3rd September 2015)

  3. #2
    Retired Member
    Join Date
    10th June 2015
    Posts
    1,009
    Thanks
    2,129
    Thanked 3,244 Times in 922 Posts
    Nice promo for the IBM option.

  4. The Following 3 Users Say Thank You to lcam88 For This Useful Post:

    Aragorn (1st September 2015), bsbray (1st September 2015), sandy (1st September 2015)

  5. #3
    Administrator Aragorn's Avatar
    Join Date
    17th March 2015
    Location
    Middle-Earth
    Posts
    20,308
    Thanks
    88,708
    Thanked 81,142 Times in 20,323 Posts
    Quote Originally posted by lcam88 View Post
    Nice promo for the IBM option.
    Yes, it is a promotional presentation, but I felt that it was important enough to share, as an illustration of how data mining works. The people making that presentation were obviously thinking about its promotional value, but they obviously neglected the side effect of the presentation, which is that they've now given us a clear insight in how they are mining all our data for use by the corporations, and how effective they are at that.
    = DEATH BEFORE DISHONOR =

  6. The Following 5 Users Say Thank You to Aragorn For This Useful Post:

    bsbray (1st September 2015), Frances (16th October 2015), lcam88 (1st September 2015), pointessa (16th October 2015), sandy (1st September 2015)

  7. #4
    Retired Member
    Join Date
    10th June 2015
    Posts
    1,009
    Thanks
    2,129
    Thanked 3,244 Times in 922 Posts
    It is interesting indeed. Apache spark does a lot of the analysis from the live feeds. http://spark.apache.org/

    Processing big data is the key here. Spark is supposed to be faster and easier to setup than Hadoop.

    This system isn't for your mere home grown setup though, it is a compute cluster solution. Probably occupying more than one U42 rack in a data center, external storage, probably connected by fiber, as well as a huge pipe to the internet, perhaps 500 mb symmetric full-duplexed link just for the demo.

    EDIT

    And that data-mining demo mostly uses open source software. Commercial software like Google analytics is already deployed and used widely. There are a couple of companies offering Apache Spark compute power as a cloud service. Amazon has something as well as another company (I can't remember the name).

    Mining requires a couple of components in the stack and they are not so easy to understand and setup. The companies are attempting to commoditize big data analysis...
    Last edited by lcam88, 1st September 2015 at 13:48.

  8. The Following 4 Users Say Thank You to lcam88 For This Useful Post:

    Aragorn (1st September 2015), bsbray (1st September 2015), Frances (16th October 2015), sandy (1st September 2015)

  9. #5
    Administrator Aragorn's Avatar
    Join Date
    17th March 2015
    Location
    Middle-Earth
    Posts
    20,308
    Thanks
    88,708
    Thanked 81,142 Times in 20,323 Posts
    Quote Originally posted by lcam88 View Post
    [...] This system isn't for your mere home grown setup though, it is a compute cluster solution. Probably occupying more than one U42 rack in a data center, external storage, probably connected by fiber, as well as a huge pipe to the internet, perhaps 500 mb symmetric full-duplexed link just for the demo. [...]
    Actually, it's a mainframe.
    = DEATH BEFORE DISHONOR =

  10. The Following 2 Users Say Thank You to Aragorn For This Useful Post:

    bsbray (1st September 2015), lcam88 (1st September 2015)

  11. #6
    Retired Member
    Join Date
    10th June 2015
    Posts
    1,009
    Thanks
    2,129
    Thanked 3,244 Times in 922 Posts
    Quote Originally posted by Aragorn
    ...new IBM LinuxONE™ mainframe, which uses Free & Open Source Software in so-called Linux Containers -- for the techies: virtualized userspaces running on top of the same kernel.
    Mainframe indeed, with "Linux containers" that define the cluster nodes...

    EDIT

    Linux containers -- "virtualized userspace" is something new to me. I just noticed.

    https://linuxcontainers.org/
    Last edited by lcam88, 1st September 2015 at 15:06.

  12. The Following 2 Users Say Thank You to lcam88 For This Useful Post:

    Aragorn (1st September 2015), bsbray (1st September 2015)

  13. #7
    Administrator Aragorn's Avatar
    Join Date
    17th March 2015
    Location
    Middle-Earth
    Posts
    20,308
    Thanks
    88,708
    Thanked 81,142 Times in 20,323 Posts
    Quote Originally posted by lcam88 View Post
    Mainframe indeed, with "Linux containers" that define the cluster nodes...

    EDIT

    Linux containers -- "virtualized userspace" is something new to me. I just noticed.

    https://linuxcontainers.org/
    Well, the technology isn't new. It already existed in FreeBSD as Jails, in Solaris as Zones and in IBM AIX as Workload Partitions -- not to be confused with LPAR ("Logical PARtitions"), which is a technique for assigning dedicated processors to a specific process. Virtualized userspace can be seen as a more secure and isolated form of a so-called chroot jail, and more lightweight than a full-blown virtualization solution where you have a complete guest operating system (including the kernel) running on top of a host operating system, or -- as with Xen or VMware -- on top of a bare-metal hypervisor.

    In GNU/Linux, container technology has already long existed in the form of independent distributions such as vServer and OpenVZ, but both were applying their own patches to the Linux kernel, so it wasn't part of the upstream source tree. However, quite a while ago as well -- albeit more recently than when vServer and OpenVZ appeared -- the upstream Linux development introduced the cgroups ("control groups") framework into the kernel, and it was upon this framework that LXC ("LinuX Containers") was then developed, using some input from the developers of the OpenVZ distribution.

    As I understand it, the delay was not so much due to getting Linux -- the kernel -- to support containers, but rather due to the absence of a suitable userspace toolchain to manage these containers, which is something both OpenVZ and vServer were already long providing for in their own distributions, and of course, given that those tools are attuned to the OpenVZ- and/or vServer-specific kernel customizations, they weren't quite going to work with the LXC framework.

    The Linux kernel actually supports the greatest variety of virtualization solutions out-of-the-box, compared to any other operating system...:

    • It has both dom0 and domU support for running on top of Xen.
    • It can act as a hypervisor, using either kvm (with qemu) or lguest.
    • It supports running entirely in userspace on top of another Linux kernel -- this is called "User Mode Linux", or UML for short.
    = DEATH BEFORE DISHONOR =

  14. The Following 2 Users Say Thank You to Aragorn For This Useful Post:

    bsbray (1st September 2015), lcam88 (1st September 2015)

  15. #8
    Retired Member
    Join Date
    10th June 2015
    Posts
    1,009
    Thanks
    2,129
    Thanked 3,244 Times in 922 Posts
    Indeed. I've used the chroot jail frequently, Gentoo uses it to bootstrap the installation process.

    I also use KVM type virtualizations to run different OS's within a vm. qemu-kvm was the way I went. I haven't played with Xen.

    I understand UML to be a way to execute a kernel... I don't do kernel development so that hasn't been in my spotlight.

    I just installed LXC (linux containers) in my CLI based CentOS 7 env and got a container up and running. It is different from chroot in that it offers control to network virtualization, CPU allocation and block device access limits within the environment, at least as far as I can tell. There is probably a way to limit the memory footprint as well...

    I can imagine all the documented options have been refined over some period of time; developing high level management tools for LXC would require that a certain level stability be reached. Once you consider the scope these tools would need to have, LVM, network and the such, perhaps the bleeding edge nature of how linux develops can easily be seen as a drawback. Specialized userland tools are likely to be CLI based until a certain level of stability and maturity is reached. Not for the faint hearted. <shrug/>

    It is neat in that it offers a light virtualized environment; doesn't need a whole installation just for its existence... It is better than KVM as long as you don't need a different OS environment.

    As a novice to LXC, and yet having a feeling about these things, a LinuxOne type cluster _not_ based on a mainframe should be possible... The system leverages the network connective nature of services used by the stack. That aspect seems to suggest lots of flexibility in the possibilities of the underlying hardware.

    EDIT

    cgroups - interesting and arcane. It is the way to go really, +1 to memory limits in containers by cgroups.

    EDIT 2

    Integration of nodes across multiple servers is another neat problem I haven't looked at. It is essential to get scalability using commodity hardware. Suggestions besides LXD? (https://computing.llnl.gov/tutorials/linux_clusters/)
    Last edited by lcam88, 1st September 2015 at 19:24.

  16. The Following 2 Users Say Thank You to lcam88 For This Useful Post:

    Aragorn (2nd September 2015), bsbray (2nd September 2015)

  17. #9
    Retired Member
    Join Date
    22nd September 2013
    Posts
    1,141
    Thanks
    15,854
    Thanked 7,406 Times in 1,137 Posts
    :spinning: I'm lost !!

  18. The Following 5 Users Say Thank You to sandy For This Useful Post:

    Aianawa (2nd September 2015), Aragorn (2nd September 2015), bsbray (2nd September 2015), genevieve (2nd September 2015), lcam88 (2nd September 2015)

  19. #10
    (account terminated) United States
    Join Date
    16th January 2015
    Location
    Au dela
    Posts
    2,901
    Thanks
    17,558
    Thanked 12,648 Times in 2,895 Posts
    This is both scary and amazing. This could be put to many, many positive uses, or it could just be used to predict sociological responses to things and manipulate public opinion in real time.

    Imagine all of the online data this kind of software could synthesize and summarize, or compare and contrast between different sources, if it were taken in that direction. Instead of having to do normal online searches from website to website and still only covering a small handful of the thousands of available sites out there, similar software could be developed to look through them all rapidly and summarize their main ideas and chart which are the most common perspectives by something like the most common key words occurring together.

  20. The Following 2 Users Say Thank You to bsbray For This Useful Post:

    Aragorn (2nd September 2015), lcam88 (2nd September 2015)

  21. #11
    Administrator Aragorn's Avatar
    Join Date
    17th March 2015
    Location
    Middle-Earth
    Posts
    20,308
    Thanks
    88,708
    Thanked 81,142 Times in 20,323 Posts
    Quote Originally posted by lcam88 View Post
    Suggestions besides LXD?
    Well, that depends on what you're looking for. If you're asking about container alternatives (for GNU/Linux), then OpenVZ is fairly mature -- it has been around for over a decade already. If you're asking about other scalable virtualization solutions, then Xen is pretty good, and it supports clustering and live migration -- its documentation sucks, though. If you're asking about a cheap and DIY clustering solution, try the Beowulf Cluster.

    And lastly, if you're interested in a distributed operating system, try Plan 9 from Bell Labs, which has recently been released as Free & Open Source Software.





    Quote Originally posted by sandy View Post
    :spinning: I'm lost !!
    I agree that the geek factor has just been pumped up to 100% in this thread, Sandy, but so long as you got the message from the video presentation in the opening post, then I've attained my objective in posting it in the first place.





    Quote Originally posted by bsbray View Post
    This is both scary and amazing. This could be put to many, many positive uses, or it could just be used to predict sociological responses to things and manipulate public opinion in real time.

    Imagine all of the online data this kind of software could synthesize and summarize, or compare and contrast between different sources, if it were taken in that direction. Instead of having to do normal online searches from website to website and still only covering a small handful of the thousands of available sites out there, similar software could be developed to look through them all rapidly and summarize their main ideas and chart which are the most common perspectives by something like the most common key words occurring together.
    Well, I'll give you three guesses as to who the players are who can afford to either buy or lease one of these things -- actually two of them, because they'll want redundancy and fault tolerance -- and what they'll be using it for...
    = DEATH BEFORE DISHONOR =

  22. The Following 3 Users Say Thank You to Aragorn For This Useful Post:

    bsbray (2nd September 2015), lcam88 (2nd September 2015), sandy (3rd September 2015)

  23. #12
    (account terminated) United States
    Join Date
    16th January 2015
    Location
    Au dela
    Posts
    2,901
    Thanks
    17,558
    Thanked 12,648 Times in 2,895 Posts
    Quote Originally posted by Aragorn View Post
    Well, I'll give you three guesses as to who the players are who can afford to either buy or lease one of these things -- actually two of them, because they'll want redundancy and fault tolerance -- and what they'll be using it for...
    Hmmm...

    NSA
    MI5/6
    Mossad

    Am I getting close? :P

    I'd imagine the NSA probably already has something equivalent or even more sophisticated. They don't seem to have to wait for public technology, and they're already tapping everything directly from ISP's, and trading foreign governments for whatever else they can't get.

  24. The Following 2 Users Say Thank You to bsbray For This Useful Post:

    Aragorn (2nd September 2015), lcam88 (2nd September 2015)

  25. #13
    Administrator Aragorn's Avatar
    Join Date
    17th March 2015
    Location
    Middle-Earth
    Posts
    20,308
    Thanks
    88,708
    Thanked 81,142 Times in 20,323 Posts
    Quote Originally posted by bsbray View Post
    Hmmm...

    NSA
    MI5/6
    Mossad

    Am I getting close? :P

    I'd imagine the NSA probably already has something equivalent or even more sophisticated. They don't seem to have to wait for public technology, and they're already tapping everything directly from ISP's, and trading foreign governments for whatever else they can't get.
    Not the NSA, because they have a habit of designing their own -- they are actually in the process of building a brand-new one for several millions of US Dollars right now -- and they are currently already renting rack space from Amazon to house their supercomputers. They don't necessarily have much use for a mainframe per se. They're more interested in supercomputers -- most notably for cracking encryption -- and supercomputers are also already very good at handling throughput these days. This used to be different in the past: supercomputers were considered better for extreme number crunching and mainframes were considered better for information consolidation and throughput. Nowadays, supercomputers blur the line, though.

    Well, no, apart from the usual suspects among the alphabet soup agencies, I was rather hinting at the financial-economic industry. It's the materialistic and sociopathic Machiavellian's wet dream to own and utilize something like this.
    = DEATH BEFORE DISHONOR =

  26. The Following 2 Users Say Thank You to Aragorn For This Useful Post:

    bsbray (2nd September 2015), lcam88 (2nd September 2015)

  27. #14
    (account terminated) United States
    Join Date
    16th January 2015
    Location
    Au dela
    Posts
    2,901
    Thanks
    17,558
    Thanked 12,648 Times in 2,895 Posts
    Quote Originally posted by Aragorn View Post
    Well, no, apart from the usual suspects among the alphabet soup agencies, I was rather hinting at the financial-economic industry. It's the materialistic and sociopathic Machiavellian's wet dream to own and utilize something like this.
    I've read things before about how the stock markets are already tightly controlled, at least in the US. They go up and down to the bankers' whims, though of course they have to maintain some semblance of reality. I've seen anecdotes of large volumes of stock trying to move and being blocked for extended periods of time, apparently simply because it wasn't convenient at that time for whoever was controlling stocks.

    Just a couple of weeks or so ago stocks shot way down on opening and they just shut trading down. Then again maybe a week ago the market dived about 1000 points within the first few minutes, and they built it back up by about 500 points before closing. The whole thing is a big scam and mind control operation as far as I see it.

  28. The Following 2 Users Say Thank You to bsbray For This Useful Post:

    Aragorn (2nd September 2015), lcam88 (2nd September 2015)

  29. #15
    Retired Member
    Join Date
    10th June 2015
    Posts
    1,009
    Thanks
    2,129
    Thanked 3,244 Times in 922 Posts
    Quote Originally posted by bsbray View Post
    Hmmm...

    NSA
    MI5/6
    Mossad

    Am I getting close? :P

    I'd imagine the NSA probably already has something equivalent or even more sophisticated. They don't seem to have to wait for public technology, and they're already tapping everything directly from ISP's, and trading foreign governments for whatever else they can't get.
    +1 to Aragorn's response. The NSA probably has at least an order of magnitude more data to sift through as well.

    Quote Originally posted by bsbray View Post
    This is both scary and amazing. This could be put to many, many positive uses, or it could just be used to predict sociological responses to things and manipulate public opinion in real time.

    Imagine all of the online data this kind of software could synthesize and summarize, or compare and contrast between different sources, if it were taken in that direction. Instead of having to do normal online searches from website to website and still only covering a small handful of the thousands of available sites out there, similar software could be developed to look through them all rapidly and summarize their main ideas and chart which are the most common perspectives by something like the most common key words occurring together.
    Here is the thing. I cannot think of a single use that does not have some negative aspect to it. Google has gotten as close to "free of evil" as possible with their search system. Clearly the trend as shown in this promo has been to use computing power to gain market advantage so as to leverage a position backed by statistics that these systems can compute. But that does not actually produce something beneficial to anyone [else].

    Quote Originally posted by Aragorn
    If you're asking about a cheap and DIY clustering solution, try the Beowulf Cluster.
    This is getting close to what I was thinking about. I think a Hadoop cluster is more flexible though, more what I had in mind. And actually, Hadoop very closely models the google clustering model except that its written in Java instead of C and C++; it's not optimized.

    I also looked at D-bus based IPC; it's not suitable in that it is designed to coordinate interactions between programs on the same machine. LXD uses D-bus. A cluster requires a "swarm" or "master -> slaves" type node model that Hadoop and Beowulf architectures sort of have. These to address problems that cluster solutions face, but that the mainframe hardware implementation doesn't have as it is a single machine. Plus the mainframe architecture reduces communication latencies to a level that hard wired networks can't attain.
    Last edited by lcam88, 2nd September 2015 at 13:20.

  30. The Following 2 Users Say Thank You to lcam88 For This Useful Post:

    Aragorn (2nd September 2015), bsbray (2nd September 2015)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •