Architecture distribuée : Que pensez vous de la solution Open Source Java Hadoop
Et de son Distributed Filesystem (HDFS) ?
Le 2009-05-31 00:28:44, par *alexandre*, Inactif
Hadoop (soutenu par IBM, Yahoo, et Apache notamment) reprend les travaux des recherches publiés par google sur le model MapReduce.
La prochaine conférence de Hadoop Summit '09 (10 juin) : http://developer.yahoo.com/events/hadoopsummit09/
Qui utilise Hadoop ?
Résumé de ce qu'est le modèle MapReduce : http://labs.google.com/papers/mapreduce.html (Document librement utilisable )
C'est un peu un fourre tout mais voici des critiques sur l'utilisation de MapReduce (ca date un peu) : http://www.databasecolumn.com/2008/0...step-back.html
Et une petite présentation des BigTable de HBase (sous projet Hadoop) : http://wiki.apache.org/hadoop-data/a..._ets_clean.pdf
A lire aussi :
Le livre Pro Hadoop est disponible chez Apress
Le cloud de Sun va-t-il s'appuyer sur la solution OpenSource Hadoop ?
Microsoft Azure favorise l'interopérabilité avec la mise à disposition de jdotnetservices
C'est un peu un fourre tout mais voici des critiques sur l'utilisation de MapReduce (ca date un peu) : http://www.databasecolumn.com/2008/0...step-back.html
Et une petite présentation des BigTable de HBase (sous projet Hadoop) : http://wiki.apache.org/hadoop-data/a..._ets_clean.pdf
A lire aussi :
-
*alexandre*InactifPour le plaisir une publication d'un googler http://www.morganclaypool.com/doi/pd...1Y200905CAC006
Un extrait qui me semble important quand aux choix d'une décision applicative et ouverte face à l'utilisation d'une interface d'un constructeur particulier lier au matériel
1.6.1 Storage
Disk drives are connected directly to each individual server and managed by a global distributed
file system (such as Google’s GFS [31]) or they can be part of Network Attached Storage (NAS)
devices that are directly connected to the cluster-level switching fabric. A NAS tends to be a simpler
solution to deploy initially because it pushes the responsibility for data management and integrity to
a NAS appliance vendor. In contrast, using the collection of disks directly attached to server nodes
requires a fault-tolerant file system at the cluster level. This is difficult to implement but can lower
hardware costs (the disks leverage the existing server enclosure) and networking fabric utilization
(each server network port is effectively dynamically shared between the computing tasks and the
file system). The replication model between these two approaches is also fundamentally different. A
NAS provides extra reliability through replication or error correction capabilities within each appliance,
whereas systems like GFS implement replication across different machines and consequently
FIGURE 1.1: Typical elements in warehouse-scale systems: 1U server (left), 7´ rack with Ethernet
switch (middle), and diagram of a small cluster with a cluster-level Ethernet switch/router (right).
int roducti on will use more networking bandwidth to complete write operations. However, GFS-like systems
are able to keep data available even after the loss of an entire server enclosure or rack and may allow
higher aggregate read bandwidth because the same data can be sourced from multiple replicas.
Trading off higher write overheads for lower cost, higher availability, and increased read bandwidth
was the right solution for many of Google’s workloads. An additional advantage of having disks colocated
with compute servers is that it enables distributed system software to exploit data locality.
For the remainder of this book, we will therefore implicitly assume a model with distributed disks
directly connected to all servers.
Some WSCs, including Google’s, deploy desktop-class disk drives instead of enterprise-grade
disks because of the substantial cost differential between the two. Because that data are nearly always
replicated in some distributed fashion (as in GFS), this mitigates the possibly higher fault rates of
desktop disks. Moreover, because field reliability of disk drives tends to deviate significantly from
the manufacturer’s specifications, the reliability edge of enterprise drives is not clearly established.
For example, Elerath and Shah [24] point out that several factors can affect disk reliability more
substantially than manufacturing process and design.
le 01/06/2009 à 15:55 -
*alexandre*InactifConcernant le support des transactions, il est définit cela dans l'explication du package
Package org.apache.hadoop.hbase.client.transactional Description
This package provides support for atomic transactions. Transactions can span multiple regions. Transaction writes are applied when committing a transaction. At commit time, the transaction is examined to see if it can be applied while still maintaining atomicity. This is done by looking for conflicts with the transactions that committed while the current transaction was running. This technique is known as optimistic concurrency control (OCC) because it relies on the assumption that transactions will mostly not have conflicts with each other.
For more details on OCC, see the paper On Optimistic Methods for Concurrency Control by Kung and Robinson available here .
To enable transactions, modify hbase-site.xml to turn on the TransactionalRegionServer. This is done by setting hbase.regionserver.class to org.apache.hadoop.hbase.ipc.TransactionalRegionInterface and hbase.regionserver.impl to org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegionServer
The read set claimed by a transactional scanner is determined from the start and end keys which the scanner is opened with.
Known Issues
Recovery in the face of hregion server failure is not fully implemented. Thus, you cannot rely on the transactional properties in the face of node failure.
Citation :
Alternatively, this may all be implemented above hbase. The client
keeps track of updates, and trys to roll back using timestamps.
Problem here is if the client dies midway through we have half the
transaction committed and loose atomicity/consistency.
We will eventually want/need atomic transactions on hbase, so I'll
look into this further. Any input would be appreciated. Would be
interesting to know how/what google provides...
De plus ca ne semble pas être une chose inimplémentable car il n'y a pas de typage de données dans l'implémentation des big table car il n'existe pas de typage ...
Si l'on fourre son nez dans la publication de google sur l'implémentation des big table on peut effectivement lire ceci
Timestamps
Each cell in a Bigtable can contain multiple versions of
the same data; these versions are indexed by timestamp.
Bigtable timestamps are 64-bit integers. They can be assigned
by Bigtable, in which case they represent “real
time” in microseconds, or be explicitly assigned by client
applications. Applications that need to avoid collisions
must generate unique timestamps themselves. Different
versions of a cell are stored in decreasing timestamp order,
so that the most recent versions can be read rst.
To make the management of versioned data less onerous,
we support two per-column-family settings that tell
Bigtable to garbage-collect cell versions automatically.
The client can specify either that only the last n versions
of a cell be kept, or that only new-enough versions be
kept (e.g., only keep values that were written in the last
seven days).
In our Webtable example, we set the timestamps of
the crawled pages stored in the contents: column to
the times at which these page versions were actually
crawled. The garbage-collection mechanism described
above lets us keep only the most recent three versions of
every page.
le 07/06/2009 à 13:41 -
*alexandre*InactifPour les impatients qui voudraient bénéficier d'une vm préconfigurée avec Hadoop, lighthttpd et autres cloudera propose une vm ubuntu avec l'ensemble des applications installées et configuré pour du training
http://www.cloudera.com/hadoop-train...irtual-machinele 12/06/2009 à 13:53 -
*alexandre*InactifJ'ai un besoin de disposer d'un moteur de recherche permettant la recherche full texte et d'indexation des fichiers pdf et open xml
et après de longues recherches ... je suis tombé la dessus http://katta.sourceforge.net/
le moteur de recherche lucene couplé à pdf box (déjà intégré) va me permettre de résoudre le problème de recherche et d'indexation des fichiers pdf
il manque juste le parser xmlcouplé à une tâche map / reduce ! le 14/06/2009 à 13:36 -
*alexandre*Inactifbon un peu d'informations (de la part d'un ancien de chez google) sur le support des transactions avec hadoop et les produits qui en sont dérivés dont fait partie cascading et hbase, ils communiquent en utilisant IPC
IPC c'est quoi, c'est tout simplement les API bas niveau permettant au système d'exploitation de communiquer entre les différents processus des applications
cela implémente la base de la programmation concurrente donc les semaphores , rendez-vous , reader writer et autres ...
Par contre hbase implémente au niveau de la gestion concurrente des données la forme dites optimiste : on se base sur le fait que très peu de cas vont déboucher sur des éléments concurrentsle 18/06/2009 à 22:00 -
*alexandre*InactifPour ceux qui seraient intéressé -toujours sur hadoop- les slides de cloudera de la dernière conférence sur hadoop
http://www.cloudera.com/blog/2009/06...-west-roundup/le 24/06/2009 à 21:38 -
*alexandre*InactifEt également pour ceux qui souhaitent développer une application basé sur l'auto apprentissage, ce projet est également basé sur hadoop
http://lucene.apache.org/mahout/le 28/06/2009 à 14:14 -
wiztricksExpert éminent sénior
que dire de plus?
- Wle 30/06/2009 à 20:08 -
*alexandre*InactifUne petite news http://www.techcrunch.com/2009/10/01...udera-desktop/
et une demo des tools
http://www.cloudera.com/desktople 03/10/2009 à 11:25 -
*alexandre*Inactifle 03/10/2009 à 16:54