-
Notifications
You must be signed in to change notification settings - Fork 0
WiP NeuroDebian snapshots(based on http://snapshot.debian.org). Devels deployment only for now
http://snapshot-neuro.debian.net
neurodebian/snapshot-neuro.debian.net
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Introduction ============ This is an implementation for a possible snapshot.debian.org service. It's not yet finished, it's more a prototype/proof of concept to show and learn what we want and can provide. So far it seems to actually work. The goal of snapshot.d.o is to collect a history of how our archives (the main debian.org archive, security, volatile etc) looked at any point in time. It should allow users to say get the vim package that was in testing on the 13th of January 2009. Storage ------- The way snapshot currently stores data is actually quite simple: - The actual files' content is stored in the "farm". Here each file is stored under a name that depends on its content: its sha1-hashsum. For reasons that weasel no longer is sure actually still apply nowadays the filesystem tree that houses those file is hashed also. A typical farm looks like this: . ./00 ./00/00 ./00/00/00000dd4ba37c55ce75c8cad71f03e85ee09a370 ./00/03 ./00/03/000354b82181fbb7de26f2060bda3d96bbff28d5 ./00/03/00037ab88f67808c40f741bdc678254ce560df24 ./00/06 ./00/06/000648707e514afd7b7d4912f4062c13062012ba (not very well filled yet) - The rest is stored in a postgresql database. This database has a concept of an _archive_ ("debian", "debian-security", "backports.org", etc.). Each archive has any number of _mirrorrun_ entries. These correspond to one import of that mirror into the database. An actual represented filesystem tree consists of _node_s. Each node corresponds to an item in an imported archive tree. A node is either a _directory_, a regular _file_, or a _symlink_. No other filetypes are currently supported - and we probably don't need much else. A directory node stores the full path from the archive root. A regular file knows its original (base)name, its size, and its sha1 hashsum (so we can find it in the farm). And a symlink knows its original base(name) and its target (i.e. what it points to). Each of those nodes has a reference to its parent directory in order to reconstruct the original filesystem structure. A node also contains pointers to the first and the last mirrorrun it appeared in. Some files, like Packages.gz files, will probably only exist for a single mirrorrun while other files like actual packages or the root directory (/) exist for many, or even all of the mirrorruns for a given archive. Access ------ Currently - apart from directly querying the SQL - there's a Python implementation of a snapshotfs fuse file system. It's not really built to scale well, but for now it's one way to access the data. There's also a pylons web application; find it in ./web/app. Indexing -------- The base of the snapshot implementation is pretty agnostic as to what kind of data we take snapshots of. It could just as well be your home directory or the bugs database. On top of that, there is a concept of packages. To be more precise there are source packages, each with a version, and there are binary packages with versions that belong to a source package. Source packages have files associated with it (by file hash), and so do binary packages. In the relation for bin packages we also keep track of what kind of arch this binary package is for. This _indexer_ works either by parsing the /indices/package-file.map.bz2 file in an ftp tree, or if that doesn't exist, by looking at each .dsc and each .deb/.udeb file. TODO ==== o We might need to handle the case where packages are removed from the archive because we aren't allowed to redistribute them. Then we could set all files belonging to a package to locked/unreadable/whatever. - Another thing would be to teach the system about releases so that one can get "sarge r2" from the system. Again, it should be able to build on top of what is already there. o A mirrorrun import should write out a "transaction log"-style flat file that lists all the nodes added in this mirrorrun, plus data identifying the mirrorrun (id, date, archive). This can be then used to inject this run's metadata into a mirror. o When adding files to the farm, copy them to a .temp-hash filename, and only link them in place once they are there. . farm fsck: o check for all the files whether their hashsum matches their name. - check if all of them are references in the database (new ones might not be there yet, instead being added in that transaction) o find list of hashes referenced in the DB but not in the farm. - integrate parts into one coherent tool o Write a nice web frontend. o handle symlinks (in the database stat() function) o make breadcrumbs useful links o packages o handle XSS x either verify that mako escapes all the input, x check if it does in 0.9.7 (there is some filter config option) x or add some layer that copies stuff to the template context doing that * pylons 0.9.7's mako config automatically escapes any special chars in template variables. o handle 'bobby tables' as datestring in urls o Write some scripts to help setting up a mirror. This should be easy given that our farm is easy to rsync and we (will) have a way to inject mirrorruns without having to dump/reload the entire DB. o keep a journal of recently added farm entries and use this for farm syncing. This will make syncing much faster than rsyncing the whole tree, which still can (and should) be done regularly as a safety net. - snapshotfs (the fuse filesystem) doesn't handle the case where a source package has more than one version of a binary package of a given name (such as with binnmus). e.g. fuse/packages/g/gtk2-engines/1:2.16.1-2 - Come up with a testsuite that checks at least the most basic things still work, such as importing a tree. [ o .. done; - to do; . partially done ] Installation ============ Dependencies ------------ Depends: ruby libdbd-pg-ruby1.8 libbz2-ruby1.8 libbz2-ruby1.8 python-yaml python-psycopg2 python-lockfile fuse-utils python-fuse uuid-runtime DB-Depends: postgresql-plperl-8.4 postgresql-8.4-debversion fsck-Depends: python-dev gcc FUSE-Depends: python-fuse Web-Depends: python-pylons (that is {python-pylons,python-routes,python-nose,python-paste,python-pastedeploy,python-pastescript,python-webob,python-weberror,python-beaker,python-mako,python-formencode,python-webhelpers,python-decorator,python-simplejson}/lenny-backports ) Web-Recommends: libapache2-mod-wsgi apache2 Apache config: required modules: expires headers To index binary and source packages, we currently need https://github.com/brianmario/bzip2-ruby. Basic setup ----------- # as psql super:: createuser -DRS snapshot dropdb snapshot # Drop the one played with before createdb -O snapshot snapshot # Create the DB psql -f db/db-init.sql snapshot # Initialize DB (Access rights and languages) # as snapshot user:: ARCHIVE_NAME=<ARCHIVE> ARCHIVE_PATH=<PATH TO ARCHIVE> cd db psql -f db-create.sql snapshot ./upgrade ../snapshot.conf # Introduce necessary DB devel- "upgrades" cd .. ./snapshot -c snapshot.conf -v -a $ARCHIVE_NAME add-archive # Initialize the archive ./snapshot -s -c snapshot.conf -p $ARCHIVE_PATH -v -a $ARCHIVE_NAME import Source Code =========== snapshot The snapshotting script db/ SQL and Python scripts to initialize and upgrade the DB backend etc/ Configuration files for Apache + cronjob configuration files for the master and mirror repositories fsck/ Tools to verify the integrity of the system fuse/ FUSE module lib/ Base components used by various scripts master/ Tools to be used with the master repository (e.g. import-run) mirror/ Tools to support operation of the mirrors web/ Pylons Web frontend Bugs ==== node_with_ts is horribly slow: SELECT hash FROM file JOIN node_with_ts ON file.node_id = node_with_ts.node_id WHERE first_run <= '2009-09-15 21:32:24' AND last_run >= '2009-09-15 21:32:24' AND name = 'openssl_0.9.8g.orig.tar.gz' AND parent=320 takes about 10 times as long as the equivalent: SELECT hash FROM file JOIN node ON file.node_id = node.node_id WHERE (SELECT run FROM mirrorrun WHERE mirrorrun_id=first) <= '2009-09-15 21:32:24' AND (SELECT run FROM mirrorrun WHERE mirrorrun_id=last) >= '2009-09-15 21:32:24' AND name = 'openssl_0.9.8g.orig.tar.gz' AND parent=320
About
WiP NeuroDebian snapshots(based on http://snapshot.debian.org). Devels deployment only for now
http://snapshot-neuro.debian.net
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published