summaryrefslogtreecommitdiffstats
path: root/fs/ocfs2
Commit message (Collapse)AuthorAgeFilesLines
* convert remaining ->clear_inode() to ->evict_inode()Al Viro2010-08-091-4/+3Star
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Make ->drop_inode() just return whether inode needs to be droppedAl Viro2010-08-092-4/+6
| | | | | | ... and let iput_final() do the actual eviction or retention Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* convert ocfs2 to ->evict_inode()Al Viro2010-08-093-11/+16
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* check ATTR_SIZE contraints in inode_change_okChristoph Hellwig2010-08-091-3/+3
| | | | | | | | | | | | | | | | | | | | | Make sure we check the truncate constraints early on in ->setattr by adding those checks to inode_change_ok. Also clean up and document inode_change_ok to make this obvious. As a fallout we don't have to call inode_newsize_ok from simple_setsize and simplify it down to a truncate_setsize which doesn't return an error. This simplifies a lot of setattr implementations and means we use truncate_setsize almost everywhere. Get rid of fat_setsize now that it's trivial and mark ext2_setsize static to make the calling convention obvious. Keep the inode_newsize_ok in vmtruncate for now as all callers need an audit for its removal anyway. Note: setattr code in ecryptfs doesn't call inode_change_ok at all and needs a deeper audit, but that is left for later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* remove inode_setattrChristoph Hellwig2010-08-092-7/+17
| | | | | | | | | | | | | | | | | | | Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* sort out blockdev_direct_IO variantsChristoph Hellwig2010-08-091-5/+4Star
| | | | | | | | | | | | Move the call to vmtruncate to get rid of accessive blocks to the callers in prepearation of the new truncate calling sequence. This was only done for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant was not needed anyway. Get rid of blockdev_direct_IO_no_locking and its _newtrunc variant while at it as just opencoding the two additional paramters is shorted than the name suffix. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* direct-io: move aio_complete into ->end_ioChristoph Hellwig2010-07-261-1/+6
| | | | | | | | | | | | | | Filesystems with unwritten extent support must not complete an AIO request until the transaction to convert the extent has been commited. That means the aio_complete calls needs to be moved into the ->end_io callback so that the filesystem can control when to call it exactly. This makes a bit of a mess out of dio_complete and the ->end_io callback prototype even more complicated. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Alex Elder <aelder@sgi.com>
* Merge branch 'upstream-linus' of ↵Linus Torvalds2010-07-1813-208/+485
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2: Silence gcc warning in ocfs2_write_zero_page(). jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node ocfs2: Don't duplicate pages past i_size during CoW. ocfs2: tighten up strlen() checking ocfs2: Make xattr reflink work with new local alloc reservation. ocfs2: make xattr extension work with new local alloc reservation. ocfs2: Remove the redundant cpu_to_le64. ocfs2/dlm: don't access beyond bitmap size ocfs2: No need to zero pages past i_size. ocfs2: Zero the tail cluster when extending past i_size. ocfs2: When zero extending, do it by page. ocfs2: Limit default local alloc size within bitmap range. ocfs2: Move orphan scan work to ocfs2_wq. fs/ocfs2/dlm: Add missing spin_unlock
| * ocfs2: Silence gcc warning in ocfs2_write_zero_page().Joel Becker2010-07-161-1/+1
| | | | | | | | | | | | | | ocfs2_write_zero_page() has a loop that won't ever be skipped, but gcc doesn't know that. Set ret=0 just to make gcc happy. Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactionsJan Kara2010-07-161-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OCFS2 uses t_commit trigger to compute and store checksum of the just committed blocks. When a buffer has b_frozen_data, checksum is computed for it instead of b_data but this can result in an old checksum being written to the filesystem in the following scenario: 1) transaction1 is opened 2) handle1 is opened 3) journal_access(handle1, bh) - This sets jh->b_transaction to transaction1 4) modify(bh) 5) journal_dirty(handle1, bh) 6) handle1 is closed 7) start committing transaction1, opening transaction2 8) handle2 is opened 9) journal_access(handle2, bh) - This copies off b_frozen_data to make it safe for transaction1 to commit. jh->b_next_transaction is set to transaction2. 10) jbd2_journal_write_metadata() checksums b_frozen_data 11) the journal correctly writes b_frozen_data to the disk journal 12) handle2 is closed - There was no dirty call for the bh on handle2, so it is never queued for any more journal operation 13) Checkpointing finally happens, and it just spools the bh via normal buffer writeback. This will write b_data, which was never triggered on and thus contains a wrong (old) checksum. This patch fixes the problem by calling the trigger at the moment data is frozen for journal commit - i.e., either when b_frozen_data is created by do_get_write_access or just before we write a buffer to the log if b_frozen_data does not exist. We also rename the trigger to t_frozen as that better describes when it is called. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down nodeWengang Wang2010-07-151-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set before sending DLM_MIG_LOCKRES_MSG message to the target. We are using dlm_migration_can_proceed() for that purpose. However, if the node is down, dlm_migration_can_proceed() will also return "go ahead". In this rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove the BUG_ON() that trips over this condition. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2: Don't duplicate pages past i_size during CoW.Tao Ma2010-07-151-0/+6
| | | | | | | | | | | | | | | | During CoW, the pages after i_size don't contain valid data, so there's no need to read and duplicate them. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2: tighten up strlen() checkingDan Carpenter2010-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function is only called from one place and it's like this: dlm_register_domain(conn->cc_name, dlm_key, &fs_version); The "conn->cc_name" is 64 characters long. If strlen(conn->cc_name) were equal to O2NM_MAX_NAME_LEN (64) that would be a bug because strlen() doesn't count the NULL character. In fact, if you look how O2NM_MAX_NAME_LEN is used, it mostly describes 64 character buffers. The only exception is nd_name from struct o2nm_node. Anyway I looked into it and in this case the domain string comes from osb->uuid_str in ocfs2_setup_osb_uuid(). That's 32 characters and NULL which easily fits into O2NM_MAX_NAME_LEN. This patch doesn't change how the code works, but I think it makes the code a little cleaner. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2: Make xattr reflink work with new local alloc reservation.Tao Ma2010-07-121-36/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | The new reservation code in local alloc has add the limitation that the caller should handle the case that the local alloc doesn't give use enough contiguous clusters. It make the old xattr reflink code broken. So this patch udpate the xattr reflink code so that it can handle the case that local alloc give us one cluster at a time. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2: make xattr extension work with new local alloc reservation.Tao Ma2010-07-121-29/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The old ocfs2_xattr_extent_allocation is too optimistic about the clusters we can get. So actually if the file system is too fragmented, ocfs2_add_clusters_in_btree will return us with EGAIN and we need to allocate clusters once again. So this patch change it to a while loop so that we can allocate clusters until we reach clusters_to_add. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: stable@kernel.org
| * ocfs2: Remove the redundant cpu_to_le64.Tao Ma2010-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | In ocfs2_block_group_alloc, we set c_blkno by bg->bg_blkno. But actually bg->bg_blkno is already changed to little endian in ocfs2_block_group_fill. So remove the extra cpu_to_le64. Reported-by: Marcos Matsunaga <Marcos.Matsunaga@oracle.com> Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2/dlm: don't access beyond bitmap sizeWengang Wang2010-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | dlm->recovery_map is defined as unsigned long recovery_map[BITS_TO_LONGS(O2NM_MAX_NODES)]; We should treat O2NM_MAX_NODES as the bit map size in bits. This patches fixes a bit operation that takes O2NM_MAX_NODES + 1 as bitmap size. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2: No need to zero pages past i_size.Joel Becker2010-07-121-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ocfs2 fills a hole, it does so by allocating clusters. When a cluster is larger than the write, ocfs2 must zero the portions of the cluster outside of the write. If the clustersize is smaller than a pagecache page, this is handled by the normal pagecache mechanisms, but when the clustersize is larger than a page, ocfs2's write code will zero the pages adjacent to the write. This makes sure the entire cluster is zeroed correctly. Currently ocfs2 behaves exactly the same when writing past i_size. However, this means ocfs2 is writing zeroed pages for portions of a new cluster that are beyond i_size. The page writeback code isn't expecting this. It treats all pages past the one containing i_size as left behind due to a previous truncate operation. Thankfully, ocfs2 calculates the number of pages it will be working on up front. The rest of the write code merely honors the original calculation. We can simply trim the number of pages to only cover the actual file data. Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: stable@kernel.org
| * ocfs2: Zero the tail cluster when extending past i_size.Joel Becker2010-07-086-54/+207
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ocfs2's allocation unit is the cluster. This can be larger than a block or even a memory page. This means that a file may have many blocks in its last extent that are beyond the block containing i_size. There also may be more unwritten extents after that. When ocfs2 grows a file, it zeros the entire cluster in order to ensure future i_size growth will see cleared blocks. Unfortunately, block_write_full_page() drops the pages past i_size. This means that ocfs2 is actually leaking garbage data into the tail end of that last cluster. This is a bug. We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect when a write or truncate is past i_size. They will use ocfs2_zero_extend() to ensure the data is properly zeroed. Older versions of ocfs2_zero_extend() simply zeroed every block between i_size and the zeroing position. This presumes three things: 1) There is allocation for all of these blocks. 2) The extents are not unwritten. 3) The extents are not refcounted. (1) and (2) hold true for non-sparse filesystems, which used to be the only users of ocfs2_zero_extend(). (3) is another bug. Since we're now using ocfs2_zero_extend() for sparse filesystems as well, we teach ocfs2_zero_extend() to check every extent between i_size and the zeroing position. If the extent is unwritten, it is ignored. If it is refcounted, it is CoWed. Then it is zeroed. Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: stable@kernel.org
| * ocfs2: When zero extending, do it by page.Joel Becker2010-07-082-64/+84
| | | | | | | | | | | | | | | | | | | | ocfs2_zero_extend() does its zeroing block by block, but it calls a function named ocfs2_write_zero_page(). Let's have ocfs2_write_zero_page() handle the page level. From ocfs2_zero_extend()'s perspective, it is now page-at-a-time. Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: stable@kernel.org
| * ocfs2: Limit default local alloc size within bitmap range.Tao Ma2010-06-161-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 6b82021b9e91cd689fdffadbcdb9a42597bbe764, we increase our local alloc size and calculate how much megabytes we can get according to group size and volume size. But we also need to check the maximum bits a local alloc block bitmap can have. With a bs=512, cs=32K, local volume with 160G, it calculate 96MB while the maximum local alloc size is only 76M. So the bitmap will overflow and corrupt the system truncate log file. See bug http://oss.oracle.com/bugzilla/show_bug.cgi?id=1262 Signed-off-by: Tao Ma <tao.ma@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2: Move orphan scan work to ocfs2_wq.Tao Ma2010-06-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We used to let orphan scan work in the default work queue, but there is a corner case which will make the system deadlock. The scenario is like this: 1. set heartbeat threadshold to 200. this will allow us to have a great chance to have a orphan scan work before our quorum decision. 2. mount node 1. 3. after 1~2 minutes, mount node 2(in order to make the bug easier to reproduce, better add maxcpus=1 to kernel command line). 4. node 1 do orphan scan work. 5. node 2 do orphan scan work. 6. node 1 do orphan scan work. After this, node 1 hold the orphan scan lock while node 2 know node 1 is the master. 7. ifdown eth2 in node 2(eth2 is what we do ocfs2 interconnection). Now when node 2 begins orphan scan, the system queue is blocked. The root cause is that both orphan scan work and quorum decision work will use the system event work queue. orphan scan has a chance of blocking the event work queue(in dlm_wait_for_node_death) so that there is no chance for quorum decision work to proceed. This patch resolve it by moving orphan scan work to ocfs2_wq. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * fs/ocfs2/dlm: Add missing spin_unlockJulia Lawall2010-06-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a spin_unlock missing on the error path. Unlock as in the other code that leads to the leave label. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression E1; @@ * spin_lock(E1,...); <+... when != E1 if (...) { ... when != E1 * return ...; } ...+> * spin_unlock(E1,...); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Joel Becker <joel.becker@oracle.com>
* | ocfs2: update gfp/slab.h includesTejun Heo2010-06-281-1/+0Star
|/ | | | | | | | | | Implicit slab.h inclusion via percpu.h is about to go away. Make sure gfp.h or slab.h is included as necessary. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Joel Becker <joel.becker@oracle.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* Merge branch 'for_linus' of ↵Linus Torvalds2010-05-301-27/+23Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: quota: Convert quota statistics to generic percpu_counter ext3 uses rb_node = NULL; to zero rb_root. quota: Fixup dquot_transfer reiserfs: Fix resuming of quotas on remount read-write pohmelfs: Remove dead quota code ufs: Remove dead quota code udf: Remove dead quota code quota: rename default quotactl methods to dquot_ quota: explicitly set ->dq_op and ->s_qcop quota: drop remount argument to ->quota_on and ->quota_off quota: move unmount handling into the filesystem quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers quota: move remount handling into the filesystem ocfs2: Fix use after free on remount read-only Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c
| * quota: rename default quotactl methods to dquot_Christoph Hellwig2010-05-241-9/+9
| | | | | | | | | | | | | | | | | | Follow the dquot_* style used elsewhere in dquot.c. [Jan Kara: Fixed up missing conversion of ext2] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * quota: drop remount argument to ->quota_on and ->quota_offChristoph Hellwig2010-05-241-8/+2Star
| | | | | | | | | | | | | | | | Remount handling has fully moved into the filesystem, so all this is superflous now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappersChristoph Hellwig2010-05-241-9/+5Star
| | | | | | | | | | | | | | | | | | | | | | Instead of having wrappers in the VFS namespace export the dquot_suspend and dquot_resume helpers directly. Also rename vfs_quota_disable to dquot_disable while we're at it. [Jan Kara: Moved dquot_suspend to quotaops.h and made it inline] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * ocfs2: Fix use after free on remount read-onlyJan Kara2010-05-241-1/+7
| | | | | | | | | | | | | | | | We also have to cancel quota syncing thread on remount read only because at that moment quota is being turned off. Otherwise quota syncing thread will try to access already freed quota structures. Signed-off-by: Jan Kara <jack@suse.cz>
* | kill spurious reference to vmtruncatenpiggin@suse.de2010-05-281-2/+6
| | | | | | | | | | | | | | | | | | | | Lots of filesystems calls vmtruncate despite not implementing the old ->truncate method. Switch them to use simple_setsize and add some comments about the truncate code where it seems fitting. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | drop unused dentry argument to ->fsyncChristoph Hellwig2010-05-281-4/+3Star
| | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, ↵Alexey Dobriyan2010-05-251-2/+2
|/ | | | | | | | | | | | | | | | SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: replace inode uid,gid,mode initialization with helper functionDmitry Monakhov2010-05-221-8/+1Star
| | | | | | Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ocfs: constify xattr_handlerStephen Hemminger2010-05-223-14/+14
| | | | | Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ocfs2: Fix lock inversion in quotas during umountJan Kara2010-05-212-4/+4
| | | | | | | | | We cannot cancel delayed work from ocfs2_local_free_info because that is called with dqonoff_mutex held and the work it cancels requires dqonoff_mutex to finish. Cancel the work before acquiring dqonoff_mutex. Acked-by: Joel Becker <Joel.Becker@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
* ocfs2: Use __dquot_transfer to avoid lock inversionJan Kara2010-05-211-12/+5Star
| | | | | | | | | | dquot_transfer() acquires own references to dquots via dqget(). Thus it waits for dq_lock which creates a lock inversion because dq_lock ranks above transaction start but transaction is already started in ocfs2_setattr(). Fix the problem by passing own references directly to __dquot_transfer. Acked-by: Joel Becker <Joel.Becker@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
* ocfs2: Fix NULL pointer deref when writing local dquotJan Kara2010-05-213-12/+12
| | | | | | | | | | | commit_dqblk() can write quota info to global file. That is actually a bad thing to do because if we are just modifying local quota file, we are not prepared (do not hold proper locks, do not have transaction credits) to do a modification of the global quota file. So do not use commit_dqblk() and instead call our writing function directly. Acked-by: Joel Becker <Joel.Becker@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
* ocfs2: Fix estimate of credits needed for quota allocationJan Kara2010-05-211-2/+3
| | | | | | | | We were missing reservation of a journal credit for modification of quota file inode when creating new dquot structure in the global quota file. Acked-by: Joel Becker <Joel.Becker@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
* ocfs2: Fix quota lockingJan Kara2010-05-213-186/+214
| | | | | | | | | | | | | | | | | | | | | | | | | OCFS2 had three issues with quota locking: a) When reading dquot from global quota file, we started a transaction while holding dqio_mutex which is prone to deadlocks because other paths do it the other way around b) During ocfs2_sync_dquot we were not protected against concurrent writers on the same node. Because we first copy data to local buffer, a race could happen resulting in old data being written to global quota file and thus causing quota inconsistency after a crash. c) ip_alloc_sem of quota files was acquired while a transaction is started in ocfs2_quota_write which can deadlock because we first get ip_alloc_sem and then start a transaction when extending quota files. We fix the problem a) by pulling all necessary code to ocfs2_acquire_dquot and ocfs2_release_dquot. Thus we no longer depend on generic dquot_acquire to do the locking and can force proper lock ordering. Problems b) and c) are fixed by locking i_mutex and ip_alloc_sem of global quota file in ocfs2_lock_global_qf and removing ip_alloc_sem from ocfs2_quota_read and ocfs2_quota_write. Acked-by: Joel Becker <Joel.Becker@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
* ocfs2: Avoid unnecessary block mapping when refreshing quota infoJan Kara2010-05-214-7/+24
| | | | | | | | | The position of global quota file info does not change. So we do not have to do logical -> physical block translation every time we reread it from disk. Thus we can also avoid taking ip_alloc_sem. Acked-by: Joel Becker <Joel.Becker@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
* ocfs2: Do not map blocks from local quota file on each writeJan Kara2010-05-213-9/+36
| | | | | | | | | | There is no need to map offset of local dquot structure to on disk block in each quota write. It is enough to map it just once and store the physical block number in quota structure in memory. Moreover this simplifies locking as we do not have to take ip_alloc_sem from quota write path. Acked-by: Joel Becker <Joel.Becker@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
* quota: unify quota init condition in setattrDmitry Monakhov2010-05-211-2/+2
| | | | | | | | | | | | | Quota must being initialized if size or uid/git changes requested. But initialization performed in two different places: in case of i_size file system is responsible for dquot init , but in case of uid/gid init will be called internally in dquot_transfer(). This ambiguity makes code harder to understand. Let's move this logic to one common helper function. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
* Merge branch 'upstream-linus' of ↵Linus Torvalds2010-05-2140-1491/+2589
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (47 commits) ocfs2: Silence a gcc warning. ocfs2: Don't retry xattr set in case value extension fails. ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break ocfs2: Reset xattr value size after xa_cleanup_value_truncate(). fs/ocfs2/dlm: Use kstrdup fs/ocfs2/dlm: Drop memory allocation cast Ocfs2: Optimize punching-hole code. Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public. Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing. Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead. ocfs2: Block signals for mkdir/link/symlink/O_CREAT. ocfs2: Wrap signal blocking in void functions. ocfs2/dlm: Increase o2dlm lockres hash size ocfs2: Make ocfs2_extend_trans() really extend. ocfs2/trivial: Code cleanup for allocation reservation. ocfs2: make ocfs2_adjust_resv_from_alloc simple. ocfs2: Make nointr a default mount option ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE o2net: log socket state changes ocfs2: print node # when tcp fails ...
| * ocfs2: Silence a gcc warning.Joel Becker2010-05-191-1/+1
| | | | | | | | | | | | | | ocfs2_block_group_claim_bits() is never called with min_bits=0, but we shouldn't leave status undefined if it ever is. Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2: Don't retry xattr set in case value extension fails.Tao Ma2010-05-191-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In normal xattr set, the set sequence is inode, xattr block and finally xattr bucket if we meet with a ENOSPC. But there is a corner case. So consider we will set a xattr whose value will be stored in a cluster, and there is no xattr block by now. So we will reserve 1 xattr block and 1 cluster for setting it. Now if we fail in value extension(in case the volume is almost full and we can't allocate the cluster because the check in ocfs2_test_bg_bit_allocatable), ENOSPC will be returned. So we will try to create a bucket(this time there is a chance that the reserved cluster will be used), and when we try value extension again, kernel bug happens. We did meet with it. Check the bug below. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1251 This patch just try to avoid this by adding a set_abort in ocfs2_xattr_set_ctxt, so in case ENOSPC happens in value extension, we will check whether it is caused by the real ENOSPC or just the full of inode or xattr block. If it is the first case, we set set_abort so that we don't try any further. we are safe to exit directly here ince it is really ENOSPC. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency breakWengang Wang2010-05-193-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we process a dirty lockres with the lockres->spinlock taken. While during the process, we may need to lock on dlm->ast_lock. This breaks the dependency of dlm->ast_lock(lock first) and lockres->spinlock(lock second). This patch fixes the problem. Since we can't release lockres->spinlock, we have to take dlm->ast_lock just before taking the lockres->spinlock and release it after lockres->spinlock is released. And use __dlm_queue_bast()/__dlm_queue_ast(), the nolock version, in dlm_shuffle_lists(). There are no too many locks on a lockres, so there is no performance harm. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * ocfs2: Reset xattr value size after xa_cleanup_value_truncate().Tao Ma2010-05-191-7/+10
| | | | | | | | | | | | | | | | | | | | | | In ocfs2_prepare_xattr_entry, if we fail to grow an existing value, xa_cleanup_value_truncate() will leave the old entry in place. Thus, we reset its value size. However, if we were allocating a new value, we must not reset the value size or we will BUG(). This resolves oss.oracle.com bug 1247. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
| * Merge branch 'discontig-bg' of git://oss.oracle.com/git/tma/linux-2.6 into ↵Joel Becker2010-05-1914-240/+667
| |\ | | | | | | | | | ocfs2-merge-window
| | * ocfs2: enable discontig block group support.Tao Ma2010-03-181-1/+2
| | | | | | | | | | | | Signed-off-by: Tao Ma <tao.ma@oracle.com>
| | * ocfs2: Set ac_last_group properly with discontig group.Tao Ma2010-04-271-12/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ac_last_group is used to record the last block group we used during allocation. But the initialization process only calls ocfs2_which_suballoc_group and fails to use suballoc_loc properly. So let us do it. Another function ocfs2_test_suballoc_bit also needs fix. I have searched all the callers of ocfs2_which_suballoc_group, and all the callers notices suballoc_loc now. Signed-off-by: Tao Ma <tao.ma@oracle.com>