Contributed by dwc on from the size-matters dept.
Otto Moerbeek (otto@) writes:
Last week I committed statvfs(3) support to OpenBSD 4.3-current. This is another step in large disk support, and I thought it would be nice to give an overview of the current state of affairs.
Large disks are disks that have more than 2TB capacity. Originally we had the following limitations:
- disklabels could only handle up to 2TB disks and partitions.
- filesystems could only be 1TB in size.
- the in-kenel buffer layer could only handle 32-bit disk addresses.
- the SCSI layer did not fully support 64-bit disk sector addresses.
Step by step all these barriers have been removed in OpenBSD 4.1, 4.2 and 4.3: FFS2 was introduced both in the GENERIC kernel and in userland: there are various tools like dump(8) and fsck_ffs(8) that manipulate on-disk data structures directly. The disklabel format has been adapted to allow for larger partitions and disks, the kernel buffer layer and filesystem code has been changed to use 64-bit disk sector addresses. The SCSI layer has been changed to allow inquiry of large disks.
All this means we now support large disks, partitions and filesystems. The statvfs(3) commits were one more step: the code that retrieves disk usage and related statistics had to be adapted too. This is more involved than you'd think: struct statfs needed to be expanded to allow for the larger blocks and files count, which in turn required some careful backward compatibility stuff. This being a bit tricky meant it did not make OpenBSD 4.3, alas. By extending struct statfs it has now become easy to support statvfs(3).
There are a few things to keep in mind when using large partitions and FFS2: in particular, checking a large filesystem requires a lot of memory. The largest factor is the number of inodes in the filesystem. The default block and fragment sizes cause a lot of inodes to be created, for large filesystems you want to enlarge both, so less inodes are created. Test things: you do not want to discover you cannot repair a filesystem because fsck need more than MAXDSIZE memory after the fact.
In the future, we would like to solve this problem by allowing some sort of background file system check.
Another thing to remember: the boot loaders and the install/upgrade kernel do not know FFS2. Do not use FFS2 for any filesystem touched by the install/upgrade process (e.g. /, /usr, /tmp and /var).
Also, not all controllers actually support large disks: ami(4) for example only allows logical volumes up to 2TB. This is a hardware restriction, not a driver restriction. Other hardware/driver combinations might have their own limitations.
Here's some dmesg lines, bioctl, disklabel and df output from my test system:
arc0 at pci2 dev 14 function 0 "Areca ARC-1120" rev 0x00: irq 10 arc0: 8 ports, 256MB SDRAM, firmware V1.42 2006-10-13 scsibus0 at arc0: 16 targets sd0 at scsibus0 targ 0 lun 0:SCSI3 0/direct fixed sd0: 4291533MB, 67449 cyl, 511 head, 255 sec, 512 bytes/sec, 8789059584 sec total $ sudo bioctl -h arc0 Volume Status Size Device arc0 0 Online 4.1T sd0 RAID5 0 Online 699G 0:0.0 noencl1 Online 699G 0:2.0 noencl 2 Online 699G 0:3.0 noencl 3 Online 699G 0:4.0 noencl 4 Online 699G 0:5.0 noencl 5 Online 699G 0:6.0 noencl 6 Online 699G 0:7.0 noencl $ sudo disklabel sd0 # Inside MBR partition 3: type A6 start 63 size 199114390 # /dev/rsd0c: type: SCSI disk: SCSI disk label: ARC-1120-VOL#00 flags: bytes/sector: 512 sectors/track: 63 tracks/cylinder: 255 sectors/cylinder: 16065 cylinders: 547093 total sectors: 8789059584 rpm: 10000 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # microseconds track-to-track seek: 0 # microseconds drivedata: 0 16 partitions: # size offset fstype [fsize bsize cpg] a: 8789059521 63 4.2BSD 65536 65536 1 c: 8789059584 0 unused 0 0$ df /big Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/sd0a 4390189376 9137728 4161542208 0% /big
Thanks for the great work, Otto!
(Comments are closed)
By Matthew Dempsky (38.102.129.10) on
After reading otto@'s comment about ami(4) and >2TB disks, I looked at the man page and saw no other mention of this. Will known limitations like these be documented once large disks are fully supported?
Thanks!
Comments
By Otto Moerbeek (otto) on http://www.drijf.net
>
> After reading otto@'s comment about ami(4) and >2TB disks, I looked at the man page and saw no other mention of this. Will known limitations like these be documented once large disks are fully supported?
>
> Thanks!
Depends, in most cases, the data about the max size of a raid set should be documented by the vendor. Only if the drivers poses special restrictions it should be documented in the driver man page, imo.
Comments
By Matthew Dempsky (69.232.203.114) on
By Niall O'Higgins (69.12.154.240) niallo@niallohiggins.com on http://niallohiggins.com
Minor correction to the article - isn't statvfs a section 2 manual page?
Comments
By Otto Moerbeek (otto) on http://www.drijf.net
>
> Minor correction to the article - isn't statvfs a section 2 manual page?
Nope, I have adapted struct statfs, statvfs(3) is just a wrapper to the adapted statfs(2) call, not a syscall itself.
Comments
By Igor Sobrado (156.35.192.2) sobrado@ on
>
> Nope, I have adapted struct statfs, statvfs(3) is just a wrapper to the adapted statfs(2) call, not a syscall itself.
I think that Niall is saying that the .Dt macro in the manual page source code shows that statvfs(3) is in section 2 (System Calls), even if the manual page resides in the right section (3, Subroutines).
By the way, thanks a lot for your excellent work on supporting large filesystems! It is a great improvement and will become more and more important in the next years as the disk sizes grow.
Comments
By Anonymous Coward (69.12.154.240) on
> >
> > Nope, I have adapted struct statfs, statvfs(3) is just a wrapper to the adapted statfs(2) call, not a syscall itself.
>
> I think that Niall is saying that the .Dt macro in the manual page source code shows that statvfs(3) is in section 2 (System Calls), even if the manual page resides in the right section (3, Subroutines).
Yes thats what I am referring to.
Comments
By Otto Moerbeek (otto) on http://www.drijf.net
> > >
> > > Nope, I have adapted struct statfs, statvfs(3) is just a wrapper to the adapted statfs(2) call, not a syscall itself.
> >
> > I think that Niall is saying that the .Dt macro in the manual page source code shows that statvfs(3) is in section 2 (System Calls), even if the manual page resides in the right section (3, Subroutines).
>
> Yes thats what I am referring to.
>
Oh, but that has been fixed for a few days already.
By Anonymous Coward (129.222.50.21) on
By Anonymous Coward (217.19.26.102) on