PostgreSQL's fsync() surprise
Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net. |
Developers of database management systems are, by necessity, concerned about getting data safely to persistent storage. So when the PostgreSQL community found out that the way the kernel handles I/O errors could result in data being lost without any errors being reported to user space, a fair amount of unhappiness resulted. The problem, which is exacerbated by the way PostgreSQL performs buffered I/O, turns out not to be unique to Linux, and will not be easy to solve even there.
Craig Ringer first reported the problem to the pgsql-hackers mailing list at the end of March. In short, PostgreSQL assumes that a successful call to fsync() indicates that all data written since the last successful call made it safely to persistent storage. But that is not what the kernel actually does. When a buffered I/O write fails due to a hardware-level error, filesystems will respond differently, but that behavior usually includes discarding the data in the affected pages and marking them as being clean. So a read of the blocks that were just written will likely return something other than the data that was written.
What about error status reporting? One year ago, the Linux Filesystem, Storage, and Memory-Management Summit (LSFMM) included a session on error reporting, wherein it was described as "a mess"; errors could easily be lost so that no application would ever see them. Some patches merged during the 4.13 development cycle improved the situation somewhat (and 4.16 had some changes to improve it further), but there are still ways for error notifications to be lost, as will be described below. If that happens to a PostgreSQL server, the result can be silent corruption of the database.
PostgreSQL developers were not pleased. Tom Lane described it as "kernel brain
damage
", while Robert Haas called it
"100% unreasonable
". In the early part of the discussion, the
PostgreSQL developers were clear enough on what they thought the kernel's
behavior should be: pages that fail to be written out should be kept in
memory in the "dirty" state (for later retries), and the relevant file
descriptor should be put into a permanent error state so that the
PostgreSQL server cannot miss the existence of a problem.
Where things go wrong
Even before the kernel community came into the discussion, though, it started to become clear that the situation was not quite as simple as it might seem. Thomas Munro reported that Linux is not unique in behaving this way; OpenBSD and NetBSD can also fail to report write errors to user space. And, as it turns out, the way that PostgreSQL handles buffered I/O complicates the picture considerably.
That mechanism was described in detail by Haas. The PostgreSQL server runs as a collection of processes, many of which can perform I/O to the database files. The job of calling fsync(), however, is handled in a single "checkpointer" process, which is concerned with keeping on-disk storage in a consistent state that can recover from failures. The checkpointer doesn't normally keep all of the relevant files open, so it often has to open a file before calling fsync() on it. That is where the problem comes in: even in 4.13 and later kernels, the checkpointer will not see any errors that happened before it opened the file. If something bad happens before the checkpointer's open() call, the subsequent fsync() call will return successfully. There are a number of ways in which an I/O error can happen outside of an fsync() call; the kernel could encounter one while performing background writeback, for example. Somebody calling sync() could also encounter an I/O error — and consume the resulting error status.
Haas described this behavior as failing to live up to what PostgreSQL expects:
Joshua Drake eventually moved the
conversation over to the ext4 development list, bringing in part of the
kernel development community. Dave Chinner quickly described this behavior as "a recipe for
disaster, especially on cross-platform code where every OS platform behaves
differently and almost never to expectation
". Ted Ts'o, instead, explained why the affected pages are marked
clean after an I/O error occurs; in short, the most common cause of I/O
errors, by far, is a user pulling out a USB drive at the wrong time. If
some process was copying a lot of data to that drive, the result will be an
accumulation of dirty pages in memory, perhaps to the point that the system
as a whole runs out of memory for anything else. So those pages cannot be
kept if the user wants the system to remain usable after such an event.
Both Chinner and Ts'o, along with others, said that the proper solution is
for PostgreSQL to move to direct I/O (DIO) instead. Using DIO gives a
greater level of control over writeback and I/O in general; that includes
access to information on exactly which I/O operations might have failed.
Andres Freund, like a number of other PostgreSQL developers, has acknowledged that DIO is the best long-term
solution. But he also noted that getting there is "a metric ton of
work
" that isn't going to happen anytime soon. Meanwhile, he said, there are other programs (he mentioned
dpkg) that are also affected by this behavior.
Toward a short-term solution
As the discussion went on, a fair amount of attention was paid to the idea that write failures should result in the affected pages being kept in memory, in their dirty state. But the PostgreSQL developers had quickly moved on from that idea and were not asking for it. What they really need, in the end, is a reliable way to know that something has gone wrong. Given that, the normal PostgreSQL mechanisms for dealing with errors can take over; in its absence, though, there is little that can be done.
One idea that came up a few times was to respond to an I/O error by marking the file itself (in the inode) as being in a persistent error state. Such a change, though, would take Linux behavior further away from what POSIX mandates and would raise some other questions, including: when and how would that flag ever be cleared? So this change seems unlikely to happen.
At one point in the discussion, Ts'o mentioned that Google has its own mechanism
for handling I/O errors. The
kernel has been instrumented to report I/O errors via a netlink socket; a
dedicated process gets those notifications and responds accordingly. This
mechanism has never made it upstream, though. Freund indicated that this kind of mechanism would be
"perfect
" for PostgreSQL, so it may make a public appearance
in the near future.
Meanwhile, Jeff Layton pondered another idea: setting a flag in the filesystem superblock when an I/O error occurs. A call to syncfs() would then clear that flag and return an error if it had been set. The PostgreSQL checkpointer could make an occasional syncfs() call as a way of polling for errors on the filesystem holding the database. Freund agreed that this might be a viable solution to the problem.
Any such mechanism will only appear in new kernels, of course; meanwhile, PostgreSQL installations tend to run on old kernels maintained by enterprise distributions. Those kernels are likely to lack even the improvements merged in 4.13. For such systems, there is little that can be done to help PostgreSQL detect I/O errors. It may come down to running a daemon that scans the system log, looking for reports of I/O errors there. Not the most elegant solution, and one that is complicated by the fact that different block drivers and filesystems tend to report errors differently, but it may be the best option available.
The next step is likely to be a discussion at the 2018 LSFMM event, which
happens to start on April 23. With luck, some sort of solution will
emerge that will work for the parties involved. One thing that will not
change, though, is the simple fact that error handling is hard to get
right.
Index entries for this article | |
---|---|
Kernel | Block layer/Error handling |
(Log in to post comments)
PostgreSQL's fsync() surprise
Posted Apr 18, 2018 18:03 UTC (Wed) by cesarb (subscriber, #6266) [Link]
What if two programs did that?
PostgreSQL calls syncfs() and receives no error indication; an error happens, but another process calls syncfs() and clears the flag; PostgreSQL calls syncfs() and receives no error indication again. Oops!
It would be better to have a per-filesystem error counter, instead of a flag: if the error count didn't increase when PostgreSQL checks it again, no error occurred in the meantime, no matter how many other processes have checked for errors.
PostgreSQL's fsync() surprise
Posted Apr 18, 2018 18:41 UTC (Wed) by corbet (editor, #1) [Link]
The counter is probably how it will actually be implemented, from my (re)reading of the discussion. I didn't quite describe the mechanism correctly — I think. It's hard to tell for sure since no patches have actually been posted yet.
PostgreSQL's fsync() surprise
Posted Apr 18, 2018 18:43 UTC (Wed) by jlayton (subscriber, #31672) [Link]
In practice, we'd want to keep an errseq_t in the superblock instead of a flag. That would allow us to ensure that we report an error to syncfs only once per file description. The big issue there though is that we also need another 32-bits per file description (aka struct file) to act as its "cursor" in the error stream, or we need to figure out some way to share the file->f_wb_err field that we use for fsync.
I proposed a draft patch earlier this week (which I meant to send as an RFC) that does the latter. It's based on Willy's suggestion to only report errors from the errseq_t when the fd is an O_PATH open. You can't call fsync on an O_PATH open, so that should be safe (though it is horribly non-obvious from a userland API standpoint):
Déjà vu?
Posted Apr 18, 2018 18:50 UTC (Wed) by marcH (subscriber, #57642) [Link]
- a concurrency nightmare
- an error reporting nightmare
- not so greatly supported by programming languages and systems
- critical for acceptable performance...
Error handling generally has:
- near zero test coverage
:-(
PS: the quality and clarity of this article are stunning. Made me feel once again good paying my own subscription (as opposed to just use my company's)
Déjà vu?
Posted Apr 18, 2018 21:23 UTC (Wed) by wahern (subscriber, #37304) [Link]
Déjà vu?
Posted Apr 19, 2018 14:27 UTC (Thu) by ringerc (subscriber, #3071) [Link]
Déjà vu?
Posted Apr 20, 2018 7:46 UTC (Fri) by marcH (subscriber, #57642) [Link]
By the way Java makes a decent attempt with "Futures"
https://docs.oracle.com/javase/7/docs/api/java/util/concu...
Java was also the first language to have a formal memory model. These "performance" features may explain why Java was more successful on the server side than in embedded for which it was targeted initially.
Ugly[*] and not fun but doing the job!
[*] https://steve-yegge.blogspot.com/2006/03/execution-in-kin...
Déjà vu?
Posted May 4, 2018 4:53 UTC (Fri) by ncm (guest, #165) [Link]
They tried, bless their hearts.
Déjà vu?
Posted May 6, 2018 1:46 UTC (Sun) by marcH (subscriber, #57642) [Link]
PostgreSQL's fsync() surprise
Posted Apr 18, 2018 20:30 UTC (Wed) by flussence (subscriber, #85566) [Link]
I don't think that's necessarily a bad thing. POSIXly correct filesystems have surprised users in unpleasant ways in the past; recall early ext4 eating people's DE config files, all because the standard had some undefined behaviour around file writes and renames.
PostgreSQL's fsync() surprise
Posted Apr 18, 2018 21:35 UTC (Wed) by wahern (subscriber, #37304) [Link]
The fact that these issues have gone undiscovered and subsequently unattended for so long should disabuse people of the notion that there's sufficient interest or resources in supplanting POSIX as a standard. How many file systems have come and gone. ext4 is the closest thing to a de facto standard in Linux but it has survived precisely because of it's simplicity and by having POSIX compliance as a guide star (intentionally or not), as opposed to chasing new ideas.
The biggest hurdle for any large or complex project is coordinating effort and maintaining focus. Standards help immensely in this regard, even flawed ones.
PostgreSQL's fsync() surprise
Posted Apr 19, 2018 12:12 UTC (Thu) by eru (subscriber, #2753) [Link]
PostgreSQL's fsync() surprise
Posted Apr 20, 2018 10:19 UTC (Fri) by anton (subscriber, #25547) [Link]
POSIXly correct filesystems have surprised users in unpleasant ways in the past; recall early ext4 eating people's DE config files, all because the standard had some undefined behaviour around file writes and renames.If a standard does not define something, it's up to the implementation to do it; i.e., it's their responsibility. Sufficiently bloody-minded implementors produce unpleasant surprises, and then point to standards or benchmarks as an excuse; but as long as the standard does not require the unpleasant behaviour (in which case it would be defined, not undefined), the implementator has the choice, and therefore the responsibility. Of course, implementors who blame the standard don't want you to recognize this, and often argue as if lack of definition in the standard required them to behave unpleasantly. It doesn't.
I wonder if the "what POSIX mandates" in the article really refers to a mandate by POSIX, or another case of lack of definition that an implementator sees as a welcome opportunity for an unpleasant surprise.
PostgreSQL's fsync() surprise
Posted Apr 20, 2018 18:12 UTC (Fri) by zlynx (guest, #2285) [Link]
If everyone is expected to be nice instead of following the standards, then there's no point in the current standard and it should be replaced with the "be nice" version.
For example, there are people who expect TCP/IP to deliver their packets in the same sized chunks they were sent. These people are simple wrong. But by the "be nice" standard we'd have to write stupid networking stacks because some people expect behavior that isn't required.
Maybe it's time for a POSIX 2020 standard. But if it isn't in there, don't expect it to work like anything else.
PostgreSQL's fsync() surprise
Posted Apr 21, 2018 14:54 UTC (Sat) by anton (subscriber, #25547) [Link]
Yes, ideally standards would be complete. In practice, they tend to specify just the intersection of the behaviour of the existing implementations (in line with the requirement that a standard should standardize common practice), as well as considering various constraints on outlier systems; e.g., "We want this standard to be implementable on a system with 64KB RAM, and mandating the pleasant behaviour would cost several KB for this subfeature alone, so we leave the behaviour unspecified." And then a bloody-minded implementor for systems that use multiple GBs of RAM uses the lack of specification as justification to implement unpleasant behaviour.And don't forget that standards are decided through consensus in the committee, so it takes just a few bloody-minded implementors on the standards committee to block any progress towards pleasantness.
If everyone is expected to be nice instead of following the standardsThat's an excellent example of what I mean with "hiding behind the standard", and why I suspect that "what POSIX mandates" is in reality different from what was claimed in the discussion described in the arcticle. If the standards do not specify what the implementation should do ("undefined behaviour" or somesuch), there is nothing in the standard that the implementation could follow, and it's the sole responsibility of the implementor to choose a particular behaviour. If, in such a situation, the implementor chooses to implement unpleasant behaviour, it's his fault, and his fault alone; the standard did not make him do it.
PostgreSQL's fsync() surprise
Posted Apr 24, 2018 16:32 UTC (Tue) by nybble41 (subscriber, #55106) [Link]
All true, of course, but "unpleasant behavior" can still be a reasonable choice. Any application which *relied* on system-specific "pleasant" behavior would necessarily be non-portable. If "pleasant" behavior is desirable then, IMHO, the right solution is to standardize the behavior so that applications can be written against the standard and not one particular implementation. In the meantime, the most productive choice when undefined behavior is detected is to complain as loudly as possible, or even terminate the process, rather than allow the application to silently continue in an undefined state. This ensures that the application developer is made aware of the issue and has both the opportunity and incentive to fix it. (However, this outcome should remain *undefined* behavior so that this can be changed in the future if and when more pleasant behavior is standardized.) Going out of one's way to make undefined behavior "pleasant" is a form of attractive nuisance, in that it tends to encourage non-portable code.
In the end, an application which relies on a specific implementation of undefined behavior, pleasant or unpleasant, is broken. A particular installation may do the right thing for certain known inputs; one may even be able to prove that it does the right thing for all possible inputs given perfect knowledge of the implementation in use on a particular system. However, the third layer of software[1]—design/logic—is missing: since the application is not in compliance with the standard, one cannot prove that it will work on any standard-compliant system, including future versions of the same system.
[1] http://www.pathsensitive.com/2018/01/the-three-levels-of-...
PostgreSQL's fsync() surprise
Posted Apr 26, 2018 16:22 UTC (Thu) by anton (subscriber, #25547) [Link]
All true, of course, but "unpleasant behavior" can still be a reasonable choice.Yes, as mentioned, when implementing on a system with 64KB, you may not be able to afford the pleasantness. But we would not be discussing this topic if all cases of unpleasant behaviour were reasonable.
Any application which *relied* on system-specific "pleasant" behavior would necessarily be non-portable.It would be *potentially* non-portable, not necessarily. It would become actually non-portable if an unpleasant implementation appears. But so what? I am pretty keen on portability, but life's too short for unreasonably unpleasant implementations. If your program does not run in 64KB anyway, there is no need to cater to that reasonable unpleasantness; and if you want to cater to unreasonable unpleasantness, it's your time and money to waste (after all, some people write programs in Brainfuck), but I would not recommend it to anyone else.
If "pleasant" behavior is desirable then, IMHO, the right solution is to standardize the behavior so that applications can be written against the standard and not one particular implementation.If you think so, go ahead and work on standardizing pleasant behaviours. But as mentioned, there is the issue of constrained systems where you cannot afford the pleasantness. One solution is to specify several levels of the standard. The minimal level allows unpleasantness that is reasonable on constrained systems; a higher level specifies more pleasantness. However, if you have unreasonable implementors in the standards committee, you will be out of luck in your standardization effort.
Concerning reporting when undefined behaviour is performed, that's a relatively pleasant way to deal with the situation. It's not appropriate when the application developer actually wants to rely on a specific behaviour and does not want to "fix" it, but it certainly makes it clear that your implementation is not pleasant enough to run this application.
In the end, an application which relies on a specific implementation of undefined behavior, pleasant or unpleasant, is broken.No, it isn't. If it behaves as intended in a specific setting, it's working, not broken. It may be unportable, but that does not make it broken.
since the application is not in compliance with the standard, one cannot prove that it will work on any standard-compliant systemMost programmers do not formally verify their programs, but instead test them. There is no way to prove that a program is in compliance with a standard by testing, even if the programmer intends to avoid undefined behavior. But even the few programmers that actually use formal verification for their programs cannot prove that their programs comply with most standard (e.g., POSIX), because most standard are not formally specified. So this whole proof issue is a red herring.
including future versions of the same system.Any system worth using (e.g., Linux) maintains in future versions the pleasantness it has supported in earlier versions.
PostgreSQL's fsync() surprise
Posted Apr 26, 2018 17:06 UTC (Thu) by zlynx (guest, #2285) [Link]
No, because that is an unreasonable limit.
Simply because of implementation limits, ext3 serialized file and directory updates in a certain way and for many years. So people got used to it. But it never applied to ext2, XFS or FAT or literally ANY other filesystem. Not to mention BSD's UFS or Hammer2, or Apple's HFS. Heck, it didn't even apply to ext3 in certain configurations.
And then people tried to require that ext4 work the same way. And btrfs. And even wanted to go back to force XFS to work that way too.
The correct answer is to fsync() everything, which would show how bad ext3 was at that particular operation. All those fsyncs make things slower for people using ext3, but that does not mean fsync is the wrong answer. It just means ext3 was a filesystem with a terrible fsync() implementation that people got used to using.
"Pleasant behavior" is often simply what programmers have become used to. It doesn't make it correct or actually pleasant.
PostgreSQL's fsync() surprise
Posted Apr 26, 2018 22:42 UTC (Thu) by nybble41 (subscriber, #55106) [Link]
See, you're talking about level 2 (particular implementations). Portable program *design* happens at level 3 (design/logic). If your program relies on behavior which is undefined according to the standard then it is non-portable, regardless of whether other implementations behave the same way. You can't say "this program works on any POSIX-compatible system", for example. You know that it works on Linux version X and maybe BSD version Y, but if someone puts together a new OS which follows all the relevant standards neither you nor they can be confident that your program will work on it unmodified.
> Most programmers do not formally verify their programs, but instead test them.
Formal verification in this context is a red herring. Tests are also a form of proof, albeit in the weaker courtroom-style, balance-of-evidence sense rather than the strict mathematical sense. The point is that without a standard you don't have a sound basis for reasoning "I called the function with these arguments, therefore the implementer and I both know that it should do this." Standards are how users and implementers of an API communicate. Relying on undefined behavior in your program is like speaking gibberish and expecting the listener to guess what you meant; there is a breakdown in communication, and the problem isn't on the implementer's end.
> Any system worth using (e.g., Linux) maintains in future versions the pleasantness it has supported in earlier versions.
As zlynx already explained, that is an unreasonable expectation and even Linux doesn't always operate that way.
PostgreSQL's fsync() surprise
Posted Feb 14, 2019 21:21 UTC (Thu) by dvdeug (subscriber, #10998) [Link]
> if someone puts together a new OS which follows all the relevant standards neither you nor they can be confident that your program will work on it unmodified.
Even POSIX-compatible systems aren't perfectly interchangable. In the case of a program like PostgreSQL, it's usually important not just that it runs, but it runs well, and POSIX can not and does not guarantee speed constraints; even Linux alone can store its filesystems in many different ways on many different media, and some of those combinations may not work in practice for PostgreSQL.
> Standards are how users and implementers of an API communicate.
In theory, but not in reality. Most of the APIs a major program depends on are implemented by one library and have but vague descriptions of how it works outside the source code and behavior of that library. There were many Unixes before POSIX, many C and C++ compilers before the first standard was written down. Many people still depend on specialized features of GNU C, enough that several compilers have to copy those unstandardized features. Standards are wonderful if they're followed, but many are underspecified or just usually ignored. New versions of the C, C++ and Scheme standard have removed features that older standards have mandated because they were not well supported.
A huge example is the fact that most of these standards are written in English, an unstandardized language, not Lojban or even French. How can we know what a standard means if the language it is written in is unstandardized? But, for the most part, we manage.
PostgreSQL's fsync() surprise
Posted Feb 18, 2019 20:09 UTC (Mon) by nybble41 (subscriber, #55106) [Link]
True, but irrelevant. I only mentioned POSIX as an example. No one is expecting a complex project like PostgreSQL to work equally well under all POSIX-compliant operating systems; there will be other dependencies.
Regarding the first point, a POSIX-compliant program would check for malloc() errors and either recover or terminate in a well-defined way. The program is portable as long as the behavior is well-defined for all conforming implementations; this is a separate consideration from being *useful*.
>> Standards are how users and implementers of an API communicate.
> In theory, but not in reality. Most of the APIs a major program depends on are implemented by one library and have but vague descriptions of how it works outside the source code and behavior of that library.
What you are describing is a failure to communicate. Programs written this way are inherently non-portable because they are written to fit the specifics of particular implementations. Any change to an implementation can cause any program to break in unspecified ways. This is the problem which standards exist to solve. They allow implementers and users of an interface to agree on roles and responsibilities; implementers can improve their code without worrying about breaking standards-compliant users, and users know which parts of the interface they can rely on and which parts may vary from one implementation (or version) to the next.
> How can we know what a standard means if the language it is written in is unstandardized?
"How can digital logic exist when all electronic components have analog characteristics?" This is bordering on abstract philosophy in the "can two people ever truly communicate" sense, but I'll try to answer it seriously anyway: We distinguish between parts of the language we can rely on for clear communication and parts which, while perhaps useful in other contexts, fail to clearly convey our intent, and build up more complex constructs from elements of the first set. The subset of natural language used for formal standards is actually pretty tightly constrained compared to literature in general. Even so, the dependency on natural language for formal specifications is a weak point and communication does occasionally break down as a result. We have feedback mechanisms in place to detect such breakdowns and correct them by issuing clarifications or revising the standards.
PostgreSQL's fsync() surprise
Posted Feb 19, 2019 0:42 UTC (Tue) by dvdeug (subscriber, #10998) [Link]
As for which parts may vary from version to version, version 3 may adhere to an entirely different standard than version 2. The fact that there is a standard may do you no good if it's evolving rapidly along with the software.
From the other side, even if you are standards conforming, that may not be enough. A user can expect that qsort sorts, but can they expect that it does so reasonably quickly? How often can you call fsync to maintain a reasonable balance between speed and safety? That's never going to be defined by the standard, but an understanding needs to be reached by the authors of a program like PostgreSQL.
I don't believe it's a question of abstract philosophy. It's one thing if standards were a tool used some places and not others in the computer world, that at their best were understood not to be sufficient to be binding on either implementer or user, then it would be reasonable to use unstandardized language in writing standards. But if _all_ APIs should depend on standards, then using an unstandardized language, when, again, formal languages like Lojban or simply standardized ones like French exist.
PostgreSQL's fsync() surprise
Posted Feb 19, 2019 22:39 UTC (Tue) by nybble41 (subscriber, #55106) [Link]
It's not a matter of either/or. Programs should be both portable *and* useful.
> A user can expect that qsort sorts, but can they expect that it does so reasonably quickly? How often can you call fsync to maintain a reasonable balance between speed and safety? That's never going to be defined by the standard...
Why not? Standards do sometimes specify things like algorithmic complexity. C doesn't specify that for qsort(), unfortunately, but C++ does require std::sort() to be O(n log n) in the number of comparisons. What constitutes a "reasonable balance" is up to the user, but there is no reason in principle why there couldn't be a standard for "filesystems useable with PostgreSQL" which defines similar timing requirements for fsync().
PostgreSQL's fsync() surprise
Posted Apr 26, 2018 16:29 UTC (Thu) by Wol (subscriber, #4433) [Link]
As I understand it, POSIX explicitly *avoids* what happens when things go wrong, precisely because POSIX has no idea what's happened.
So a linux standard that says "this is the way we handle errors" will be completely orthogonal to POSIX. And would be a good thing ...
The trouble with POSIX is it's an old standard, that is out-of-date, and while I believe there is some effort at updating it, there is far too much undefined behaviour out there.
Cheers,
Wol
PostgreSQL's fsync() surprise
Posted Apr 18, 2018 20:47 UTC (Wed) by willy (subscriber, #9762) [Link]
Before, we would mark the inode as having a writeback error and then at least one caller of fsync would receive that error. So if nobody other than checkpointer was calling fsync and the inode wasn't evicted from memory, checkpointer would see the error.
Now, we assume that anybody opening the file isn't interested in historical errors or they would have had the fd open. Clearly not true for PostgreSQL. What I suggested was that *if* nobody has seen the error yet, then the error is not so historical after all, and we should report it to the new opener.
As I alluded earlier, we can still lose errors this way if the inode was evicted under memory pressure. It just restores some of the earlier behaviour we had.
PostgreSQL's fsync() surprise
Posted Apr 19, 2018 11:12 UTC (Thu) by jlayton (subscriber, #31672) [Link]
That said, I'm not opposed to re-enabling the ability to see unreported errors that occurred prior to the open (and your scheme to do that was pretty clever) if the Pg folks think it's of value in the near term. Maybe we could make that behavior opt-in based on a sysctl or something?
PostgreSQL's fsync() surprise
Posted Apr 18, 2018 21:20 UTC (Wed) by Sesse (subscriber, #53779) [Link]
PostgreSQL's fsync() surprise
Posted Apr 18, 2018 21:29 UTC (Wed) by willy (subscriber, #9762) [Link]
PostgreSQL's fsync() surprise
Posted Apr 18, 2018 21:33 UTC (Wed) by Sesse (subscriber, #53779) [Link]
Failed writeback to removable devices
Posted Apr 19, 2018 6:36 UTC (Thu) by epa (subscriber, #39769) [Link]
The most common cause of I/O errors, by far, is a user pulling out a USB drive at the wrong time. If some process was copying a lot of data to that drive, the result will be an accumulation of dirty pages in memory, perhaps to the point that the system as a whole runs out of memory for anything else.That suggests that removable devices should be handled a little differently for writeback. The number of dirty pages should be more strictly capped (perhaps ten megabytes per USB drive, or some other heuristic) so that writing becomes closer to synchronous. This is also a usability improvement: when copying files around on the hard disk, I don't care if they are still dirty pages in memory to be flushed to disk later. Once the file copy has 'completed' I can go on to the next task. But if copying to a USB stick, 99% of the time it's because you want to take the USB stick out of the computer and take the data somewhere else (or possibly to then boot from that device). Here the user really does need to wait for the data to be written to the device, and it doesn't help if the file copy dialogue box (or 'cp' command) appears to finish but the writeback happens in the background, with no indication to the user of progress so far or a notification when it completes. As the article notes, that can also lead users to remove the USB stick before the pages are flushed, thinking that the operation has completed -- and if lecturing users about this worked, we wouldn't still have the problem after twenty years.
So I suggest for the most common kinds of removable devices that can be identified as such, the kernel should keep a lid on writeback, both for the total number of dirty pages and how long they can hang around before being written out (I suggest one second is reasonable for USB sticks). For non-removable devices which can be identified as such, the kernel could be a bit more careful and not blithely clear the dirty bit on pages that couldn't be flushed because of I/O errors. Yes, I know that in principle you may not be able to tell in advance which devices are removable and which aren't, but this is more a theoretical than a practical concern: it would be sufficient to treat USB-attached drives as removable and ATA/SCSI/whatever ones as non-removable.
Failed writeback to removable devices
Posted Apr 19, 2018 7:07 UTC (Thu) by neilbrown (subscriber, #359) [Link]
Memory management is predicated on the fact that dirty pages can be cleaned, and clean pages can be freed - in deterministic time. If you mess with that, then deadlocks are just around the corner. Possibly you could set a quota of unwriteable pages and keep pages around as long as there are fewer than the limit, but I doubt that would really end up being helpful.
Failed writeback to removable devices
Posted Apr 19, 2018 9:24 UTC (Thu) by epa (subscriber, #39769) [Link]
For removable devices, the data does have to be thrown away after a write error (assuming that the error was due to unplugging the device).
Failed writeback to removable devices
Posted Apr 20, 2018 9:24 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
Failed writeback to removable devices
Posted Apr 20, 2018 11:43 UTC (Fri) by ringerc (subscriber, #3071) [Link]
We'd receive SIGCHLD for the killed user backend worker(s)/checkpointer/etc, which would trigger crash recovery where we kill all other backends then execute redo. That's perfect. Something portable would be better, of course, but something that covers 95% of users is pretty darn good.
I was unaware of the hwpoison mechanism.
Failed writeback to removable devices
Posted Apr 19, 2018 13:20 UTC (Thu) by NRArnot (subscriber, #3033) [Link]
(Yes, they are fairly crappy by disk drive standards, but they have the advantage over any form of backup across a network that as soon as you pull the USB cable, the data is no longer vulnerable to hostile actors elsewhere on the internet. So as a secondary, last-resort backup of an entire system, they are valuable.)
Failed writeback to removable devices
Posted Apr 19, 2018 15:07 UTC (Thu) by farnz (subscriber, #17727) [Link]
Ideally, you'd be able to tune the dirty pages cap to match the throughput the target can handle in a sensible timescale - say 100 ms for removable devices. That way, your device still has a lot of data to handle compared to its throughput, but it'll block applications when they're able to generate dirty pages far faster than your device can handle - no more application finished with 60 seconds of data left to write out to your USB device.
Something like less-annoying background writeback definitely takes you in the right direction…
Failed writeback to removable devices
Posted Apr 27, 2018 10:39 UTC (Fri) by Wol (subscriber, #4433) [Link]
On a twin-core machine, load average will hit 4 or 5 or 6, and system response basically goes through the floor. Actually, the cause could well be that RAM is flooded, but whatever the cause, it's rather frustrating.
Cheers,
Wol
Failed writeback to removable devices
Posted Apr 19, 2018 16:43 UTC (Thu) by nix (subscriber, #2304) [Link]
Honestly, I suspect most of us here are doing last-ditch USB drive backups of *something*, at least.
Failed writeback to removable devices
Posted Dec 31, 2020 15:13 UTC (Thu) by andrit (guest, #143916) [Link]
PostgreSQL's fsync() surprise
Posted Apr 19, 2018 6:52 UTC (Thu) by mjthayer (guest, #39183) [Link]
PostgreSQL's fsync() surprise
Posted Apr 19, 2018 7:16 UTC (Thu) by neilbrown (subscriber, #359) [Link]
In other situations, errors are more likely and less fatal. Writes to NFS don't necessarily return ENOSPC immediately - you might not get that until fsync. If you use thin-provisioning then (apparently) it is possible to get IO errors which will stop happening once the admin plugs in a new device.
Quoting from the email thread:
> This also means AFAICS that running Pg on NFS is extremely unsafe, you MUST
> make sure you don't run out of disk. Because the usual safeguard of space
> reservation against ENOSPC in fsync doesn't apply to NFS. (I haven't tested
> this with nfsv3 in sync,hard,nointr mode yet, *maybe* that's safe, but I
> doubt it). The same applies to thin-provisioned storage. Just. Don't.
However, the whole point of having an OS is to hide these details. You shouldn't *have* to care what sort of filesystem or storage you are using - behavior should be predictable. Unfortunately, that isn't how it works in the real world.
PostgreSQL's fsync() surprise
Posted May 6, 2018 6:04 UTC (Sun) by ssmith32 (subscriber, #72404) [Link]
It's not really the OS's job to make unreliable hardware reliable or slow hardware performant.
And I do agree with the thin provisioning.. for critical data.. just don't.
PostgreSQL's fsync() surprise
Posted Apr 26, 2018 16:18 UTC (Thu) by ringerc (subscriber, #3071) [Link]
But it also needs to be able to know reliably that "all data from last successful flush is now fully flushed", so it can make decisions appropriately. Right now it turns out we can't know that.
Nobody really wants a kernel panic or database crash because we can't fsync() some random session table that gets nuked by the app every 15 minutes anyway, after all. In practice that won't happen because the table is usually created UNLOGGED but there are always going to be tables you don't want to lose, but don't want the whole system to grind to a halt over either.
PostgreSQL's fsync() surprise
Posted Apr 26, 2018 18:02 UTC (Thu) by andresfreund (subscriber, #69562) [Link]
FWIW, I don't agree that that's a useful goal. It'd be nice in theory, but it's not even remotely worth the sort of engineering effort it'd require.
> Nobody really wants a kernel panic or database crash because we can't fsync() some random session table that gets nuked by the app every 15 minutes anyway, after all.
I don't think that's a realistic concern. If your storage fails, you're screwed. Continuing to behave well in the face of failing storage would require a *LOT* of work. We'd need timeouts everywhere, we'd need multiple copies of the data etc.
PostgreSQL's fsync() surprise
Posted Apr 19, 2018 13:26 UTC (Thu) by oseemann (subscriber, #6687) [Link]
PostgreSQL's fsync() surprise
Posted Apr 19, 2018 14:32 UTC (Thu) by ringerc (subscriber, #3071) [Link]
I suggest taking extra care and doing extra testing if you use:
* Any sort of network block device
* Thin-provisioned storage
* multipath I/O (especially if you haven't set queue_if_no_path etc)
Also, take care not to run out of space in your file system, or test disk-exhaustion behaviour in advance, if you use NFS. Or, preferably, don't do that.
But while this is not cool, it's NOT going to be randomly corrupting PostgreSQL installations all over the place. It's also likely that PostgreSQL is far from the only thing affected.
PostgreSQL's fsync() surprise
Posted Apr 19, 2018 20:30 UTC (Thu) by andresfreund (subscriber, #69562) [Link]
Worth to note that that'll probably have to be an opt-in configuration. Using DIO one certainly has more control and can get higher performance, but it also requires that the database is more carefully configured. But a lot of people use PostgreSQL without configuring the size of it's own buffer cache at all - the OS adaptively providing a second level of caching makes that OK for a lot of scenarios. Postgres can't realistically figure out how much memory it should use on a given system. It doesn't, and shouldn't, have the information to make such a policy decision.
PostgreSQL's fsync() surprise
Posted Apr 20, 2018 14:38 UTC (Fri) by cornelio (guest, #117499) [Link]
The advantage of keeping around the most experienced filesystem developer ever.
PostgreSQL's fsync() surprise
Posted Apr 23, 2018 21:00 UTC (Mon) by helsleym (subscriber, #92730) [Link]
PostgreSQL's fsync() surprise
Posted Apr 23, 2018 21:19 UTC (Mon) by andresfreund (subscriber, #69562) [Link]
https://github.com/freebsd/freebsd/blob/master/sys/kern/v...
PostgreSQL's fsync() surprise
Posted Apr 25, 2018 14:12 UTC (Wed) by xxiao (guest, #9631) [Link]
PostgreSQL's fsync() surprise
Posted May 9, 2018 20:14 UTC (Wed) by nilsmeyer (guest, #122604) [Link]
Basically MySQL / InnoDB will manage all the buffering and try to bypass the kernel buffering as much as possible. This is why you usually try to allocate most (like 75/80%) of the memory on a MySQL server to the InnoDB buffer pool.
PostgreSQL's fsync() surprise
Posted May 2, 2018 23:56 UTC (Wed) by gerdesj (subscriber, #5446) [Link]
PostgreSQL's fsync() surprise
Posted May 3, 2018 0:03 UTC (Thu) by andresfreund (subscriber, #69562) [Link]
PostgreSQL's fsync() surprise
Posted May 3, 2018 6:20 UTC (Thu) by zlynx (guest, #2285) [Link]
PostgreSQL's fsync() surprise
Posted Nov 12, 2018 4:03 UTC (Mon) by immibis (guest, #105511) [Link]
PostgreSQL's fsync() surprise
Posted May 3, 2018 11:26 UTC (Thu) by james (subscriber, #1325) [Link]
That means you can only run them on systems with that sort of storage available -- which meansdnf install package-that-uses-postgresql-as-a-database-engine
doesn't have a chance of Just Working.
PostgreSQL's fsync() surprise: Patched proposed
Posted May 3, 2018 22:09 UTC (Thu) by tech2018 (guest, #124143) [Link]
https://patchwork.kernel.org/patch/10358111/
PostgreSQL's fsync() surprise
Posted May 10, 2018 1:25 UTC (Thu) by ringerc (subscriber, #3071) [Link]