In the course of my video editing on Windows 7, I had seen two sorts of instances where I might want to change something about a person’s voice. First, it seemed I might want to modify someone’s voice so that it would not be recognizable — if, for instance, I wanted to share someone’s unique utterance while protecting their privacy. Second, I did not always like how my own voice sounded, and I wondered if autotuning or other tweaks could make it sound better. As it turned out, the efforts captured in this post were mostly focused on the first of those two issues.
DIY Voice Alteration with Audacity
Degrees of Voice Anonymization
Testing AV Voice Changer
Trying Audacity with Plug-Ins
Trying MorphVOX Pro
Cleaning Up a Morphed Voice
I went looking for software — preferably, freeware — that would be able to modify the sounds of voices already recorded and saved in audio files. This would be different from programs that would work only on a real-time basis, altering the sound of one’s voice during a phone call, for example, or during online gaming. (In my searching, the most frequently recommended tools for real-time-only purposes included MorphVOX Junior (free; 3.8 stars from 8 raters at Softpedia; 8 votes at AlternativeTo), Classic ($30; 3.3 stars from 2 raters at Softpedia), and Pro ($40; 4.0 stars from 39 raters at Softpedia); VoiceChanger; and Skype Voice Changer.)
My search for apps capable of working on previously recorded files led to numerous sources listing and opining on the best programs of that nature. Those sources included TopBestAlternatives (2016), DigiKnew (2016), AptGadget (2017), TechClickr (2016), TopTenReviews (2017), and List of Freeware (n.d.). These sites tended to favor a few such apps:
- NCH’s Voxal (free or trialware, $15-17; 3.1 stars from 24 raters at Softpedia) headed several such lists.
- AV Voice Changer Software Diamond (trialware; $100, minus possible coupon; 4.0 stars from 250 raters at Softpedia). Note that AV also offered freeware real-time voice changers.
- … as compared to e.g., Free Voice Changer (free; 1.2 stars from 5 raters at Softpedia).
TopTenReviews had a rather different list, including physical devices like the VE-5 Vocal Performer, designed to alter the sound of a singer’s voice during performance. But they did also include AV Voice Changer Software (at the top) and Voxal (in tenth place). Regarding AV, they said,
One of the strengths of the AV Voice Changer is its robust support. Besides the hundreds of optimized voices (nickvoices) built into the program, new voices are added weekly to the program’s add-on store. . . . Yet another standout feature of the AV Voice Changer is its voice morpher. This section of the voice changer software has an adjustable voice equalizer as well as a vocal enhancer. It presents a sound graph with pitch control on one axis and timbre control on the other axis. You can select either mono or stereo sound output and then move a pointer across the graph until you find the right sound modulation for your voice.
I believed TopTenReviews, but in my review of the AV Voice Changer website, I did not see any links to a store offering additional voices. (Later, by pursuing links provided when I downloaded and ran the program (below), I did find a webpage offering 75 additional Nickvoice addons.) I saw indications that AV Voice Changer allowed modifications in pitch and timbre (which it said were the key aspects of a voice) as well as information about, and means of making changes to, other aspects of a voice. I was particularly interested in their references to such concepts as Voice Beautifying, Voice Effects, Audio Effects, Brightness, Vowel Enhancer, and Voice Imitation. The Voice Comparator would apparently assist in making one voice sound more like another (e.g., making mine sound like Frank Sinatra’s).
As an alternative, I had previous good experience with NCH software, notably the Debut video capture tool, so I was interested in Voxal, their relatively inexpensive alternative. But their webpage did not provide much detail. iVoiceSoft characterized Voxal as “basic” software for “simple purpose.” NCH, Softpedia, and FindMySoft offered screenshots that seemed to support that impression: I saw a setting for adjusting pitch, but none for timbre; and it seemed to me that the few available voices (e.g., Astronaut, Super Villain) were mostly oddities of little practical use.
DIY Voice Alteration with Audacity
There was another option. I could use an audio editing program, like Audacity, to attempt do-it-yourself manual waveform adjustments. iVoiceSoft cautioned that this approach would be very complex, and that the options would still be fairly basic. A search led to a JustAlexHalford video that seemed to confirm the results of my own firsthand inspection of Audacity, indicating that the primary adjustments available would be speed and pitch — but, again, apparently not timbre. On the other hand, Audacity plugins offered a rather stupendous array of options, so it did seem that I might eventually be able to figure out something that would work.
Another search led to technical discussions that got very quickly into Nyquist (i.e., a computer programming language for sound synthesis), among other things — and also to a Quora discussion whose participants mostly agreed that there was not, yet, a voiceover (VO) technology that could replace human speakers who had skills and/or training in audio presentation. One participant in that discussion advised that professional audio recording software (e.g., Pro Tools) might have better options than Audacity. Among the suggestions in a Pro Tools discussion, this one seemed the most useful for my purposes — and what it suggested would be feasible in Audacity:
I would have 4 tracks of the same voice. 1st unprocessed, 2nd with small amount of pitch up, 4rd with a similar amound of pitch down and the 4th with moderate pitch down. Mix the 4 to get enough subtle difference. Vary the balance of these four every so often through out the program to stop some from getting used to your mix enough to start recognising the original voice through your processing.
The concept there, as I understood it, was to use Audacity to overlay several variously adjusted copies of the same recording. Techwalla amplified that by suggesting that the duplicate tracks could be modified, not only in pitch, but also in equalization and otherwise (with e.g., reverb and a chorus filter). A search led to various tips for other voice changes. For example, it seemed that a change between male and female could sometimes be achieved with a simple 19% change of pitch, and there seemed to be various tips on special effects (e.g., telephone or robot voice effects).
As far as I could tell, all of these approaches would require clear designation of the voice clip to be modified, and would apply to everything within that clip. So if I had a recording of a conversation with someone, and if I wanted to modify their voice but not my own, I would have to make note of my settings, and would have to apply those settings to each portion of the conversation where the other person was speaking, without applying them to my own (or anyone else’s) part of the discussion. To simplify that, I could copy the track in which the recording appeared; remove all of the other person’s remarks from one copy of that track, while removing all of my remarks from the other; and then adjust the settings only for the track where the other person’s voice appeared. There was reportedly no way to automate that separation of my voice from theirs; it would apparently have to be done manually. The results could be weird if there was any consistent background noise (e.g., traffic): apparently it would change (e.g., become slower and deeper) in those portions containing the other person’s voice. Maybe, in some cases, it would be possible to downplay that noise and/or replace it with a separate track of consistent background noise.
Collectively, it did appear that an audio editing program could make significant changes in various aspects of a previously recorded voice, and that the options would grow as one became more familiar with Audacity’s many possibilities. It seemed that an Audacity approach (and, in some cases, an AV Voice Changer approach) could require a lot of work, but would not necessarily be highly technical.
Degrees of Voice Anonymization
It appeared that the real challenge — pursued with limited success in that Pro Tools discussion — would be, not to make a man sound like his sister, but rather to make him sound like a man from some other culture or region. Evidently the technology did not yet exist to do that reliably via software — not only changing the sound of his voice, but also using different terms, inserting different filler words (e.g., “like,” “uh,” “you know”), pausing at different places, pronouncing words differently, and so forth. Here, again, my searches repeatedly encountered advice to hire an actor with the desired voice qualities — or, perhaps, to request a free voice sample at Freesound.
These thoughts suggested that there were degrees of voice anonymization. The most completely anonymous voice alteration would be one that would make my voice unrecognizable even to friends and family members: they would never guess that I was the one speaking the words they were hearing. As a second best, without robotic-level speech alteration, close acquaintances would probably not connect me with a radically altered voice, unless that voice happened to use unusual words or expressions, or made idiosyncratic sounds (e.g., a certain type of laugh), that would immediately remind a friend of my speech style.
Short of that, there seemed to be a level of voice anonymization in which people who don’t know you would be very unlikely to link you with the voice they are hearing. So, for example, if I published a video with the voice of a woman speaking abstractly of love, it seemed Audacity would allow me to make her voice deeper, richer, and slower, to a point where (absent extrinsic clues) it would not occur, even to her friends, that the voice was hers. But apparently it would take AV Voice Changer to make that woman sound like Scarlett Johansson.
By contrast against that scenario, as voice modification became less sophisticated, risks to anonymity would rise. So if I didn’t use AV Voice Changer or Voxal, and also didn’t bother to do anything to the audio clip in Audacity, other than slightly adjust its pitch, I might just produce what the speaker sounds like when s/he is excited or tired — which, of course, would not be anonymous at all.
Testing AV Voice Changer
I downloaded and installed the latest demo version of the AV program, whose full current name seemed to be AVSoft Voice Changer Software Diamond. The installer indicated that I was installing AV Voice Changer Software Diamond 9.0, whereas the website and Softpedia indicated that the latest version was 9.5. This inconsistency was not entirely resolved when I ran the software: the program itself said it was version 9.5, but the pop-up reminding me that I had only 14 days in this trial said version 9.0.
Audio4Fun, the developer, offered some tutorials, but oddly seemed to be trying to prevent me from finding tutorials that I might find helpful or interesting. First, the Tutorials link from the program’s main webpage made me choose between version 8, version 9, or the Diamond Edition — not to mention separate links for Basic and Gold editions. But, as just noted, I thought I was getting Diamond 9, so apparently I was supposed to look for relevant tutorials under both version 9 and Diamond. Moreover, a glance at the version 8 tutorials suggested that some of them (e.g., “Talk like John F. Kennedy with Voice Changer Software Diamond 8.0”) could also be helpful to a user of version 9, even if the specific commands were somewhat different. (There did not appear to be any John F. Kennedy (JFK) tutorial for version 9.) And when I looked at the page for version 9 tutorials, they required me to keep clicking “Load More Tutorials” to see the full text list of maybe 25 tutorials. The page itself did not offer a title clarifying that these were tutorials specific to version 9, but I could see that the URL in my browser’s address bar did refer to “diamond-90.” I never did quite figure out how that list differed from the list of ~180 tutorials pertaining to the Diamond version of the software.
I decided to try the version 9 tutorial titled, Talk Like a Charming Lady. This one had averaged only 2.5 stars from two raters. I guessed that the poor ratings were due to poor outcomes, not to poor teaching methods: I observed that the older JFK tutorial had 4.8 stars from 21 raters. It seemed possible that the difference in results was due to a difference in feasibility: maybe it was easier to make voices sound like JFK than to make them sound like a charming lady. At any rate, both tutorials had five steps, but now I saw that those steps differed, between versions 8 and 9. For the Charming Lady tutorial, I boiled down those five steps as follows:
- Download and unzip the sample voice: a 167KB WAV file containing a one-sentence, ten-word, two-second sample of the lady’s voice.
- In AV Voice Changer Software Diamond 9.5 (referred to here as VCSD), I went into Utilities > Voice Comparator > Add. The Add button was the blue button with a plus (+) symbol on it, towards the top right corner of the window. For that button, mouseover produced a popup reading, “Add new person’s voice.”
Oops! End of tutorial! I got a popup telling me that I couldn’t do this in the trial version. So … OK, what could I try in this demo? Back in the main VCSD screen, I went to File > Open > select a WAV containing my voice. I tried dragging a couple of the controls offered there, but each time I got that same popup: “This feature is available in Full version only.” On the left side of the window, I tried Nickvoices > Male voice to > 20-40 Woman 1. That worked. The window seemed to indicate that, for this particular setting, “Tibre” (apparently a misspelling of “timbre”) would be set to 97% of the original, and Pitch to 125%. I felt the pitch was too high, almost childlike, but I couldn’t adjust it in the trial version, and I was also not allowed to play with the Voice Beautifying settings. The quality of the copy seemed much lower than of the original. The imported clip was not particularly whiny, but this result certainly was. I tried again with a different Nickvoice: Male voice to > Sweet-Voiced Girl. Here, “Tibre” was 106% and Pitch was 120%. Again, really whiny, but at least a somewhat better sound. Trying again, the 40-60 Woman 3 nickvoice (Tibre 107%, Pitch 107%) sounded whiny and tinny, but also somewhat like a woman I actually knew.
I tried again with a different input clip, this time using the classic Clint Eastwood Do You Feel Lucky? clip from Dirty Harry. The result was not whiny. So there you are: a personal therapeutic experience for me and my voice, courtesy of VCSD: apparently I sound whiny, at least when reconstructed as a female — or at least I did on the day when I dictated that particular piece of audio, or at least that’s what VCSD made of me.
I wondered whether the program produced better results when given better-quality input. The WAV containing my own voice was recorded on a cheap voice recorder at 88kbps, while the Clint Eastwood WAV was extracted from a downloaded movie file and had a bitrate of 705kbps. Not that this really helped. On the Clint clip, the Sweet-Voiced Girl nickvoice sounded more like a boy than a girl, and the 40-60 Woman 3 nickvoice sounded like a slightly older boy. The output quality was still noticeably worse than the input. Quality was also an issue when I ran the Clint Eastwood voice through Nickvoices > For movie maker > Ghost Voice 1. In that case, in addition to the reverb that tried but failed to produce a ghostly impression, I noticed that Tibre was at 78% and Pitch at 84%. This suggested that the Tibre adjustment could be especially useful when departing more radically from the original. That seemed to be confirmed when I recast Clint as Teen Girl 1: in that case, Tibre = 92%. Playing around a bit more, I found that Voice Beautifying > Advanced led to a separate dialog offering adjustments for vowel enhancement, sound quality, and timbre (none working in the trial version) that could perhaps reduce whinyness or improve output quality.
It occurred to me that maybe the quality problems were just due to the interface — that maybe an output file would sound better. So I tried to save a Clint Eastwood version to file. But once again, unfortunately, the trial program said that I would have to buy the full version to do that.
Based on this review, my feeling was that I would buy this program if I were an audio professional, because there probably would be instances where some of its tools would be useful, and because I’d probably be in the habit of acquiring various tools to play with. I really liked its interface, particularly the ability to move various sliders and see a visual representation of the effect. But at this point, I couldn’t justify spending $100 on a piece of software that I would rarely need and that might not even work very well.
Uninstallation was an unpleasant experience. I had to reboot to delete the installer. Worse, it appeared that AVSoft Voice Changer had screwed up the sound on my system. Audio would not play. I ran Control Panel > Troubleshooting > Hardware and Sound > Troubleshoot audio playback. This gave me a result: “Problem detected with Realtek High Definition Audio.” I clicked View Details. It said, “A driver (service) for this device has been disabled. An alternate driver may be providing this functionality.” An attempt to Update Driver from that location did not help. I went into Control Panel > Device Manager. It indicated problems with all of my Sound, Video and Game Controllers. First on the list was Avsoft Virtual Audio Device. That was new. Its name suggested a connection with AVSoft Voice Changer. I tried right-click > Uninstall. That worked, but the other items still had problems. Instead of making further attempts in the relatively complex steps suggested to fix the troubleshooter’s error message, I tried a Windows 7 System Restore to roll the machine back to a prior state. That worked. Now — after devoting the better part of a half-hour to this unnecessary difficulty — my system’s sound was functioning normally.
I went back to the Voxal webpage, and downloaded and installed its current version (1.38). To use it for free, the installer required me to certify that I was using the program only for home use, as distinct from business purposes. The Technical Support webpage had links for, among other things, the discussion forum and an explanation of how to use Voxal on an existing file. I don’t know why the Tech Support page didn’t seem to provide a link to the Help webpage; I found it by clicking on the Help icon in the upper right corner of the program window.
For working with an existing file, the instructions were to select the preset voice that you want to apply, then click the Process button in the toolbar and choose the audio file to be processed. The problem: there was no Process button. Eventually I figured out that I needed to be on the Tools tab, not the Voice tab. In Tools, I saw a Process File button. That prompted me for the input file and the desired output filename. The Preview File button offered a slightly easier alternative for testing purposes.
The preset voices offered in Voxal all used combinations of the program’s audio effects. Those effects, visible in the Voice tab > Edit option, were as follows: 3-band equalizer (much less flexible, by the way, than most equalizers), low pass, high pass, wah-wah, vibrato, tremolo, reverb, flanger, echo, distortion, chorus, pitch (and voice pitch) shifter, and amplify. Voxal permitted addition of other voices, through its Voice tab > New option. To configure a new voice or revise an existing one, I would select the desired voice and then choose Edit. The resulting window would allow me to add, configure, or remove the effects comprising that voice. For instance, the Cave voice used echo, reverb, and then echo again, and each of those could be individually adjusted. For example, the first echo effect in the Cave voice could be set to 40% gain, 400ms duration, while the second one might be set to 27% gain, 150ms.
I was not confident that the preset voices in Voxal were well constructed. For example, I tried Voxal’s Female 3 voice on the sample file that I had used with VCSD (above). Here, it didn’t sound whiny, and I felt the output quality was better than what I’d seen in VCSD. But the output was too high-pitched, probably because the Female 3 voice — indeed, all of Voxal’s preset female voices — used pitch shifts of 30% to 50%, whereas my understanding (above) was that the appropriate shift would be more like 19%.
I was also not sure that all of the effects included in Voxal’s preset voices were helpful. For instance, when I tried their Big Guy preset, the result was muddy, presumably because it depended on Voxal’s voice pitch shift effect. That is, I suspected the voice pitch shift effect may have introduced more distortion than it was worth. Similarly, what I got from their Child preset didn’t sound like a child at all.
I tried those same voices on the Clint Eastwood excerpt. As with the AV test (above), this better-quality clip produced better-quality results. That said, the results were still not necessarily what I hoped for. The Big Guy voice was not bad, and was certainly much bigger than the Child and Female 3 voices; but the latter, unfortunately, still didn’t sound much like a child or a female.
These programs seemed to assume that a certain standard sequence of alterations would consistently produce acceptable results. My experience suggested that this assumption could be tenuous. What would work for one audio file might not work for another. The typical user might have to engage in much trial and error, using Voxal to modify a specific voice (or type of voice) of particular interest, in order to find settings that would contribute to a decent-sounding outcome.
The question, in that case, was whether it would make more sense to use Voxal than Audacity. Audacity seemed to have some advantages. Audacity was open-source, cross-platform software, developed by dozens of people and downloaded more than 100 million times. It would not be surprising if its effects would be of higher quality than those built into Voxal. Also, Audacity would allow the user to modify just a portion of a file, rather than the whole thing. Most of the effects available in Voxal were available in Audacity’s Effect menu; and it appeared that many other effects were also feasible, either by adding third-party plugins to Audacity or by combining existing effects. Audacity would allow testing of a single effect (e.g., echo), so as to generate a clearer understanding of what each effect was contributing, rather than insisting on applying multiple effects (e.g., echo, then reverb, then echo). When a certain sequence of changes did become established, Audacity did have a chain option for batch processing and effect automation — and at first glance, it seemed that, with relatively minor complexity, this option would be more powerful than anything in Voxal. In Audacity, the user could see the waveform, with its potential for clarifying what was happening, and what needed to happen.
The primary advantages of Voxal over Audacity seemed to be that Voxal could combine multiple effects into a single voice, and would remember those settings and display them clearly. It took only a couple of clicks to see a graphic indication of the settings being used in each effect comprising a Voxal voice. Then again, a user could certainly keep notes on the settings that s/he found most effective in Audacity, and those notes could contain more detail than would appear in the Voxal screen.
I had been doing a few basic kinds of voice editing for many years, mostly with Cool Edit 2000, and I had occasionally used Audacity, so I was not as intimidated by its interface as some newbies might be. But I didn’t know much about using Audacity for these particular purposes. (I was currently using Audacity 2.1.1.)
Voxal indicated that its Female voices all began with a pitch shift, so that’s where I started in Audacity. I opened a copy of the voice file I had used previously; I double-clicked on the waveform, to select the whole thing; and then I went to Audacity’s menu: Effect > Change Pitch > Percent Change > 19.0 > Preview. The preview was too brief (six seconds, by default), so I changed it to 15 seconds by going into Edit > Preferences > Playback. Now I was having second thoughts about the advice (above) suggesting a 19% bump in pitch: that didn’t seem to be enough. I tried various settings from 30% to 50%, and settled for now on 35%.
So far, my changed voice just sounded like a sped-up version of me. I wasn’t sure how to proceed, so I did a general search. A YouTube video suggested starting with an audio clip in which the male speaker made a point of sounding like a female. Too late for that — the recording was already done. But in the video’s illustration, I could see how much of a difference that could make. Others agreed: to sound like a woman, one should get into the proper state of mind (be “close and breathy” with the microphone) before making the recording, and one should use the right voice. The premise, there, was that most people have several voices. The video’s next suggestion was to remove background noise (if any) via Effect > Noise Reduction > select a piece with nothing but the background noise to be removed > Get Noise Profile > Step 2 > leave defaults unless you have a better idea > OK. Then the video recommended what I had already done: Effect > Change Pitch.
As a next step, one of Voxal’s Female voices (i.e., Female 2) then applied an equalizer, with gains of -1 dB at 100 Hz, 1 dB at 900 Hz, and -30 dB at 2500 Hz. To try that, I went into Effect > Equalization. It seemed Voxal might be in error, on the last of those three values. Instead of reducing the high tones, one source sensibly recommended using equalization to emphasize the high frequencies and reduce the lows.
According to another source located through variations on a plugin-specific search, there is a rule of thumb that says you can change a voice by as much as three musical notes (e.g., C to F). Beyond that, it starts sounding weird. Audacity’s Effect > Change Pitch did offer a Pitch option by which I could specify a change from the estimated starting pitch (D#/Eb, in my case) to some other. As it turned out, my experimental change of 35% nearly reached that recommended limit: three steps, up to G#/Ab (5.2 half-steps). But another source repeated the earlier view that 18% was the approximate limit, beyond which the voice would lose realism.
At a certain point in this inquiry, I began to see why I might have been hoping for too much from Voxal and AV Voice Changer. With or without misleading advertising, there’s a difference between what we want and what is technically feasible at present.
Trying Audacity with Plug-Ins
I encountered some recommendations for specialty software (i.e., Roland VS-880EX, UltraPitch) that reportedly did a superior job with voice gender bending. Someone else recommended several Audacity plugins, for which I searched: Autotune, KeroVee, and RoVee. Another (apparently older) possibility: AutoTalent. And there were others. In this section, I look into a couple of those options.
For Auto-Tune, WikiHow said I needed Audacity VST Enabler and GSnap. To take care of that, I downloaded and installed vst-bridge-1.1.exe and GSnapWin64.zip (even though WikiHow specified the 32-bit version). The latter gave me Gsnap.dll and GVSTLicense.txt, which I copied into C:\Program Files (x86)\Audacity\Plug-ins. I had Audacity running at this point, and apparently it needed to restart before I would see Effect > GSnap. Then, according to WikiHow, I was supposed to see a dialog to “Register Effects,” which would allow me to tell Audacity to use GSnap. That didn’t happen. Instead, I went into Effect > Manage > Select GSnap (labeled as being in “New” state). But then I clicked on the State column heading, to sort that dialog by State. Doing so showed me that, actually, there were a bazillion New effects, possibly added by VST Enabler. I selected them all and clicked Enable > OK. Now the Effect menu pick gave me a long list of all those additional effects. Sadly, GSnap wasn’t among them. I killed Audacity and tried again with the 32-bit version, like I was supposed to do in the first place > Effect > Manage. That gave me just four new plug-ins, including GSnap and vst-bridge. I enabled them all. Now GSnap was on the list. So, yes, it appeared I did have to use the 32-bit version.
WikiHow said that Auto-Tune would work only with singing, not with speech, because it needed a melody to detect and adapt to. This wasn’t the original mission, but I went with it. I found a recording of me singing. It needed noise reduction, so I did that first. I also went into Effect > Amplify to reduce its volume a bit, so as to give Auto-Tune room to work. Then I went to Effect > GSnap. Next, I was supposed to click the Select a Scale button. I hoped that I was on-key with the original artist whom I was imitating, and therefore specified the scale that I got from some website, when I searched for the scale used in that song. Then WikiHow said I should choose these settings: Minimum and Maximum frequencies (40Hz and 2000Hz, respectively), Gate -80dB, Speed 1, Attack and Release (1ms). I left everything else at default values and clicked Apply and then (still in that GSnap dialog) the green Play arrow at lower left. The result was awful. Auto-Tune was all over the place. Apparently I had fooled it: it believed I was singing something that I wasn’t. Well, I took it as a compliment.
I did try GSnap with my plain speech file, but I couldn’t see that it was doing anything. It appeared, in short, that the Auto-Tune suggestion was a red herring. (Sigh.) At least I had learned something about adding plugins to Audacity, and had also acquired a bit of humility. All for free.
KeroVee and RoVee
KeroVee’s Japanese webpage, run through Google Translate, seemed to say that its sister product, RoVee, was the simpler choice for merely changing voice quality — as distinct from 1 2 videos demonstrating the use of KeroVee for autotuning songs. Participants in one discussion seemed to say that that KeroVee added pitch correction (which I took to mean auto-tuning) to RoVee, and that the latter was better for “pitchbending.” They also repeated the common observation that it was harder to feminize a male voice than the other way around.
I downloaded RoVee 1.21. Its ReadMe said, “RoVee . . . can convert voices to like male, female, toys or robotic.” The advice was, in effect, to add RoVee.dll (included in the download) to C:\Program Files (x86)\Audacity\Plug-ins — and then, as before, to go into Audacity > Effect > Manage > enable the new plugins.
I chose Effect > RoVee to open it. A video demo from 2010, and another from 2012, seemed to say that previous versions of RoVee had offered a choice of effects (e.g., MaleToFemale) that were not available in the current version. Then again, as I viewed the available controls, I saw that the Formant knob could be turned toward male or female poles — or, more effectively, I could type the desired values into the boxes under the knobs. (The best way to use the mouse with these controls was to click on the control and then drag vertically, up or down — not to try going around in a circle.) The Mix knob at the right controlled the combination of previous voice (“Dry”) and replacement voice (“Effect”). In other words, I could hear just me, just the changed voice, or various ugly mixes of both. There did not seem to be much reason to use any value other than 100 on the Mix setting.
The quality in the video demos was much better than the quality I got with my own audio clips. I tried RoVee with the Clint Eastwood clip. I found that the maximum possible values were Formant = 100, Pitch = 12, Mix = 100. But I did not find any values that would make Clint sound like anything less male than a teenage boy. On the positive side, I did find some settings that made him sound like a duck.
This was the point at which I finally got around to looking up the meanings of “timbre” and “formant,” both of which kept cropping up during this exploration. Wikipedia described timbre as essentially everything, other than pitch and loudness, that makes one sound different from another. So, for example, a guitar and a piano hitting the same note at the same volume would sound different because they would have different timbres. Formant, by contrast, seemed to be a much more technical term. It tentatively appeared that it might not even make sense to ask how timbre is different from formant. Suffice it to say that the effects achieved by various formant adjustments did not clarify what formant was and how it might be relevant.
At this point, I paused to regroup and reflect upon what I had learned so far.
My first interest, driving this inquiry, was in achieving voice anonymization. It seemed that could mean different things under different circumstances. Depending on the situation, a good general anonymization objective might be to make it very unlikely that people who don’t know you would link you with the voice they are hearing. So, on one hand, people who know you well, and know that you fiddle with voice modification, might suspect it’s you as soon as they hear a recording in which the speaker uses an unusual term, laughs in your own unique way, or otherwise sends a hint that software is unlikely to conceal. But, on the other hand, this general concept of anonymization would seem adequate, if you’re just looking to modify your voice for purposes of introducing different products in a series of corporate video voiceovers.
In pursuit of anonymization, I was particularly interested in what turned out to be the hard case of making a man sound like a woman. Apparently it would have been easier to make a woman sound like a man. Within the tools I had seen so far, the nature of the original recording was crucial: apparently both the quality and the content of the recording mattered. I had mediocre luck, attempting to convert my own voice to that of a woman, but very little success when attempting the same with the voice of Dirty Harry. It seemed the results could have been much better if Clint and I had been attempting to sound like a woman in the first place.
Within what I had reviewed, there did not seem to be a compelling reason to invest time and money in a special-purpose program like AV Voice Changer Software or Voxal. It seemed I would save money, learn more, have more options and more control, and achieve an outcome of comparable quality, by just using Audacity, perhaps with the aid of plugins, to experiment until I found the best combination of effects. With honorable mention to Auto-Tune and KeroVee plugins for music, the RoVee plugin provided a relatively simple and useful way of tweaking voices.
There was other software that I hadn’t examined. Contrary to what I had just said, I realized that some of it could prove useful. For instance, some people recommended MorphVOX Pro, about which I may have been mistaken: rather than being limited to streaming audio, it seemed it might be able to read input files and produce output files after all. For the record, I also saw one or two recommendations for Melodyne ($849).
Experience so far suggested that pitch adjustments could only change by somewhere between 18% and 35% before the voice would sound artificial. Hence, it would seem easier to anonymize a man’s voice by making him sound like someone similar to him. The advice might be, don’t try to adjust pitch beyond that acceptable range; instead, change other aspects of his voice.
To do that, I might find value in some of the other adjustment possibilities encountered so far. Those included equalization and other standard audio effects in Audacity (e.g., reverb, echo, chorus); the “Tibre” adjustment in AV Voice Changer Software; and the Formant adjustment in RoVee. Such tweaks had not been super-helpful in changing a man’s voice to sound like a woman’s, but they might prove sufficient to convert a man’s voice into that of an unrecognizably different man. Learning about these things might also help me with the other objective stated at the outset: to make my own voice sound better.
Trying MorphVOX Pro
I downloaded and installed MorphVOX Pro 4.4.65 ($40 minus coupon, if any, sometimes offered after using and then closing the trial program; 3.9 stars from 39 raters at Softpedia) to explore its offer of a fully functional seven-day trial. It started with an opportunity for the MorphVOX Voice Doctor to optimize MorphVOX for my voice. I declined offers to sign up for something and to set all programs on my computer to use MorphVOX. Other than that, I basically went with the installation defaults. I had to tinker with Microphone Volume and selection and then, ultimately, give up: for some reason, Audacity found both of my two different headset microphones, while MorphVOX couldn’t work with either.
So now we faced the question of whether MorphVOX could import and work with a previously recorded audio file. The answer was: yes! On the menu, I went into MorphVOX > Morph a File. I selected a source file, containing a male voice downloaded from Freesound, and designated an output folder and filename. But, oops, I was supposed to designate what I wanted to do to the file, before taking those steps; now I had an output file that I didn’t want. In other words, it seemed there was no option to experiment and preview. I just had to choose my settings blind, and then produce the result.
So, OK, I tried that. I chose the Man alias, in the left panel of MorphVOX, and decided to make this particular voice more stereotypically male. I lowered Pitch Shift and Timbre Shift, and chose the Woman to Man preset on the equalizer. It worked, in some sense of the word: I had myself a real gorilla — what I would later see referred to as a “mud monster” — though unfortunately the new voice was fuzzy. But, clearly, anonymization was a success: nobody would think the original voice could have come from this creature.
The thing is, this output was just as rapid as the input, whereas — for this gorilla’s voice — it should be slower. I tried tinkering with the downloaded voice in Audacity. I started with Effect > Change Pitch > -2 semitones (i.e., an 11% drop in frequency). I could have gone more, but it made the guy unusually deep-voiced. Contrary to anonymization, that sort of voice might attract attention. Next, after some experimentation, I tried Effect > Change Speed > -10%. That, again, made the voice very deep. I undid the Change Pitch and tried Change Speed -15% by itself. That was pretty good, but still maybe too deep. I tried Change Speed -20% > Change Pitch 15%. I liked that: a bit deeper, a bit more laid-back than the original guy — but, alas, probably not different enough to count as anonymized.
I saved the Audacity result and ran it through MorphVOX, with everything turned off except the Timbre: Shift = 0.25, Strength = 25. I really couldn’t hear that MorphVOX had done anything. I tried again: Shift = 0.50, Strength = 75. Well, that was a different voice, but with lots of fuzz (i.e., distortion). I tried another version, going the other way: Shift = -0.75, Strength = 50. That left me with the original voice plus a couple of audible defects. I concluded that, in MorphVOX as in AV Voice Changer, a “Timbre” setting basically just meant making the voice fuzzier.
So, OK, what if I left Timbre out of it — maybe run the gorilla from Audacity through MorphVOX’s gender bender? I clicked Reset on MorphVOX’s Tweak Voice area (i.e., where the Timbre settings were), turned on Morph and Listen in its Voice Selection area, and chose its Woman alias. Oh, now I understood: when I had made those changes to Timbre, I had been changing its Man voice. I accepted the option to discard those changes. Now, in Woman voice, the Pitch and Timbre settings were all elevated by default. And, well, the results were a miracle. They had turned the gorilla into a real woman. I mean, she sounded pretty ragged around the edges, definitely had a few miles on her, but was surely still capable of screwing up some poor bastard’s world. I tried again, elevating Pitch and Timbre Shift a bit more, and thus made her a tad more feminine, without noticeably increasing distortion.
At this point, I started to prepare a video — should have thought of it earlier — to illustrate some of these changes.
I tried again, using those same elevated Woman settings, but this time starting with the original downloaded guy voice, instead of the Audacity gorilla. No, too much: now she was a fricking chipmunk. I tried again with the original guy, setting both the Pitch and Timbre Shift at 0.50. (Timbre Strength was 100 by default through all of these experiments.) No, that wasn’t enough. Now she was androgynous: sometimes a woman, sometimes not. I played around a bit more and concluded that the original guy was just inferior to the Audacity gorilla, as both a man and a woman. In other words, somehow, slowing and deepening the audio in Audacity had made the male version more likeable, and had also laid the foundation for a seemingly more genuine female via MorphVOX.
Cleaning Up a Morphed Voice
MorphVOX, like other programs examined above, added distortion especially when I used its Timbre feature. If possible, I wanted to get rid of that, along with anything else I could improve. The distortion wasn’t a matter of noise. As I could see from the waveform produced by MorphVOX, there was no appreciable background noise.
I had seen (above) that Voxal applied two different effects to its female voices — equalize or high pass filter — after shifting pitch upwards. Those were the two effects mentioned, in an Audacity discussion, as ways of reducing distortion. That discussion also warned, however, that there were limits to what a cleanup effort could achieve, once the distortion was already baked into the waveform. Voxal’s equalization had sought to suppress excessively low and high frequencies, and enhance those in the normal female voice range.
It seemed that equalization could also achieve filtering, so I focused on equalization. I went into Audacity’s Effect > Equalization > Select Curve > Bass Cut. That was too much: it cut out everything below 200Hz. I canceled out and started over with Draw Curves. I dragged the blue line (originally a horizontal line at 0dB) downwards; then I clicked in the middle of that line and dragged upwards. That gave me a bend, which I was able to modify so that it began cutting aggressively at around 150Hz. Clicking in the middle of a blue line segment would add another point of manipulation.
Unfortunately, Audacity did not display the exact values that I was modifying. I tried Graphic EQ, but that gave me a dozen data points along the curve, making it difficult to draw easily. It was also not possible to zoom in, to see a particular segment of the line in more horizontal detail, though there were helpful vertical sliders on the left side of the graph. It might have been possible to run the audio through another equalizer and save or record the output, but I didn’t explore that.
I kept playing with it, using the Preview option to hear the effects of my line-drawing. Cutting out too much bass made the female voice tinny. Ford said most vocal energy is in the range of 125 Hz to 6 kHz. Lathrop said I could cut below 50 to 100Hz. Benediktsson said that I could safely cut below 75Hz (except maybe for bass singers like Barry White), probably even 100Hz, and higher still for a female, to eliminate low-end rumble. In a very helpful BehindTheMixer post, Huff said the vocal range for singers varied from 82-329Hz for bass up to 261-1046Hz for soprano, and thus observed that his high pass filter (i.e., eliminating low frequencies) could be anywhere between 80 and 200 Hz. Wikipedia suggested that, “In telephony, the usable voice frequency band ranges from approximately 300 Hz to 3400 Hz.” In this case, I decided that using the equalizer to cut out audio below approximately 150Hz would make the voice a little more certainly female, without making it less appealing.
On the other end of the spectrum, various sources indicated that singers could go to 14kHz and higher. But there were also some potential risks in higher frequencies. Benediktsson said that, if the voice had excessive sibilance (i.e., high-energy S sounds), “you can try cutting around 7 KHz. It will make the s’s less pronounced . . . . Better yet, [use] a compressor that only compresses the ‘s’ area. . . . Male sibilance is typically 3-7k Hz and female sibilance is typically 5-9k Hz so there needs to be some experimentation.” Going in just the opposite direction, as if to emphasize the subjectivity and case-by-case nature of the task, TalkinMusic said, “If you feel like you need to add more clarity and presence then make a boost in the 5kHz to 9kHz range.” Huff agreed that, depending on the situation, the frequencies above 1kHz could feature both the unwanted (i.e., sibilance, harshness) and the desirable (i.e., presence, clarity, sparkle, breathiness). For speech, I saw that Voxal’s female voices oddly applied high pass filters at 120 Hz and also 8kHz (!), and also used its equalizer to filter sounds above 2.5kHz.
I thought it might be helpful if I could see what we were talking about, in this particular recording. The Audacity manual confirmed that Audacity could display the spectral (what it called the Spectrogram) view of the recording. To do that, the instructions had me go into the Track Control Panel (i.e., the gray box at the left end of the waveform) > click on the inverted triangle, next to the truncated filename (compare top bar, “06 Audacity”), to open the dropdown menu > Spectrogram log(f). The documentation explained that the colors signified the strength of the tone, ranging from white (strongest) through light and dark red (progressively weaker) to light blue and finally dark blue (weakest). So in this case, the strongest signal was in the range of, roughly, 100Hz to 1000Hz. I could zoom in by putting the mouse on the vertical axis (i.e., where the numbers were) and left-clicking, or by right-dragging on a section of the axis to indicate what span I wished to see; and I could zoom out by right-clicking on the axis.
That actually wasn’t terribly helpful. But with additional listening and tinkering, ultimately I decided not to impose a low pass filter (i.e., not to trim out everything above a certain frequency). For this particular clip, there didn’t seem to be a level at which such a filter would produce a better result. To the contrary, even at relatively high frequencies, the filtering reduced the voice’s femininity.
What I really needed, instead, was something that would trim the fuzz that MorphVOX had added when changing the voice’s timbre. One comment suggested that, if I had merely attempted a pitch shift in Audacity, I might have gotten less distortion by using Audacity’s Effect > Sliding Time Scale/Pitch Shift feature. Another discussion pointed out that, if what I wanted to remove had been just a single constant noise (e.g., a buzz), I could have zoomed in on the spectral display to identify it, and then used a filter to remove it. But as illustrated in a different exchange, it was possible that the distortion would be “too close to the voice tones to separate them.” That, unfortunately, seemed to be the case here. At some point, if I found myself pursuing that issue further, there was the option of posting a question, to see if anyone else had suggestions.
Aside from low- and high-pass filters, there was also the option of using the equalizer to boost midrange frequencies in this female voice. Huff offered the adage, “cut narrow and boost wide.” Benediktsson and TalkinMusic seemed to say that vocals could need reduction in the 100-350Hz range, if they were too boomy, but could need boosting if the voice was too thin; and that specific cuts on the 350-500Hz range could be helpful.
By this point, I was seeing that there were many possible tweaks to make voices sound better. For example, I saw repeated references to compression, which (from my Cool Edit work) I understood as the technique by which softer sounds could be enhanced — the technique used to make DJ voices sound so powerful. As with midrange filtering, I tinkered with compression briefly, but didn’t invest the time needed to learn more about Audacity, and to play with this particular voice.
Based on my limited searching, reading, and experimenting, at this point the equalizer seemed to be the primary tool for enhancing the femininity of the male-to-female voice transition achieved through MorphVOX. There was no immediately obvious solution to the distortion introduced by that transition.
The tentative advice seemed to be, start with a good-quality recording (preferably involving a male actor who tries to sound like a woman, if a female voice is the desired outcome); use Audacity rather than a tool like MorphVOX or RoVee whenever possible (e.g., changing pitch and speed, by themselves, might yield a satisfactory female-to-male solution); and then, after using something like MorphVOX, explore equalization and other tools in Audacity to clean up and enhance the result.
It was now clear to me that improving voices could be a complex task, requiring much more familiarity with Audacity than I presently had, or would soon develop. Learning how to improve voices — be it my own or the female voice produced through this post’s efforts — would remain a challenge for another day.