/tags/rel-0-0-1a/FreeSpeech/html/description.html

#
HTML | 529 lines | 386 code | 143 blank | 0 comment | 0 complexity | 1a9808172bb136aa9a03bd66c9b687d8 MD5 | raw file

✨ Summary
  1. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
  2. <HTML>
  3. <HEAD>
  4. <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
  5. <META NAME="GENERATOR" CONTENT="Mozilla/4.07 [en] (X11; I; Linux 2.2.0-pre8 i686) [Netscape]">
  6. <META NAME="Author" CONTENT="Jean-Marc Valin">
  7. <TITLE>SpeechInput Description</TITLE>
  8. </HEAD>
  9. <BODY BACKGROUND="BACK8.png">
  10. <CENTER><B><U><FONT SIZE=+4>SpeechInput Description</FONT></U></B></CENTER>
  11. <OL>
  12. <LI>
  13. <A HREF="#General">General</A></LI>
  14. <LI>
  15. <A HREF="#TechIssues">Technical Issues</A></LI>
  16. <LI>
  17. <A HREF="#Sound Recording">Sound Recording</A></LI>
  18. <LI>
  19. <A HREF="#Front-end">Front-end</A></LI>
  20. <LI>
  21. <A HREF="#Recognition">Speech Recognition Unit</A></LI>
  22. <LI>
  23. <A HREF="#Collection">Data Collection</A></LI>
  24. <LI>
  25. <A HREF="#Integration">System Integration</A></LI>
  26. <LI>
  27. <A HREF="#TODO">TODO</A></LI>
  28. </OL>
  29. <H1>
  30. <HR WIDTH="100%"><A NAME="General"></A>1. General</H1>
  31. <H2>
  32. 1.1 Target</H2>
  33. <UL>
  34. <LI>
  35. Speaker independent HMM-based LVCSR (realtime) recognition system that
  36. accepts different kinds of language models, i.e.</LI>
  37. </UL>
  38. <UL>
  39. <UL>
  40. <LI>
  41. &nbsp;Grammars for command utterances (desktop control)</LI>
  42. <LI>
  43. &nbsp;N-gram models for dictation</LI>
  44. </UL>
  45. </UL>
  46. <H1>
  47. <HR WIDTH="100%"><A NAME="TechIssues"></A>2. Technical Issues</H1>
  48. <H2>
  49. 2.1 Version Management</H2>
  50. <UL>
  51. <LI>
  52. CVS repository</LI>
  53. </UL>
  54. <H2>
  55. 2.2 Languages, Libraries etc.</H2>
  56. <UL>
  57. <LI>
  58. C++ (new draft standard)&nbsp; (WWW:&nbsp;&nbsp; <A HREF="http://www.cygnus.com/misc/wp/">http://www.cygnus.com/misc/wp/</A>
  59. )</LI>
  60. <LI>
  61. Standard Template Library (STL) (WWW: <A HREF="http://dir.yahoo.com/Computers_and_Internet/Programming_Languages/C_and_C__/C__/Class_Libraries/Standard_Template_Library__STL_/">http://dir.yahoo.com/Computers_and_Internet/Programming_Languages/C_and_C++/C++/Class_Libraries/Standard_Template_Library__STL_/</A>)</LI>
  62. </UL>
  63. <H2>
  64. 2.3 Programming Style/Documentation</H2>
  65. <UL>
  66. <LI>
  67. <A HREF="http://www.kde.org/">KDOC</A> like documentation</LI>
  68. </UL>
  69. <H2>
  70. 2.4 Configuration Files</H2>
  71. <UL>
  72. <LI>
  73. "Object-Oriented" configuration files.</LI>
  74. </UL>
  75. <H2>
  76. 2.5 Licence/Copyright</H2>
  77. <UL>
  78. <LI>
  79. Program code: GPL</LI>
  80. <LI>
  81. Produced data files, acoustic models, dictionaries, language models etc....
  82. are GPL?</LI>
  83. <LI>
  84. Collected speechware must also be GPL?</LI>
  85. <LI>
  86. Only dictionnaries and grammars can be supplied by applications as non-GPL?</LI>
  87. </UL>
  88. <H1>
  89. <HR WIDTH="100%"><A NAME="Sound Recording"></A>3. Sound Recording</H1>
  90. <H2>
  91. 3.1 Audio Sources</H2>
  92. <UL>
  93. <LI>
  94. Audio data files (HMM model training)</LI>
  95. <LI>
  96. Audio server - microphone (application: command and dictate)</LI>
  97. </UL>
  98. <H2>
  99. 3.2 Audio Server</H2>
  100. <UL>
  101. <LI>
  102. Separated process using System V IPC (shared memory, semaphores)</LI>
  103. <LI>
  104. Get audio data directly from /dev/audio (16-bit linear, 16 kHz)</LI>
  105. <LI>
  106. Do speech/non-speech filtering on coarse spectrum data</LI>
  107. <LI>
  108. perform end-pointing based on energy, then do an fft and endpointing based
  109. on pitch detection.</LI>
  110. <BR>(Assure that low-energy phonemes are not cut away at utterance boundaries!)
  111. <LI>
  112. Option: only listen to microphone if user holds down specified key (e.g.
  113. Ctrl) ?!</LI>
  114. <LI>
  115. Option: keep average power spectrum and average log-power spectrum. Gives
  116. the noise estimation and the channel estimation (but makes necessary "adjustment
  117. recordings", e.g. before data collection!)</LI>
  118. </UL>
  119. <H2>
  120. 3.2 Audio Data</H2>
  121. <UL>
  122. <LI>
  123. 16-bit linear / 16 kHz</LI>
  124. </UL>
  125. <H1>
  126. <HR WIDTH="100%"><A NAME="Front-end"></A>4. Front-end</H1>
  127. <UL>
  128. <LI>
  129. FFT</LI>
  130. <LI>
  131. MEL scale</LI>
  132. <LI>
  133. Cepstrum (DCT on log MEL)</LI>
  134. <LI>
  135. Delta ceps</LI>
  136. <LI>
  137. (cepstral mean subtraction)</LI>
  138. <LI>
  139. (Noise reduction / channnel equalization?)</LI>
  140. <LI>
  141. (Neural network based on power spectrum?)</LI>
  142. </UL>
  143. <H1>
  144. <HR WIDTH="100%"><A NAME="Recognition"></A>5. Speech Recognition Unit</H1>
  145. <H2>
  146. 5.1 Data Files</H2>
  147. <UL>
  148. <LI>
  149. Speaker database</LI>
  150. <LI>
  151. Vocabulary</LI>
  152. <LI>
  153. Dictionary</LI>
  154. <LI>
  155. Language model</LI>
  156. <UL>
  157. <LI>
  158. N-gram</LI>
  159. <LI>
  160. Grammar (grammar editor)</LI>
  161. </UL>
  162. </UL>
  163. <H2>
  164. 5.2 Acoustic Model / HMM</H2>
  165. <H4>
  166. 5.2.1 Components</H4>
  167. <UL>
  168. <LI>
  169. <B>HMMGraph</B> with <B>HMMStates</B> as knots.</LI>
  170. <LI>
  171. One phoneme consists of three <B>HMMStates</B>: "->(begin)->(middle)->(end)->".</LI>
  172. <LI>
  173. <B>HMMStateSet</B> (knows all <B>HMMStates</B>)</LI>
  174. <UL>
  175. <LI>
  176. allows for an update of its components</LI>
  177. </UL>
  178. <LI>
  179. <B>State</B></LI>
  180. <BR>Subclasses:
  181. <UL>
  182. <LI>
  183. <B>HMMState</B></LI>
  184. <LI>
  185. <B>VocabTreeState</B></LI>
  186. </UL>
  187. <LI>
  188. <B>HMMStates</B> have a generic interface including:</LI>
  189. <UL>
  190. <LI>
  191. unique identifier</LI>
  192. <LI>
  193. link to an acoustic model (e.g. mixture distribution) which allows for
  194. frame scoring</LI>
  195. <LI>
  196. scoring method</LI>
  197. <LI>
  198. accumulation of suff. statistics</LI>
  199. </UL>
  200. <LI>
  201. Examples for <B>HMMState</B>-implementations: <B>HMMStateMixture</B>, <B>HMMStateNN</B></LI>
  202. <LI>
  203. <B>HMMStateMixture</B>:</LI>
  204. <UL>
  205. <LI>
  206. mixture coefficients</LI>
  207. <LI>
  208. vector of pointers to its density models</LI>
  209. <LI>
  210. extension: store pointers to tied resources (like common covariances or
  211. transfromations of the feature space)</LI>
  212. </UL>
  213. <LI>
  214. <B>AcousticModelSet</B> knows all used acoustic models and manages their
  215. names as well as loading/saving!</LI>
  216. </UL>
  217. <H4>
  218. 5.2.2 Data Structures needed</H4>
  219. <UL>
  220. <LI>
  221. binary decision <B>tree</B>: used for clustering - one tree per subpolyphon</LI>
  222. <LI>
  223. <B>Label</B>: a label for an utterance is the alignment of its frames to
  224. the according HMM states</LI>
  225. <LI>
  226. <B>Database</B>: efficient indexing of audio data files and/or labels</LI>
  227. </UL>
  228. <H4>
  229. 5.2.3 Training</H4>
  230. (given utterance, HMM-graph)
  231. <P>1. produce targets for the states (Baum-Welch, Viterbi)
  232. <BR>2. accumulate statistics, weighted by targets
  233. <P><U>in other words</U>:
  234. <BR>1. calculate viterbi alignment ("label") once,&nbsp; and store it
  235. <BR>2. k-means initialisation of mixture models
  236. <BR>3. some EM iterations to fit mixture models (separated into accumulation
  237. phase and an update step)
  238. <BR>4. goto 1.
  239. <H4>
  240. 5.2.4 Extensions</H4>
  241. <UL>
  242. <LI>
  243. Context dependent models (per subphone):</LI>
  244. <UL>
  245. <LI>
  246. build distribution tree using set of linguistic questions and an information
  247. criterion</LI>
  248. <LI>
  249. leaves contain the states</LI>
  250. <BR>-> given phone+context the correct state is determined by running down
  251. the tree ...
  252. <UL>
  253. <LI>
  254. during HMM building (for training)</LI>
  255. <LI>
  256. during vocabulary tree building (for decoding) .... (WORD END STATES!!!)</LI>
  257. </UL>
  258. </UL>
  259. <LI>
  260. MLLR (maximum likelikood linear regression, Leggetter/Woodland 1995)</LI>
  261. <LI>
  262. VTLN (vocal tract length normalization)</LI>
  263. <LI>
  264. Matthias' shared transformations</LI>
  265. </UL>
  266. <H4>
  267. 5.2.5 Misc</H4>
  268. <UL>
  269. <LI>
  270. Phoneme representation:ASCII (conversion tool for UNICODE ?)</LI>
  271. <LI>
  272. Important: ensure numeric stability:</LI>
  273. <UL>
  274. <LI>
  275. use floor values</LI>
  276. <LI>
  277. SVD instead of matrix inversion</LI>
  278. <LI>
  279. log values</LI>
  280. <LI>
  281. covariance matrix with all positive eigenvalues (d&eacute;finie positive)</LI>
  282. </UL>
  283. </UL>
  284. <H2>
  285. 5.3 Decoding</H2>
  286. <UL>
  287. <LI>
  288. <B>VocabTree</B></LI>
  289. <LI>
  290. <B>VocabTreeStates</B></LI>
  291. <BR>facilities to support decoder:
  292. <UL>
  293. <LI>
  294. link back to the phone and word of the state</LI>
  295. </UL>
  296. <LI>
  297. Viterbi beam search</LI>
  298. <LI>
  299. Performance enhancements:</LI>
  300. <UL>
  301. <LI>
  302. never evaluate all mixture components but use <I>Gaussian Selection</I>
  303. or <I>BBI</I> (<I>Bucket Box Intersection</I>)</LI>
  304. <LI>
  305. precomputation where possible!</LI>
  306. </UL>
  307. </UL>
  308. <H2>
  309. 5.4 Connection To Audio Server</H2>
  310. <UL>
  311. <LI>
  312. first call front_end->new_token()</LI>
  313. <LI>
  314. then enter a loop:</LI>
  315. <BR>1) front_end->get_new_frames()
  316. <BR>2) score returned frames
  317. <BR>(audio module sleeps if buffer empty!)</UL>
  318. <H1>
  319. <HR WIDTH="100%"><A NAME="Collection"></A>6. Data Collection</H1>
  320. <UL>
  321. <LI>
  322. Sound recording program needed that allows for dictated speech.</LI>
  323. <LI>
  324. One sentence after another.</LI>
  325. <LI>
  326. Language independent</LI>
  327. <LI>
  328. Autotranscription of recorded utterances</LI>
  329. <LI>
  330. Random choice sentences from a text database</LI>
  331. <LI>
  332. Internet-able (send tarred speaker data to "an" archive)</LI>
  333. <LI>
  334. Store user name, user email, description of used microphone, gender, language,
  335. dialect (region)</LI>
  336. <LI>
  337. Discuss license policy for data files!</LI>
  338. </UL>
  339. <H1>
  340. <HR WIDTH="100%"><A NAME="Integration"></A>7. System Integration</H1>
  341. <UL>
  342. <LI>
  343. A daemon that act as a "speech server" (like X?) and a library that defines
  344. an API to communicate with the server (Xlib?)</LI>
  345. <LI>
  346. Two modules in the server connected with Sys V IPC:</LI>
  347. <UL>
  348. <LI>
  349. sound recording</LI>
  350. <LI>
  351. front-end + decoding</LI>
  352. </UL>
  353. <LI>
  354. Defining a tcl interface of the c++-classes (via "swig") allows for developing
  355. scripts for file management/training/clustering/distributed computing and
  356. so on on a very high abstraction level !!!</LI>
  357. <LI>
  358. <A HREF="http://www.kde.org/">KDE</A>/<A HREF="http://www.gnome.org/">gnome</A>
  359. interface (via additions to dialog/menu classes ?)</LI>
  360. <LI>
  361. Connection to existing apps (e.g. emacs)</LI>
  362. <LI>
  363. Send result to focussed application or specify app in utterance ?! (multiple
  364. language models that change with the focus!)</LI>
  365. <LI>
  366. xdm (speaker verification)</LI>
  367. </UL>
  368. <H1>
  369. <HR WIDTH="100%"><A NAME="TODO"></A>8. TODO</H1>
  370. <UL>
  371. <LI>
  372. System definition (interfaces)</LI>
  373. <LI>
  374. Divide work to be done</LI>
  375. <LI>
  376. Create CVS-repository</LI>
  377. <LI>
  378. (find people willing to contribute)</LI>
  379. </UL>
  380. <HR WIDTH="100%">
  381. <BR><I><A HREF="mailto:valj01@gel.usherb.ca">Jean-Marc Valin</A></I>, Universit&eacute;
  382. de Sherbrooke
  383. <BR>$Date: 1999-01-21 01:15:37 +0100 (Thu, 21 Jan 1999) $
  384. <BR>&nbsp;
  385. </BODY>
  386. </HTML>