/README.md

https://github.com/privacore/open-source-search-engine · Markdown · 112 lines · 86 code · 26 blank · 0 comment · 0 complexity · 046b1abf85362febf5310c4d378e7a04 MD5 · raw file

  1. # Warning: Do not use this code.
  2. Findx is shutting down. Please read https://privacore.github.io/
  3. # Gigablast - an open source search engine
  4. An open source web and enterprise search engine and spider/crawler.
  5. This is a fork of the original Gigablast project available at https://github.com/gigablast/open-source-search-engine/. This version is heavily modified by Privacore, and tailored for our use. It is *not* a drop-in replacement for the original Gigablast.
  6. ## Modifications by Privacore
  7. Our aim is *not* to maintain backwards compatibility with the original Gigablast data files.
  8. | Feature | Description |
  9. | ------------- | ------------- |
  10. | Multi-threading | Many improvements have been made with regards to multi-threading and general optimizations.|
  11. | Stability | Numerous general bugfixes and major improvements in thread safety.|
  12. | Data formats | Posdb is being changed to store the entries for a page in a single Posdb file, rather than spreading out a the entries across multiple files and merging the data in memory + handling delete keys at query time. A new index file will point to the file containing the newest version of a document. |
  13. || Spiderdb is modified to use sqlite3 database instead of RDB format.|
  14. | Data file merging | Our version use a dedicated drive for merging, instead of merging + deleting part files on-the-fly on the same data drive. We will create a completely merged file on the merge drive, temporarily make GB use that file for queries, delete the original files, copy the newly merged file back to the 'production drive', switch back query handling to that drive and delete the temporary file. The merge drive must be big enough to hold at least 1 instance's posdb data.|
  15. | Alerting | Start script improved to send alerts if GB crashes (and avoid successive coredumps, but stay down for analysis).|
  16. | Trace log | Lots of options to add very detailed trace log to different parts of the code.|
  17. | Summaries | Improvements in search results summary generation.|
  18. | Language detection | Google's CLD2 library integrated to improve language detection.|
  19. | Code removed | About half of the original source has been removed, e.g. diffbot/eventguru/buzzlogic/seo specific integrations.|
  20. | Disk space | Lots of 'junk' removed from the Posdb data files, reducing space usage significantly. This means that if you use our version with old Gigablast data files, data will not be deleted up correctly when re-indexing a page. You will need to rebuild the Posdb data files.|
  21. | Ranking | Ranking weights made configurable. |
  22. |...|and much more...|
  23. ## Migrating Gigablast to our fork
  24. | Step | Description |
  25. | ------------- | ------------- |
  26. | Backup! | There, you have been warned.. |
  27. | Build | git clone https://github.com/privacore/open-source-search-engine.git <br>git submodule init <br>git submodule update<br>make -j4<br>make dist|
  28. | Copy | Stop your running GB instances. Copy the files contained in the new gb-[date]-[rev].tar.gz file to your GB instance 0.|
  29. | Install | Go to your GB instance 0 and do a './gb install' to copy the binary and needed files to all instances.|
  30. | Remove files | Remove the posdb files from your collections |
  31. | Convert files | Convert the spiderdb files to sqlite3 format by using './gb convertspiderdb' |
  32. | Start | './gb start' from your instance 0 and you should be on your way.|
  33. | Rebuild | Rebuild the posdb data files through the web UI. This is needed because we store less data in posdb than the original version, and GB cannot clean this 'junk' data up when re-indexing pages.|
  34. ## SUPPORTED PLATFORMS
  35. ### Primary:
  36. * Ubuntu 16.04, g++ 5.4.0, Python 2.7.6
  37. ### Secondary:
  38. * OpenSuSE 13.2, GCC 4.8.3
  39. * OpenSuSE 42.2, GCC 6.2.1
  40. * Fedora 25, GCC 6.3.1
  41. ## DEPENDENCIES
  42. ### Compilation
  43. #### Ubuntu
  44. * g++
  45. * make
  46. * cmake
  47. * python
  48. * libpcre3-dev
  49. * libssl-dev
  50. * libprotobuf-dev
  51. * protobuf-compiler
  52. * libsqlite3-dev
  53. #### OpenSuse
  54. * g++
  55. * make
  56. * cmake
  57. * python
  58. * pcre-devel
  59. * libssl-dev
  60. * protobuf-devel
  61. * libprotobuf13
  62. #### Fedora
  63. * g++
  64. * make
  65. * cmake
  66. * python
  67. * pcre-devel
  68. * openssl-devel
  69. * protobuf-devel
  70. * protobuf-compiler
  71. * sqlite-devel
  72. ### Runtime
  73. * Multi-instance installations require [Vagus](https://github.com/privacore/vagus) for keeping track of which instances are dead and alive.
  74. #### Ubuntu
  75. * libssl1.0.0
  76. * libpcre3
  77. * libprotobuf9v5
  78. ## RUNNING GIGABLAST
  79. See <a href=html/faq.html>html/faq.html</a> for all administrative documentation including the quick start instructions.
  80. Alternatively, visit http://www.gigablast.com/faq.html
  81. ## CODE ARCHITECTURE
  82. See <a href=html/developer.html>html/developer.html</a> for all code documentation.
  83. Alternatively, visit http://www.gigablast.com/developer.html
  84. ## SUPPORT
  85. Privacore does not provide paid support for Gigablast. We refer you to the original project at https://github.com/gigablast/open-source-search-engine/ and the owner Matt Wells. He has a Pro version you can buy which include support options.
  86. We provide limited support for our fork, primarily for active contributors.